BMOW title
Floppy Emu banner

Archive for the 'Yellowstone' Category

Yellowstone 3.5 Inch Drive Support!

Success! My Yellowstone disk controller card for the Apple II now works with 3.5 inch floppy drives! Along with the previously-implemented 5.25 inch floppy and Smartport HD support, this completes the triumvirate of Apple II disk drives. While it’s still very rough around the edges, I now have a working universal disk controller for Apple II that can handle any type of disk drive. This is exciting, because existing disk controllers for two of the three drive types are rare, expensive, or both. After a very long period of slow progress, I feel that everything’s finally starting to come together.

I’m especially pleased to see 3.5 inch floppies working, because 1 MHz Apple II machines like my Apple IIe theoretically aren’t fast enough to keep up with the higher bit rate of 3.5 inch disks. There’s not enough time for the CPU to poll for a new byte, store it, and get ready for the next byte before it’s already passed by. The official 3.5 inch disk controller card from Apple solves this problem by placing an entire second computer on the disk controller, with its own 2 MHz 6502 CPU, RAM, and ROM. But Yellowstone uses some Very Special Tricks in hardware to achieve 3.5 inch floppy support on the 1 MHz CPU. It borrows a technique from the UDC disk controller, and forces the computer’s READY signal low to halt the CPU until a new disk byte is ready. This eliminates the need for software polling, and shaves just enough cycles to make everything work.

So now what? This is just the beginning; a proof of concept more than a finished project. I’ve only done the most cursory testing, and I’m sure there are many compatibility problems still to address, and devilish bugs to find and fix. I know about a few of them already. Here’s some of my planned testing:

  • Test all disk types with more thorough read and write tests
  • Try all the copy-protected 5.25 inch disks that rely on weird behaviors
  • Test formatting disks
  • Test with other cards installed in every slot
  • Test with an NMOS 6502 CPU
  • Boot from another disk controller and access Yellowstone as a secondary
  • Use with Apple IIGS at fast speed
  • Boot GSOS from 3.5 inch floppy and Smartport HD

Yay for testing. It’s important, but not very fun, and I may enlist some beta testers to help. The other big task still ahead is to design the second hardware version of the Yellowstone card. The current prototype has sprouted extra patch wires and even an extra chip glued to the board, but it’s still missing some key features. Here are some of the items on my to-do list:

  • Remove SPI ROM
  • Add external SRAM
  • Add a second disk connector
  • Add a switch to select between 2-port operation or daisy-chain 1-port
  • Buffer to isolate FPGA from disk signals
  • Connect Q3 and IOSTROBE to FPGA clock pins
  • Connect the upper 4 address lines
  • Connect RDY and PHI1
  • Add output buffer for RDY
  • Make better / wider GND connections
  • Improve the bypass capacitors
  • Allowance for in-circuit JTAG / SPI reprogramming
  • Allowance for self-test or external test
  • Add open-drain buffers or inline resistors for disk signals with multiple drivers
  • Label all the unlabeled pins and ports on the card
  • Switch to a bigger LDO voltage regulator
  • Add more power and ground test points
  • Round the corners on PCB

That ought to keep me busy for a while.

Be the first to comment! 

Yellowstone 5.25 Inch and Smartport Support

Way back in 2017, I began development of Yellowstone, an FPGA-based disk controller card for the Apple II. It’s hard to believe it’s been almost four years. Sometimes I feel like I’m moving inch by inch along a journey of a thousand miles. But today I made some significant progress, and a milestone of sorts.

My earlier efforts were focused on duplicating the functionality of the Liron disk controller card (for Unidisk 3.5 and Smartport hard drives), and separately on the Disk II controller card for 5.25 inch drives. More recently I’ve been working on mirroring the functionality of the UDC disk controller, because it supports all of those types of drives as well as standard 3.5 inch drives, and it can automatically detect drive types. Today’s good news: when configured as a UDC, Yellowstone is now able to auto-detect, read, and write Unidisk, Smartport, and 5.25 inch drives! That’s basically everything the UDC offers except for 3.5 inch drive support, which is my next goal. It may sound like I’ve merely duplicated what I had before, but the UDC works quite differently from other disk controllers, and reaching this point means I’m getting closer to unlocking the full potential of a “universal” disk controller.

I’ve discovered some interesting things along the way. Many readers probably know that the Disk II controller card was Apple’s first disk controller, designed by WOZ and built entirely from simple off-the-shelf parts. Later Apple developed the IWM or “Integrated Wozniak Machine”, which did everything the Disk II controller could do, all in a single chip. But the IWM wasn’t just a replacement for the Disk II circuitry; it modified and extended it in a backwards-compatible way. The differences are subtle, but important. For example, if you write a byte to the Disk II controller while the disk is turned off, it has no effect. But if you write a byte to the IWM while the disk is turned off, it updates a configuration register that controls some interesting extra features.

I had been working towards creating an accurate IWM model, but what I discovered is that the code in the UDC’s ROM doesn’t really work with an IWM. It expects to be paired with some custom logic whose behavior is much closer to the original Disk II controller. It does strange-seeming things, like reading and writing from Smartport hard drives without ever asserting the drive’s enable signal. And it leaves a few other mysterious behaviors where I must guess at what’s intended. I’ve had to modify my Verilog design, breaking some of the IWM behaviors in order to match the UDC expectations. That makes me a little uneasy; I don’t really want to maintain two different designs. Fortunately the differences aren’t extensive.

3.5 inch drive support is the next step, but there’s so much to do beyond that. The real UDC card supports two independent disk connectors, but Yellowstone only has one. Some later versions of the UDC also support daisy-chaining drives, which I’d love to get working for Yellowstone too. Unfortunately there are UDC versions with Smartport support, and versions with daisy-chaining support, but none with both. Combining the two may be a major challenge.

After Yellowstone is functionally complete, there will still be plenty more work to do. I need to redesign the board to better isolate the FPGA from any 5V signals, and generally make it as robust and foolproof as possible. I need to decide what I’m doing about the required DB-19 female connectors, which are stupidly difficult to find, though I do have some. I need to revisit all that ROM packing stuff I described here recently, to see if I can’t squeeze everything into a more common FPGA with a lower cost. I probably need to build some kind of end-user reprogramming capability too, to allow for bug fixes or new features. I’m using a JTAG programmer and the Lattice development software, but that’s not a user-friendly solution. That could become a major project in its own right. And finally I need to design a self-test capability, or an external testing rig, that can be used to verify large numbers of boards (relatively speaking) after they’ve been assembled. At the rate I’m going, I’ll be busy for a long time!

Read 1 comment and join the conversation 

Yellowstone Progress Update

I’m still working on development of an FPGA-based disk controller card for the Apple II – Yellowstone. Over the past couple of months, I spent a long while analyzing the design of the UDC disk controller. The UDC supports all three major types of Apple II disk drives, making it a promising place to begin learning. After that I spent a long while more exploring how I might squeeze the UDC’s 8K of ROM and 2K of RAM into the limited resources of Yellowstone’s FPGA. Just recently I finally finished up those investigations and returned to actively building and testing the Yellowstone card. Unfortunately it still doesn’t work.

I built a second Yellowstone prototype, identical to the original except for selecting a Lattice MachXO2-2000 FPGA instead of a MachXO2-1200. This new chip is just barely large enough to hold the necessary ROM and RAM for my UDC pseudo-work-alike Verilog code. I’m not sure if I’ll use this solution for the final edition of Yellowstone, or if I’ll use a smaller MachXO2 version paired with a separate ROM or RAM, but at least I’m up and running again.

The card seems to work as expected when I probe its memory space from the Apple II monitor. I can access all 8K of ROM via its custom bank-switching logic, and its 2K of RAM also through bank switching. I can probe its soft-IWM and watch the disk I/O lines change. Everything looks OK. But when I try to actually boot a 5.25 inch disk, it just freezes the computer.

It’s not completely dead; it does do *something*. The disk drive turns on and spins. Using a logic analyzer, I can see some brief activity on the disk I/O lines that I interpret as “hello, are you a 3.5 inch drive?” before it goes silent. If I then reset the Apple II and examine some memory locations where I know the UDC store status info, I can see that it detected one disk drive. But why didn’t it boot? More importantly, why did it freeze?

If this were a normal software program, I could use a debugger to interrupt the program and see where it’s frozen. That alone might be enough to reveal what’s wrong. If not, I could restart the program from the beginning, and step through it line-by-line until I found the problem. But nothing like that is possible here. There’s no facility for Apple II breakpoints or single-stepping through code that’s in ROM, and even if there were, the I/O code is timing-dependent and would likely break when run in the debugger. The poor man’s debugger is printed log messages, flashing LEDs, and similar indicators, but even that will be difficult. I can’t easily add or edit code in the UDC ROM, because it contains lots of absolute address references as well as implicit assumptions about certain chunks of code and data avoiding page boundaries.

I wish I still had my old HP 1631D logic analyzer. Then I could hook up 24 probes to the Apple II’s address bus and data bus and then let the computer run, examining the logged CPU cycles afterwards using the HP’s state listing view. My Saleae logic analyzer is nice for many tasks, but even if it had 24 probes, it’s basically only designed for timing / waveform views. I guess not many people look at parallel busses anymore.

Read 12 comments and join the conversation 

FPGA Block RAM Packing

In an earlier blog post, I was lamenting how one-ninth of an FPGA block RAM was wasted when storing 8-bit ROM data, because there’s no simple way to make use of the 9th parity bit in each word of a block RAM. Horrors! To fight this injustice, I’ve developed a solution that I call packed ROM. It stores nine 8-bit bytes in eight 9-bit words of block RAM, and provides an interface to read the data as if it were an 8-bit memory with a larger depth. Using this method, I’m able to store 1152 bytes of read-only data per block RAM instead of only 1024. The solution relies on the fact that the block RAMs are dual port – you can read from two different addresses simultaneously. Compared with using the same number of block RAMs as a standard 8-bit wide ROM, this solution consumes an extra 54 LUT4s in a MachXO2-1200 FPGA – about 4 percent of the total. It increases the MachXO2-1200’s effective capacity for this type of 8-bit ROM data from 7168 to 8064 bytes.

Here’s the Verilog code, as well as a Python program that reads a plain binary file and writes a “packed” file in .mem format. The code assumes 7 block RAMs, but should be easily adaptable to other numbers.

module packedROM #(parameter NUM_BLOCK_RAMS = 7) (
    input [12:0] addr,
	input clk,
	output reg [7:0] Q
	// packs 1152*NUM_BLOCK_RAMS 8-bit data bytes into 1024*NUM_BLOCK_RAMS 9-bit words 
	// uses 54 LUT4s of the MachXO2
	// may need to change addr width depending on NUM_BLOCK_RAMS. Use $clog2()? 
	// nine bytes A-I are packed into eight 9-bit words as follows:
	// 0: I3 I2 I1 I0 A4 A3 A2 A1 A0
	// 1: I7 I6 I5 I4 B4 B3 B2 B1 B0
	// 2: F7 F6 E7 E6 C4 C3 C2 C1 C0
	// 3: H7 H6 G7 G6 D4 D3 D2 D1 D0
	// 4: A7 A6 A5 E5 E4 E3 E2 E1 E0
	// 5: B7 B6 B5 F5 F4 F3 F2 F1 F0
	// 6: C7 C6 C5 G5 G4 G3 G2 G1 G0
	// 7: D7 D6 D5 H5 H4 H3 H2 H1 H0
	// bytes A-H are sequental in the byte-oriented address space below addr 1024*NUM_BLOCK_RAMS
	// byte I is one of the "extra" bytes, in byte-oriented address space beyond addr 1024*NUM_BLOCK_RAMS
	reg [12:0] wordAddressA;
	reg [12:0] wordAddressB;
	wire [8:0] QA;
	wire [8:0] QB; 
	// dualPortROM is a wrapper for the MachXO2 block RAMs, created by the Lattice IP Express tool.
	// it is actually a dual port RAM with the write input unused
	dualPortROM myDualPortROM(
	wire [12:0] overflowAddr = addr - (NUM_BLOCK_RAMS * 1024);
	always @* begin
		if (addr < NUM_BLOCK_RAMS * 1024) begin
			// packed area, bytes A-H
			wordAddressA <= addr;
			// word address for the upper bits depends on low three bits of the byte address
			case (addr[2:0])
				0: begin // A
					wordAddressB <= { addr[12:3], 3'b100 };
					Q <= { QB[8:6], QA[4:0] };	
				1: begin // B
					wordAddressB <= { addr[12:3], 3'b101 };
					Q <= { QB[8:6], QA[4:0] };	
				2: begin // C
					wordAddressB <= { addr[12:3], 3'b110 };
					Q <= { QB[8:6], QA[4:0] };	
				3: begin // D
					wordAddressB <= { addr[12:3], 3'b111 };
					Q <= { QB[8:6], QA[4:0] };	
				4: begin // E
					wordAddressB <= { addr[12:3], 3'b010 };
					Q <= { QB[6:5], QA[5:0] };	
				5: begin // F
					wordAddressB <= { addr[12:3], 3'b010 };
					Q <= { QB[8:7], QA[5:0] };	
				6: begin // G
					wordAddressB <= { addr[12:3], 3'b011 };
					Q <= { QB[6:5], QA[5:0] };	
				7: begin // H
					wordAddressB <= { addr[12:3], 3'b011 };
					Q <= { QB[8:7], QA[5:0] };	
		else begin
			// overflow area, byte I
			// word address is byte overflow address times 8 for the lower bits, and times 8 plus 1 for the upper bits
			wordAddressA <= { overflowAddr[9:0], 3'b000 };
			wordAddressB <= { overflowAddr[9:0], 3'b001 };
			Q <= { QB[8:5], QA[8:5] };

import os
from array import array

infile = "coderom.bin"
outfile = "coderom.mem"

inputData = array('B')

insize = os.path.getsize(infile)
with open(infile, 'rb') as f:
    inputData.fromfile(f, insize)
    out = open(outfile,"w") 
    num_block_rams = 7
    outsize = 1024 * num_block_rams
    for x in range(0,outsize):
        baseAddr = x & ~7
        if x & 7 == 0:
            out.write('{:02X}\n'.format( (((inputData[outsize+baseAddr//8])&0xF)<<5) | ((inputData[baseAddr])&0x1F)))
        elif x & 7 == 1:
            out.write('{:02X}\n'.format( (((inputData[outsize+baseAddr//8])&0xF0)<<1) | ((inputData[baseAddr+1])&0x1F)))
        elif x & 7 == 2:
            out.write('{:02X}\n'.format( (((inputData[baseAddr+5])&0xC0)<<1) | (((inputData[baseAddr+4])&0xC0)>>1) | ((inputData[baseAddr+2])&0x1F)))
        elif x & 7 == 3:
            out.write('{:02X}\n'.format( (((inputData[baseAddr+7])&0xC0)<<1) | (((inputData[baseAddr+6])&0xC0)>>1) | ((inputData[baseAddr+3])&0x1F)))
        elif x & 7 == 4:
            out.write('{:02X}\n'.format( (((inputData[baseAddr])&0xE0)<<1) | ((inputData[baseAddr+4])&0x3F)))
        elif x & 7 == 5:
            out.write('{:02X}\n'.format( (((inputData[baseAddr+1])&0xE0)<<1) | ((inputData[baseAddr+5])&0x3F)))
        elif x & 7 == 6:
            out.write('{:02X}\n'.format( (((inputData[baseAddr+2])&0xE0)<<1) | ((inputData[baseAddr+6])&0x3F)))
        elif x & 7 == 7:
            out.write('{:02X}\n'.format( (((inputData[baseAddr+3])&0xE0)<<1) | ((inputData[baseAddr+7])&0x3F)))

Read 4 comments and join the conversation 

When 64 Kbits Is Not 8 Kbytes

This FPGA-based disk controller project is going to need every byte of on-chip memory that I can scrounge up. The datasheet says my Lattice MachXO2-1200 has 64 Kbits of embedded RAM (EBR). See the shaded column in the table above. 64 Kbits is 8 Kbytes, and I plan to store 8 KB of 6502 program code, so that looks perfect. Except that I misinterpreted the table in two different ways.

Looking more closely at the table, there are 7 EBR blocks and each block is 9 Kbits. That’s a total of 63 Kbits, not 64. The datasheet is just wrong here, or they’re using some very liberal rounding method. I just lost 1 Kbit!

That’s not the worst of it. Later in the datasheet, it mentions that each EBR block can be configured as 8192 x 1, 4096 x 2, 2048 x 4, or 1024 x 9. Only the last of those configurations represents 9 Kbits of data. If you want to store 8-bit wide data, you have to use the 1024 x 9 configuration and throw away the extra bit. So that’s another 1 Kbit lost from each bank.

When you add everything up, if you’re storing 8-bit wide data, the MachXO2-1200 can only store 56 Kbits in EBR rather than the advertised 64 or 63 Kbits. That may sound like a small difference, but it will have a big impact on my design.

Sure I could add an external SRAM, and maybe I’ll do that eventually, but I really want to squeeze all the advertised memory space from this chip. The wasted area offends my engineering sensibilities. So I’ve been brainstorming a couple of crazy solutions, and wondering if anyone else has ever tried something similar.

I could pack the 8-bit byte data like a bitstream into consecutive 9-bit words of an EBR block, so nine bytes A through I would be stored in eight words:

EBR0[0]: B0 A7 A6 A5 A4 A3 A2 A1 A0
EBR0[1]: C1 C0 B7 B6 B5 B4 B3 B2 B1
EBR0[2]: D2 D1 D0 C7 C6 C5 C4 C3 C2
EBR0[3]: E3 E2 E1 E0 D7 D6 D5 D4 D3
EBR0[4]: F4 F3 F2 F1 F0 E7 E6 E5 E4
EBR0[5]: G5 G4 G3 G2 G1 G0 F7 F6 F5
EBR0[6]: H6 H5 H4 H3 H2 H1 H0 G7 G6
EBR0[7]: I7 I6 I5 I4 I3 I2 I1 I0 H7

To read a byte would require reading two separate words, and then doing some bit shifting and masking with the results. I’d need some kind of state machine and an appropriate clock to handle the two separate reads. And I’d need some kind of divide by nine logic (or eight-ninths?) to convert a byte-oriented address to the corresponding 9-bit word address.

A second idea is to leverage the fact that these are seven separate EBR blocks, and to read from them all in parallel, combining one or two bits from each to reassemble the byte:

EBR0[0]: I0 H0 G0 F0 E0 D0 C0 B0 A0
EBR1[0]: I1 H1 G1 F1 E1 D1 C1 B1 A1
EBR2[0]: I2 H2 G2 F2 E2 D2 C2 B2 A2
EBR3[0]: I3 H3 G3 F3 E3 D3 C3 B3 A3
EBR4[0]: I4 H4 G4 F4 E4 D4 C4 B4 A4
EBR5[0]: I5 H5 G5 F5 E5 D5 C5 B5 A5
EBR6[0]: E6 D7 D6 C7 C6 B7 B6 A7 A6 

The intended advantage here was that I would only need to do one read from each EBR block, but since there are one fewer EBR blocks than bits in a byte, one of the blocks must perform double-duty and this advantage is lost. And I would still need some kind of divide by nine address logic, with different logic for some EBRs than others. I’m actually not sure whether this approach would even work.

I feel like there should be some not-too-complex scheme to store the full 63 Kbits of data in a way allowing for 8-bit byte retrieval, but I can’t quite find it.

Read 9 comments and join the conversation 

Squeezing FPGA Memory

I’m developing an Apple II disk controller that’s based on the UDC disk controller design. The original UDC card had 8K of ROM and 2K of RAM, so it needs 10K of combined memory. The FPGA device I’m using for prototyping, a Lattice MachXO2-1200, has 8K of embedded block RAM and 1.25K of distributed RAM. It also has 8K of “user flash memory”. So will the UDC design fit? It’s close, but I think the answer is no.

At first I thought I could store the ROM data in the FPGA’s UFM section, but that doesn’t look promising. I can store the data there, but compared to embedded block RAM, accessing UFM is inconvenient and probably impractical for live execution of 6502 code. Accessing the UFM requires setting up a Wishbone interface in the FPGA’s Verilog code, starting a memory transaction, and reading out an entire page of flash (16 bytes). It’s also pretty slow. I don’t think it’ll be possible to read an arbitrary byte of UFM and return it to the CPU within ~500 ns, as would be required for directly executing code from it.

OK, so no UFM. Maybe I can store the 8K of ROM data in EBR, using RAM to hold what’s technically ROM? That would work, but it would leave only 1.25K of distributed FPGA RAM remaining to implement the required 2K of RAM for the disk controller. It’s 768 bytes short. No good.

I could switch to a larger FPGA with more memory, or add a separate RAM or ROM chip. But that would increase cost and complexity, and anyway wouldn’t help with my prototype board that’s already built.

Stupid Idea #1

From my analysis of the UDC ROM, I think the upper half of the card’s RAM is only used when communicating with Smartport drives. So I might be able to reduce the RAM from 2K to 1K, and at least I’d be able to test whether 3.5 inch and 5.25 inch drive support works. Using 8K of EBR and 1K of distributed RAM, I’d have a whopping 256 bytes of RAM left. Will it work? I think distributed RAM just means using the FPGA’s logic resources as RAM, so this approach would use 80% of the FPGA’s logic resources and only leave 20% remaining for the actual card functionality, like the IWM model and other logic. It might work, it might not.

Stupid Idea #2

The 8K of ROM isn’t one large chunk. It’s divided into 1K banks that can be mapped into a single 1K region of the computer’s address space. There’s already a small code routine to facilitate the bank switching. What if I could somehow make this routine copy the desired 1K block from UFM to EBR at the moment it’s needed? Then I’d have 8K in UFM, with a 1K cache in EBR, and the 2K of RAM also in EBR.

This would definitely fit, but there would be a delay every time code execution moved to a different 1K ROM page. How long does it take to move 1024 bytes from UFM to EBR? I’m not sure, but I’ll guess it’s tens to hundreds of microseconds. Will that cause problems? Maybe. Will this approach be a pain to implement? Definitely.

Stupid Idea #3

From what I’ve observed of the ROM code, bank 1 contains 5.25 inch functions plus 3.5 inch formatting. Bank 3 is exclusively for 3.5 inch stuff, and bank 7 is exclusively for Smartport drives. Maybe I could temporarily remove some parts of the ROM, in order to make it all fit? Then I might be able to test all the different types of supported disk drives, just not all at the same time.

Stupid Idea #4

Maybe I can modify the ROM code to use 2K of the Apple II’s own RAM instead of 2K of onboard RAM? Then everything would fit in the FPGA. But there must be a good reason the UDC designers didn’t do this. What 2K region of Apple II RAM is safe to use, and wouldn’t get overwritten by running software? I’m not sure.

Stupid Idea #5

Maybe I can modify the prototype board somehow, and graft an extra RAM or ROM chip on there for testing purposes? Maybe I can add a second peripheral card and somehow use its RAM or ROM? Now these ideas are getting crazy.

What’s the Long-Term Solution?

None of these ideas except #2 are workable as a long-term solution, if I eventually move ahead with manufacturing this disk controller card. So what path makes the most sense in the long-term?

Stepping up to the MachXO2-2000 would add about $2 in parts cost, which maybe doesn’t sound like much, but it’s significant. The XO2-2000 has 9.25K EB RAM and 2K distributed RAM, so the design should fit with a small amount of room to spare. That’s surely the least-effort solution.

I could keep the MachXO2-1200 and add a separate 2K RAM chip. The 8K of ROM would fit in the 1200’s EBR. The combined cost might be slightly lower than the MachXO2-2000, but the design and layout would become more complex, and I’m not sure it’s worth it.

I could step down to the MachXO2-640 (2.25K EBR, 640 bytes distributed RAM) and add a separate ROM chip. Total cost would be slightly less than a MachXO2-1200, and I’d also gain lots of extra ROM space for implementing extra features or modes. That would be great. Like adding a separate RAM chip, the extra ROM would make the board design and layout somewhat more complex. But the biggest drawback would be for manufacturing or reprogramming, because both the FPGA and the ROM would need to be programmed separately before the card could be used. Or maybe the FPGA could program the ROM somehow, but it would still be cumbersome and far less attractive than a single-chip programming process.

I never imagined a shortage of just 768 bytes could make such a difference. What an adventure!

Read 13 comments and join the conversation 

Older Posts »