BMOW title
Floppy Emu banner

FPGA Block RAM Packing

In an earlier blog post, I was lamenting how one-ninth of an FPGA block RAM was wasted when storing 8-bit ROM data, because there’s no simple way to make use of the 9th parity bit in each word of a block RAM. Horrors! To fight this injustice, I’ve developed a solution that I call packed ROM. It stores nine 8-bit bytes in eight 9-bit words of block RAM, and provides an interface to read the data as if it were an 8-bit memory with a larger depth. Using this method, I’m able to store 1152 bytes of read-only data per block RAM instead of only 1024. The solution relies on the fact that the block RAMs are dual port – you can read from two different addresses simultaneously. Compared with using the same number of block RAMs as a standard 8-bit wide ROM, this solution consumes an extra 54 LUT4s in a MachXO2-1200 FPGA – about 4 percent of the total. It increases the MachXO2-1200’s effective capacity for this type of 8-bit ROM data from 7168 to 8064 bytes.

Here’s the Verilog code, as well as a Python program that reads a plain binary file and writes a “packed” file in .mem format. The code assumes 7 block RAMs, but should be easily adaptable to other numbers.

module packedROM #(parameter NUM_BLOCK_RAMS = 7) (
    input [12:0] addr,
	input clk,
	output reg [7:0] Q
    );
	
	// packs 1152*NUM_BLOCK_RAMS 8-bit data bytes into 1024*NUM_BLOCK_RAMS 9-bit words 
	// uses 54 LUT4s of the MachXO2
	// may need to change addr width depending on NUM_BLOCK_RAMS. Use $clog2()? 
	
	// nine bytes A-I are packed into eight 9-bit words as follows:
	// 0: I3 I2 I1 I0 A4 A3 A2 A1 A0
	// 1: I7 I6 I5 I4 B4 B3 B2 B1 B0
	// 2: F7 F6 E7 E6 C4 C3 C2 C1 C0
	// 3: H7 H6 G7 G6 D4 D3 D2 D1 D0
	// 4: A7 A6 A5 E5 E4 E3 E2 E1 E0
	// 5: B7 B6 B5 F5 F4 F3 F2 F1 F0
	// 6: C7 C6 C5 G5 G4 G3 G2 G1 G0
	// 7: D7 D6 D5 H5 H4 H3 H2 H1 H0
	// bytes A-H are sequental in the byte-oriented address space below addr 1024*NUM_BLOCK_RAMS
	// byte I is one of the "extra" bytes, in byte-oriented address space beyond addr 1024*NUM_BLOCK_RAMS
	
	reg [12:0] wordAddressA;
	reg [12:0] wordAddressB;
	
	wire [8:0] QA;
	wire [8:0] QB; 
	
	// dualPortROM is a wrapper for the MachXO2 block RAMs, created by the Lattice IP Express tool.
	// it is actually a dual port RAM with the write input unused
	dualPortROM myDualPortROM(
		.DataInA(9'b000000000),
		.DataInB(9'b000000000),
		.AddressA(wordAddressA),
		.AddressB(wordAddressB),
		.ClockA(clk),
		.ClockB(clk),
		.ClockEnA(1'b1),
		.ClockEnB(1'b1),
		.WrA(1'b0),
		.WrB(1'b0),
		.ResetA(1'b0),
		.ResetB(1'b0), 
		.QA(QA),
		.QB(QB)
	);
	
	wire [12:0] overflowAddr = addr - (NUM_BLOCK_RAMS * 1024);
			
	always @* begin
		if (addr < NUM_BLOCK_RAMS * 1024) begin
			// packed area, bytes A-H
			wordAddressA <= addr;
			// word address for the upper bits depends on low three bits of the byte address
			case (addr[2:0])
				0: begin // A
					wordAddressB <= { addr[12:3], 3'b100 };
					Q <= { QB[8:6], QA[4:0] };	
					end
				1: begin // B
					wordAddressB <= { addr[12:3], 3'b101 };
					Q <= { QB[8:6], QA[4:0] };	
					end
				2: begin // C
					wordAddressB <= { addr[12:3], 3'b110 };
					Q <= { QB[8:6], QA[4:0] };	
					end
				3: begin // D
					wordAddressB <= { addr[12:3], 3'b111 };
					Q <= { QB[8:6], QA[4:0] };	
					end	
				4: begin // E
					wordAddressB <= { addr[12:3], 3'b010 };
					Q <= { QB[6:5], QA[5:0] };	
					end		
				5: begin // F
					wordAddressB <= { addr[12:3], 3'b010 };
					Q <= { QB[8:7], QA[5:0] };	
					end											
				6: begin // G
					wordAddressB <= { addr[12:3], 3'b011 };
					Q <= { QB[6:5], QA[5:0] };	
					end						
				7: begin // H
					wordAddressB <= { addr[12:3], 3'b011 };
					Q <= { QB[8:7], QA[5:0] };	
					end
			endcase
		end
		else begin
			// overflow area, byte I
			// word address is byte overflow address times 8 for the lower bits, and times 8 plus 1 for the upper bits
			wordAddressA <= { overflowAddr[9:0], 3'b000 };
			wordAddressB <= { overflowAddr[9:0], 3'b001 };
			Q <= { QB[8:5], QA[8:5] };
		end
	end
endmodule




import os
from array import array

infile = "coderom.bin"
outfile = "coderom.mem"

inputData = array('B')

insize = os.path.getsize(infile)
with open(infile, 'rb') as f:
    inputData.fromfile(f, insize)
    out = open(outfile,"w") 
    
    num_block_rams = 7
    outsize = 1024 * num_block_rams
   
    for x in range(0,outsize):
        baseAddr = x & ~7
        if x & 7 == 0:
            out.write('{:02X}\n'.format( (((inputData[outsize+baseAddr//8])&0xF)<<5) | ((inputData[baseAddr])&0x1F)))
        elif x & 7 == 1:
            out.write('{:02X}\n'.format( (((inputData[outsize+baseAddr//8])&0xF0)<<1) | ((inputData[baseAddr+1])&0x1F)))
        elif x & 7 == 2:
            out.write('{:02X}\n'.format( (((inputData[baseAddr+5])&0xC0)<<1) | (((inputData[baseAddr+4])&0xC0)>>1) | ((inputData[baseAddr+2])&0x1F)))
        elif x & 7 == 3:
            out.write('{:02X}\n'.format( (((inputData[baseAddr+7])&0xC0)<<1) | (((inputData[baseAddr+6])&0xC0)>>1) | ((inputData[baseAddr+3])&0x1F)))
        elif x & 7 == 4:
            out.write('{:02X}\n'.format( (((inputData[baseAddr])&0xE0)<<1) | ((inputData[baseAddr+4])&0x3F)))
        elif x & 7 == 5:
            out.write('{:02X}\n'.format( (((inputData[baseAddr+1])&0xE0)<<1) | ((inputData[baseAddr+5])&0x3F)))
        elif x & 7 == 6:
            out.write('{:02X}\n'.format( (((inputData[baseAddr+2])&0xE0)<<1) | ((inputData[baseAddr+6])&0x3F)))
        elif x & 7 == 7:
            out.write('{:02X}\n'.format( (((inputData[baseAddr+3])&0xE0)<<1) | ((inputData[baseAddr+7])&0x3F)))

Read 4 comments and join the conversation 

4 Comments so far

  1. Newsdee - February 5th, 2021 7:23 pm

    Do you think this disk controller could be integrated to an Apple II+ FPGA core? The following has most of an Apple II implemented, but it’s missing hard disk (and floppy disk write) support: https://github.com/MiSTer-devel/Apple-II_MiSTer

  2. David Brown - February 22nd, 2021 2:54 pm

    As you are having issues with the boot up, it could be that the issue is when the address crosses to the “overflow address” in the logic above. Looks like none of the logic is clocked, so if the address changes before the data is stable, it could be getting an invalid value. The two demux paths (branches of the of) likely have different delays, especially since the overflow address involves a subtractor feeding the address input to the BRAM. The data path from address through the Block RAM through the demuxing (the case + if) could be quite long, no to mention any propagation delay from the logic to the pins and bus.
    Might need to latch/register the address bus to ensure there’s enough time to settle. Ideally, the data would be registered at the I/O pins to ensure consistent timing from the I/O to the bus. If you can’t run a faster clock, then maybe use opposite edges or slightly delayed clock to do this.

  3. Steve - February 22nd, 2021 3:43 pm

    Thanks! Good theory, maybe you’re right. The clock I’m using for the packed ROM is the 7 MHz fclk, not the 1 MHz main clock, so I think that will help. I’ve examined packed ROM memory locations using the Apple II system monitor, and everything looks OK for both the normal and overflow address ranges. But maybe the timing is slightly different when actually executing code from the packed ROM instead of examining the ROM values in the monitor.

  4. Steve - February 22nd, 2021 3:46 pm

    To clarify, dualPortROM is a wrapper for the MachXO2 block RAMs, whose address input is clocked using the 7 MHz fclk.

Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.