BMOW title
Floppy Emu banner

Archive for the 'Nibbler' Category

Programming the 4-Bit CPU

Instructions for the Nibbler 4-bit CPU come in two types: immediate and addressed. Both types begin with 4 bits of instruction opcode, to identify the specific instruction. Immediate instructions like ADDI include a 4-bit operand value embedded in the instruction, and so are 8 bits in total size. Addressed instructions like JMP include a 12-bit address in data space (RAM) or program space (ROM), and are 16 bits in total size. The instruction encoding is very simple:

The four instruction opcode bits i[0..3] are combined with the ALU carry flag C and equal flag E, as well as the Phase bit, in order to form a 7-bit address for the two microcode ROMs. The ROMs’ outputs form the 16 control signals needed to orchestrate the behavior of all the other chips. The contents of the microcode ROMs are shown in the table below.


The first line of the table shows that phase 0 is the same for all instructions. That’s good, because at this point the instruction hasn’t been fetched from memory yet, so the CPU doesn’t even know what instruction it’s executing! Nothing interesting happens as far as the microcode is concerned, while the Fetch register is loaded with the program ROM byte, and the program counter is advanced to the next byte.

Phase 1 is where all the interesting work happens. ADDI is a good example of a typical instruction. Its instruction opcode is 1010 binary, or $A hex. For this instruction, the same control signals will be asserted regardless of the C or E flags. /oeOprnd will be 0, enabling the bus driver to drive the immediate operand value onto the data bus, where it connects to one of the ALU inputs. The other ALU input is always connected to the accumulator register. The /carryIn, M, and S[0..3] control signals will be set to put the ALU into addition mode, with no carry-in. /loadA will be 0, so the ALU result will be stored to the accumulator. /loadFlags will also be 0, so the carry and equal flags will be updated with the results from the addition.

JE (jump if equal) is another good example. Its instruction opcode is 1000 binary, or $8 hex. If the E flag is 1, then incPC will be 0 (PC will not be incremented) and /loadPC will be 0 (PC will be loaded with the new address). So if E is 1, the jump will be taken. If the E flag is 0, then incPC will be 1 (PC will be incremented) and /loadPC will be 1 (PC will not be loaded). So if the E flag is 1, the CPU just skips over the destination address byte and advances to the next instruction, without taking the jump. Jump instructions don’t involve the ALU, so the ALU control signals are unimportant and are shown in gray.

Notice that some gray control signals are labeled “dc” for don’t care, but others have a 0 or 1 value. This is because I’ve chosen specific values for those dc’s in order to accomplish another goal. Look carefully, and you’ll see that /carryIn is everywhere equal to i3, and M is everywhere equal to i2. That means /carryIn and M don’t really need to be in the microcode at all: anything that needs them can just connect to i3 and i2 instead. This idea could probably be extended to more bits, but I stopped at two. If you look at the circuit schematic I posted previously, you’ll see that I’m not currently taking advantage of this, but it could be helpful later if I find a need to add another control signal or two.

Instruction Set

With only 4 bits for the instruction opcode, there’s a maximum of 16 different instructions, so the choice of which instructions to implement is challenging. It would be possible to have more than 16 instructions, but only by shrinking the address size or immediate operand size. Since this is a 4-bit CPU, a 4-bit instruction size just makes sense.

There are five types of jump instructions:

JMP – unconditional jump
JC – jump if carry
JNC – jump if no carry
JE – jump if equal
JNE – jump if not equal

This is a good set of jump instructions for convenient programming. Although they’re convenient, the negative jumps JNC and JNE aren’t absolutely necessary, because you can always rewrite them with a positive jump and an unconditional jump:

JNC noCarry
   do stuff
   do other stuff


JC carry
JMP noCarry
   do stuff
   do other stuff

LD and ST load a nibble from data space (RAM) to the accumulator, or store a nibble from the accumulator to data space.

LIT loads a literal immediate nibble value to the accumulator. It might also have been named LDI.

OUT stores a nibble from the accumulator to one of the eight output ports, while IN loads a nibble from one of the eight input ports to the accumulator. These instructions are intended to be used for I/O.

ADDI adds a literal immediate nibble value to the accumulator, while ADDM adds a nibble in data space to the accumulator. In either case, the carry-in for the addition is 0, and the carry-out and equal flags are set by the results of the addition.

CMPI compares a literal immediate nibble value to the accumulator (subtracts the value from the accumulator), while CMPM compares a nibble in data space to the accumulator. The accumulator value is not modified, but the carry-out and equal flags are set by the results of the subtraction. Because the ALU is performing a subtraction, the carry-in is always 1 in order to make the math work out correctly.

NORI performs a NOR with an immediate value, NORM performs a NOR with a data space value. The carry-out and equal flags are set by the results of the NOR.
Synthetic Instructions

What about the instructions that aren’t here, like subtraction, AND, OR, NOT? Happily, all of these can be synthesized from the existing instructions. The assembler I’m working on includes some of these as macros, so they can be used as if they were built-in instructions.

ORI x: NORI x; NORI #0
ORM x: NORM x; NORI #0
ANDI x: NORI #0; NORI ~x
SUBI: ADDI (~x+1)

The only common instructions that can’t be trivially synthesized are ANDM and SUBM, because they require the use of a temporary register to hold the complement of the data space value. A dedicated location in data space could be used for this purpose, since Nibbler only has one register.

In many cases, with some thought you can eliminate the need for these synthetic instructions entirely. Take the common case of checking a specific bit in an input. On a CPU with an ANDI instruction, this might look like:

#define BIT2 $4 ; $4 = 0100
IN #0 ; load accumulator with nibble from port 0
JNE bit2is1

With Nibbler, you might think to rewrite this using the synthetic ANDI mentioned above:

#define NOT_BIT2 $B ; $B = 1011
IN #0 ; load accumulator with nibble from port 0
JNE bit2is1

But this can be shortened by simplifying the two negatives NORI and JNE into a single positive JE:

#define NOT_BIT2 $B ; $B = 1011
IN #0 ; load accumulator with nibble from port 0
JE bit2is1

More to Come

Next time I’ll post more about the software tools I’ve written for simulating Nibbler and assembling programs. Until then, questions and comments are always welcome!

Read 14 comments and join the conversation 

Custom 4-Bit CPU Schematic and Control

Enough with the vague design talk – here’s the circuit schematic for the Nibbler 4-bit CPU! Click the image to zoom in to a full size view. The whole system fits on a single page, including the CPU itself and the I/O devices, so it’s easy to wrap your head around.

Except for RAM and ROM, all the chips shown here are common 7400 series parts. I haven’t selected a logic family yet, but most likely they’ll be 7400HC or 7400HCT, which require less power while offering similar speed to the more common 7400LS family.

Program Data

The parts on the schematic are arranged in the same relative positions as in the architecture diagram from my previous post. At the middle-right is the program ROM, where the currently running program is stored. This is an 8Kx8 EEPROM, but Nibbler’s address size only allows for 4K programs, so one of the address inputs is unused and is hard-wired to 0. Program memory is 8 bits wide, and so all 8 of the ROM’s I/O lines are used. Depending on the type of instruction, these may be 4 bits of instruction opcode and 4 bits of immediate operand, or 4 bits of instruction opcode and 4 bits of address, followed by 8 more bits of address. At the start of execution of each instruction, this program byte is loaded into the Fetch register.

The address of the program instruction that’s currently being executed is stored in the program counter. The PC consists of three ‘163 4-bit counters, chained together to make a 12 bit logical register. After most instructions, the PC will increment to point to the next instruction. For jump instructions, the PC can also be loaded with a new address.  The address comes from the Fetch register operand value (highest 4 bits) and the program ROM byte (lowest 8 bits).

Control and Microcode

At the top left of the schematic are the three chips pertaining to the execution of the current instruction. The Fetch register is a ‘377, an 8-bit register that holds the current instruction opcode in the high 4 bits and instruction or address data in the low 4 bits. ALU flags are stored in the 4-bit Flags register, a ‘173. There are only two flags, carry and equal, so two of the four bits are unused. The last chip in this group is a ‘175, a quad flip-flop. One flip-flop is used to synchronize the reset signal, and another is the Phase bit, which constantly toggles between 0 and 1 to indicate which of the two clock cycles of an instruction’s execution is currently underway. Fetch is loaded at the end of the clock cycle when Phase is 0. The other two flip-flops are unused.

With two chips that are only half-used, is there a way to combine the functions of the ‘173 and the ‘175 into a single chip? Probably not: flip-flops load data on every clock, but the ‘173 needs a load enable for the ALU flags.

The instruction opcode, ALU flags, and phase are combined to form a 7-bit address for the two microcode ROMs, shown at the mid-left. The output of the two ROMs constitutes the 16 control signals needed to orchestrate the behavior of all the other chips. The microcode is stored in two 2Kx8 EEPROMs, so four of the eleven address inputs on each ROM are unused and hard-wired to 0.

ALU Datapath

At the bottom-left of the schematic are the ‘181 ALU and the ‘173 accumulator register “A”. The ALU (arithmetic and logic unit) can perform any common arithmetic or logical operation on its two inputs. In this case, one input always comes from the accumulator, while the other is supplied from the data bus. The ALU result is stored back into the accumulator. The ALU, accumulator, and data bus are all 4 bits wide, which is what makes Nibbler a 4 bit CPU.

Carry-In and Carry Flag

If you look carefully, you’ll see that the ALU’s carry-in bit is a control signal provided by microcode, not the carry flag from the Flags register. This is a subtle but important point: the carry flag is an output from an arithmetic instruction, and can be used to make a conditional jump if the carry flag is/isn’t set, but it doesn’t feed back into the ALU to affect later calculations. This means that when performing multi-nibble pair-wise additions, the program must check the carry flag after each nibble addition, and add an extra 1 into the next addition if it’s set.

This was a conscious design choice. If the carry flag did connect to the ALU’s carry-in bit, then the program would need to clear it before performing any single-nibble additions, and those are much more common than multi-nibble additions. Also the carry-in bit can’t simply be hard-wired to 0, because as you’ll see later, the CMP (compare) instruction requires carry-in to be 1 in order to work properly. So carry-in must be provided by the microcode.


RAM is shown at the bottom-center. Its I/O lines are connected to the data bus, and the address comes from the Fetch register operand value (highest 4 bits) and the program ROM byte (lowest 8 bits). Ideally the system would use a 4Kx4 SRAM, to match Nibbler’s address size and data width, but the closest match I could readily find was a 2Kx8 SRAM. That means there will only be 2048 addressable nibbles instead of 4096, and half of the RAM I/O lines will be unused.

Notice the CLK signal is connected to the RAM’s /CE (chip enable) input. This means the RAM will only be enabled during the second half of each clock cycle. This is a simple way of preventing erroneous writes to RAM during the early part of the clock cycle, when the /WE (write enable) signal and RAM address may not yet be valid.

IN and OUT Ports

The IN and OUT ports are also connected to the data bus, and are shown on the schematic at bottom-right. IN0 is a ‘125 4-bit bus driver, which outputs the state of four pushbuttons connected to pull-up resistors. Because there’s only a single IN port, no decoding of the port number is done, and this ‘125 will actually respond to any port number with the IN instruction. If more IN ports were added, then additional port number decoding logic would be needed.

The two OUT ports are ‘173 4 bit registers. OUT1 connects to databus[4..7] of a 16×2 character LCD display using the common HD44780 controller. Although this LCD controller has an 8 bit interface, it can also operate in 4 bit mode, in which case only the highest 4 LCD databus lines are used. OUT0 connects two more lines to the LCD, for the RS and E signals needed to control LCD data transfers. The other two lines from OUT0 connect to an LED, which can be toggled on/off as a basic debugging aid, and to a speaker, which can be bit-banged in software to generate simple square-wave tones at different frequencies.

Notice that the ‘173s have two load enable inputs, /G1 and /G2, and both must be low in order to load data to the chip. /G1 of both chips is connected to the /LOADOUT control signal. But as with the IN port, the OUT port number is not fully decoded, in order to avoid needing extra decoding logic. Instead, bit 0 of the port number is connected to OUT0 /G2, and bit 1 to OUT1 /G2. This means that OUT0 will actually respond to any port number where bit 0 is 0, and OUT1 to any port number where bit 1 is 0. It would even be possible to load both OUT ports simultaneously by using a port number where both bits 1 and 0 were 0, although that probably wouldn’t be useful.

Bus Drivers

The last two components on the data bus are a pair of 4-bit bus drivers, shown at the center and at the bottom-center of the schematic. These are two halves of a single ‘244 octal driver. One drives the ALU result onto the data bus, which is necessary when storing data to RAM or an OUT port. The other drives the operand value from the Fetch register onto the data bus, which is necessary for instructions that involve an immediate constant value.

More to Come

Next time I’ll post more details about the control signals, microcode, and instruction set. Until then, questions and comments are always welcome!


Read 15 comments and join the conversation 

Hello Nibbler!

Say hello to Nibbler, the 4-bit homemade CPU! Ever since I built BMOW1, people have written to me asking how to make their own homebrew computers. BMOW is a complex design that can be difficult to comprehend, so I decided it was time to create a minimal CPU that’s easy to understand, easy to build, but still capable of running interesting programs. Ideas for Nibbler began percolating in my brain, and after a few weeks of pencil sketches and hand simulation, it’s finally ready to share. And if you’ve forgotten, a nibble is half a byte or 4 bits, so the name fits the CPU.

Some of you may be thinking: “4-bit CPU? BORING!” I agree that many of the 4-bit CPU designs on the web aren’t very exciting, though that’s not an inherent problem with their 4-bitness, but is caused by shortcomings in the computer that surrounds the CPU. Most designs are limited to 256 nibbles of memory, which just isn’t enough to fit a program that does anything very interesting. I/O is often limited to basic LEDs and switches, further reducing the scope of what’s possible.

My goals for Nibbler are:

  • Only use commonly-available 7400 series chips and RAM/ROM. No programmable logic or other goodies.
  • Keep the total number of chips as few as possible.
  • Employ a simple, straightforward design that’s easy to understand.
  • Maintain a clean logical separation between the CPU and the computer surrounding it.
  • Run interesting, interactive programs involving several I/O devices.
  • NOT: Be the most powerful CPU, or the easiest to write programs for.



The architecture of Nibbler is shown above. The CPU core is just eleven 7400 series chips, plus the clock crystal. RAM and ROM add two more chips, and peripheral I/O in “the computer” adds three more, for a total of sixteen chips overall. Compared to BMOW’s 65 chips and multiple clocks, that’s very lightweight.

Instruction opcodes are 4 bits wide, which allows for 16 possible types of instructions. All instructions require exactly two clock cycles to execute. During the first clock cycle, called phase 0, the instruction opcode and operand are retrieved from memory and stored in a register called Fetch. The second clock cycle, called phase 1, performs the calculation or operation needed to execute the instruction.

A pair of microcode ROMs is used to generate the sixteen internal control signals needed to load, enable, and increment the other chips in the CPU at the appropriate times. The microcode ROM address is formed from the instruction opcode, the phase, and the CPU carry and equal flags. Each microcode ROM outputs a different group of eight of the sixteen total control signals.

A load-store design is used, with all arithmetic and logical computation results being stored into the single 4-bit accumulator register named “A”.  Data can be moved between A and memory locations in RAM, but otherwise all the CPU instructions operate only on A. This greatly simplifies the hardware requirements, at the cost of some decrease in flexibility when writing programs.

In contrast to most modern CPUs, the Nibbler design uses a Harvard Architecture. That means programs and data are stored in separate address spaces, and travel on separate busses. The data bus is 4 bits wide, as one should expect for a 4-bit CPU. The program bus is 8 bits wide: 4 bits for the instruction opcode, and 4 bits for an immediate operand.

Program and data addresses are both 12 bits wide, resulting in total addressable storage of 4096 bytes for programs and 4096 nibbles for data. A 12 bit program counter holds the current instruction address. Since instruction opcodes are 4 bits wide, that makes instructions involving absolute memory addresses 4 + 12 = 16 bits in size, or two program bytes.

Nibbler is notable for a few things it does NOT have. There’s no address decoder, because there’s not more than one chip mapped into different regions of the same address space. Program ROM occupies all of the program address space, and RAM occupies all of the data address space. As you’ll see later, I/O peripherals aren’t memory-mapped, but instead use port-specific IN and OUT instructions to transfer data.

Nibbler also lacks any address registers, which means it can’t support any form of indirect addressing, nor a hardware-controlled stack. All memory references must use absolute addresses. That’s a significant limitation, but it’s in keeping with the project’s K.I.S.S. design goals. With the use of jump tables and dedicated memory locations, a simple call/return mechanism can be implemented without a true stack.

Up to sixteen distinct I/O devices can be supported by the CPU, but the planned I/O devices require just one IN port and two OUT ports. The computer’s input comes from four momentary pushbuttons, arranged in a left/right/select/back cross configuration, and connected to the IN port. Output utilizes one of the two OUT ports, and includes the obligatory LEDs used for debugging, as well as a piezo speaker for software-controlled sound, and a two-line character-based LCD display.

The specific 7400 logic family and chips to be used aren’t yet finalized, but in back of the envelope calculations, it looks like the CPU should support a speed of just over 4 MHz. The longest path is for a write to RAM during phase 1: Clock-to-Q delay for the Fetch register, plus propagation delay for the microcode ROMs, ALU, and bus driver, plus data setup time for the RAM. At two clock cycles per instruction, 4 MHz operation would result in 2 MIPS, which is the same or better than BMOW.

I’ll write more about the instruction set and programming model next time. Until then, if you have any comments or questions, I’d love to hear them!

Read 17 comments and join the conversation 

« Newer Posts