BMOW title
Floppy Emu banner

Programming the 4-Bit CPU

Instructions for the Nibbler 4-bit CPU come in two types: immediate and addressed. Both types begin with 4 bits of instruction opcode, to identify the specific instruction. Immediate instructions like ADDI include a 4-bit operand value embedded in the instruction, and so are 8 bits in total size. Addressed instructions like JMP include a 12-bit address in data space (RAM) or program space (ROM), and are 16 bits in total size. The instruction encoding is very simple:

The four instruction opcode bits i[0..3] are combined with the ALU carry flag C and equal flag E, as well as the Phase bit, in order to form a 7-bit address for the two microcode ROMs. The ROMs’ outputs form the 16 control signals needed to orchestrate the behavior of all the other chips. The contents of the microcode ROMs are shown in the table below.


The first line of the table shows that phase 0 is the same for all instructions. That’s good, because at this point the instruction hasn’t been fetched from memory yet, so the CPU doesn’t even know what instruction it’s executing! Nothing interesting happens as far as the microcode is concerned, while the Fetch register is loaded with the program ROM byte, and the program counter is advanced to the next byte.

Phase 1 is where all the interesting work happens. ADDI is a good example of a typical instruction. Its instruction opcode is 1010 binary, or $A hex. For this instruction, the same control signals will be asserted regardless of the C or E flags. /oeOprnd will be 0, enabling the bus driver to drive the immediate operand value onto the data bus, where it connects to one of the ALU inputs. The other ALU input is always connected to the accumulator register. The /carryIn, M, and S[0..3] control signals will be set to put the ALU into addition mode, with no carry-in. /loadA will be 0, so the ALU result will be stored to the accumulator. /loadFlags will also be 0, so the carry and equal flags will be updated with the results from the addition.

JE (jump if equal) is another good example. Its instruction opcode is 1000 binary, or $8 hex. If the E flag is 1, then incPC will be 0 (PC will not be incremented) and /loadPC will be 0 (PC will be loaded with the new address). So if E is 1, the jump will be taken. If the E flag is 0, then incPC will be 1 (PC will be incremented) and /loadPC will be 1 (PC will not be loaded). So if the E flag is 1, the CPU just skips over the destination address byte and advances to the next instruction, without taking the jump. Jump instructions don’t involve the ALU, so the ALU control signals are unimportant and are shown in gray.

Notice that some gray control signals are labeled “dc” for don’t care, but others have a 0 or 1 value. This is because I’ve chosen specific values for those dc’s in order to accomplish another goal. Look carefully, and you’ll see that /carryIn is everywhere equal to i3, and M is everywhere equal to i2. That means /carryIn and M don’t really need to be in the microcode at all: anything that needs them can just connect to i3 and i2 instead. This idea could probably be extended to more bits, but I stopped at two. If you look at the circuit schematic I posted previously, you’ll see that I’m not currently taking advantage of this, but it could be helpful later if I find a need to add another control signal or two.

Instruction Set

With only 4 bits for the instruction opcode, there’s a maximum of 16 different instructions, so the choice of which instructions to implement is challenging. It would be possible to have more than 16 instructions, but only by shrinking the address size or immediate operand size. Since this is a 4-bit CPU, a 4-bit instruction size just makes sense.

There are five types of jump instructions:

JMP – unconditional jump
JC – jump if carry
JNC – jump if no carry
JE – jump if equal
JNE – jump if not equal

This is a good set of jump instructions for convenient programming. Although they’re convenient, the negative jumps JNC and JNE aren’t absolutely necessary, because you can always rewrite them with a positive jump and an unconditional jump:

JNC noCarry
   do stuff
   do other stuff


JC carry
JMP noCarry
   do stuff
   do other stuff

LD and ST load a nibble from data space (RAM) to the accumulator, or store a nibble from the accumulator to data space.

LIT loads a literal immediate nibble value to the accumulator. It might also have been named LDI.

OUT stores a nibble from the accumulator to one of the eight output ports, while IN loads a nibble from one of the eight input ports to the accumulator. These instructions are intended to be used for I/O.

ADDI adds a literal immediate nibble value to the accumulator, while ADDM adds a nibble in data space to the accumulator. In either case, the carry-in for the addition is 0, and the carry-out and equal flags are set by the results of the addition.

CMPI compares a literal immediate nibble value to the accumulator (subtracts the value from the accumulator), while CMPM compares a nibble in data space to the accumulator. The accumulator value is not modified, but the carry-out and equal flags are set by the results of the subtraction. Because the ALU is performing a subtraction, the carry-in is always 1 in order to make the math work out correctly.

NORI performs a NOR with an immediate value, NORM performs a NOR with a data space value. The carry-out and equal flags are set by the results of the NOR.
Synthetic Instructions

What about the instructions that aren’t here, like subtraction, AND, OR, NOT? Happily, all of these can be synthesized from the existing instructions. The assembler I’m working on includes some of these as macros, so they can be used as if they were built-in instructions.

ORI x: NORI x; NORI #0
ORM x: NORM x; NORI #0
ANDI x: NORI #0; NORI ~x
SUBI: ADDI (~x+1)

The only common instructions that can’t be trivially synthesized are ANDM and SUBM, because they require the use of a temporary register to hold the complement of the data space value. A dedicated location in data space could be used for this purpose, since Nibbler only has one register.

In many cases, with some thought you can eliminate the need for these synthetic instructions entirely. Take the common case of checking a specific bit in an input. On a CPU with an ANDI instruction, this might look like:

#define BIT2 $4 ; $4 = 0100
IN #0 ; load accumulator with nibble from port 0
JNE bit2is1

With Nibbler, you might think to rewrite this using the synthetic ANDI mentioned above:

#define NOT_BIT2 $B ; $B = 1011
IN #0 ; load accumulator with nibble from port 0
JNE bit2is1

But this can be shortened by simplifying the two negatives NORI and JNE into a single positive JE:

#define NOT_BIT2 $B ; $B = 1011
IN #0 ; load accumulator with nibble from port 0
JE bit2is1

More to Come

Next time I’ll post more about the software tools I’ve written for simulating Nibbler and assembling programs. Until then, questions and comments are always welcome!

Read 14 comments and join the conversation 

14 Comments so far

  1. Hans Franke August 30th, 2013 5:01 am

    Nice. As you already pointed out, the complementary jups are not realy needed. I would drop them and only keep JNC and JE (or to be consitent JNE). I would vote for the negative versions, since JNC is used almost all multi word instructions to jump arround +/-1 correctiond (since there’s no ADC).

    This frees two opcodes for other usage. My first candidate would have been an ADC, but due the fact, that most arithmetic instructions are to be synthesized and thus not sigle step, it looses value. Beside a use for indirect addressing, I would go for implementing at least SUBM (SUBI isn’t that important, since constants can be made up from RAM *1) and a DJZ – decrement and jump when zero. Counting loops is something realy often needed – be it colours, rater lines or timing. Using DJZ instead of the more common DJNZ is based on the single register structure


    LD counter
    DJZ end_of_loop
    ST counter
    JMP loop

    Then again, since we have only one register, tradeof might be marginal.

    *1 speaking of constant values – the CMPI, ADDI and NORI could be done away by using 16 Bytes of RAM. Either be done by hardware (interpreting x00 as x) which costs us a little decoding (8 input NAND), or nothing (even freeing up a buffer) by having software seting up the 16 locations.

    Sure, this would increase every *I instruction from 1 Byte to two bytes, but at the same time freeing up 3 (or 4 in the hardware case) opcodes, which could save a lot more code on the long run.

  2. Hans Franke August 30th, 2013 5:48 am

    LAst post, But I have to share :))

    Thinking of it, with a somewhat dramatic change in design, the instuction set could not only be more straightened out, but also made more powerful – and eventually faster:

    Go RISC

    And no, the actual design is not RISC, it’s minimalistic. RISC for this CPU would mean adding at least a second register (B), have a set of load lnstructions for B and make all operationd between A and B. Thus instead of having two versions of each instruction to handle immediate and Memory, there is only one with implied A and B as source. On the other hand, there coudl be a plenty of instructions to load data in whatever way (immediate, memory, memory indirect, …) from whatever source (memory, port). Since there is no need for an operand encoding within an instruction anymore, all arithmetic instruction could be either reduced to a single nibbel or there would be only one opcode for any of the 16 possible operations.

    The opcodetable would look something like this:
    (And no, the numbers are only for operations, no suggestions, sorting should be done in a way to please the logic needed)

    0x Load B with immediate
    1xxx Load B from Memory
    2x Load B with A
    3x Load B from Port (In B)
    4xxx Jump Carry
    5xxx Jump Equal
    6xxx Jump Not Carry
    7xxx Jump Not Equal
    9xxx Store A to Memory
    Bx Store A to Port (Out A)
    Cxxx Jump
    Ex Logic Operation A x B -> A
    Fx Arithmetic Operation A x B -> A

    (As I said, it may need reordering)
    We end up with only 13 opcodes used (3 free), and at the same time not only eliminating the need for operation synthesis, but get access to all 32 functions of the 74181. Given some are weired, and (depending on your point of view) only some 8-12 are usefull, thats way more than the 3 before (ADD/CMP/NOR).

    Some operation may now require more than one operation (notably all with immediate), but I think this will be outwighted by simplicity and additional functions. Since we done even away with most immidiate opcode forms (except for load), it’s even ‘RISCier’ than most of todays RISC CPUs (to be super pure, we coudl do away with the Load Immediate at all, since the 181 supplies us with the constant values of 0 and F, and all other values could be build from there on)

    The ‘free’ Opcodes can be nicely used for additional Jumps or better some kind of skip logic from the operations – something like

    ‘Perform OP and skip next instruction if ‘

    that could do away with all jumps (well except the basic) while compacting code. OPR would be 8 opcodes now with the ability to skip the next instruction if Carry or Equal or both is are set.

    Another candidate for the free opcodes would be son SET CARRY – this would not only be useful for chained calculation (yes, I do understand your logic about not routing them) or to make the JC ind a JUMP, and so do away with this singular case.

    I realy need to see if this realy bloats the chipcount much past the added B.

    Speaking of chips, have you considered using a chip with seperate read and write lines (but common address) or even one with complete seperate read and write ports? At least the seperated data lines for read and write where not uncommon in the early days, and did make away with the need for some buffers.

    (now I shut up)

  3. Steve Chamberlin August 30th, 2013 8:30 am

    I agree instructions like DJZ would be helpful. I don’t think they could be done without extending instruction execution beyond two clock cycles, and making the number of clocks per instruction variable instead of fixed.

    I also agree about dropping JNC and JNE, and replacing them with something else. But when I looked at it, I couldn’t think of anything very exciting to add in their place that would still work with the existing data and control paths, at two clocks per instruction. SUBM seems like the best option, but I like the orthogonality of the current instruction set. Happily it’s easy to change this by modifying the microcode, and no hardware modification is needed, so I can always change it later.

  4. Steve Chamberlin August 30th, 2013 8:36 am

    I like your “RISC” idea with the B register. Although it adds one additional chip, it works nicely with the existing data and control paths, and in many ways it’s conceptually simpler than the current design. The major downside is that all instructions that load A, like LIT, LD, IN, ADDM, would now take two instructions instead of one. Maybe with more tweaking of the specific instructions chosen, that could be avoided. I’ll definitely give this some more thought.

  5. Hans Franke August 30th, 2013 11:19 am

    Agreed, you will always a second instruction to load something into A. Even compared to classic RISC. It’s due the fact that B/A act more like a pipeline than regular (free addressable) registers. A constrain because we want to keep it as simple as possible. In a more free environment, I would replace A/B (or at least B) by a register file with four 4 Bit registers (single write and dual (single) read port). But this would add at least two chips.

    In reality, some of the additional instructions can be saved again by reordering the operands. I think we need some real world examples (past adding two numbers) to see how much it realy adds (it does like any RISC).

    Code size, word size, instruction complexity and cycles used are exchangeable variables – short, more compact code means more complex instructions and more cycles per instruction and vice versa. So if we want a machine as simple as possible, having short operations with a single operation structure (or at least not many) is desirable, but it will result in quite verbose code – after all, what we do is replacing hardware (and/or additional micro programm cycles) by software.

    Also, we should keep in mind,4K programm address space is a huge resource. This is anywhere between 2 and 4K instructions. Even without subroutine handling it will be hard to fill.

    If you realy want, a special load for A and B would make sense. The value get loaded into B (as usual), while the ALU gets a C=B command fed. It would also only make sense for MEM and IMM. For IN it doesn’t realy matter, and adding an address feature for any operation would not only kill all RISC advantage, but also not fit the opcode space. A backdraw is that there’s nnow a need to feed the ALU-op now also from some other source than the lower instruction nibble. I wouldn’t go the extra length.

  6. Hans Franke August 30th, 2013 11:38 am

    Maybe some notes about programm memory size. The classic HP35 had 768 Words of 10 bit each. a lot of 80s handheld games where based arround the TMS1000 family with usually less than 1K code. A lot of the early Atari 2600 games needed only 2K (and the median 6502 instruction length is above 2 bytes/instruction).

    So I think with 2-4 k instructions we’re comfortable.

    Recently someone did take the sinclair scientific calculator apart:
    The calculator is based arround a 4 Bit CPU from TI with 32 instructions in a 10 bit word and 320 words of ROM

  7. Erik Petrich August 30th, 2013 11:53 am

    If you only plan on a maximum of 8 input ports and 8 output ports, then bit 3 of the instruction for IN and OUT isn’t really being used for anything. You could connect OPRND3 to the microcode ROMs to distinguish between IN and OUT, use the opcode $5 or $b for both IN and OUT, and then the unused opcode ($b or $5) would be free for another instruction.

    Right shift seems rather cumbersome to implement with these instructions; I think it’s the only fundamental ALU operation that seems unreachable by a pair or trio of the existing instructions.

    To efficiently handle the jump tables needed to make up for the lack of a stack, a decrement-jump-if-zero (DJZ?) instruction would be handy. If understand the timing problem for DJZ an DJNZ correctly, the problem is that you need to know whether to take the jump or not, but this information won’t be available until the ALU operation completes which then turns this into a 3-cycle instruction. Since you have 4 address bits of the microcode ROMs unused, you could connect the accumulator there and know whether the jump will be taken or not (A==1?) at the start of the cycle thus keeping the timing exactly the same as the other jump instructions.

  8. Steve Chamberlin August 30th, 2013 3:20 pm

    > So if we want a machine as simple as possible, having short operations with a single operation structure (or at least not many) is desirable, but it will result in quite verbose code

    Right. Although I never said it specifically, that’s the main reason I chose a comparatively large 4K program size. I’ve written some test programs and run them in the simulator I made (more on this later), and the programs are definitely verbose. A test that verifies every instruction and exercises the LCD and buttons is about 650 bytes. It might be possible to shrink the program space to 2K, but I think anything less than that would start to get uncomfortable.

  9. Steve Chamberlin August 30th, 2013 3:38 pm

    Erik, good idea about IN and OUT. I actually meant to say 16 ports rather than 8, but since I’m only using 2 OUTS (or at most 5 with some of the other proposed changes), a max of 8 should be fine.

    Right shift does indeed stink. That would be true even if I had every possible ALU operation at my command: the ‘181 just doesn’t do it. I think a 16-entry lookup table would be the easiest solution. Or since there’s no indirect addressing and hence no tables, I guess it would be a 16-way if/then/else routine. 🙂

    Your statement of the DJNZ problem is correct. But on top of that, the ALU actually doesn’t have a Z flag, it has an E flag, which isn’t the same thing. The ‘181 won’t tell you when its output is zero; if you want that information you have to add your own comparator or a 4-input NOR.

    Connecting all 4 bits of the accumulator to microcode inputs is an interesting idea. It could function as a substitute for the missing Z flag, at least for instructions where the result is written to A. It seems a little strange, but opens up some new possibilities.

  10. Hans Franke August 30th, 2013 5:33 pm

    > It might be possible to shrink the program space to 2K, but I think anything less than that would start to get uncomfortable.

    Don’t get me wrong, I’m all in favour for a 4K programm space – and it comes (somewhat) natural when operating with nibbles and byte organized programm storage.

    My comment was rather to underline that this 4K space offers the flexibility to unload hardware into software.

  11. Hans Franke August 31st, 2013 6:50 pm

    I did just go thru your ROM table to key it in an old generator tool i made a long long time ago, and I got puzzled by the way Equal and Carry is handled. I interprete the header that /C = 0 does tell Carry is set. And E = 1 is Equal – right?

    Just the code of both micro instructions is the same – thus either the JE/JNE or the JC/JNC are to be switched. Or is my reading mixed up?

  12. Steve Chamberlin August 31st, 2013 8:18 pm

    Good catch! The entries for JC and JNC in the microcode table were shown incorrectly. I updated the table to show the right microcode values.

  13. Hans Franke September 1st, 2013 7:03 am

    Ok, I did a bit more fiddle on the RISC idea (*1) and took your button check example and converted it to RISC. The result is somewhat surprising. As a measurement I just counted the number of instructions in the programm and the bytes needed to encode. Ofc, this will by no means thell anything about speed, just about code size, but it can be taken as a hint.

    I did go thru 4 variations:
    A) Original Nibbler (CISC)
    B) RISC-Nibbler as described in my post
    C) RISC-Nibbler with an addition of a STB (and maybe OUTB) instruction
    D) RISC-Nibbler with STB and reordered structure

    (C did only reorder/change instructions to fit the B register, while D did also take advantage of tha capability to hold information in A for longer than the actual operation)


    For A: 54 Bytes, 33 Instructions
    For B: 61 Bytes, 43 Instructions
    For C: 55 Bytes, 37 Instructions
    For D: 52 Bytes, 34 Instructions

    1. B) did quite fit my expectations by having ~30% more instructions and code size.

    2. A real surprise was that his increase in instructions comes with an equal increase in execution time. Sure, it’s based in the two cycle structure of each instruction, thus more instructions mean slower speed … it just wasn’t that present in my mind (and to be honest, if it wouldn’t be for the chip count, we would see a different execution time for single or doule byte instructions – after all, ROM access isn’t free, and accessing one or two bytes does make a difference)

    3. Without a way to store B directly (most notable used when bringing constants to RAM) it realy isn’t the improvement I expected it to be – less because the increase size and decreased speed, but it would take away the benefit of having a secondary register to keep information.

    4. Unlike I expected it, the biggest improvement by adding a B isn’t to hold information in B, but to keep previous results in A. The other (as noted in 2) is the ability to move constants to RAM (or Port) without destroying A.

    5. It is surprising that D is not only in the same range as A, but in fact two bytes _SHORTER_ than the CISC version. I would put this to the count of no longer needed systhesize of operations.

    I think the above noted improvements will gain even more code reduction (and maybe even speed up) on longer/more complex programms. the above example is just a first glimpse.

    While going thru I did also find, that it might be a good idea to again reduce the operation availabe from all 32 ‘181 Variations back to like 16. We have to go thru the microcode ROM anyway, since it has to supply ALU-codes for other (non arith/logic) operations. So we can select ‘only’ like 16 ‘usefull’ operations.

    The opcode freed can now be used for doing exactly the same, just without the /loadA signal – meaning, we can perform all (available) operations without destroying A. So we not only have a special case of a non destructive subtraction (SMP), but also non destructiv AND/OR/NOT/… Quite handy for consecutive tests of a input value etc.

    Well, unless I’m complete off track now, RISC seams to perform way better than expected.

    *1 – had quite a disapointing time with my old microcode tool – somehow Windows 7 does not realy like a 1986 16 bit Assembly programm. Maybe I rewrite it – it does ease ROM creation a lot (not only by avoiding switching lines 🙂 )

  14. Steve Chamberlin September 2nd, 2013 6:32 am

    Interesting! I need to find time to look into the B register idea more.

Leave a reply. For customer support issues, use the Contact page instead of comments.