Emulating microprocessors with macros

Whenever I work on an emulator (having written several in the past), I try to make my life as interesting as possible. After all, implementing hundreds of opcodes can be a very dull task.

Most recently, I joked that C macros were powerful enough for it to be feasible to implement an simple architecture in them. One thing led to another, with the result being an Intel 8080 emulator core implemented purely with C macros.

In this post, I'll go over the awful hacks that helped make this monstrosity a reality… and why perhaps it's not such a bad idea to write an emulator in macros.

Quick Intel 8080 Refresher

The Intel 8080 is an 8-bit microprocessor with a 16-bit address bus, released back in 1974. It features 8-bit general-purpose registers B, C, D, E, H, and L, alongside an accumulator A. The general-purpose registers can be treated as 16-bit "pairs" BC, DE, and HL, allowing for 16-bit operations to be performed. Memory is addressed through the register pair HL.

A table of supported opcodes and their mnemonics may be found here.

Motivation for Macros

Macros don't just needlessly complicate the development of an emulator. There is, in fact, a very tangible benefit to using macros in implementing instructions — performance.

A typical 8080 instruction is encoded as an 8-bit value, with register operands embedded in the opcode. The encoding for all MOV instructions is shown below.

01 aaa bbb
|   |   └── source register
|   └── destination register
└── MOV prefix   

For instance, the assembly MOV A, B is encoded as 01 111 000; 111 identifies register A, and 000 identifies register B. A traditional emulator might implement MOV with a generic, lookup-based approach:

void MOV(uint8_t opcode) {
  uint8_t src = opcode & 0x7;
  uint8_t dst = (opcode >> 3) & 0x7;
  set_reg(dst, get_reg(src));  // set_reg and get_reg are switch-based lookups
}

From a software engineering standpoint, this is an ideal implementation. MOV is succint, and set_reg/get_reg are dedicated helpers that can be reused in future code. A+ for code quality.

And yet, this approach is suboptimal in the event that we're optimizing for performance, rather than readability. There is a large overhead (in terms of host machine cycles) for the set_reg and get_reg routines that can't easily be eliminated. The compiler might end up inlining the routines and removing the overhead of the 2 calls, but the approach still requires mapping a 3-bit register ID to the actual register, twice per instruction.

Instead, what if we used macros to generate code for all possible variants of an instruction? Illustrating with an example, MOV can be implemented like this:

#define MOV(X, Y) \
{                 \
    X = Y;        \
}

Of course, we need to make sure that we generate code for all valid permutations of X and Y, but this is easy to do programmatically.

Dispatching Instructions to Macros

At the core of every emulator is a tight loop that increments the program counter, fetches an opcode, and transfers control to the appropriate opcode handler. If all out opcodes are implemented as macros, how can we achieve this?

One option is we can make use of GCC's support for taking the address of labels by generating a label for each opcode, and storing its address in a lookup table that we can later branch into. A straightforward application of this idea would look something like this:

#define DONE goto done_opcode;
#define MOV(X, Y) \
{                 \
  X = Y;          \
  DONE;           \
}

void run_forever(void) {
  // Declare all registers, memory, etc.
  ...
  
  static void *ops[] = {
    ... &&MOV_A_B, &&MOV_A_C, &&MOV_A_D, ...
  };

  while (1) {
    goto *ops[memory[PC++]];
    done_opcode: ;
  }

  ...
  MOV_A_B:
    MOV(A, B)
  MOV_A_C:
    MOV(A, C)
  MOV_A_D:
    MOV(A, D)
  ...
}

The observant reader will notice that it's almost as if we're treating MOV as a function, substituting its invocation and return with gotos. If we didn't want to rely on GCC-specific extensions, we could instead implement this functionality with regular functions. That said, by branching within the same function, we save the overhead stack frame management imposes.

You can check out this example on godbolt.org to view a colorized disassembly of the above code. It's worth noting that in the end, the greatest overhead for our MOV instructions becomes goto *ops[memory[PC++]], which is an operation we'd have to perform regardless of how our opcode was implemented — good!

Handling Register Pairs

As I mentioned earlier, the 8080 is an 8-bit processor that can work on 16-bit data in the form of "register pairs" BC, DE, and HL.

If we want to be able to use macros everywhere, we need some efficient way to enforce consistency between the values assigned to BC, and the values assigned to the individual B and C registers. A simple solution here is to make use of a union of two 8-bit values and a 16-bit value, and some macros to hide the underlying complexity. Thankfully, GCC does a good job at optimizing these accesses.

typedef union {
  uint16_t pair;
  uint8_t reg[2];
} regpair_t;

register uint8_t A;
register regpair_t bc, de, hl;

#define B bc.reg[1]
#define C bc.reg[0]
#define BC bc.pair
#define D de.reg[1]
#define E de.reg[0]
#define DE de.pair
#define H hl.reg[1]
#define L hl.reg[0]
#define HL hl.pair

This will magically work on little-endian systems, due to the order in which the bytes of pair are stored. You're out of luck on a big-endian system, but those are pretty rare to come by these days.

Putting it All Together

With all the plumbing done, all that's left is to implement the rest of the opcodes. This isn't particularly hard, nor is it enlightening. One thing that deserves special attention is that due to our macros resembling functions, it is easy to forget that they are, in fact, macros — and that any use of a parameter re-evaluates it. This can lead to very subtle, hard-to-find bugs.

In the end, a purely macro-based approach appears to be about an order of magnitude faster than a traditional function approach. This was tested (admittedly not rigorously) by measuring the time it took to complete all the test runs of the famous 8080EXER.COM program between macro8080 (the embodiment of the ideas expressed in this post) and several other hobby emulators found on GitHub.

I guess you can sacrifice readability for performance!