Building a Text Editor: the Gap Buffer // Work Dad Dev

Prologue: The Agon 8-bit computer

In December of 2022 I came across a new 8-bit computer called Agon. It was built around the eZ80 CPU, a modern version of the venerable Z80 processor that was popular in many 1980s microcomputers, and an ESP32 microcontroller for video generation using FabGL (now vdp-gl, a fork of FabGL maintained by me). The original Z80 was an 8 bit CPU with a 16 bit address bus, allowing it to access up to 64KiB of memory at a time. There were machines that had more memory available, accessible via banking.

The eZ80 is still an 8-bit CPU but it is pipelined and runs at 18Mhz whilst most of the Z80 computers of the era ran at 3.5MHz. The address bus was widened to 24-bits, allowing up to 16MiB of memory to be accessed without resorting to banking. The Agon takes advantage of this and provides 512KiB of memory on-board which is huge for an 8-bit computer. The Agon can run BBC BASIC, the first BASIC dialect to introduce structured programming constructs like functions, procedures, labels for GOTO instead of line numbers. Originally it was limited to accessing up to 64KiB but the developer has since released BBC BASIC ADL which can access the full 512KiB of on board memory directly.

Over the year of 2023 a small but growing community of developers has been improving the software for the machine and in July 2023 a company from the UK, Herber, released a console version of the machine called Agon Console8. 100% backwards compatible, it adds joystick and mouse ports, a nice case and more importantly has developers working on improving the operating system (MOS) and the video generation co-processor (VDP).

As for myself, I bought the machine to learn to program in eZ80 assembly and as my first project I chose a text editor. I had never written one and was always curious to understand how it was done. I spent some time learning and writing assembly on the machine itself (it has a text editor based on GNU nano and a native assembler) and while it was really fun, it proved to be quite a task for someone that had zero previous experience with Z80 assembly.

So I decided to abandon it and restart from scratch, this time in C, using AgDev, an LLVM-based C compiler that targets the Agon, and an emulator so I can quickly iterate on the code. It is currently very functional and being used by members of the community and I plan to keep improving it over time.

The design of AED

AED (Another text EDitor) is a text editor, a simple tool that everyone that has ever used a computer is familiar with. There’s a screen area with a cursor, you type letters / symbols and they show up, you navigate the screen with arrow keys, create / delete lines, etc.

One way to design an editor is to think of it as having a Model (the representation of the text in the computer memory), a View (the screen area) and the Presenter (gathers user input and updates the model and screen accordingly). This is called MVP and is an architectural pattern useful for building applications that have a user interface.

In AED, the model is called text_buffer and is actually composed of a character buffer and a line buffer. But we are getting a bit ahead of ourselves..

Choosing the right model data structure

A text editor can be viewed as a list of lines, with each line having a different size. So a good first approach when considering the model is to create exactly a list of strings! In C it would be something like this:

typedef struct _line {
  char* buf_;
  int   bsz_;
  int   lsz_;
  int   pos_;
  struct line* prev_;
  struct line* next_;
} line;

line has a character buffer (buf_) that is dynamically allocated (size stored in bsz_). lsz_ stores the number of characters currently on the line and pos_ stores the current position of the cursor on the line.

prev_ and next_ are pointers to the previous and next lines on the list, allowing us to navigate up and down the lines from wherever we are.

Looking at a single line for a moment, let’s consider how we would edit text using the character buffer. Consider this text is in the buffer, with B a pointer to the beginning, E a pointer to the end and C the current position of the cursor:

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[H I ,   T H E R E _ _  _  _  _  _  _ ]
 B                 C                   E

This line has 15 characters (bsz_), but has only used 9 so far (lsz_) and the cursor is at position 9 (pos_).

Adding a letter at the cursor is fast and easy, as is backspacing and moving the cursor left and right. But what happens if we move the cursor to the left and it is in the middle of the text, like this:

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[H I ,   T H E R E _ _  _  _  _  _  _ ]
 B       C                             E

In this case, if I want to add a new letter I would first have to shift each of the letters from THERE one position to the right and then add the character at position 4:

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[H I ,   S T H E R E _  _  _  _  _  _ ]
 B         C                           E

This means the longer the line and the farther away the cursor is from the end of the line, the more work it takes (shifting characters) to insert a character in the buffer. It could be argued that the number of characters in a line is small and the screen imposes a hard limit on the number of characters displayed so this isn’t such a big deal.

Typically, when you use a text editor, you tend to limit yourself to the visible horizontal area, so 80 columns in the case of the Agon’s default 640x480 resolution with 8x8 characters. At 18MHz, it is fast enough to shift characters around so this is a reasonable representation of a line.

Using a list is also not a big issue because there is no cache in the Agon, so accessing a pointer to any random position always takes the same amount of time. And with the exception of full page updates (PAGE UP / PAGE DOWN), edits tend to be localized so moving across lines is fast enough as well.

Dynamic memory / overhead.

However, a bigger issue is allocating per-line memory. Typically we would have to guess an initial value, say 25 characters, and keep managing that memory; increasing it as needed and either keep the last allocated size, having unused memory allocated, or shrinking it, creating more work. This can lead to fragmentation, when the computer has enough memory but the memory allocator can’t find large enough contiguous memory blocks.

Also, doing the math there is quite a lot of overhead per line. We have:

the character buffer pointer (buf_, 3 bytes)
Its size (bsz_, 3 bytes)
The used character count (lsz_, 3 bytes)
The position index (pos, 3 bytes)
The prev_ and next_ pointers (6 bytes)

A total of 18 bytes per line! If you have an average of 25 characters per line (reasonable when programming in C or BASIC), you actually need 43 bytes, an overhead of ~92%! If all 80 cols are used, that’s 98 bytes or ~22% of overhead.

We could reduce the overhead by removing the prev_/next_ pointers, allocating an array of lines and keeping a global pos_ index. But that introduces other problems i.e when we exhaust the text buffer, we need to copy all those little line buffers to a new text buffer and our friend memory fragmentation is still an issue.

On a machine with 512KiB of RAM, all this overhead adds up. Ideally we want to reduce it in order to fit as much text in memory as possible.

Single character and line buffers

One way to address this would be to have single character and line buffers. In C, it could be something like this:

typedef struct _char_buffer {
  char* buf_;
  int sz_;
  int pos_;
  int cnt_;
} char_buffer;

typedef struct _line {
  int* line_size_;
  int sz_;
  int pos_;
  int cnt_;
} line;

With these data structures, we reduced our per-line overhead to 3 bytes: just an entry in line_size_. The other fields are an editor-level overhead: they remain constant throughout the use of the program.

We now have to allocate an initial size for both the character and line buffers. Let’s say 256KiB and 12KiB (4096 lines) respectively. If either of these are exhausted then we reallocate and copy. But, now we only have to manage 2 buffers instead of per-line ones; memory fragmentation no longer plays a significant role.

However, we introduced a new problem: editing became really expensive. If you have a 10KiB file and you want to add a new character at the beginning, that’s 10240 characters you have to shift in memory instead of the 20 or so of a line, plus the line buffer shifts. That will be really slow and it will be noticeable on an 18MHz machine.

If we want to keep our low per-line overhead, we need a different data structure that can allow us to keep our fast edits with a low overhead.

The Gap Buffer

It wasn’t until some time in 2020, when I first started to research how to write a text editor, that I learned about this data structure called gap buffer. You can think of it as a string with two pointers. Moving around the string means moving characters between the two pointers so that there is always a gap between them. Let’s look at an example to better understand.

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[H I ,   T H E R E _ _  _  _  _  _  _ ]
 B                 C                   CE
                                       E

Compared to our simple string buffer, the gap buffer has 4 instead of 3 pointers:

B: this always points to the beginning of the buffer.
E: this always points to the end of the buffer.
C: This is the current position of the cursor
CE: this is the cursor end pointer.

C and CE are the key to making this work. Notice that there is a gap between them. If we move C we also need to move CE in order to keep the gap a constant size. The gap is only reduced by adding new characters. You can think of this as the buffer having a prefix (from B to C) and a suffix (from CE to E) and therefore getting the full text requires concatenating them both.

If we move the cursor 2 positions to the left, the buffer will be updated to:

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[H I ,   T H E R E _ _  _  _  _  R  E ]
 B             C                 CE
                                       E

In order to move C, we moved CE and copied the characters as we moved. The prefix is now “HI, THE” and the suffix is “RE”. Joining them, you still get the original “HI, THERE” string. There is no need to erase “RE” from before moving C as the prefix will only consider the characters in the C-B range.

What happens if we add a new character at C?

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[H I ,   T H E I E _ _  _  _  _  R  E ]
 B               C               CE
                                       E

We now have “HI, THEI” as the prefix and still have “RE” as the suffix, with the full text now being “HI, THEIRE”. Notice that no shifting was involved. We just wrote the character to C and then moved it to the next position.

Deleting a character is just moving CE to the right:

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[H I ,   T H E I E _ _  _  _  _  R  E ]
 B               C                  CE
                                       E

Now we have the string “HI, THEIE”. And backspace is just moving C to the left without copying and moving CE:

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[H I ,   T H E I E _ _  _  _  _  R  E ]
 B             C                    CE
                                       E

Giving us “HI, THEE”.

The gap buffer makes navigating and editing O(1) for all single character operations at the cost of 1 extra pointer (3 bytes) compared to the original string buffer. The cost per character is higher as there is almost always copying involved but it is still very low.

Wrapping it up

Now that we know how the gap buffer works, here is how we would represent it in C:

typedef struct _char_buffer {
  char* buf_;
  char* ccur_;
  char* cend_;
  int size_;   // we always do buf_ + size_ to know E.
} char_buffer;

// line_buffer is also a gap buffer, but to track the line size.
typed struct _line_buffer {
  int* line_size_;
  char* ccur_;
  char* cend_;
  int size_;
} line_buffer;

typedef struct _text_editor {
  char_buffer cb_;
  line_buffer lb_;
} text_editor;

Now our text buffer has a constant overhead of 24 bytes regardless of how big our text is plus 3 bytes per line for the size and we only need to manage 2 buffers in order to keep track of the state.

I was surprised I hadn’t heard of the gap buffer before and it was quite hard to find a good explanation of how it worked on the internet, even though it is really simple to understand. It is to me quite an elegant data structure that solves a very practical problem and I can see why many text editors of the early computers would use it.