Line-column tracking in Emacs for tree-sitter

ep1.5, line-column tracking

(This is a follow-up of In-depth Review of Emacs tree-sitter integration.)

As mentioned previously, Emacs doesn’t keep track of the line and column number in a buffer and passes dummy values to tree-sitter. This works fine most of the time, but breaks down when the grammar actually needs the column information for parsing, like Haskell.

So now we have to have line and column tracking in Emacs. Since Emacs uses gap buffer for storing text, there’s no efficient way to calculate the line and column number of arbitrary point, and we have to use cache. Fortunately cache works very well for text editors because text edits made by humans are predominately linear, local edits, not random access.

The idea is simple: since most edits must occur around point, we just cache the line and column number (henceforth linecol) of point. Whenever there’s an edit and we need the linecol for the beginning, old end, and new end of that edit, we can just scan from point to these positions and calculate the linecol of them by counting newlines. In reality, we need to cache the linecol of quite a few more positions, and be super careful about which cache is valid at what time. Nonetheless, in the end we were able to implement the tracking system on top of the existing delicate code synchronizing visible ranges, and the performance with linecol tracking is virtually the same as without it (in my crude benchmarks).

Data structures

We cache linecol in both the buffer and each parser. In a buffer, we cache point and the beginning and end of the visible portion (henceforth PT, BEGV, ZV). In each parser, we cache the visible beginning and end of the parser (henceforth visi_beg, visi_end).

Caches are stored in a struct ts_linecol, which is passed around in various functions. Emacs has rich and complex definition of “column”, but here the column is what tree-sitter uses: byte offset from the beginning of the line.

struct ts_linecol
{
  ptrdiff_t bytepos;  /* Byte position in buffer (1-based) */
  ptrdiff_t line;     /* Line number (1-based) */
  ptrdiff_t col;      /* Column in bytes (0-based from BOL) */
};

One thing of note is that there are two levels of invalidness involved. For a linecol, it becomes invalid when the bytepos doesn’t match line and col anymore, and it happens when a buffer edit occurs in front of it or contains it. It can also happen that the bytepos stored in the linecol doesn’t match the actual position of BEGV, ZV, etc; let’s call this kind of situation "misaligned". We’ll need this distinction later.

Scanning

Calculating the linecol of a position is simple: we start from a known valid ts_linecol (or beginning of buffer), scan the text between it and the target position either forward or backward, and count the number of newlines in-between. If scanning-forward, remember the position of last newline to calculate the column number; if scanning backward, scan a bit further to find the next newline to calculate the column. (See treesit_linecol_of_pos.)

Handling all the edge cases correctly requires a bit of patience. Harder still is to find a way to use existing functions (display_count_lines) to implement the scanning instead of writing a simple scanner to count newlines. I had to countermand a specific cases where display_count_lines’s behavior doesn’t match what we want (see treesit-count-lines). But who knows what crazy edge case would come up when scanning the buffer, I’d rather hide behind display_count_lines and bear with its quirks.

We also need a function that, given the old linecol of a position, calculate the new linecol of a position after a buffer change. I won’t go too deep into it right now, but enough to say that without it, we’d have to update the linecol of visi_beg and visi_end by scanning from point every time; that distance can be far, defeating the usefulness of the cache.

The trick is to realize that when you know the content of the change, you know how many newlines are added/removed. And you can simply add/subtract lines from the old linecol to get the new linecol, even if the position is far away from the change.

There’s also a small twist: instead of saving the text before and after change and scan through them, we use treesit_linecol_of_pos to calculate linecol of start, old_end and new_end of the change. The line and column change can be inferred from these linecols. And because it’s a common pattern for a cache to be both invalid and misaligned, we also add a target_bytepos: after we got the new linecol of the position (fixes invalidness), we scan to target_bytepos to get the final linecol (fixes misalignment).

static struct ts_linecol
compute_new_linecol_by_change (struct ts_linecol pos_linecol,
                               struct ts_linecol start_linecol,
                               struct ts_linecol old_end_linecol,
                               struct ts_linecol new_end_linecol,
                               ptrdiff_t target_bytepos);

Here’s how it works: First of all, if the change is completely after the position, the linecol is unchanged. If the changed region is completely before the position, the line difference between the old linecol (pos) and new linecol (pos') is the line difference between old_end (oe) and new_end (ne):

Insert:               Delete:

|     |   |           |     |   |
s     oe  pos         s     oe  pos

|  |   |              |         |   |
s  ne  pos'           s         ne  pos'

As for column, notice that the column of pos' is just the column of new_end plus the distance between old_end and pos:

Suppose # is text, | is cursor:

################
########|########|
        oe        pos

Now, if we insert something:

################
########|OOOOO
OOOOOOOOOO|########|
          ne       pos'

And the same goes for deletion.

Finally, if the changed region includes the position, we just give up and scan from either start or new_end, whichever is closer. Most changes in text editor are not very large, so this is fine in practice.

Tracking

Remember the lifecycle of parser updates for tree-sitter in Emacs: for each user edit, we report the change to each parser but don’t reparse yet. Only when the user requests nodes from the parser do we sync up the visible beg and end of the parser with the visible portion of the buffer, and reparse.

For the purpose of tracking lincol, the work is also split in two. When recording user edits, we now need to first update the linecol of our cache, then compute the correct lincol for start, old_end, and new_end to send to the parser when reporting the change. We also take the opportunity to update cached lincol for BEGV, ZV, and PT, so their lincol are always valid. (But the lincols of buffer-level caches can get misaligned when we later use them, since other things can move them.)

Then, before reparse, we need to update BEGV and ZV’s linecol, and use them for computing the pseudo edits we use to sync the visible region of the parser and buffer.

During buffer change

The most important thing to keep in mind is that, at each point in time, which linecols are valid.

Before the buffer change is applied, the linecol of PT is still valid (although we don’t know if it’s misaligned or not, and we don’t care). We use it to compute the linecol of start and old_end (by scanning from it). The linecol of start will remain valid after the change and will be our source of truth in the post-change world.

After the change is applied in the buffer, we compute the linecol of new_end by scanning from start. Now we have the linecol of start, old_end, and new_end, that we can use to convert any old linecol into a valid one by compute_new_linecol_by_change. We use them to update the linecol of visi_beg and visi_end of the parser.

For reporting the edit to the parser, because the parser’s visible region starts at visi_beg, we need to send it the delta between old_end and old_visi_beg rather than just old_end. Similarly we need to calculate the delta between start and new_visi_beg, and between new_end and new_visi_beg. Note that old_end must use old_visi_beg, because they are in the same past buffer state.

Finally, we use the linecol of start, old_end, and new_end to update the linecol of BEGV, ZV, and simply set the linecol of PT to that of new_end. Maybe PT’s position doesn’t really match new_end (ie, misaligned), but that’s fine; we only care that PT’s linecol cache is valid and nearby when next change happens.

During reparse

Remember that we need to send a series of pseudo edits to the parser to synchronize the visible region of the buffer and parser before we reparse. And each of the edits needs linecols paired with start, old_end, and new_end. The good news is, there’s no buffer change happening, so all the cached linecols are valid, and we can scan around to get new ones easily.

We first update the linecol of BEGV and ZV by scanning from the cached linecol. The parser’s visi_beg and visi_end are already valid and aligned. Then we just compute the start, old_end, and new_end and make the 3-step edits. Finally we store the updated linecol of visi_beg and visi_end to the parser.

Epilogue

There you have it. By exploiting human’s text editing pattern and careful cache management, we added linecol tracking to Emacs at almost no cost. I did some rough benchmarks and there’s virtually no difference between tracking/no-tracking. But in the good ol’ tradition of “you don’t pay for what you don’t use”, linecol tracking is only enabled if the parser opts into tracking linecol. And by default we only opt-in Haskell parsers. You can opt-in by adding the language to treesit-languages-require-line-column-tracking.