Practical Intro to Operational Transformation
Can’t you just give me the best algorithm?
Unfortunately, there isn’t a single best algorithm that you can just learn and implement. There are many different algorithms, each with their own trade-offs and subtleties, none of which I found completely satisfactory. In this article, I hope to just introduce basic ot concepts in simple terms, and draw a rough picture of how it generally works, and point readers to materials for further reading. I also included a simple ot algorithm that should give the reader something concrete to look at.
A huge disclaimer upfront: I’m not an expert, only someone that read all of papers and blog posts that I can find, so take what I say with a grain of salt. There’s a frustrating lack of materials of ot online (which is part of what prompted this article), so I can’t say I’ve grasped the whole picture regarding ot. If there’s any mistake, or anything I missed, please do let me know. Thanks!
What’s OT & CRDT
Nowadays you can’t talk about ot (operational transformation) without mentioning crdt (conflict-free replicated data types). Broadly speaking, they are the two approaches to collaborative editing. ot is the first to be conceived and researched on since the 90’s. crdt is a database/distributed system concept that was brought into collaborative editing in around 2006. I’ll briefly introduce both and give them a comparison, before delving into ot.
ot works by transforming operations like insert, delete, update, etc. Consider this example:
- Two users A, B work on a document containing only “x”;
- A inserts “a” before “x” (insert “a” at position 0), propagates its operation to B;
- Not yet aware of A’s modification, B deletes “x” (delete at position 0), propagates its operation to A.
Now if A applies B’s operation verbatim, it would delete the character at position 0, which is “a” rather than the intended “x”. The correct way is to transform B’s deletion so that it deletes at position 1, which then deletes “x”.
This is a trivial example, but things get very complicated very fast once you mix more concurrent operations and have 3+ users.
The problem, in essence, is that B’s operation is based on document state “x”, so A can’t apply it verbatim on document “ax”. Once we transform the operation to be based on document “ax”, the operation can apply cleanly.
crdt is a data type plus a set of operations, such that operations can be transmitted in different order and eventually still get the same result, as long as everybody eventually receives all the operations. Then, you don’t need a central synchronization, a distributed gossip protocol can ensure that every node in the system eventually reaches the same state. (I omitted some details for brevity.)
Turns out you can design a crdt that represents a document and a set of operations that covers all text editing operations (on plain text). You model the document as a series of characters with unique ids, and have two operations: insert after the character with id x, and hide character with id x.
Take the same example above. A’s operation is now “insert after the beginning of the document”, B’s operation is now “hide character with id 1 (which is character x)”. Note that operations are not position-based anymore. When A applies B’s operation, it first finds the position of “the character with id 1”, and deletes it.
You’ll notice that it only has “hide” and no “delete”, because crdt handles delete as hide. crdt can’t delete a character since other operations might need that character as an anchor. (The hidden characters are called tombstones.)
OT vs CRDT
Both ot and crdt are viable approaches, with their own trade-offs.
Collaborative editing is a distributed systems problem, and as everything else in distributed systems, you can simplify your system by either centralizing things (so make it less distributed), or you limit the number of things you can do (limit the functionality). crdt and ot moves around these axes of trade-offs.
The advantage of crdt is that it’s relatively simple to make fully distributed. There are many ot algorithms that claims to be distributed, but they usually have some catch: vector timestamps, requirement of global synchronization, bad space/time complexity that scales with number of nodes in the system, no mention of handling join/leave, etc. (Not to say there are no truly distributed ot algorithms, there are.)
People often credit crdt for being simpler than ot, which it is, on paper. The papers and algorithms only talk about the concept. But if you are to actually implement it, and do it efficiently, things get complicated.
Recall that we described an crdt operation as “find the position of the character with id 1”, a lot of complexity is hidden in the “find”. How do you find the character with id 1? Are you going to scan the whole document every time?
Also, to use crdt in the real-world for an editor, you need to translate “delete character with id 1” into something an editor can actually apply, namely, delete character at position x. And once you find the position of the character with id 1 in the crdt data structure, you’ll need to subtract the hidden characters from that position, because the editor’s document doesn’t contain those. Same for the reverse direction, you need to translate an editor operation made by the user into a crdt operation.
Also, since crdt lives in a different world than what the user and editor see, operations that crdt produce might not be “intuitive” to the user, even though the document ends up in a consistent state. A big part of collaborative editing is capturing and preserving “user intent”, and it’s harder to do when your core operations are slightly different from what the user actually does, and you see a different document internally than what the user sees.
There are tricks1 that we can do to speed up the translation, but it’s still a lot of complexity. On the other hand, a basic ot algorithm is much easier to implement.
Even though ot do is simple, ot undo is very complicated and inefficient. We’ll expand on this later sections. Undo in crdt is simple and can be handled as normal operations.
I hope I’ve demonstrated that neither ot nor crdt is strictly superior to the other. Some further comparison can be found in Appendix A.
Intro to OT
An ot algorithm is made of three parts, a set of basic operations; the transformation function that transforms operations against each other; and a control algorithm that determines which operation to transform against which.
Usually there’s several editors collaborating on a document, each editor is commonly called a site. What we want to achieve is eventual consistency, that is, the document on each site should end up the same; but we also want to preserve user intent—if the algorithm messes up the sequence of text and makes the document unreadable, even if all the sites have the same garbled document, it’s not of much use for users.
To nobody’s suprise, user intent is fuzzy and hard to pin down formally. Nevertheless, ot researchers over the years came up with formal concepts and properties that formalizes the problem and can be used to prove the correctness of algorithms. Most notably, tp1 and tp2.
TP1 and TP2
TP stands for transformation property. They are properties that a transformation function has to satisfy in order to preserve consistency and user intent. tp1 is basic and easily satisfied. tp2 is rarely satisfied, most algorithms use a control algorithm to rearrange operations such that tp2 is never encountered.
tp1
says: given two concurrent operations o1 and o2
created on the same state, you can get the same document state
through two routes. First route: transform o2 against
o1 and get o2’, and apply it on top of
o1, so history is [o1, o2’]. Second route:
transform o1 against o2 and get
o1’, and apply it on top of o2, so history is
[o2, o1’].
tp2
involves three operations, given any operation o3, you
should be able to transform it against [o1, o2’], or against
[o2, o1’], and the result o3’ should be the
same. This is very hard to achive. But luckily, you only need
tp2 if
your algorithm needs to transform operations that are not created on the
same context. Most algorithms simply arrange the operations so they never
need to do this transformation. Instead, they pre-transform the
operations so that they’re based on the same context, then transform them
against each other. Indeed, how to correctly and efficiently track and
store the context of each operation, and find the operation with the
right context, etc, would be the job of the control algorithm.
Control algorithm
The principle of a control algorithm is simple, suppose there are two
editors, and you’re the control algorithm on editor 2, user has made a series of edits
o1,...,o3, and in comes an op o8, from
editor 2. Op o8 is made on
a document state different from the document state currently in your
editor, so it can’t be applied verbatim. Let’s suppose our transformation
function satisfies tp1 but not tp2, which is
typical, how do you transform o8 so that it avoids
tp2 and
the result can apply cleanly on the current document?
Note that a document state is simply made of all the ops applied to
it. So we can define the document state, aka the context of any op
generated on that state, as a set of ops. So our current document state
is {o1, o2, o3}. Conveniently, the three ops are generated
by the user sequentially, so o1’s context is
{}, o2’s context is {o1}, and
o3’s context is {o1, o2}, each can be directly
applied in the order of o1, o2,
o3.
Now if we look back at o8, how we transform it depends on
its context. Suppose by the time editor
2 generates o8, it has already received o1, then the
context of o8 is {o1}. We just need to
transform it in some way such that o8’s context becomes
{o1, o2, o3} to apply it on our document. And with our
current history, there is a valid path to do it: first transform
o8 against o2, since both have context
{o1}, to get o8’. Note that now
o8’ has context {o1, o2}. Then transform
o8’ against o3 and get o8” which
has context {o1, o2, o3}, and it can be applied to our
document. Remember, because our transformation function doesn’t satisfy
tp2, we
can only transform an op against another when they have the same
context.
So that’s the rough idea, in real algorithms, there are many ways to
play with the context. You can enforce a global sequence so that the
context is implied by the position of an op in the sequence. Or you might
find yourself with an op with a context that you can’t transform (maybe
o8’s context is {o1, o3}, how do you transform
that?)
For concrete algorithms, I recommend reading Google Wave’s implementation2 and the cot paper3, and maybe the uno4 paper for a blend of ot and crdt. I also came up with an algorithm myself, it’s basically a blend of Google Wave and cot, plus tombstones (see below).
Undo
On the surface, undo with ot seems
simple: for an op o, you just generate the inverse of it
I(o), and treat it as a concurrent operation after
o, transform it against all ops after o, and
apply it to the document. But this won’t work. Suppose you insert "AAA",
undo it, then redo the undo. Now the history consists of [ins(0,
"AAA"), del(0, "AAA"), ins(0, "AAA")]. Let’s try to undo the first
insertion—make an inverse of it, which is del(0, "AAA");
transform it against the second op, now you get ins(0, "");
then you transform it against the third op, of course, nothing changes,
and you still get ins(0, ""). That’s wrong! We expect to get
del(0, "AAA"). Our transformation function doesn’t know that
the third op undoes the second op, so after the content in our op was
removed in the first transformation, it didn’t get added back during the
second transformation.
What I just described is basically ip2: transforming an
op o1 against another op o2 followed by its
inverse I(o2) should give you o1 back.
In total, there are three inverse properties ip1, ip2, ip3. ip1 is basic and like tp1, is always satisfied. It says applying an op followed by its inverse gives you the same document.
ip3
says given two concurrent ops o1 and o2, if you
transform o2 against o1 to get o2’
and apply to the document, then generate an inverse of o1
and transform it against o2’ to get I(o1)’.
This should be the same as transforming o1 against o2 to get
o1’, then taking the inverse of it to get
I(o1’).
Again, ip2 and ip3 are generally avoided with control algorithms. Earlier ot algorithms try to associate undo with their original op and skips them in pairs when transforming the inverse op. Later ot algorithms like cot use increasingly complicated and expensive procedure to generate inverse operations.
But if you’re willing to introduce tombstones into the algorithm, suddenly the control algorithm becomes easy: with tombstone, ip1, ip2, and ip3 can all be satisfied, you can just use the naive algorithm described at the beginning of this section. But obviously tombstones come with significant drawbacks like increased storage, need to convert between internal position and editor position, etc.
If you want to learn the proper way to do undo with ot, I recommend reading the cot paper3. If I remember correctly, its undo is exponential time.
All in all, I don’t think undo is “solved”. Even Google Doc has cracks. Try this: open the same doc in two tabs, in one tab type AABB, in the other insert XX between AA and BB. Now undo in both tabs. Redo in the first tab, you get AABB back. Now redo in the second tab, you get XXAABB.
A simple OT algorithm
This is a very simple OT algorithm, in fact, simplicity is the whole point here. I tried to come up with an algorithm that’s as simple to implement as possible. As a result, it doesn’t support rich text, nor is it distributed, and it uses tombstones. But it’s as simple as you can get.
For transform function, we just use the most basic string-based insert and delete. Each site has an assigned site id, which is used to break ties.
There needs to be a central server that receives ops from each site, transforms them and broadcasts the op to each site. Each op’s context is represented by a sequence number in a global sequence, its context is basically all the ops before it combined. We make sure all the ops are generated from/transformed to have this linear, continuous context. This is key to simplify the control algorithm.
The server’s job is simple, it maintains the global history, and
receives ops from each site. The ops sent from each site doesn’t always
have the latest context, if the current global history is
[o1,...,o9], a site might send an op with context
[o1,...,o6]. That’s fine, the server just transforms it
against o7,...,o9, stores it in the global history, and
broadcast it out. The important thing is that a client’s op’s context is
always a prefix of the global history. Eg, a context like [o1, o2,
o5, o6] would never happen.
On the editor side, we maintain a history made of two segments, the
global history L1 and local history L2. And the
current editor document state is at the end of L1 + L2.
L1 contains all the ops sent from the server, and
L2 contains all the ops the user created that haven’t been
sent to the server yet. If user keeps adding ops, we put them at the end
of L2. The ops we receive from the server always have
continuous global sequence numbers, and the context of a new op from the
server is always all the ops in L1. That means whenever an
op comes from the server, we just need to transform it against all the
ops in L2 sequentially, and it’ll have the correct context
to be applied to the current local document. Also the untransformed
server op is appended to the end of L1.
To send ops to server, we always send the whole L2 to
server together, because ops in L2 have continuous context.
We send L2 to server, server transforms them against the
global history it has, and broadcast it out. Note that the server can
always transform the L2 we send out, because
L2’s context is L1, which is always a prefix of
the global history that the server maintains. Then, we received
L2’ back from server, which is the transformed version of
L2, and append it to the end of L1. We also
remove the ops from L2 since they’re now in L1.
Waiting for this ack from server ensures
that our local L1 is always a prefix of the global history
on the server.
In case of disconnection and lost messages, a site can simply re-request ops from the server and resend its local ops, there’s no special synchronization or fixup required.
That’s about it for the do part. This is basically what Google Wave does. I omitted some details but readers should be able to see how it manages context to ensure ops always have the correct context for the transformation we need to do.
For undo, I used tombstones. Now I need to translate internal doc position (which counts tombstones) and external editor document position. This is still vastly simpler than trying to use a control algorithm to avoid ip2 and ip3. If someone has a good way to implement undo without tombstones, I’m eager to hear.
Real world implementations
Most of the algorithms mentioned in the history section have an accompanying implementation. Most notably Sun et al implemented CoWord and CoMaya, which brings collaborative editing to Word and Maya. Very impressive work. The latest algorithm used in CoWord and CoMaya is cot. cot is also used in CodoxWord and ibm OpenCoWeb according to the pot paper6.
The CKE editor based their editor on this paper.7 They showed some of the data structures they use in their other blog post8, and it looks just like the one in the paper.
Appendix A, OT vs CRDT cont.
People also say crdt can’t work well with rich text9, because it can only handle insertion and deletion, so it’s difficult for it to preserve user intent on complicated operations like “add table row”. And all the commercial rich text editors are implemented in ot (TinyMCE editor9, cke editor11, Codox, Google Doc, etc). But, I’m pretty sure that crdt can handle rich text. The author of CoWord, CoPPT, and CoMaya12 uses ta (transparent adaptation)13, which works by converting high-level operations in the application into three primitive operations: insert, delete, and update. You just need to properly implement crdt with rich text, rather than encoding rich text in json and using crdt on the plain json.
Though to be fair, ot algorithms rarely have more than a handful of basic operations either, because you need to define transformation between each pair of basic operations, so the number of transformations you need to define is the number of basic operations squared. The common approach is ta (transparent adaptation). Basically translating high-level operations into basic operations. I didn’t look into this very deeply.
Space complexity: crdt document keeps all the characters ever inserted, both visible characters and tombstones. ot needs to store all the concurrent operations. It doesn’t need to store all the operations: once an operation is known to be applied at all nodes, it can be discarded.
So it seems ot has better space complexity in general? Not quite. If you want to make the ot system fully distributed, you’ll have to keep all the history, since it’s impossible to tell if an operation is applied at every node when you don’t even know all the nodes. And if you think about it, “full history” and “all characters ever inserted in to the document” sounds familiar, no?14
On the other hand, if you do know when all sites have applied an operation, you can garbage-collection tombstones in crdt.
Time complexity: it varies greatly depending on the algorithm, optimization, etc. Suffice to say both can be efficient and it really depends.
Finally, there is a detailed paper15 that compares ot and crdt. It was later expanded into a three-part series: Ⅰ16, Ⅱ17, Ⅲ18. The main idea is that crdt in its essence is still based on transformations, and there are a lot of hidden complications in applying the algorithm on paper to editors. The paper makes crdt sound like being inferior to ot, but the bias towards ot is pretty palpable, so I’d take it with a grain of salt :-)
Appendix B, OT history
deOPT (grove)19 is (I think) the first ot algorithm, created in 1989. Later, people found out a flaw: the deOPT puzzle.
DistEdit20 in 1994 explores selective undo and undo ui.
Jupiter21 in 1995 is the basis of Google Wave2.
The Sun et al lineage starts from got23 in 1998. got aims to solve the deOPT puzzle. Then there is goto also in 1998 by the same author.
After that came anyundo24 in 2002, which can undo any operation at any time. Previous algorithms usually have some restriction on what operation can be undone and when can you undo it. anyundo only concerns with undo so it pairs with goto to become a complete solution.
cot3 is the final algorithm, iterated upon goto-anyundo, published in 2009. pot6 is the final final algorithm, proposed in a theoretical paper in 2016. (They also proposed a improved version of tibot, tibot 2.0, in that paper.)
Control algorithm-wise, it seems to end there for Sun et al. After cot, they went on to research collab editing for 3D modeling software (e.g., Co-Maya).
tibot27 by Li & Li and Sun in 2004 is an interesting algorithm that uses logic time intervals, doesn’t need central ordering and doesn’t need a vector timestamp. (Still, every node needs to constantly sync with every other node, so join/leave will be a challenge.)
There are some other algorithms proposed by other researchers: soct3/4, nice, etc. You can check out the wiki page on operational transformation. It has a nice table summarizing a lot of the algorithms.
Post cot, the trend seems to have shifted to two major directions: One is ttf28 (by hal), which incorporates tomstones into ot to improve undo complexity. Another is abt29 (by Li & Li), which moves operations in the history buffer such that an operation that inserts a character always comes before an operation that deletes that character.
Along the ttf line, you have uno4, ST-Undo31, etc. Along the abt line, you have abtu32.
Here I quote the author of ST-Undo:
The follow-up OT-based undo solutions invent functional components to address the abnormal ordering problem. TTF [36] introduces an object sequence to keep deleted objects. ABT [39] introduces a special history buffer in which insert operations are placed before delete operations. UNO [25, 26] is a selective undo algorithm built on TTF. As deleted objects are never lost, UNO can preserve the ordering relations among objects. Except for the object sequence, UNO stores both do and undo operations in the history buffer. The time and space complexity of UNO is linear in the size of the object sequence plus the number of operations in the history buffer. ABTU [28] is developed from ABT [39]. In ABTU, undo operations are stored in the form of inverse operations of the corresponding do operations in the history buffer. As an operation may be transformed with both do and undo operations, ABTU arranges the operations in the history buffer according to their effect positions. ABTU has a linear time and space complexity in the size of history buffer.
I only looked into ST-Undo and abtu closely. abtu is very complicated and uses vector timestamps, ST-Undo is moderately complicated, but to me, part of the complexity (using a tree to store tombstones) is unnecessary in practice. They could’ve just used a cache around cursor, and the performance would be even better in practice, and it’s dead simple to implement. But I guess caching isn’t interesting for an academic paper.
I also found optic33, seems like a “truly distributed” algorithm that handles node joining and leaving.