Practical Intro to Operational Transformation

Can’t you just give me the best algorithm?

Unfortunately, there isn’t a single best algorithm that you can just learn and implement. There are many different algorithms, each with their own trade-offs and subtleties, none of which I found completely satisfactory. In this article, I hope to just introduce basic ot concepts in simple terms, and draw a rough picture of how it generally works, and point readers to materials for further reading. I also included a simple ot algorithm that should give the reader something concrete to look at.

A huge disclaimer upfront: I’m not an expert, only someone that read all of papers and blog posts that I can find, so take what I say with a grain of salt. There’s a frustrating lack of materials of ot online (which is part of what prompted this article), so I can’t say I’ve grasped the whole picture regarding ot. If there’s any mistake, or anything I missed, please do let me know. Thanks!

What’s OT & CRDT

Nowadays you can’t talk about ot (operational transformation) without mentioning crdt (conflict-free replicated data types). Broadly speaking, they are the two approaches to collaborative editing. ot is the first to be conceived and researched on since the 90’s. crdt is a database/distributed system concept that was brought into collaborative editing in around 2006. I’ll briefly introduce both and give them a comparison, before delving into ot.

ot works by transforming operations like insert, delete, update, etc. Consider this example:

  1. Two users A, B work on a document containing only “x”;
  2. A inserts “a” before “x” (insert “a” at position 0), propagates its operation to B;
  3. Not yet aware of A’s modification, B deletes “x” (delete at position 0), propagates its operation to A.

Now if A applies B’s operation verbatim, it would delete the character at position 0, which is “a” rather than the intended “x”. The correct way is to transform B’s deletion so that it deletes at position 1, which then deletes “x”.

This is a trivial example, but things get very complicated very fast once you mix more concurrent operations and have 3+ users.

The problem, in essence, is that B’s operation is based on document state “x”, so A can’t apply it verbatim on document “ax”. Once we transform the operation to be based on document “ax”, the operation can apply cleanly.

crdt is a data type plus a set of operations, such that operations can be transmitted in different order and eventually still get the same result, as long as everybody eventually receives all the operations. Then, you don’t need a central synchronization, a distributed gossip protocol can ensure that every node in the system eventually reaches the same state. (I omitted some details for brevity.)

Turns out you can design a crdt that represents a document and a set of operations that covers all text editing operations (on plain text). You model the document as a series of characters with unique ids, and have two operations: insert after the character with id x, and hide character with id x.

Take the same example above. A’s operation is now “insert after the beginning of the document”, B’s operation is now “hide character with id 1 (which is character x)”. Note that operations are not position-based anymore. When A applies B’s operation, it first finds the position of “the character with id 1”, and deletes it.

You’ll notice that it only has “hide” and no “delete”, because crdt handles delete as hide. crdt can’t delete a character since other operations might need that character as an anchor. (The hidden characters are called tombstones.)

OT vs CRDT

Both ot and crdt are viable approaches, with their own trade-offs.

Collaborative editing is a distributed systems problem, and as everything else in distributed systems, you can simplify your system by either centralizing things (so make it less distributed), or you limit the number of things you can do (limit the functionality). crdt and ot moves around these axes of trade-offs.

The advantage of crdt is that it’s relatively simple to make fully distributed. There are many ot algorithms that claims to be distributed, but they usually have some catch: vector timestamps, requirement of global synchronization, bad space/time complexity that scales with number of nodes in the system, no mention of handling join/leave, etc. (Not to say there are no truly distributed ot algorithms, there are.)

People often credit crdt for being simpler than ot, which it is, on paper. The papers and algorithms only talk about the concept. But if you are to actually implement it, and do it efficiently, things get complicated.

Recall that we described an crdt operation as “find the position of the character with id 1”, a lot of complexity is hidden in the “find”. How do you find the character with id 1? Are you going to scan the whole document every time?

Also, to use crdt in the real-world for an editor, you need to translate “delete character with id 1” into something an editor can actually apply, namely, delete character at position x. And once you find the position of the character with id 1 in the crdt data structure, you’ll need to subtract the hidden characters from that position, because the editor’s document doesn’t contain those. Same for the reverse direction, you need to translate an editor operation made by the user into a crdt operation.

Also, since crdt lives in a different world than what the user and editor see, operations that crdt produce might not be “intuitive” to the user, even though the document ends up in a consistent state. A big part of collaborative editing is capturing and preserving “user intent”, and it’s harder to do when your core operations are slightly different from what the user actually does, and you see a different document internally than what the user sees.

There are tricks1 that we can do to speed up the translation, but it’s still a lot of complexity. On the other hand, a basic ot algorithm is much easier to implement.

Even though ot do is simple, ot undo is very complicated and inefficient. We’ll expand on this later sections. Undo in crdt is simple and can be handled as normal operations.

I hope I’ve demonstrated that neither ot nor crdt is strictly superior to the other. Some further comparison can be found in Appendix A.

Intro to OT

An ot algorithm is made of three parts, a set of basic operations; the transformation function that transforms operations against each other; and a control algorithm that determines which operation to transform against which.

Usually there’s several editors collaborating on a document, each editor is commonly called a site. What we want to achieve is eventual consistency, that is, the document on each site should end up the same; but we also want to preserve user intent—if the algorithm messes up the sequence of text and makes the document unreadable, even if all the sites have the same garbled document, it’s not of much use for users.

To nobody’s suprise, user intent is fuzzy and hard to pin down formally. Nevertheless, ot researchers over the years came up with formal concepts and properties that formalizes the problem and can be used to prove the correctness of algorithms. Most notably, tp1 and tp2.

TP1 and TP2

TP stands for transformation property. They are properties that a transformation function has to satisfy in order to preserve consistency and user intent. tp1 is basic and easily satisfied. tp2 is rarely satisfied, most algorithms use a control algorithm to rearrange operations such that tp2 is never encountered.

tp1 says: given two concurrent operations o1 and o2 created on the same state, you can get the same document state through two routes. First route: transform o2 against o1 and get o2’, and apply it on top of o1, so history is [o1, o2’]. Second route: transform o1 against o2 and get o1’, and apply it on top of o2, so history is [o2, o1’].

tp2 involves three operations, given any operation o3, you should be able to transform it against [o1, o2’], or against [o2, o1’], and the result o3’ should be the same. This is very hard to achive. But luckily, you only need tp2 if your algorithm needs to transform operations that are not created on the same context. Most algorithms simply arrange the operations so they never need to do this transformation. Instead, they pre-transform the operations so that they’re based on the same context, then transform them against each other. Indeed, how to correctly and efficiently track and store the context of each operation, and find the operation with the right context, etc, would be the job of the control algorithm.

Control algorithm

The principle of a control algorithm is simple, suppose there are two editors, and you’re the control algorithm on editor 2, user has made a series of edits o1,...,o3, and in comes an op o8, from editor 2. Op o8 is made on a document state different from the document state currently in your editor, so it can’t be applied verbatim. Let’s suppose our transformation function satisfies tp1 but not tp2, which is typical, how do you transform o8 so that it avoids tp2 and the result can apply cleanly on the current document?

Note that a document state is simply made of all the ops applied to it. So we can define the document state, aka the context of any op generated on that state, as a set of ops. So our current document state is {o1, o2, o3}. Conveniently, the three ops are generated by the user sequentially, so o1’s context is {}, o2’s context is {o1}, and o3’s context is {o1, o2}, each can be directly applied in the order of o1, o2, o3.

Now if we look back at o8, how we transform it depends on its context. Suppose by the time editor 2 generates o8, it has already received o1, then the context of o8 is {o1}. We just need to transform it in some way such that o8’s context becomes {o1, o2, o3} to apply it on our document. And with our current history, there is a valid path to do it: first transform o8 against o2, since both have context {o1}, to get o8’. Note that now o8’ has context {o1, o2}. Then transform o8’ against o3 and get o8” which has context {o1, o2, o3}, and it can be applied to our document. Remember, because our transformation function doesn’t satisfy tp2, we can only transform an op against another when they have the same context.

So that’s the rough idea, in real algorithms, there are many ways to play with the context. You can enforce a global sequence so that the context is implied by the position of an op in the sequence. Or you might find yourself with an op with a context that you can’t transform (maybe o8’s context is {o1, o3}, how do you transform that?)

For concrete algorithms, I recommend reading Google Wave’s implementation2 and the cot paper3, and maybe the uno4 paper for a blend of ot and crdt. I also came up with an algorithm myself, it’s basically a blend of Google Wave and cot, plus tombstones (see below).

Undo

On the surface, undo with ot seems simple: for an op o, you just generate the inverse of it I(o), and treat it as a concurrent operation after o, transform it against all ops after o, and apply it to the document. But this won’t work. Suppose you insert "AAA", undo it, then redo the undo. Now the history consists of [ins(0, "AAA"), del(0, "AAA"), ins(0, "AAA")]. Let’s try to undo the first insertion—make an inverse of it, which is del(0, "AAA"); transform it against the second op, now you get ins(0, ""); then you transform it against the third op, of course, nothing changes, and you still get ins(0, ""). That’s wrong! We expect to get del(0, "AAA"). Our transformation function doesn’t know that the third op undoes the second op, so after the content in our op was removed in the first transformation, it didn’t get added back during the second transformation.

What I just described is basically ip2: transforming an op o1 against another op o2 followed by its inverse I(o2) should give you o1 back.

In total, there are three inverse properties ip1, ip2, ip3. ip1 is basic and like tp1, is always satisfied. It says applying an op followed by its inverse gives you the same document.

ip3 says given two concurrent ops o1 and o2, if you transform o2 against o1 to get o2’ and apply to the document, then generate an inverse of o1 and transform it against o2’ to get I(o1)’. This should be the same as transforming o1 against o2 to get o1’, then taking the inverse of it to get I(o1’).

Again, ip2 and ip3 are generally avoided with control algorithms. Earlier ot algorithms try to associate undo with their original op and skips them in pairs when transforming the inverse op. Later ot algorithms like cot use increasingly complicated and expensive procedure to generate inverse operations.

But if you’re willing to introduce tombstones into the algorithm, suddenly the control algorithm becomes easy: with tombstone, ip1, ip2, and ip3 can all be satisfied, you can just use the naive algorithm described at the beginning of this section. But obviously tombstones come with significant drawbacks like increased storage, need to convert between internal position and editor position, etc.

If you want to learn the proper way to do undo with ot, I recommend reading the cot paper3. If I remember correctly, its undo is exponential time.

All in all, I don’t think undo is “solved”. Even Google Doc has cracks. Try this: open the same doc in two tabs, in one tab type AABB, in the other insert XX between AA and BB. Now undo in both tabs. Redo in the first tab, you get AABB back. Now redo in the second tab, you get XXAABB.

A simple OT algorithm

This is a very simple OT algorithm, in fact, simplicity is the whole point here. I tried to come up with an algorithm that’s as simple to implement as possible. As a result, it doesn’t support rich text, nor is it distributed, and it uses tombstones. But it’s as simple as you can get.

For transform function, we just use the most basic string-based insert and delete. Each site has an assigned site id, which is used to break ties.

There needs to be a central server that receives ops from each site, transforms them and broadcasts the op to each site. Each op’s context is represented by a sequence number in a global sequence, its context is basically all the ops before it combined. We make sure all the ops are generated from/transformed to have this linear, continuous context. This is key to simplify the control algorithm.

The server’s job is simple, it maintains the global history, and receives ops from each site. The ops sent from each site doesn’t always have the latest context, if the current global history is [o1,...,o9], a site might send an op with context [o1,...,o6]. That’s fine, the server just transforms it against o7,...,o9, stores it in the global history, and broadcast it out. The important thing is that a client’s op’s context is always a prefix of the global history. Eg, a context like [o1, o2, o5, o6] would never happen.

On the editor side, we maintain a history made of two segments, the global history L1 and local history L2. And the current editor document state is at the end of L1 + L2. L1 contains all the ops sent from the server, and L2 contains all the ops the user created that haven’t been sent to the server yet. If user keeps adding ops, we put them at the end of L2. The ops we receive from the server always have continuous global sequence numbers, and the context of a new op from the server is always all the ops in L1. That means whenever an op comes from the server, we just need to transform it against all the ops in L2 sequentially, and it’ll have the correct context to be applied to the current local document. Also the untransformed server op is appended to the end of L1.

To send ops to server, we always send the whole L2 to server together, because ops in L2 have continuous context. We send L2 to server, server transforms them against the global history it has, and broadcast it out. Note that the server can always transform the L2 we send out, because L2’s context is L1, which is always a prefix of the global history that the server maintains. Then, we received L2’ back from server, which is the transformed version of L2, and append it to the end of L1. We also remove the ops from L2 since they’re now in L1. Waiting for this ack from server ensures that our local L1 is always a prefix of the global history on the server.

In case of disconnection and lost messages, a site can simply re-request ops from the server and resend its local ops, there’s no special synchronization or fixup required.

That’s about it for the do part. This is basically what Google Wave does. I omitted some details but readers should be able to see how it manages context to ensure ops always have the correct context for the transformation we need to do.

For undo, I used tombstones. Now I need to translate internal doc position (which counts tombstones) and external editor document position. This is still vastly simpler than trying to use a control algorithm to avoid ip2 and ip3. If someone has a good way to implement undo without tombstones, I’m eager to hear.

Real world implementations

Most of the algorithms mentioned in the history section have an accompanying implementation. Most notably Sun et al implemented CoWord and CoMaya, which brings collaborative editing to Word and Maya. Very impressive work. The latest algorithm used in CoWord and CoMaya is cot. cot is also used in CodoxWord and ibm OpenCoWeb according to the pot paper6.

The CKE editor based their editor on this paper.7 They showed some of the data structures they use in their other blog post8, and it looks just like the one in the paper.

Appendix A, OT vs CRDT cont.

People also say crdt can’t work well with rich text9, because it can only handle insertion and deletion, so it’s difficult for it to preserve user intent on complicated operations like “add table row”. And all the commercial rich text editors are implemented in ot (TinyMCE editor9, cke editor11, Codox, Google Doc, etc). But, I’m pretty sure that crdt can handle rich text. The author of CoWord, CoPPT, and CoMaya12 uses ta (transparent adaptation)13, which works by converting high-level operations in the application into three primitive operations: insert, delete, and update. You just need to properly implement crdt with rich text, rather than encoding rich text in json and using crdt on the plain json.

A group lead by Chengzheng Sun, a prominent figure in ot research.

Though to be fair, ot algorithms rarely have more than a handful of basic operations either, because you need to define transformation between each pair of basic operations, so the number of transformations you need to define is the number of basic operations squared. The common approach is ta (transparent adaptation). Basically translating high-level operations into basic operations. I didn’t look into this very deeply.

Space complexity: crdt document keeps all the characters ever inserted, both visible characters and tombstones. ot needs to store all the concurrent operations. It doesn’t need to store all the operations: once an operation is known to be applied at all nodes, it can be discarded.

So it seems ot has better space complexity in general? Not quite. If you want to make the ot system fully distributed, you’ll have to keep all the history, since it’s impossible to tell if an operation is applied at every node when you don’t even know all the nodes. And if you think about it, “full history” and “all characters ever inserted in to the document” sounds familiar, no?14

Unless a user makes a million undo and redo. You don’t want to just store undo operations as references to original operations, since they need to be transformed and thus will become different.

On the other hand, if you do know when all sites have applied an operation, you can garbage-collection tombstones in crdt.

Time complexity: it varies greatly depending on the algorithm, optimization, etc. Suffice to say both can be efficient and it really depends.

Finally, there is a detailed paper15 that compares ot and crdt. It was later expanded into a three-part series: 16, 17, 18. The main idea is that crdt in its essence is still based on transformations, and there are a lot of hidden complications in applying the algorithm on paper to editors. The paper makes crdt sound like being inferior to ot, but the bias towards ot is pretty palpable, so I’d take it with a grain of salt :-)

Appendix B, OT history

deOPT (grove)19 is (I think) the first ot algorithm, created in 1989. Later, people found out a flaw: the deOPT puzzle.

DistEdit20 in 1994 explores selective undo and undo ui.

Jupiter21 in 1995 is the basis of Google Wave2.

The Sun et al lineage starts from got23 in 1998. got aims to solve the deOPT puzzle. Then there is goto also in 1998 by the same author.

After that came anyundo24 in 2002, which can undo any operation at any time. Previous algorithms usually have some restriction on what operation can be undone and when can you undo it. anyundo only concerns with undo so it pairs with goto to become a complete solution.

cot3 is the final algorithm, iterated upon goto-anyundo, published in 2009. pot6 is the final final algorithm, proposed in a theoretical paper in 2016. (They also proposed a improved version of tibot, tibot 2.0, in that paper.)

Control algorithm-wise, it seems to end there for Sun et al. After cot, they went on to research collab editing for 3D modeling software (e.g., Co-Maya).

tibot27 by Li & Li and Sun in 2004 is an interesting algorithm that uses logic time intervals, doesn’t need central ordering and doesn’t need a vector timestamp. (Still, every node needs to constantly sync with every other node, so join/leave will be a challenge.)

There are some other algorithms proposed by other researchers: soct3/4, nice, etc. You can check out the wiki page on operational transformation. It has a nice table summarizing a lot of the algorithms.

Post cot, the trend seems to have shifted to two major directions: One is ttf28 (by hal), which incorporates tomstones into ot to improve undo complexity. Another is abt29 (by Li & Li), which moves operations in the history buffer such that an operation that inserts a character always comes before an operation that deletes that character.

Along the ttf line, you have uno4, ST-Undo31, etc. Along the abt line, you have abtu32.

Here I quote the author of ST-Undo:

The follow-up OT-based undo solutions invent functional components to address the abnormal ordering problem. TTF [36] introduces an object sequence to keep deleted objects. ABT [39] introduces a special history buffer in which insert operations are placed before delete operations. UNO [25, 26] is a selective undo algorithm built on TTF. As deleted objects are never lost, UNO can preserve the ordering relations among objects. Except for the object sequence, UNO stores both do and undo operations in the history buffer. The time and space complexity of UNO is linear in the size of the object sequence plus the number of operations in the history buffer. ABTU [28] is developed from ABT [39]. In ABTU, undo operations are stored in the form of inverse operations of the corresponding do operations in the history buffer. As an operation may be transformed with both do and undo operations, ABTU arranges the operations in the history buffer according to their effect positions. ABTU has a linear time and space complexity in the size of history buffer.

I only looked into ST-Undo and abtu closely. abtu is very complicated and uses vector timestamps, ST-Undo is moderately complicated, but to me, part of the complexity (using a tree to store tombstones) is unnecessary in practice. They could’ve just used a cache around cursor, and the performance would be even better in practice, and it’s dead simple to implement. But I guess caching isn’t interesting for an academic paper.

I also found optic33, seems like a “truly distributed” algorithm that handles node joining and leaving.

Resources

deOPT: Concurrency control in groupware systems
DistEdit: A framework for undoing actions in collaborative systems
Jupiter: High-latency, low-bandwidth windowing in the Jupiter collaboration system
got: Achieving convergence, causality preservation, and intention preservation in real-time cooperative editing systems
anyundo: Undo as concurrent inverse in group editors
cot: Context-based Operational Transformation for Distributed Collaborative Editing Systems
pot: Conditions and Patterns for Achieving Convergence in OT-Based Co-Editors
tibot: A Time Interval Based Consistency Control Algorithm for Interactive Groupware Applications
optic: Coordination Model for Real-Time Collaborative Editors
uno: An Undo Framework for P2P Collaborative Editing
abtu: An Algorithm for Selective Undo of Any Operation in Collaborative Applications
ot for xml: Generalizing Operational Transformation to the Standard General Markup Language
CoWord, CoPPT, ta: Transparent adaptation of single-user applications for multi-user real-time collaboration. Extends the xml paper7
ttf: Tombstone Transformation Functions for Ensuring Consistency in Collaborative Editing Systems
abt: An Admissibility-Based Operational Transformation Framework for Collaborative Editing Systems