A technique to store rich formatting separately from text

David Chester

Apr 05, 2022 · 4 min read

A technique to store rich formatting separately from text

Some time ago we found ourselves having built a collaborative whiteboard product, where text was stored as plain text. But then we wanted to add a limited set of rich text formatting options -- bold, italics, and first- and second-level headings, and so plain text would not suffice anymore.

Should we use markdown?

How should we store the rich text? We first pursued markdown as a storage format. It has the nice quality that if we end up rendering it as text inadvertently somewhere (in an admin tool, in a link preview, etc) it will mostly look okay. A stray raw # here or there would be not so bad.

As always, the initial implementation came together relatively quickly and seemed promising enough. But there was trouble lurking. At the time of our investigations, we did not find great options for libraries that would both encode and decode markdown. The marked library was great for turning markdown into HTML, and turndown was good enough at turning HTML back into markdown, but there were edge cases where the round-trip was not perfect, and performance overhead was not great when dealing with lots of data.

Also, while it was a nifty idea that our existing plain text content was already just markdown with no formatting yet -- of course that wasn't really quite true. It was mostly true, but what about all those times a user used square brackets around some text to mean something different than a hyperlink, or used underscores to mean something other than emphasis. We'd have to escape all of those markdown special characters.

What about HTML?

There are some benefits to storing rich text as HTML. Parsers are ubiquotous and high-quality. There's much more flexibility to be had in the richness of the formatting, given markdown's limited feature set.

And then the downsides -- having HTML showing up in places where plain text was previously expected is a worse type of failure case. While a stray # might be acceptable, <h1></h1> tags would be less so. Security is a concern as well. We can sanitize on the way in and on the way out too, but it will require vigilance forever more in every place we deal with this content.

The answer: plain text + rich formatting metadata

In the end, we put together a solution that let us continue storing plain text as plain text, while storing rich formatting metadata completely separately from the text. Rich formatting start end end positions are specified using character offsets from the beginning of the text. We encode to and from HTML, but this technique could work with any other rich text format just as well.

1
2
3
4
5
6
7
8
9
10
11
12
13
14 const { encode, decode } = require('html-text-weaver');
 
const html = '<h1>Hey, you! <b><i>Get out of there!</i></b></h1>';
 
encode(html);
 
// {
//  text: 'Hey, you! Get out of there!',
//  meta: [ [ 'h1', 0, 27 ], [ 'b', 10, 27 ], [ 'i', 10, 27 ] ] }
// }
 
decode(encode(html))
 
// '<h1>Hey, you! <b><i>Get out of there!</i></b></h1>'

This gives us the trade-offs we're looking for: Text continues to be stored as plain text, so we don't have to escape or re-encode any existing data. We get to use the flexibility and stability of HTML, but don't have the security risk, since our own decoder is always the thing that it producing the HTML we serve to the page.

We put this into an npm module with source available at github.com/frameable/html-text-weaver

Thanks for reading...

We make truly awesome collaboration tools for Microsoft Teams, and we'd love to show you around.

Talk to us →