Working on TWC Formatting
Last year I started writing TWC, a TUI wrapper for TaskWarrior, great command line todo manager. I wrote it because I couldn’t find any similar interactive applications and because I really like Taskwarrior, but quite dislike its tabular form of tasks presentation.
TWC groups tasks in agendas. They are basically task filters displayed together on a single page/tab. I called each of them a block. Each block has separate definition of how it should render tasks on the screen. This definition is called formatting string.
HTML Era
Prior to 0.9 release users defined formatting strings as a mixture of HTML-like markup and Python format mini-language. TWC allowed passing two kinds of HTML tags:
- definitions of styles (like background and foreground colors);
- pre-defined tags with special semantics of adding or changing text wrapped by them:
<sr>
to surround non-empty text;<ind>
to change non-empty text to some other text (indicator).
Below is an example of such string, prettified for the needs of this article. In real life it would be wrapped in quotes, assigned to the variable and with line endings escaped.
<comment>
<sr right=" ">
<sr left="[" right="]">
<ind value="A">{annotations}</ind>
<ind value="D">{due}</ind>
<ind value="S">{scheduled}</ind>
</sr>
</sr>
<sr right=" ">{id}</sr>
</comment>
{description}
<sr left=" "><info>{tags}</info></sr>
It should display something like [AD] 5 Paint the room @home:errands
, but
of course it didn’t because HTML parser was buggy. We’ll get back to this
later.
Replacing HTML with DSL
TWC 0.9 introduced a huge change: instead of HTML users can pass a formatted list of items.
Single item now looks like this: name:style:fmt:
. with style and fmt
both being optionals. Multiple items are separated by a comma or plus sign.
Choice of separator is important because they have different meanings:
,
produces a space between two items;+
glues items together.
Users can pass item’s style name after a colon and Python formatting after another one. Similar to sed, style and formatting strings must be terminated with a colon. This one is the biggest quirk which I couldn’t get rid of.
And here’s an example of items definition that actually displays what we intended to with that overly verbose HTML-like thing before:
[flags:comment:%a%d%s:],id:comment:,description,tags:info:
Why not HTML
HTML formatting was single thing which I disliked the most in TWC. So I finally gave it a little thought and after few days concluded that HTML might not be the best choice for TWC formatting strings.
User-wise it’s hard to read and reason about because of its verbose markup.
Configuration files looked horrible because of it. <sr>
tag was the worst.
Because it had surround-if-not-empty semantics, it was actually the only
way to conditionally separate optionally empty items, which in turn made each
each formatting string a never-ending river of <sr>'s
. Because many blocks
usually need slightly different formatting, this horrible pattern was
repeated over and over again. Add completely non-standard HTML tags,
unfamiliar to the users who now have to constantly check documentation (I,
the author, had to check it from time to time) and disaster is ready.
Code-wise situation was not much better. It required HTML parser. This alone should say everything, but let’s elaborate on that thought, shall we?
Parser was implemented on top of html.parser module. It wasn’t particularily long and complex, its test suite was more than satisfactionary, to the point that I was quite confident modifying it. It covered many uttely stupid corner cases, but that’s fine as long as they’re tested. And yet, after 2 days of work I somehow wasn’t able to fix one silly bug with nesting of particular tags.
See, HTML parser had to do a little more than just parse HTML. It had to parse it in particular way, compatible with formatted text, which is data format understood by Python Prompt Toolkit, a library which TWC uses for drawing of the interface. Basically it’s a list of tuples, but it must be kept in rather simple, flat form, because otherwise Prompt Toolkit fails to render it miserably.
So what could I do with a DOM tree which is nested by definition? I defined each tag as a Python class that, when it was appropriate, received parts of formatted text, which could be modified, replaced and recreated. To keep things simple each such entity had only a limited knowledge of a small part of formatting string and nothing else. So it had to deduct what to do from a presence (or lack of presence) of arbitrary strings and from a number of tuples it received — all of that to flatten some HTML. No surprise that this is the part where bugs occured.
Alternative
So the alternative to HTML which I figured out is a custom domain-specific language. It was designed from scratch1 with ease of use in mind. It’s, I think, opinionated (a popular trend that I observe recently which stands in opposition to generic, but complex programs):
- it deliberately drops some of the features of HTML like tags nesting or allowing drop-in style definitions and replaces them with simpler, more limited counterparts;
- it also creates some arbitrary conventions like removing spaces between empty items.
But even though, it remains quite fleximple without losing readability. At least it’s more readable than HTML counterpart, I think that we all agree on that. A little writing gives a good effect and actually invites users to experiment with it. Users shouldn’t need the documentation to jump in.
Code-wise it’s implemented on top of a simple regular expression and even simpler tokenization. You know: split on commas, partition on colons… Its simplicity works not only for end-users, but also for implementation. There are not many things to break. That’s another win.
Is it perfect? No. It still has a field for improvements. For example I’d
like to implement a kind of “intelligent” mistake fixing, especially for
missing terminating colon - a common mistake I make all the time. I tried to
fix it before the release, but I failed, as lack of that colon introduces
many interesting corner cases. Like, what about datetime string formatting
which typically uses a colon to separate hours from minutes (due::%Y-%m-%d
%H:%M:
)? What about spaces? What surrounding characters?
I hope to figure it out one day. End of story.