Cleaning up erlang_markdown

October 10, 2009. Filed under erlang 20 markdown 6

After a couple more hours, erlang_markdown has cleaned up nicely, and is supporting pretty much all of the Markdown spec except for the multi-line header syntax. You can look at the current testsuite to check its comprehensiveness.

Here I'll discuss and reflect on the script's design, and end with a look at performance tuning the library.

Single-Pass Parsing

From the beginning the goal of this project was to create a single-pass Markdown interpreter. A greater man than I would undoubtedly have a justification for using this approach, but just happened to be the approach I wanted to take. Having done so, here are some thoughts on using single-pass versus multi-pass approaches for translating Markdown into html:

  1. The more irregular the language, the more painful to write a single-pass interpreter. In particular, languages with implicit termination pose problems for the single-pass approach.
  2. A single-pass interpreter to parse an irregular language will become more complicated, more brittle, and less extendable than a multi-pass parser.
  3. Markdown is both irregular and has implicit termination.
  4. The single-pass approach allows stream processing, and can thus facilitates constant memory use.
  5. Most usecases of Markdown don't readily facilitate stream processing (rendering comments for a website, blog entries, notes, etc), and instead expect to pass in a full string/binary and be returned the full translation as a string/binary.

With those comments in mind, in terms of long-term sustainability and flexibility, I would always recommend a multi-pass approach over a single-pass one. Only in situations where a multi-pass approach has proven to have inadequate performance characteristics (such as SAX parsing for massive XML documents where DOM parsing becomes unfeasible due to memory consumption) should a single-pass approach be attempted.

Or if your system already operates on data as streams. Or if you're bored.

The actual implementation of erlang_markdown is fairly straightforward, although filled with enough special cases to break a reader down into tears. The core functions are

line_start(<<Binary/binary>>, OpenTags, Acc, LinkContext, MultiContext).
single_line(<<Binary/binary>>, OpenTags, Acc, LinkContext, MultiContext).

markdown:line_start/5 is called at the beginning of each line, and is responsible for managing all multi-line constructs like paragraphs, lists, blockquotes and pre blocks. MultiContext contains a stack of open multi-line tags (ul, ol, li, p, pre, blockquote, etc).

line_start/5 is undoubtedly the ugliest portion of the code, as it is responsible for detecting the termination of multi-level indents (comparing the indent depth against the number of ol and ul tags open in the MultiContext stack) and other logic like determining is a line without any syntax should be treated as a new paragraphs, a continuing paragraph, or part of an already opened li tag.

markdown:single_line/5 handles the rendering of a line after all the multi-line syntax has been stripped out. For example, * *this is a test* would have already been reduced to *this is a test* by the time it reaches single_line/5.

It uses LinkContext to store previously declared links using the [test]: "test" syntax for links. After studying the xmerl:eventp a bit, I would re-implement the function definitions as either

line_start(<<Binary/binary>>, PropList).
single_line(<<Binary/binary>>, PropList).


line_start(<<Binary/binary>>, #markdown_parser{}).
single_line(<<Binary/binary>>, #markdown_parser{}).

The current approach makes it extremely awkward to extend erlang_markdown, but using a property list or record would really improve on this. For the time being there probably isn't a strong incentive to make this refactor, but if/when I do a third rewrite this would be the core of it.

Performance Tuning

One of the acute lesson's I've learned with Erlang performance is that talk is cheap. There is a great deal of generic advice on tuning Erlang applications for performance--and much of it is generically applicable--but it is challenging to distinguish the real overhead from the imagined overhead for your specific task before you actually write the script and generate some numbers.

(Of course, this is always true, regardless of language, but I--perhaps due to my relative inexperience--have found a great deal of confusion among those proffering Erlang performance advice.)

I used this function to generate the test input of variable length.

gen_string(N) ->
    Str1 = "* **this is a test** *and so is this*\n* another line\n\n1. a line\n2. a line2\n3. another line\n\n",
    Str2 = ">> blockquote\n>> blockquote2\n\n    pre block1\n    same pre block\n\n",
    Str3 = "[test]( \"this\")\ this out\n ``code block``\n\n",
    Str4 = "1. This is a test\n    2. so is this\n    3. yep...n4. yep\n\n* hi\n    * there\n    * ayep..\n * the end\n\n",
    Str = lists:append([Str1, Str2, Str3, Str4]),
    lists:append(lists:map(fun(_n) -> Str end, lists:seq(0,N))).

For profiling, I used the markdown_tests:test_performance/1 function:

test_performance(N) ->
    test_performance(N, 10).

test_performance(N, Runs) ->
    LongStr = gen_string(N),
    Times = lists:map(fun(_X) ->
                              timer:tc(markdown, markdown, [LongStr])
                      end, lists:seq(1,Runs)),
    Time = lists:foldr(fun({Micros, _}, Acc) ->
                               Micros + Acc
                       end, 0, Times),
    Seconds = (Time / N) / 1000000.0,
    Length = length(LongStr),
    io:format("~p chars, ~p seconds, ~p us/char~n", [Length, Seconds, (Time/N)/Length]).

It starts by generating the test string, and then runs the concatenated string through the markdown:markdown/1 function. It runs each test Run times, which by default is 10, and then outputs the averaged results. It would certainly be more interesting to calculate the standard deviation, identify and exempt outliers and so on, but I decided not to get too carried away with it.

Now, let's take a look at the performance numbers we are starting with:

Erlang R13B01 (erts-5.7.2) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.2  (abort with ^G)
1>  l(markdown), l(markdown_tests).
2> lists:foreach(fun(X) -> markdown_tests:test_performance(X) end,  lists:seq(100,500,100)).
31714 chars, 0.008108770000000001 seconds, 0.2556842403985622 us/char
63114 chars, 0.01380413 seconds, 0.21871740025984726 us/char
94514 chars, 0.007689983333333333 seconds, 0.08136343116716395 us/char
125914 chars, 0.010926575000000001 seconds, 0.08677807868862875 us/char
157314 chars, 0.042361656000000004 seconds, 0.26928090316182923 us/char

The performance is looks pretty good here, which I found fairly surprising, as I was concerned about using the idiom use collecting html fragments as they are generated:

Fragments = lists:foldr(fun(X,Acc) -> [magic(X) | Acc] end, [], Input)
InOrder = lists:reverse(Fragments),
AsStrings = lists:map(fun(X) -> binary_to_list(X) end, InOrder),
Merged = lists:append(AsStrings),

The collect-by-prepending-then-reverse idiom is common in Lisp code, but the awkward merging of binary fragments into a single binary was most definitely not standard.

As such, my first attempt at optimization was to remove the conversion to lists, and instead attempt this algorithm:

Fragments = lists:foldr(fun(X,Acc) -> [magic(X) | Acc] end, [], Input)
InOrder = lists:reverse(Fragments),
lists:foldr(fun(X, Acc) -> <<Acc/binary, X/binary>> end, <<"">>, InOrder).

The results were

Erlang R13B01 (erts-5.7.2) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.2  (abort with ^G)
1> l(markdown), l(markdown_tests).
2> lists:foreach(fun(X) -> markdown_tests:test_performance(X) end,  lists:seq(100,500,100)).
31714 chars, 0.0064280200000000004 seconds, 0.20268714132559754 us/char
63114 chars, 0.00546126 seconds, 0.08653008841144595 us/char
94514 chars, 0.007205226666666667 seconds, 0.07623449083380945 us/char
125914 chars, 0.009586265 seconds, 0.0761334323427101 us/char
157314 chars, 0.048146004 seconds, 0.30605034516953356 us/char

After a wee bit of munging we can compare those numbers and see how they stack up

>>> def clean_line(x):
...     nums = [ y.split(" ")[0] for y in x ]
...     return [ float(y) for y in nums ]
>>> def manage(x):
...     a = x.split("\n")
...     a = [ b.split(", ") for b in a ]
...     return [ clean_line(b) for b in a ]
>>> one ="first data set..."
"first data set..."
>>> two = "second data set..."
"second data set..."
>>> for a,b in zip(one,two):
...     print "%s / %s -> %s" % (a[2], b[2], a[2]/b[2])
0.255684240399 / 0.202687141326 -> 1.26147242852
0.21871740026 / 0.0865300884114 -> 2.52764563489
0.0813634311672 / 0.0762344908338 -> 1.06727847563
0.0867780786886 / 0.0761334323427 -> 1.13981566335
0.269280903162 / 0.30605034517 -> 0.879858191346

In general the conversion-less approach (the 2nd) was a bit faster, although not universally so. This suggests that the binary to string conversion is not excessively expensive.

Next I was curious to see how badly the test was penalizing the code by passing in the input as a string instead of as binary, as it will always convert list input into binary:

markdown(Text) when is_list(Text) ->
markdown(Binary) when is_binary(Binary) ->
    line_start(Binary, [], [], [], []).

After changing the test to use binary, I reran the numbers (using the second version with binary concatenation as opposed to converting to lists for concatenation):

Erlang R13B01 (erts-5.7.2) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.2  (abort with ^G)
1> l(markdown), l(markdown_tests).
2> lists:foreach(fun(X) -> markdown_tests:test_performance(X) end,  lists:seq(100,500,100)).
31714 chars, 0.00505559 seconds, 0.15941193163902379 us/char
63114 chars, 0.00465163 seconds, 0.07370203124504865 us/char
94514 chars, 0.005229263333333334 seconds, 0.055327923200090286 us/char
125914 chars, 0.006679399999999999 seconds, 0.05304731801070572 us/char
157314 chars, 0.034418122 seconds, 0.21878613473689565 us/char

And the comparison with the second run looks like:

>>> for a,b in zip(two,three):
...     print "%s / %s -> %s" % (a[2], b[2], a[2]/b[2])
0.202687141326 / 0.159411931639 -> 1.27146782077
0.0865300884114 / 0.073702031245 -> 1.17405296638
0.0762344908338 / 0.0553279232001 -> 1.37786648087
0.0761334323427 / 0.0530473180107 -> 1.43519852083
0.30605034517 / 0.218786134737 -> 1.39885621883

It would seem--although I would strongly recommend running much more strenuous tests than this one--that the cost of converting the strings into binary isn't huge, but nor can't be dismissed entirely.

My next thought was to examine if there was a substantial improvement from aggregating the binary in-step instead of as a final action across all the data, thus the above formula shrinks to:

lists:foldr(fun(X,Acc) -> X2 = magic(X), <<Acc/binary, X2/binary>> end, [], Input)

However, this change involves quite a bit of code-change, so I decided to forgo testing it using the erlang_markdown code. It should improve memory usage a bit, but I suspect overall would be a fairly minor improvement.

Anyway, at ~4 million characters per second, erlang_markdown is already fast enough for any purposes I can imagine using it for.