A Failed Stack-based Markdown Interpreter
One of my weekend projects was to write a stack-based Markdown interpreter in Erlang. Let me start by acknowledging that the project failed in both terms of completeness and performance.
The in-line aspects of Markdown are very straight-forward to parse with a stack:
*test*
**test**
`test`
[test][http://test.com/ "test")
[test]: http://test.com/ "test"
[test][test]
Even fairly complex combinations like *test **this** out `please *if you won't mind* it`*
is easy with a stack-based interpreter, although it
might be fairly complex in a solution which relied on regular expressions.
However, Markdown's multi-line constructs become very awkward to handle with a naive implementation, or at least with a naive implementation and a lack of planning.
Consider the syntax for unordered lists:
- this is a list
- yep, still a list
- still a list
Before creating each <li>
element, you need to check if there is already a <ul>
element on the stack. If not, add it, otherwise, proceed. Closing is a bit more complex,
and requires checking the next line when you close the <li>
element making
a judgement.
Then there are blockquotes, which behave the same way as the <ul>
quotes, except
that there is no <li>
to place on the stack to use for keeping track of the current
position.
Then there are code blocks, which behave like blockquotes.
Then there are paragraphs, which appear by default for text without an enclosing tag.
Combine these things with multiple indention levels and inconsistent treatment of newlines (a newline in a blockquote becomes a space, whereas in a pre block remains a newline, etc), and the lack of planning in this first implementation was pretty damning.
Below are a few more details on the progress this first implementation made, but I'm back to the drawing board for the multi-line aspect. I'm still undecided if I'll stick with the stack-based concept for multi-line parsing, but I'm thinking I make perform a pre-compile stage which builds a DOM of all multi-line constructs and then uses a stack-parser exclusively for individual lines.
Completeness
The test battery for erlang_markdown
is currently:
testcases() ->
[{"*test*", <<"<p><em>test</em></p>">>},
{"**test**", <<"<p><strong>test</strong></p>">>},
{"``test``", <<"<p><code>test</code></p>">>},
{"`test`", <<"<p><code>test</code></p>">>},
{"[test](http://test.com/)", <<"<p><a href=\"http://test.com/\">test</a></p>">>},
{"[test](http://test.com/ \"test2\")", <<"<p><a href=\"http://test.com/\" title=\"test2\">test</a></p>">>},
{"[test]: http://test.com/\n[test][test]", <<"<p><a href=\"http://test.com/\">test</a></p>">>},
{"[test]: http://test.com/ \"test2\"\n[test][test]", <<"<p><a href=\"http://test.com/\" title=\"test2\">test</a></p>">>},
{"out", <<"<p>out</p>">>},
{"out\n", <<"<p>out</p>">>},
{"out\nwoot\n", <<"<p>out woot</p>">>},
{"out\nwoot", <<"<p>out woot</p>">>},
{">> this is\n>> a test\n", <<"<blockquote>this is a test</blockquote>">>},
{" this is\n a test\n", <<"<pre>this is\na test</pre>">>},
{" this is a\n test\n yep\n>> and this\n>> is too", <<"<pre>this is a\ntestyep</pre><blockquote>and this is too</blockquote>">>},
{"this is a test\nand another\nand *another*\n", <<"<p>this is a test and another and <em>another</em></p>">>},
{" test\n this\n out\n\n **yep**", <<"<pre>test\nthisout</pre><p> <strong>yep</strong></p>">>},
{"this is *a test*\n\n import test\n import test2\nstill a **test**\n", <<"<p>this is <em>a test</em></p><pre>import test\nimport test2</pre><p>still a <strong>test</strong></p>">>},
{"test\nthis\n\n>> out", <<"<p>test this</p><blockquote>out</blockquote>">>},
{"test\nthis\n\n>> out\n>> again\n", <<"<p>test this</p><blockquote>out again</blockquote>">>},
{"test this\n**out**\n\n>> and this\n>> as well\n\n woo\n hoo\n", <<"<p>test this <strong>out</strong></p><blockquote>and this as well</blockquote><pre>woo\nhoo</pre>">>},
{" this is a test\n of pre\n\n* test this\n* out\n\n woohoo", <<"<pre>this is a test\nof pre</pre><ul><li>test this</li><li>out</li></ul><p>woohoo</p>">>},
{" *test this ``out``*", <<"<p> <em>test this <code>out</code></em></p>">>},
{" *test this **out***", <<"<p> <em>test this <strong>out</strong></em></p>">>},
{"* this is a test\n* so is this\n* and this\n", <<"<ul><li>this is a test</li><li>so is this</li><li>and this</li></ul>">>},
{"* *test this **out***\n", <<"<ul><li><em>test this <strong>out</strong></em></li></ul>">>},
{"1. test\n2. test", <<"<ol><li>test</li><li>test</ol></li>">>},
{"1. this is a test\n2. so is this\n\n* and so on\n* and further\n\na paragraph\n", <<"<ol><li>this is a test</li><li>so is this</li></ol><ul><li>and so on</li><li>and further</li></ul><p>a paragraph</p>\
">>},
% failing tests
{"* *test this **out***", <<"<ul><li><em>test this <strong>out</strong></em></li></ul>">>},
{"this is really just a test\nis that okay?\n\nand what about\nthis\n", <<"<p>this is really just a test is that okay?</p><p>and what about this</p>">>}
].
And the output of running the test battery:
62> timer:tc(markdown, test, []).
Running tests:
EE............................
Failed: <<"<ul><li><em>test this <strong>out</strong></em></ul></li>">> != <<"<ul><li><em>test this <strong>out</strong></em></li></ul>">> (input: "* *test this **out***")
Failed: <<"<p>this is really just a test is that okay? and what about this</p>">> != <<"<p>this is really just a test is that okay?</p><p>and what about this</p>">> (input: "this is really just a test\nis that okay?\n\nand what about\nthis\n")
28 Tests Passed / 2 Tests Failed / 30 Total Tests
{1885,ok}
The first issue can be resolved by adding a newline to the end of the input, but it really should work as is. The second issue reveals a fundamental lack of planning, which is has caused this partial implementation to become extremely awkward to modify, and also precludes it ever being fully compliant.
You'll notice that the tests don't cover multi-level constructs. The design takes them into account, and I'm fairly certain it would be straight-forward to get them working, but honestly the overall design for multi-line parsing is sufficiently flawed to make this not worth bothering with.
Performance
test_performance(N) ->
Str1 = "* **this is a test** *and so is this*\n* another line\n\n1. a line\n2. a line2\n3. another line\n\n",
Str2 = ">> blockquote\n>> blockquote2\n\n pre block1\n same pre block\n\n",
Str3 = "[test](http://test.com \"this\")\ this out\n ``code block``\n\n",
Str = lists:append([Str1, Str2, Str3]),
LongStr = lists:append(lists:map(fun(_n) -> Str end, lists:seq(0,100))),
{Time, _} = timer:tc(markdown, markdown, [LongStr]),
Seconds = Time / 1000000.0,
io:format("~p chars took ~p seconds~n", [length(LongStr), Seconds]).
The performance certainly isn't pretty for large documents either. There are some obvious O(n^2) operations (merging the accumulated binaries into a string and then reconverting into a binary can't be an inexpensive operation), and probably some less obvious performance implications resulting from the ad-hoc implementation.
8> markdown:test_performance(1).
428 chars took 5.04e-4 seconds
ok
9> markdown:test_performance(10).
2354 chars took 0.002643 seconds
ok
10> markdown:test_performance(100).
21614 chars took 0.040283 seconds
ok
11> markdown:test_performance(1000).
214214 chars took 58.638143 seconds
ok
Yep, pretty much abysmal. Out of curiosity I removed the final step which calls list_to_binary
on the entire structure and reran the test:
18> markdown:test_performance(1000).
214214 chars took 106.470855 seconds
ok
Which somehow managed to take nearly twice as long. I'll just go ahead and consider that a fluke, but it certainly didn't magically improve things.
After Some Additional Work...
I reworked the multi-line logic into something more systematic (similar concept to the previous logic, but more systematically implemented) and have it passing all the tests now. The only missing functionality is indented lists. Which is admittedly fairly significant.
Performance has improved by a fair clip as well:
71> markdown:test_performance(1).
428 chars took 3.04e-4 seconds
ok
72> markdown:test_performance(1).
428 chars took 3.04e-4 seconds
ok
73> markdown:test_performance(10).
2354 chars took 0.001699 seconds
ok
74> markdown:test_performance(100).
21614 chars took 0.033964 seconds
ok
75> markdown:test_performance(200).
43014 chars took 0.121854 seconds
ok
76> markdown:test_performance(500).
107214 chars took 4.124487 seconds
ok
77> markdown:test_performance(1000).
214214 chars took 25.825798 seconds
ok
This time I was a bit more systematic testing without the final list_to_binary
call,
and the performance seems totally inconsistent. Although it is always quicker than with it,
sometimes it is extremely quick (10x faster) than other attempts...
Since everything is in memory it doesn't seem like caching should have a major impact. Perhaps I am missing something.
78> l(markdown).
{module,markdown}
79> markdown:test_performance(1).
428 chars took 3.84e-4 seconds
ok
80> markdown:test_performance(10).
2354 chars took 0.001953 seconds
ok
81> markdown:test_performance(100).
21614 chars took 0.019525 seconds
ok
82> markdown:test_performance(200).
43014 chars took 0.449793 seconds
ok
83> markdown:test_performance(200).
43014 chars took 0.054803 seconds
ok
84> markdown:test_performance(200).
43014 chars took 0.052132 seconds
ok
85> markdown:test_performance(200).
43014 chars took 0.160802 seconds
ok
86> markdown:test_performance(500).
107214 chars took 0.566199 seconds
ok
87> markdown:test_performance(500).
107214 chars took 0.417525 seconds
ok
88> markdown:test_performance(500).
107214 chars took 0.245372 seconds
ok
89> markdown:test_performance(1000).
214214 chars took 7.128518 seconds
ok
90> markdown:test_performance(1000).
214214 chars took 11.816736 seconds
ok
91> markdown:test_performance(1000).
214214 chars took 2.119201 seconds
ok
92> markdown:test_performance(1000).
214214 chars took 17.814619 seconds
ok
Hmm. More work to come. Will push changes to GitHub when it returns to life.