After a couple more hours, erlang_markdown has cleaned up nicely, and is supporting pretty much all of the Markdown spec
except for the multi-line header syntax. You can look at the current testsuite to check its comprehensiveness.
Here I'll discuss and reflect on the script's design, and end with a look at performance tuning the library.
From the beginning the goal of this project was to create a single-pass Markdown interpreter.
A greater man than I would undoubtedly have a justification for using this approach, but just
happened to be the approach I wanted to take. Having done so, here are some thoughts
on using single-pass versus multi-pass approaches for translating Markdown into html:
The more irregular the language, the more painful to write a single-pass interpreter.
In particular, languages with implicit termination pose problems for the single-pass
A single-pass interpreter to parse an irregular language will become more complicated, more brittle,
and less extendable than a multi-pass parser.
Markdown is both irregular and has implicit termination.
The single-pass approach allows stream processing, and can thus facilitates constant memory use.
Most usecases of Markdown don't readily facilitate stream processing (rendering comments for a website,
blog entries, notes, etc), and instead expect to pass in a full string/binary and be returned the full translation
as a string/binary.
With those comments in mind, in terms of long-term sustainability and flexibility,
I would always recommend a multi-pass approach over a single-pass one.
Only in situations where a multi-pass approach has proven to have inadequate
performance characteristics (such as SAX parsing
for massive XML documents where DOM parsing
becomes unfeasible due to memory consumption) should a single-pass approach be
Or if your system already operates on data as streams. Or if you're bored.
The actual implementation of erlang_markdown is fairly straightforward, although filled
with enough special cases to break a reader down into tears. The core functions are
markdown:line_start/5 is called at the beginning of each line, and is
responsible for managing all multi-line constructs like paragraphs, lists,
blockquotes and pre blocks. MultiContext contains a stack of open
multi-line tags (ul, ol, li, p, pre, blockquote, etc).
line_start/5 is undoubtedly the ugliest portion of the code, as it is responsible
for detecting the termination of multi-level indents (comparing the indent depth against
the number of ol and ul tags open in the MultiContext stack) and
other logic like determining is a line without any syntax should be treated as a new
paragraphs, a continuing paragraph, or part of an already opened li tag.
markdown:single_line/5 handles the rendering of a line after all the
multi-line syntax has been stripped out. For example, * *this is a test*
would have already been reduced to *this is a test* by the time it reaches
It uses LinkContext to store previously declared links using the [test]: http://test.com/ "test" syntax for links.
After studying the xmerl:eventp a bit, I would re-implement the function definitions as either
The current approach makes it extremely awkward to extend erlang_markdown,
but using a property list or record would really improve on this. For the time being
there probably isn't a strong incentive to make this refactor, but if/when I do a third
rewrite this would be the core of it.
One of the acute lesson's I've learned with Erlang performance is that
talk is cheap. There is a great deal of generic advice on tuning Erlang
applications for performance--and much of it is generically applicable--but
it is challenging to distinguish the real overhead from the imagined overhead
for your specific task before you actually write the script and generate some
(Of course, this is always true, regardless of language, but I--perhaps due to my relative inexperience--have
found a great deal of confusion among those proffering Erlang performance advice.)
I used this function to generate the test input of variable length.
gen_string(N)->Str1="* **this is a test** *and so is this*\n* another line\n\n1. a line\n2. a line2\n3. another line\n\n",Str2=">> blockquote\n>> blockquote2\n\n pre block1\n same pre block\n\n",Str3="[test](http://test.com \"this\")\ this out\n ``code block``\n\n",Str4="1. This is a test\n 2. so is this\n 3. yep...n4. yep\n\n* hi\n * there\n * ayep..\n * the end\n\n",Str=lists:append([Str1,Str2,Str3,Str4]),lists:append(lists:map(fun(_n)->Strend,lists:seq(0,N))).
For profiling, I used the markdown_tests:test_performance/1 function:
It starts by generating the test string, and then
runs the concatenated string through the markdown:markdown/1 function. It runs each
test Run times, which by default is 10, and then outputs the averaged results. It would certainly be more
interesting to calculate the standard deviation, identify and exempt outliers and so on, but I decided not
to get too carried away with it.
Now, let's take a look at the performance numbers we are starting with:
However, this change involves quite a bit of code-change, so I decided to forgo testing
it using the erlang_markdown code. It should improve memory usage a bit, but
I suspect overall would be a fairly minor improvement.
Anyway, at ~4 million characters per second, erlang_markdown is already fast enough for any purposes I can imagine using it for.