Peter Hinchley

Handling Line Breaks With Markdown and Regular Expressions in PHP

Tagged: php, regexp

I use Markdown to convert the comments entered on this blog into HTML. Markdown is surprisingly feature rich, but its primary purpose, at least when it comes to formatting comments, is to wrap text into paragraph tags. The process works as expected when text is separated by two newline characters, but when two blocks of text are separated by a single newline character, Markdown merges them together as if they were a contiguous block.

For example, if a user enters a comment as follows:

Hello.

You have a nice blog.

Markdown will convert the plain text comment into the following HTML markup:

<p>Hello</p>
<p>You have a nice blog.</p>

This is just what we want. But if a user enters a comment as follows, which typically happens when the comment is a code snippet:

My code looks like this:

WScript.Echo "Hello"
WSCript.Echo "Goodbye"

Markdown will convert the plain text into the following HTML markup:

<p>My code looks like this:</p>
<p>WScript.Echo "Hello"
WScript.Echo "Goodbye"</p>

Which renders on the page like this:

My code looks like this:

WScript.Echo "Hello" WScript.Echo "Goodbye"

Now this is clearly not what was intended.

To resolve the issue I use the PHP function preg_replace to insert a line break (<br />) after each newline character that is not immediately followed by another newline character, and then pass the output to Markdown for final processing. The end result is this:

<p>My code looks like this:</p>
<p>WScript.Echo "Hello"<br />
WScript.Echo "Goodbye"</p>

Which renders on the page like this:

My code looks like this:

WScript.Echo "Hello"
WScript.Echo "Goodbye"

Just what we wanted.

The preg_replace function performs a regular expression search and replace and looks like this:

$comment = preg_replace("/\b[ \t]*\n(?![ \t]*\n)/", "<br />\n", $comment);

The regular expression searches for a word boundary \b that is optionally followed by one or more whitespace characters [ \t]*. This is necessary to ensure we don't add a line break to an empty line. We then match the newline character \n, but only if it is not followed by another newline character which is optionally preceded by one or more whitespace characters (?![\t]*n). If a match is made, the newline character and white space that directly precedes it, is replaced with <br />.

If you are not familiar with regular expressions, this may seem a little weird, but once you read up on the details of regular expression patterns, it will hopefully be self explanatory.

Your Say