Convert Characters in Code Blocks to HTML Entities Using PHP
I often write blog posts that include code listings. Sometimes the code listings include HTML tags, or other programming constructs, that use special characters that should not be interpreted by the browser, but rather displayed literally. To do this, it is necessary to write the special characters as encoded HTML entities. For example, instead of writing <, it is necessary to write the encoded form of the character: <.
When the browser displays the blog post, the encoded entities contained within the code listings are rendered in their decoded form. For example, < is displayed as <. This is just what was intended, and the process works perfectly, until the blog post is loaded into a textarea for editing. At this point, when the modified post is resubmitted, it is the decoded characters which are displayed in the textarea, not the encoded entities, that are committed to the database. This is not the intended outcome, as the special characters that should be interpreted as "content", will now be processed as HTML markup, and the blog post will not render correctly.
The solution to this issue is to automatically encode special characters as HTML entities before storing the blog post in the database. The trick, however, is to only encode the special characters that exist within <pre> or <code> blocks (if we were to encode the entire blog post, all legitimate tags would cease to be interpreted as HTML markup).
I use the <code> tag on this blog to encapsulate inline code fragments, and the <pre> tag to encapsulate blocks of code. If I want to use syntax highlighting, I include a class attribute within the <pre> tag as follows:
<pre class="brush: php"> code...
This class attribute shown above will render the code using syntax highlighting specific to PHP.
So how do we encode special characters that exist within <pre> and <code> tags, and cater for the condition that the tags may include class or id attributes? I do this by using two PHP functions: preg_replace_callback and htmlentities. The first function performs a regular expression search and replace using a callback, and the second converts special characters to their equivalent HTML entity. The resultant code is shown below:
function encode_html($tag, $content) {
return preg_replace_callback(
'/\<('.$tag.')([^\>]*)\>(.+?)\<\/'.$tag.'\>/s',
create_function(
'$matches',
'return "<".$matches[1].$matches[2].">".htmlentities($matches[3])."</".$matches[1].">";'
),
$content
);
}
To call the function, pass both the name of the tag (without angled brackets) that defines the encoding scope, and the content to be processed. For example, to encode all HTML entities between <pre> tags, within a text stream assigned to $content, call the function as follows:
$content = encode_html("pre", $content);
The code works by applying a regular expression to find the content between the requested tags. It does this by searching for the start of the tag (<'.$tag.') (the brackets force the matched tag to be saved as $matches[1]) which may be optionally followed by anything (e.g. a class attribute) other than a closed angle bracket: ([^\>]*) (which if found, is saved as $matches[2]). This must be followed by a closing angle bracket, and then at least 1 character: \>(.+?) (the result of which is saved as $matches[3]). The + symbol indicates 1 or more, while the ? tells the regular expression not to be greedy (in other words, match everything up to the first closing tag, not a closing tag that may appear later in the document). The final segment matches the closing tag \<'.$tag.'\>/s', while the s ensures the match will apply across multiple lines.
The create_function function is used to define an anonymous function that reconstructs the matched content, replacing the text between the matched tags with its encoded equivalent, as processed by the htmlentities function.
By passing all edited blog posts through this function, I can rest easy in the knowledge that any special characters used within code listings will be properly encoded, and not only that, but stay encoded even after repeated editing.
One other tip: if you need to include <pre> or <code> tags within you code listings, always wrap the code in the tag that is processed first. For example, in this post, because all <pre> and <code> tags are written inline, and hence wrapped in the <code> tag, I call encode_html with an argument of code before calling the function with an argument of pre. This ensures that all <pre> tags, and all opening <code> tags inside of other <code> tags, are properly encoded.