Highlighting Code Using Pygments and Beautiful Soup

November 12th, 2006

Syntax highlighting in blog posts is something that has always bugged me. I don’t like JavaScript-based solutions so I wrote a quick&dirty function that highlights Python-code in my blog posts on the server side. Following examples are written for Django, but they should work on any Python software.

The problem

I want to use Markdown and still be able to have automatic syntax highlighting for Python code that’s inline in my blog posts. Markdown alone tends to break HTML-formatted source code (because of indentations, etc) so fully working solution needs a bit tweaking.

The Solution

We’ll need:

With these tools we’re able to build a helper function that looks for source-code in a given text, highlights it’s syntax and applies Markdown filtering to it without messing up the syntax highlighted code.

The Code

My (simplified) BlogEntry model looks like this:

class BlogEntry(models.Model):
    title = models.CharField(maxlength=500)
    body = models.TextField(
        help_text='Use <a href="http://daringfireball.net/projects/markdown/syntax">Markdown-syntax</a>')
    body_html = models.TextField(blank=True, null=True)
    pub_date = models.DateTimeField(default = datetime.datetime.now)
    use_markdown = models.BooleanField(default=True)

    class Admin:
        fields = (
            (None, {
                'fields' : ('title', 'body', 'pub_date', 'use_markdown')
            }),
        )

Redundant body_html element is for performance: instead of calculating markdown- and syntax highlight for the body on every request, we calculate it only on every save. (Yes, it could also be done on the body-field itself, but I prefer that the content I’m editing does not change every time I save it.)

Next the highlighting function:

    def _highlight_python_code(self):
        from pygments import highlight
        from pygments.lexers import PythonLexer
        from pygments.formatters import HtmlFormatter
        from unessanet.misc.BeautifulSoup import BeautifulSoup
        
        soup = BeautifulSoup(self.body)
        python_code = soup.findAll("code", "python")
        
        if self.use_markdown:
            import markdown

            index = 0
            for code in python_code:
                code.replaceWith('<p class="python_mark">mark %i</p>' % index)
                index = index+1
            
            markdowned = markdown.markdown(str(soup))
            soup = BeautifulSoup(markdowned)
            markdowned_code = soup.findAll("p", "python_mark")
            
            index = 0
            for code in markdowned_code:
                code.replaceWith(highlight(python_code[index].renderContents(), PythonLexer(), HtmlFormatter()))
                index = index+1
        else:
            for code in python_code:
                code.replaceWith(highlight(code.string, PythonLexer(), HtmlFormatter()))            
            
        return str(soup)

This function searches <code>-blocks that have class=”python” attribute. It first replaces them with placeholder text, then applies markdown if necessary, and finally replaces the placeholders with syntax highlighted code. It may not be the most beautiful code, but it works :)

And finally the save method:

def save(self):
    self.body_html = self._highlight_python_code()
    super(BlogEntry,self).save()

The body_html-field is updated on every save. On the template side you can use simply {{ entry.body_html }} without applying any additional filters.

The CSS needed for syntax coloring can pe printed out with Pygments for example like this: css = HtmlFormatter().get_style_defs(‘.highlight’). It may be wise to save the code and put it in a static CSS-file.

Known Limitations

  • Not a bug, but feature, every instance of code-tags that have class=”python” will be replaced. This was a bit annoying when trying to document this particular function…
  • Unicode strings break the highlighter. Any help on this is appreciated!

This code is published under Creative Commons License. Please share any comments! :)

Tagged with , , , ,

7 comments

1. Jyrki November 12th, 2006

What you could do is to try decoding the code before formatting. I don't exactly know what kind of errors you end up with unicode formatted code, but you could try code.decode(errors='replace') or code.decode(errors='ignore'). We used this approach to avoid problems having unicode strings in sqlite databases (for some reason, postgres -> sqlite -> postgres broke when it had non decoded unicode along).

Decode decodes the string using the codec registered for encoding. It defaults to default encoding (probably utf-8). If you want to force different encoding, use parameter encoding in the function call.

2. Jyrki November 12th, 2006

And by the way, this fancy commenting system of yours hangs when pressing Send Comment on FF 2.0 & Linux :)

3. Ville Säävuori November 18th, 2006

Thanks for the comments, Jyrki!

I'll have to look into the utf-8 problem. Maybe when I get bored to the fact that I can't have scandinavic characters (or any other fancy things) in blog examples =)

..and the ajaxified comment system, well, that's like a "feature". (In other words, a bug I seem not to know how to fix.)

4. simon, eight media November 18th, 2006

Concerning unicode, have you tried the source-code encoding definitions? http://www.python.org/dev/peps/pep-0263/

5. Ville Säävuori December 30th, 2006

Simon, the encoding definitions seemed not to help in this case.

And the comments are fixed now :)

Original encoding problem still lives, however.

6. Gavi January 16th, 2007

Check out http://www.pygments.com

A similar idea but instead of using page parsing to render the code, uses javascript to send the code to the server (cross domain aka JSON) and gets back rendered code.

The only condition is put code in pre tag and give it a unique id which includes the type of the language like python_1, html_mycode etc etc

7. Adam Blinkinsop January 31th, 2007

Found you via Google, while attempting to set up syntax highlighting for my own site (http://www.personal-api.com/).

I found your example extremely useful, but I wanted to be able to highlight more than just python. To this end, I modified it to take any language that Pygments supports:

formatter = HtmlFormatter(cssclass='source')
code_blocks = soup.findAll("code")
for block in code_blocks:
    # The tricky part:
    lexer = get_lexer_by_name(block['class'], stripall=True)
    for code in block:
        code.replaceWith(highlight(code.string, lexer, formatter))
Proudly Powered by Django
Unessa.net © 2000-2008 Ville Säävuori. All lefts reversed. | Contact information