HTML to reStructuredText in Python using Pandoc

During the conversion of my blog from Wordpress to a custom Django-based system, I wanted to move from HTML markup to reStructuredText (partly to make it easier to publish Sphinx documentation to my blog).

While it is dead simple to convert reStructuredText to HTML, going the other way is more difficult. Luckily, Pandoc, the swiss army knife for converting between markup formats, can do a nice job converting HTML to reStructuredText.

I wrote a custom Django Command to parse a Wordpress XML export file and store the blog entries. The relevant code to convert HTML to reStructuredText is very simple. It simply makes a subprocess call to the Pandoc command and retrieves the command's output. Make sure you have Pandoc installed (in Ubuntu, sudo apt-get install pandoc will work).

import subprocess
def html2rst(html):
    p = subprocess.Popen(['pandoc', '--from=html', '--to=rst'],
                         stdin=subprocess.PIPE, stdout=subprocess.PIPE)
    return p.communicate(html)[0]