HTML to reStructuredText in Python using Pandoc
During the conversion of my blog from Wordpress to a custom Django-based system, I wanted to move from HTML markup to reStructuredText (partly to make it easier to publish Sphinx documentation to my blog).
While it is dead simple to convert reStructuredText to HTML, going the other way is more difficult. Luckily, Pandoc, the swiss army knife for converting between markup formats, can do a nice job converting HTML to reStructuredText.
I wrote a custom Django Command to parse a Wordpress XML export file and
store the blog entries. The relevant code to convert HTML to
reStructuredText is very simple. It simply makes a subprocess call to
the Pandoc command and retrieves the command's output. Make sure you
have Pandoc installed (in Ubuntu, sudo apt-get install pandoc
will
work).
import subprocess
def html2rst(html):
p = subprocess.Popen(['pandoc', '--from=html', '--to=rst'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
return p.communicate(html)[0]