擅长:python、mysql、java
<p><a href="http://www.crummy.com/software/BeautifulSoup/" rel="noreferrer">BeautifulSoup</a>让你几乎一路走到那里:</p>
<pre><code>>>> import BeautifulSoup
>>> f = open('a.html')
>>> soup = BeautifulSoup.BeautifulSoup(f)
>>> f.close()
>>> g = open('a.xml', 'w')
>>> print >> g, soup.prettify()
>>> g.close()
</code></pre>
<p>这将正确关闭所有标记。剩下的唯一问题是<code>doctype</code>仍然是<code>HTML</code>——要将其更改为您选择的doctype,您只需要更改第一行,这并不难,例如,不需要直接打印经过美化的文本</p>
<pre><code>>>> lines = soup.prettify().splitlines()
>>> lines[0] = ('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"'
'"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">')
>>> print >> g, '\n'.join(lines)
</code></pre>