<p>可以使用XPath的<code>string()</code>函数,该函数递归地将单个节点转换为字符串(可选的<code>.</code>表示当前节点):</p>
<pre class="lang-py prettyprint-override"><code>from scrapy.selector import HtmlXPathSelector
def node_to_string(node):
return node.xpath("string(.)").extract()[0]
#
body = """<body>
<div style="clear:both" id="novelintro" itemprop="description">you are foolish!<font color=red size=4>I am superman!</font></div>
<div style="clear:both" id="novelintro2" itemprop="description">hi girl<legend >I love you!</legend></div>
<div style="clear:both" id="novelintro3" itemprop="description">If I<legend > marry your mother<div>then I am your father!</div></legend></div>
</body>"""
hxs = HtmlXPathSelector(text=body)
# single target use
print node_to_string(hxs.xpath('//div[@id="novelintro"]'))
print
# multi target use
for div in hxs.xpath('//body/div'):
print node_to_string(div)
print
# alternatively
print [node_to_string(n) for n in hxs.xpath('//body/div')]
print
</code></pre>
<p>输出</p>
^{pr2}$
<p>请注意,由于源代码中缺少空格,因此缺少空格。<code>string()</code>处理空白的方式与浏览器相同。在</p>