<p>让我们看看scrapy shell中的不同提取模式,从示例HTML构建一个选择器:</p>
<pre><code>>>> import scrapy
>>> t = '''<td width="25%" valign="top" align="center">
... <h2 class="video"><img src="content/pl_makingfood_mjadrah.jpg" alt="Thumbnail image from video" width="160" height="120" /><br /><br />
... <i>Mjadra</i></h2> <p class="video">Video <br />
...
... <a href="content/pl_makingfood_mjadrah.rm" class="main">real</a>&nbsp;&nbsp;
... <a href="content/pl_makingfood_mjadrah.mp4" class="main" target="_blank">mp4</a><br /><br />
...
... Palestinian Arabic &amp; English <br />
... <a href="content/pl_makingfood_mjadrah.doc" target="_blank" class="main"> doc </a>&nbsp; &nbsp;
... <a href="content/pl_makingfood_mjadrah.pdf" target="_blank" class="main"> pdf </a></p>
... </td>'''
>>> selector = scrapy.Selector(text=t, type="html")
</code></pre>
<p>首先,让我们循环<code><h2 class="video"></code>元素(使用CSS选择器),并提取循环中每个标题的字符串表示:</p>
^{pr2}$
<p>我们丢失了<code><i></code>信息。在</p>
<p>让我们尝试只获取文本节点(使用<code>text()</code>节点测试):</p>
<pre><code>>>> for h2 in selector.css('h2.video'):
... print(h2.xpath('text()').extract())
...
['\n']
</code></pre>
<p>更糟糕的是,我们没有在<code><i></code>元素中获取文本节点。(实际上,<code>text()</code>只选择直接子文本节点,而不是子节点的子节点)</p>
<p>让我们试试<code>.//</code>,也就是<code>./descendant-or-self::node()/</code>快捷方式:</p>
<pre><code>>>> for h2 in selector.css('h2.video'):
... print(h2.xpath('.//text()').extract())
...
['\n', 'Mjadra']
</code></pre>
<p>不比使用XPath的<code>string()</code>好多少。在</p>
<p>现在,让我们使用<code>node()</code>节点测试,捕获元素和文本节点:</p>
<pre><code>>>> for h2 in selector.css('h2.video'):
... print(h2.xpath('node()').extract())
...
['<img src="content/pl_makingfood_mjadrah.jpg" alt="Thumbnail image from video" width="160" height="120">', '<br>', '<br>', '\n', '<i>Mjadra</i>']
</code></pre>
<p>这更好,但是我们有这些<code><img></code>标记,您可能不想要。所以我们只选择文本节点和<code><i></code>s:</p>
<pre><code>>>> for h2 in selector.css('h2.video'):
... print(h2.xpath('./node()[self::text() or self::i]').extract())
...
['\n', '<i>Mjadra</i>']
>>>
</code></pre>
<p>您可能需要从每个标题中提取一个字符串。因此,使用Python的<code>join()</code>是一个选项:</p>
<pre><code>>>> for h2 in selector.css('h2.video'):
... print( "".join(h2.xpath('./node()[self::text() or self::i]').extract()) )
...
<i>Mjadra</i>
>>>
</code></pre>