如何从Scrapy选择器中提取原始html？问题的回答

如何从Scrapy选择器中提取原始html？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<p><strong>简短回答：</strong></p> <ul> <li>Scrapy/Parsel选择器<code>.re()</code>和<code>.re_first()</code>方法替换HTML实体（除了<code>&lt;</code>，<code>&amp;</code>）</li> <li>相反，使用<code>.extract()</code>或<code>.extract_first()</code>获取原始HTML（或原始JavaScript指令），并对提取的字符串使用Python的<code>re</code>模块</li> </ul> <p><strong>长答案：</strong></p> <p>让我们看一个示例输入和从HTML中提取Javascript数据的各种方法。</p> <p>HTML示例：</p> <pre><code><html lang="en"> <body> <div> <script type="text/javascript"> var i = {a:['O&#39;Connor Park']} </script> </div> </body> </html> </code></pre> <p>使用scrapy选择器（它使用下面的<a href="https://github.com/scrapy/parsel">parsel</a>库），您可以使用多种方法提取Javascript片段：</p> <pre><code>>>> import scrapy >>> t = """<html lang="en"> ... <body> ... <div> ... <script type="text/javascript"> ... var i = {a:['O&#39;Connor Park']} ... </script> ... ... </div> ... </body> ... </html> ... """ >>> selector = scrapy.Selector(text=t, type="html") >>> >>> # extracting the <script> element as raw HTML >>> selector.xpath('//div/script').extract_first() u'<script type="text/javascript">\n var i = {a:[\'O&#39;Connor Park\']}\n </script>' >>> >>> # only getting the text node inside the <script> element >>> selector.xpath('//div/script/text()').extract_first() u"\n var i = {a:['O&#39;Connor Park']}\n " >>> </code></pre> <p>现在，使用<code>.re</code>（或<code>.re_first</code>）可以得到不同的结果：</p> <pre><code>>>> # I'm using a very simple "catch-all" regex >>> # you are probably using a regex to extract >>> # that specific "O'Connor Park" string >>> selector.xpath('//div/script/text()').re_first('.+') u" var i = {a:['O'Connor Park']}" >>> >>> # .re() on the element itself, one needs to handle newlines >>> selector.xpath('//div/script').re_first('.+') u'<script type="text/javascript">' # only first line extracted >>> import re >>> selector.xpath('//div/script').re_first(re.compile('.+', re.DOTALL)) u'<script type="text/javascript">\n var i = {a:[\'O\'Connor Park\']}\n </script>' >>> </code></pre> <p>HTML实体<code>&#39;</code>已被<a href="https://en.wikipedia.org/wiki/Apostrophe#Unicode">apostrophe</a>替换。这是由于<code>.re/re_first</code>实现中的<a href="https://w3lib.readthedocs.org/en/latest/w3lib.html#w3lib.html.remove_entities">^{<cd11>}</a>调用（请参阅<code>parsel</code>源代码，在<a href="https://github.com/scrapy/parsel/blob/master/parsel/utils.py#L59">^{<cd14>}</a>函数中）导致的，在简单调用<code>extract()</code>或<code>extract_first()</code>时不使用该调用</p>

如何从Scrapy选择器中提取原始html？

1 个回答

相关Python问题