使用Scrapy在下一个同级标记中获取信息的Xpath

2024-09-25 08:37:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试着接触Scrapy,现在我试着从一个词源网站上提取信息:http://www.etymonline.com 现在,我只想知道单词和它们的原始描述。这是一个常见的HTML代码块在etymonline中的显示方式:

<dt>
  <a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a>
  <a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com">
    <img src="graphics/dictionary.gif" width="16" height="16" alt="Look up address at Dictionary.com" title="Look up address at Dictionary.com"/>
  </a>
</dt>
<dd>
  1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="crossreference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).
</dd>

单词包含在<dt>标记和下一个同级标记<dd>中的描述中。 要在像http://www.etymonline.com/index.php?l=a&p=9&allowed_in_frame=0这样的页面上获取单词列表,可以编写word = sel.xpath('//dl/dt/a/text()').extract()

然后我尝试循环遍历这个单词列表,并使用这行代码info = selInfo.xpath("//dl/dt[a='"+word[i]+"']/following-sibling::dd")提取相关信息。但似乎没用。有什么想法吗?


Tags: ofinfromcomhttpindexaddressdt
3条回答

使用以下同级的解决方案。

class SingleSpider(scrapy.Spider):
    name = "etym"
    allowed_domains = ["etymonline.com"]
    start_urls = [
        "http://www.etymonline.com/index.php?l=d&allowed_in_frame=0"]

    def parse(self, response):


        for nodes in response.xpath('//dl'):
            for i in nodes.xpath('dt'):
                print i.xpath('a/text()').extract()   
                print i.xpath('following-sibling::dd[1]/text()').extract()    

基本上:

  • 你一个接一个地得到Dt元素
  • 打印链接中包含的文本
  • 移到下一个同级并打印包含的文本
  • 列表项

这里是输出的摘录:

[u'daiquiri (n.)'] [u'type of alcoholic drink, 1920 (first recorded in F. Scott Fitzgerald), from ', u', name of a district or village in eastern Cuba.']

[u'dairy (n.)'] [u'late 13c., "building for making butter and cheese; dairy farm," formed with Anglo-French ', u' affixed to Middle English ', u' (in ', u' "dairymaid"), from Old English ', u' "kneader of bread, housekeeper, female servant" (see ', u' (n.1)). The purely native word was ', u'.']

[u'dais (n.)'] [u'mid-13c., from Anglo-French ', u', Old French ', u' "table, platform," from Latin ', u' "disk-shaped object," also, by medieval times, "table," from Greek ', u' "quoit, disk, dish" (see ', u' (n.)). Died out in English c.1600, preserved in Scotland, revived 19c. by antiquarians.']

要在<dt>之后到达<dd>,可以使用following-sibling轴,这是正确的。

following-sibling::dd在上下文节点后使用select alldd元素。因此,您需要使用位置谓词[1],将XPath限制为只使用第一个。

对于从//dl/dt中得到的每个dt元素,您选择following-sibling::dd[1]

下面是一个使用scrapy shell作为术语“address”的示例会话:

$ scrapy shell "http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none"
...
2014-11-26 10:34:53+0100 [default] DEBUG: Crawled (200) <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f1396cc6950>
[s]   item       {}
[s]   request    <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s]   response   <200 http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s]   settings   <scrapy.settings.Settings object at 0x7f1397399bd0>
[s]   spider     <Spider 'default' at 0x7f13966c05d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: for dt in response.xpath('//dl/dt'):
    print "Word:", dt.xpath('string(a)').extract()
    print "Definition:", dt.xpath('string(following-sibling::dd[1])').extract()
    print
   ...:     
Word: [u'address (n.)']
Definition: [u'1530s, "dutiful or courteous approach," from address (v.) and from French adresse. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']

Word: [u'addressee (n.)']
Definition: [u'1810; see address (v.) + -ee.']

Word: [u'address (v.)']
Definition: [u'early 14c., "to guide or direct," from Old French adrecier "go straight toward; straighten, set right; point, direct" (13c.), from Vulgar Latin *addirectiare "make straight," from Latin ad "to" (see ad-) + *directiare, from Latin directus "straight, direct" (see direct (v.)). Late 14c. as "to set in order, repair, correct." Meaning "to write as a destination on a written message" is from mid-15c. Meaning "to direct spoken words (to someone)" is from late 15c. Related: Addressed; addressing.']

Word: [u'salutatorian (n.)']
Definition: [u'1841, American English, from salutatory "of the nature of a salutation," here in the specific sense "designating the welcoming address given at a college commencement" (1702) + -ian. The address was originally usually in Latin and given by the second-ranking graduating student.']

...

Word: [u'reverend (adj.)']
Definition: [u'early 15c., "worthy of respect," from Middle French reverend, from Latin reverendus "(he who is) to be respected," gerundive of revereri (see reverence). As a form of address for clergymen, it is attested from late 15c.; earlier reverent (late 14c. in this sense). Abbreviation Rev. is attested from 1721, earlier Revd. (1690s). Very Reverend is used of deans, Right Reverend of bishops, Most Reverend of archbishops.']

Word: [u'nun (n.)']
Definition: [u'Old English nunne "nun, vestal, pagan priestess, woman devoted to religious life under vows," from Late Latin nonna "nun, tutor," originally (along with masc. nonnus) a term of address to elderly persons, perhaps from children\'s speech, reminiscent of nana (compare Sanskrit nona, Persian nana "mother," Greek nanna "aunt," Serbo-Croatian nena "mother," Italian nonna, Welsh nain "grandmother;" see nanny).']


In [2]: 

xpath工作的思想不是loop提取的列表,而是在xpath的父节点中。

目前我的mac电脑上没有“scrapy”,但这里的技术应该同样适用,比如:

# I use lxml for loose html string parsing
from lxml import html

s = '''<dt><a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a> <a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com"><img src="graphics/dictionary.gif" width="16" height="16" alt="Look up address at Dictionary.com" title="Look up address at Dictionary.com" /></a></dt>
<dd>1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="crossreference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).</dd>'''

sel = html.fromstring(s)

# rather than extracting the words straight away, you loop from the parent xpath
for nodes in sel.xpath('//dt'):
    # then access a node to get the text
    print nodes.xpath('a/text()')
    # and go back to parent and search the dd node
    print nodes.xpath('../dd/text()')

# sample results
['address (n.)']
['1530s, "dutiful or courteous approach," from ', ' (v.) and from French ', '. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']

希望这有帮助。

相关问题 更多 >