我的网页抓取代码目前的工作,我想更具体,我可以选择什么样的数据抓取告诉它什么具体的头,上面是它?

2024-09-30 01:37:18 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我当前的代码,它从特定的div中提取所有的<li>

from scrapy.spider import BaseSpider
from project.items import QualificationItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from urlparse import urljoin

class recursiveSpider(BaseSpider):
name = 'usw'
allowed_domains = ['http://international.southwales.ac.uk']
start_urls = ['http://international.southwales.ac.uk/countries']

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    links = []
    xpath = '/html/body/div[1]/div[4]/div[2]/ul/li/a/@href'
    link = [ 'http://international.southwales.ac.uk' + x for x in hxs.select(xpath).extract()]
    links.extend(link)

    for link in links:
        yield Request(link,headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'},callback=self.parse_linkpage,dont_filter=True)

def parse_linkpage(self, response):
    hxs = HtmlXPathSelector(response)
    item = QualificationItem()
    result = hxs.select('/html/body/div[1]/div[4]/div[2]/div/ul[*]/li/text()').extract()
    item ['Qualification'] = result
    return item

我希望这段代码更加具体,如果你看这个网页http://eu.southwales.ac.uk/country/cyprus/en/和html,我想提取本科入学要求,所以我想提取的具体数据是标题<h4>Undergraduate Entry Requirements</h4>下面的\ul\li。我可以只使用xpath,但它会在不同的国家/地区页面上发生变化和移动,这就是为什么我要问是否可以通过选择它上面的标题来提取它。你知道吗


Tags: fromimportdivhttpparseresponselinkli
2条回答

当然。您只需修改XPath就可以找到您感兴趣的ul后面的第一个h4,例如本科入学要求。例如:

//ul[preceding-sibling::h4[text() = "Undergraduate Entry Requirements"]][1]

这里的关键是preceding-sibling轴。你知道吗

@adamretter答案的另一种选择是使用following-sibling

//h4[normalize-space(.)="Undergraduate Entry Requirements"]
 /following-sibling::ul[1]

另外,我在您的代码中看到了一些东西:

  • allowed_domains应该包含域名,而不是url
  • 使用urlparse.urljoin()构建完整的url更安全(您已经包含了urljoin,所以最好使用它)
  • parse方法中没有必要使用临时的links列表

所以你的蜘蛛代码变成了:

from scrapy.spider import BaseSpider
from project.items import QualificationItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from urlparse import urljoin


USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'

class recursiveSpider(BaseSpider):
    name = 'usw'
    allowed_domains = ['international.southwales.ac.uk']
    start_urls = ['http://international.southwales.ac.uk/countries']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        xpath = '/html/body/div[1]/div[4]/div[2]/ul/li/a/@href'
        for link in hxs.select(xpath).extract():
            yield Request(urljoin(response.url, link),
                          headers={'User-Agent': USER_AGENT},
                          callback=self.parse_linkpage,
                          dont_filter=True)

    def parse_linkpage(self, response):
        hxs = HtmlXPathSelector(response)
        item = QualificationItem()
        xpath = """
                //h4[normalize-space(.)="Undergraduate Entry Requirements"]
                 /following-sibling::ul[1]/li/text()
                """
        item['Qualification'] = hxs.select(xpath).extract()
        return item

相关问题 更多 >

    热门问题