我的网页抓取代码目前的工作，我想更具体，我可以选择什么样的数据抓取告诉它什么具体的头，上面是它？

from scrapy.spider import BaseSpider from project.items import QualificationItem from scrapy.selector import HtmlXPathSelector from scrapy.http.request import Request from urlparse import urljoin class recursiveSpider(BaseSpider): name = 'usw' allowed_domains = ['http://international.southwales.ac.uk'] start_urls = ['http://international.southwales.ac.uk/countries'] def parse(self, response): hxs = HtmlXPathSelector(response) links = [] xpath = '/html/body/div[1]/div[4]/div[2]/ul/li/a/@href' link = [ 'http://international.southwales.ac.uk' + x for x in hxs.select(xpath).extract()] links.extend(link) for link in links: yield Request(link,headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'},callback=self.parse_linkpage,dont_filter=True) def parse_linkpage(self, response): hxs = HtmlXPathSelector(response) item = QualificationItem() result = hxs.select('/html/body/div[1]/div[4]/div[2]/div/ul[*]/li/text()').extract() item ['Qualification'] = result return item

2条回答

网友

1楼 · 编辑于 2024-09-30 01:37:18

当然。您只需修改XPath就可以找到您感兴趣的ul后面的第一个h4，例如本科入学要求。例如：

//ul[preceding-sibling::h4[text() = "Undergraduate Entry Requirements"]][1]

这里的关键是preceding-sibling轴。你知道吗

网友

2楼 · 编辑于 2024-09-30 01:37:18

@adamretter答案的另一种选择是使用following-sibling

//h4[normalize-space(.)="Undergraduate Entry Requirements"]
 /following-sibling::ul[1]

另外，我在您的代码中看到了一些东西：

allowed_domains应该包含域名，而不是url
使用urlparse.urljoin()构建完整的url更安全（您已经包含了urljoin，所以最好使用它）
在parse方法中没有必要使用临时的links列表

所以你的蜘蛛代码变成了：

from scrapy.spider import BaseSpider
from project.items import QualificationItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from urlparse import urljoin


USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'

class recursiveSpider(BaseSpider):
    name = 'usw'
    allowed_domains = ['international.southwales.ac.uk']
    start_urls = ['http://international.southwales.ac.uk/countries']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        xpath = '/html/body/div[1]/div[4]/div[2]/ul/li/a/@href'
        for link in hxs.select(xpath).extract():
            yield Request(urljoin(response.url, link),
                          headers={'User-Agent': USER_AGENT},
                          callback=self.parse_linkpage,
                          dont_filter=True)

    def parse_linkpage(self, response):
        hxs = HtmlXPathSelector(response)
        item = QualificationItem()
        xpath = """
                //h4[normalize-space(.)="Undergraduate Entry Requirements"]
                 /following-sibling::ul[1]/li/text()
                """
        item['Qualification'] = hxs.select(xpath).extract()
        return item

相关问题更多 >

编程相关推荐

热门问题

热门文章