如何编写包含两列的xPath

2024-07-01 07:25:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在用刮擦来刮东西。 我尝试了很多如何刮这个网站有2列。 网站代码:

<div>
    <div class="something">
        <article>
            <h2>
                <a href="somelinks">
        <article>
            <h2>
                <a href="somelinks">
        <article>
            <h2>
                <a href="somelinks">
    <div class="something">
        <article>
            <h2>
                <a href="somelinks">
        <article>
            <h2>
                <a href="somelinks">
        <article>
            <h2>
                <a href="somelinks">
</div>

我的代码:

for href in response.xpath("//div[@class='something']/article/h2/a/@href"):
    url = response.urljoin(href.extract())
    yield scrapy.Request(url, callback=self.parse_dir_contents)

我的密码错了吗?我好像跑不了。呼吸管自动关闭。你知道吗


Tags: 代码indivurlfor网站responsearticle
1条回答
网友
1楼 · 发布于 2024-07-01 07:25:15

您可以使用下面的spider从http://www.bebizzy.com/the-bebizzy-blog/中删除所有博客文章

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from check_site.items import YourItem


class StackSpider(CrawlSpider):
    name = 'stack'
    allowed_domains = ['bebizzy.com']
    start_urls = ['http://www.bebizzy.com/the-bebizzy-blog/']

    rules = (
        Rule(LinkExtractor(restrict_css='a.more-link'), callback='parse_item', follow=True),
        Rule(LinkExtractor(restrict_css='div.pagination>div>a'), callback='parse', follow=True),
    )

    def parse_item(self, response):
        self.logger.info(response.url)
        i = YourItem()
        #TODO: fill your item
        #i['title'] = ...
        return i

蜘蛛收到的日志:

2016-05-15 21:45:18 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-15 21:45:18 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-15 21:45:18 [scrapy] INFO: Enabled item pipelines: 
2016-05-15 21:45:18 [scrapy] INFO: Spider opened
2016-05-15 21:45:18 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-15 21:45:26 [stack] INFO: http://www.bebizzy.com/2016/04/12/learn-smartphone-features-spring/
2016-05-15 21:45:26 [stack] INFO: http://www.bebizzy.com/2016/03/04/why-you-need-a-responsive-website/
2016-05-15 21:45:26 [stack] INFO: http://www.bebizzy.com/2016/03/14/samsung-galaxy-s7-s7-edgereview/
2016-05-15 21:45:26 [stack] INFO: http://www.bebizzy.com/2016/03/10/marketing-your-business-online/
2016-05-15 21:45:26 [stack] INFO: http://www.bebizzy.com/2016/03/16/demographics-of-social-media-users/
2016-05-15 21:45:27 [stack] INFO: http://www.bebizzy.com/2016/03/02/websites-launched-creekside-farmstands-and-mandan-farmers-market/
2016-05-15 21:45:27 [stack] INFO: http://www.bebizzy.com/2016/03/01/what-is-wordpress/
2016-05-15 21:45:32 [stack] INFO: http://www.bebizzy.com/2016/03/18/mobile-friendly-sites-increase-seo-rank-google/
2016-05-15 21:45:33 [stack] INFO: http://www.bebizzy.com/2016/02/21/manage-multiple-wordpress-installations-with-managewp/
2016-05-15 21:45:33 [stack] INFO: http://www.bebizzy.com/2016/03/24/buy-laptop-tablet-2/
2016-05-15 21:45:33 [stack] INFO: http://www.bebizzy.com/2016/03/30/customizing-android-smartphone-screens/
2016-05-15 21:45:34 [stack] INFO: http://www.bebizzy.com/2015/09/18/vzwbuzz-recap-show-mobile-music/
2016-05-15 21:45:34 [stack] INFO: http://www.bebizzy.com/2015/09/03/choosing-a-new-logo/
2016-05-15 21:45:37 [stack] INFO: http://www.bebizzy.com/2015/10/16/best-android-apps-for-your-ghost-hunting-adventure/
2016-05-15 21:45:38 [stack] INFO: http://www.bebizzy.com/2015/10/21/samsung-note-5/
2016-05-15 21:45:39 [stack] INFO: http://www.bebizzy.com/2015/10/22/ue-roll-bluetooth-speaker/
2016-05-15 21:45:39 [stack] INFO: http://www.bebizzy.com/2015/11/17/best-apps-for-the-upcoming-election/
2016-05-15 21:45:39 [stack] INFO: http://www.bebizzy.com/2015/12/07/best-star-wars-android-apps/
2016-05-15 21:45:40 [stack] INFO: http://www.bebizzy.com/2016/02/19/using-microsoft-office-on-your-mobile-device/
2016-05-15 21:45:41 [stack] INFO: http://www.bebizzy.com/2016/01/08/best-android-business-apps-for-2016/
2016-05-15 21:45:41 [stack] INFO: http://www.bebizzy.com/2015/09/01/best-games-for-your-android-phone-essentialapps/
2016-05-15 21:45:44 [stack] INFO: http://www.bebizzy.com/2015/03/12/android-apps-for-your-spring-to-do-list/
2016-05-15 21:45:44 [stack] INFO: http://www.bebizzy.com/2015/02/02/mobile-technology-for-a-better-valentines-day/
2016-05-15 21:45:45 [stack] INFO: http://www.bebizzy.com/2015/03/18/logitech-k480-bluetooth-keyboard/
2016-05-15 21:45:45 [stack] INFO: http://www.bebizzy.com/2015/03/01/the-samsung-s6-and-the-htc-one-m9/
2016-05-15 21:45:47 [stack] INFO: http://www.bebizzy.com/2015/07/07/i-had-switchersremorse-once-once/
2016-05-15 21:45:47 [stack] INFO: http://www.bebizzy.com/2015/04/10/best-android-fishing-apps/
2016-05-15 21:45:48 [stack] INFO: http://www.bebizzy.com/2015/05/17/htcs-new-flagship-the-htc-one-m9/
2016-05-15 21:45:48 [stack] INFO: http://www.bebizzy.com/2015/07/28/windows10-twitter-stream/
2016-05-15 21:45:49 [stack] INFO: http://www.bebizzy.com/2015/01/06/my-3-words/

只需在#TODO:注释后添加项填充逻辑

相关问题 更多 >

    热门问题