无聊的项目，刮一个时间表

from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class SchemaSpider(BaseSpider): name = "schema" allowed_domains = ["http://stats.swehockey.se/"] start_urls = [ "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940" ] def parse(self, response): hxs = HtmlXPathSelector(response) rows = hxs.select('//table[@class="tblContent"]/tbody/tr') for row in rows: date = row.select('/td[1]/div/span/text()').extract() teams = row.select('/td[2]/text()').extract() print date, teams

1条回答

网友

1楼 · 发布于 2024-06-28 11:37:06

两个问题：

tbody是现代浏览器添加的标记。Scrapy在html中根本看不到它。
数据和团队的xpath是不对的：应该使用相对xpath（.//），td索引也是错误的，应该是2和3，而不是1和2

以下是整个代码和一些modidications（工作）：

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class SchemaItem(Item):
    date = Field()
    teams = Field()


class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tblContent"]/tr')

        for row in rows:
            item = SchemaItem()
            item['date'] = row.select('.//td[2]/div/span/text()').extract()
            item['teams'] = row.select('.//td[3]/text()').extract()

            yield item

希望有帮助。在

相关问题更多 >

编程相关推荐

热门问题

热门文章