如页面所示,从div中刮取数据

2024-10-03 19:26:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从这个URL https://eksisozluk.com/mortingen-sitraze--1277239中提取数据,我想提取标题,然后是标题下的所有注释。如果你打开网站,你会看到标题下的第一条评论是(bkz:mortingen)。问题是,(bkz位于div中,而div中mortingen位于锚定链接中,因此很难按照Web站点上显示的方式刮取数据。有谁能帮助我使用CSS选择器或Xpath来刮取所有注释,如图所示。 我的代码写在下面,但它给了我(bkz:在一列中,然后akhisar,然后在三个独立的列中,而不是一列中

def parse(self, response):
    data={}
    #count=0
    title = response.css('[itemprop="name"]::text').get()
    #data["Title"] = title
    count=0
    data["title"] = title
    count=0
    for content in response.css('li .content ::text'):
        text = content.get()
        text=text.strip()
        content = "content" +str(count)
        data[content] = text
        count=count+1
    yield data

Tags: 数据texthttpsdivurl标题dataget
1条回答
网友
1楼 · 发布于 2024-10-03 19:26:11

您应该首先获得所有不带::text.content,并使用for-loop分别处理每个.content。对于每个.content,您应该运行::text以仅获取此内容中的所有文本,将其放入列表中,然后将其合并为单个字符串

       for count, content in enumerate(response.css('li .content')):
            text = []

            # get all `::text` in current `.content`
            for item in content.css('::text'):
                item = item.get()#.strip()
                # put on list
                text.append(item)

            # join all items in single string
            text = "".join(text)
            text = text.strip()

            print(count, '|', text)
            data[f"content {count}"] = text

最小工作代码

您可以将所有代码放在一个文件中并运行python script.py,而无需在scrapy中创建项目

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = ['https://eksisozluk.com/mortingen-sitraze 1277239']

    def parse(self, response):
        print('url:', response.url)

        data = {}  # PEP8: spaces around `=`

        title = response.css('[itemprop="name"]::text').get()
        data["title"] = title

        for count, content in enumerate(response.css('li .content')):
            text = []

            for item in content.css('::text'):
                item = item.get()#.strip()
                text.append(item)

            text = "".join(text)
            text = text.strip()

            print(count, '|', text)
            data[f"content {count}"] = text

        yield data
    
#  - run without project and save in `output.csv`  -

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(MySpider)
c.start()

编辑:

getall()稍微短一点

        for count, content in enumerate(response.css('li .content')):

            text = content.css('::text').getall()

            text = "".join(text)
            text = text.strip()

            print(count, '|', text)
            data[f"content {count}"] = text

相关问题 更多 >