Scrapy:从返回到items.py字典的数组中删除html

2024-10-01 09:25:55 发布

您现在位置:Python中文网/ 问答频道 /正文

首先,谢谢你的帮助

…stackoverflow(&;python)si新手,很抱歉使用了错误的术语:)

我使用Scrapy从html源中提取数据,该源使用Scrapy的选择器在items.py中创建dict字段:

def parse_item(self, response):

    item = SiennaautoItem() #instatiating dict
    item['attributes'] = response.css('p.attrgroup').extract()

    yield item

这将返回具有多个值的数组/列表的dict:

> ['<p class="attrgroup">\n\n\n\n            <span><b>2014 honda odyssey
> touring elite</b></span>\n            <br>\n\n    </p>', '<p
> class="attrgroup">\n\n\n\n            <span>VIN:
> <b>5FNRL5H66EB107700</b></span>\n            <br>\n\n\n\n\n           
> <span>condition: <b>like new</b></span>\n            <br>\n\n\n\n\n   
> <span>cylinders: <b>6 cylinders</b></span>\n            <br>\n\n\n\n\n
> <span>drive: <b>fwd</b></span>\n            <br>\n\n\n\n\n           
> <span>fuel: <b>gas</b></span>\n            <br>\n\n\n\n\n           
> <span>odometer: <b>99000</b></span>\n            <br>\n\n\n\n\n       
> <span>paint color: <b>white</b></span>\n            <br>\n\n\n\n\n    
> <span>size: <b>full-size</b></span>\n            <br>\n\n\n\n\n       
> <span>title status: <b>clean</b></span>\n            <br>\n\n\n\n\n   
> <span>transmission: <b>automatic</b></span>\n           
> <br>\n\n\n\n\n            <span>type: <b>mini-van</b></span>\n        
> <br>\n\n    </p>']

以下是呈现的html:

['\n\n\n\n 2014 honda odyssey touring elite\n
\n\n

', '\n\n\n\n VIN: 5FNRL5H66EB107700\n
\n\n\n\n\n
condition: like new\n
\n\n\n\n\n
cylinders: 6 cylinders\n
\n\n\n\n\n drive: fwd\n
\n\n\n\n\n
fuel: gas\n
\n\n\n\n\n
odometer: 99000\n
\n\n\n\n\n
paint color: white\n
\n\n\n\n\n
size: full-size\n
\n\n\n\n\n
title status: clean\n
\n\n\n\n\n
transmission: automatic\n

\n\n\n\n\n type: mini-van\n

\n\n ']

我的问题是,如何删除html标记以及如何从span标记创建键,这些是:

状况、行驶、里程表等

我希望从item[attributes]返回的值创建它们自己的dict值,例如:

项目[里程表] 项目[条件] 等

非常感谢你的帮助,因为我已经在这上面呆了一段时间了


Tags: brsizeresponsehtmlitemdictattributesclass
1条回答
网友
1楼 · 发布于 2024-10-01 09:25:55

我的xpath有点生疏,但这里有一种不使用xpath的方法,只需使用w3lib库即可

from w3lib.html import remove_tags,replace_escape_chars

html_array=['<p class="attrgroup">\n\n\n\n            <span><b>2014 honda odyssey > touring elite</b></span>\n            <br>\n\n    </p>', '<p > class="attrgroup">\n\n\n\n            <span>VIN: > <b>5FNRL5H66EB107700</b></span>\n            <br>\n\n\n\n\n            > <span>condition: <b>like new</b></span>\n            <br>\n\n\n\n\n    > <span>cylinders: <b>6 cylinders</b></span>\n            <br>\n\n\n\n\n > <span>drive: <b>fwd</b></span>\n            <br>\n\n\n\n\n            > <span>fuel: <b>gas</b></span>\n            <br>\n\n\n\n\n            > <span>odometer: <b>99000</b></span>\n            <br>\n\n\n\n\n        > <span>paint color: <b>white</b></span>\n            <br>\n\n\n\n\n     > <span>size: <b>full-size</b></span>\n            <br>\n\n\n\n\n        > <span>title status: <b>clean</b></span>\n            <br>\n\n\n\n\n    > <span>transmission: <b>automatic</b></span>\n            > <br>\n\n\n\n\n            <span>type: <b>mini-van</b></span>\n         > <br>\n\n    </p>']
html=replace_escape_chars(' '.join(list(map(lambda x:remove_tags(x),html_array))))


data={}
for i in html.split('>'):
    splitted_content = list(map(lambda x:x.strip(),i.split(":")))
    if splitted_content[0].replace(':','').strip() in ['condition','cylinders','drive','fuel']: #put in this array the elements you need
        data[splitted_content[0]]=splitted_content[1]

print(data)

输出:

   
{'condition': 'like new', 'cylinders': '6 cylinders', 'drive': 'fwd', 'fuel': 'gas'}

相关问题 更多 >