首先,谢谢你的帮助
…stackoverflow(&;python)si新手,很抱歉使用了错误的术语:)
我使用Scrapy从html源中提取数据,该源使用Scrapy的选择器在items.py中创建dict字段:
def parse_item(self, response):
item = SiennaautoItem() #instatiating dict
item['attributes'] = response.css('p.attrgroup').extract()
yield item
这将返回具有多个值的数组/列表的dict:
> ['<p class="attrgroup">\n\n\n\n <span><b>2014 honda odyssey
> touring elite</b></span>\n <br>\n\n </p>', '<p
> class="attrgroup">\n\n\n\n <span>VIN:
> <b>5FNRL5H66EB107700</b></span>\n <br>\n\n\n\n\n
> <span>condition: <b>like new</b></span>\n <br>\n\n\n\n\n
> <span>cylinders: <b>6 cylinders</b></span>\n <br>\n\n\n\n\n
> <span>drive: <b>fwd</b></span>\n <br>\n\n\n\n\n
> <span>fuel: <b>gas</b></span>\n <br>\n\n\n\n\n
> <span>odometer: <b>99000</b></span>\n <br>\n\n\n\n\n
> <span>paint color: <b>white</b></span>\n <br>\n\n\n\n\n
> <span>size: <b>full-size</b></span>\n <br>\n\n\n\n\n
> <span>title status: <b>clean</b></span>\n <br>\n\n\n\n\n
> <span>transmission: <b>automatic</b></span>\n
> <br>\n\n\n\n\n <span>type: <b>mini-van</b></span>\n
> <br>\n\n </p>']
以下是呈现的html:
['\n\n\n\n 2014 honda odyssey touring elite\n
', '\n\n\n\n VIN: 5FNRL5H66EB107700\n
\n\n
\n\n\n\n\n
condition: like new\n
\n\n\n\n\n
cylinders: 6 cylinders\n
\n\n\n\n\n drive: fwd\n
\n\n\n\n\n
fuel: gas\n
\n\n\n\n\n
odometer: 99000\n
\n\n\n\n\n
paint color: white\n
\n\n\n\n\n
size: full-size\n
\n\n\n\n\n
title status: clean\n
\n\n\n\n\n
transmission: automatic\n
\n\n\n\n\n type: mini-van\n
\n\n ']
我的问题是,如何删除html标记以及如何从span标记创建键,这些是:
状况、行驶、里程表等
我希望从item[attributes]返回的值创建它们自己的dict值,例如:
项目[里程表] 项目[条件] 等
非常感谢你的帮助,因为我已经在这上面呆了一段时间了
我的xpath有点生疏,但这里有一种不使用xpath的方法,只需使用w3lib库即可
输出:
相关问题 更多 >
编程相关推荐