我试图从网页上刮下一个项目的名称、价格和描述。你知道吗
这是HTML
...
<div id="ProductDesc">
<a href="javascript:void(0);" onClick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '', '', '')"><h5 id="productTitle">Split Sport Longsleeve T-shirt</h5></a>
<h5 id="productPrice">$42.00</h5>
<br style="clear:both;" /><br />
Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>
...
以下是我目前掌握的代码:
line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
price = line.h5.extract()
print price.get_text()
desc = line.get_text()
print desc
它输出:
Split Sport Longsleeve T-shirt
$42.00
然后错误:
Traceback (most recent call last):
...
File "/home/myfile.py", line 35, in siftInfo
print line.get_text()
File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 901, in get_text
strip, types=types)])
File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 876, in _all_strings
for descendant in self.descendants:
File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 1273, in descendants
current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'
我想要输出:
Split Sport Longsleeve T-shirt
$42.00
Style # 53TD4141 Screenprinted longsleeve cotton tee.
注:
如果Iprint line
而不是printline.get_text()
,则返回:
Split Sport Longsleeve T-shirt
$42.00
<div id="ProductDesc">
<a href="javascript:void(0);" onclick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '', '', '')"></a>
<br style="clear:both;"/><br/>
Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>
编辑1:
如果我省略了关于价格的两行,并在空白处添加了一些解析,那么我得到:
新代码:
line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
desc = line.get_text()
print (' ').join(desc.split())
输出:
Split Sport Longsleeve T-shirt
$42.00 Style # 53TD4141 Screenprinted longsleeve cotton tee.
所以,第二个line.h5.extract()
在某种程度上改变了线的类型,但第一个不是。你知道吗
因为它的格式不好的评论,我把它放在这里。这是我运行的代码和得到的输出:
输出
相关问题 更多 >
编程相关推荐