Beautiful Soup 4的extract()将标记更改为NoneTyp

2024-10-02 12:23:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从网页上刮下一个项目的名称、价格和描述。你知道吗

这是HTML

...
<div id="ProductDesc">
                            <a href="javascript:void(0);" onClick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '',  '',  '')"><h5 id="productTitle">Split Sport Longsleeve T-shirt</h5></a>
                            <h5 id="productPrice">$42.00</h5>
                            <br style="clear:both;" /><br />
                        Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>
...

以下是我目前掌握的代码:

line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
price = line.h5.extract()
print price.get_text()
desc = line.get_text()
print desc

它输出:

Split Sport Longsleeve T-shirt
$42.00

然后错误:

Traceback (most recent call last):
  ...
  File "/home/myfile.py", line 35, in siftInfo
    print line.get_text()
  File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 901, in get_text
    strip, types=types)])
  File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 876, in _all_strings
    for descendant in self.descendants:
  File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 1273, in descendants
    current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'

我想要输出:

Split Sport Longsleeve T-shirt
$42.00
Style # 53TD4141 Screenprinted longsleeve cotton tee.

注:

如果Iprint line而不是printline.get_text(),则返回:

Split Sport Longsleeve T-shirt
$42.00
<div id="ProductDesc">
                            <a href="javascript:void(0);" onclick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '',  '',  '')"></a>

                            <br style="clear:both;"/><br/>
                        Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>

编辑1:

如果我省略了关于价格的两行,并在空白处添加了一些解析,那么我得到:

新代码:

line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
desc = line.get_text()
print (' ').join(desc.split())

输出:

Split Sport Longsleeve T-shirt
$42.00 Style # 53TD4141 Screenprinted longsleeve cotton tee.

所以,第二个line.h5.extract()在某种程度上改变了线的类型,但第一个不是。你知道吗


Tags: textinbrdividgetlineelement
1条回答
网友
1楼 · 发布于 2024-10-02 12:23:33

因为它的格式不好的评论,我把它放在这里。这是我运行的代码和得到的输出:

from bs4 import BeautifulSoup
from urllib.request import urlopen

def mainTest():
    url = "http://10deep.com/store/split-sport-longsleeve-t-shirt"
    print("url is: " + url)
    page=urllib.request.urlopen(url)

    soup = BeautifulSoup(page.read())
    line = soup.find(id="ProductDesc")
    name = line.h5.extract()
    print(name.get_text())
    price = line.h5.extract()
    print(price.get_text())
    desc = line.get_text()
    print(desc)

mainTest()

输出

C:\Python34\python.exe C:/{path}/testPython.py
url is: http://10deep.com/store/split-sport-longsleeve-t-shirt
Split Sport Longsleeve T-shirt
$42.00




                        Style # 53TD4141 Screenprinted longsleeve cotton tee.

Process finished with exit code 0

相关问题 更多 >

    热门问题