美化组：我想要的文本</span>

<div class="itemText"> <div class="wrapper"> <span class="itemPromo">Customer Choice Award Winner</span> <a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16819116501" title="View Details" > <span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span> <span class="itemDescription" id="lineDescriptionID" style="display:none">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span> </a> </div>

f = open('egg.data', 'rb') content = f.read() content = content.decode('utf-8', 'replace') content = ''.join([x for x in content if ord(x) < 128]) soup = bs(content) for itemText in soup.find_all('div', attrs={'class':'itemText'}): wrapper = itemText.div wrapper_href = wrapper.a for child in wrapper_href.descendants: if child['id'] == 'titleDescriptionID': print(child, "\n")

3条回答

网友

1楼 · 编辑于 2024-09-28 23:30:36

wrapper_href.descendants包含任何^{} objects，这就是您要绊倒的地方。NavigableString本质上是字符串对象，您试图用child['id']行来索引它：

>>> next(wrapper_href.descendants)
u'\n'

为什么不直接使用itemText.find('span', id='titleDescriptionID')加载标记呢？

演示：

>>> for itemText in soup.find_all('div', attrs={'class':'itemText'}):
...     print itemText.find('span', id='titleDescriptionID')
...     print itemText.find('span', id='titleDescriptionID').text
... 
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K

网友

2楼 · 编辑于 2024-09-28 23:30:36

spans = soup.find_all('span', attrs={'id':'titleDescriptionID'})
for span in spans:
    print span.string

在代码中，wrapper_href.descendants至少包含4个元素、2个span标记和2个由2个span标记括起的字符串。它递归地搜索它的子对象。

网友

3楼 · 编辑于 2024-09-28 23:30:36

from BeautifulSoup import BeautifulSoup
pool = BeautifulSoup(html) # where html contains the whole html as string

for item in pool.findAll('span', attrs={'id' : 'titleDescriptionID'}):
    print item.string

当我们使用BeautifulSoup搜索标记时，我们得到一个BeautifulSoup.tag对象，它可以直接用于访问其其他属性，如内部内容、样式、ref等

相关问题更多 >

编程相关推荐

热门问题

热门文章