Python BeautifulSoup 提取元素间的文本

<html> <body> <table> <td class="MYCLASS">  <a hef="xy">Text</a> <p>something</p> THIS IS MY TEXT <p>something else</p> </br> </td> </table> </body> </html>

3条回答

网友

1楼 · 编辑于 2024-09-20 22:51:50

请改用^{}：

from bs4 import NavigableString, Comment
print ''.join(unicode(child) for child in hit.children 
    if isinstance(child, NavigableString) and not isinstance(child, Comment))

是的，这有点像跳舞。

输出：

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print ''.join(unicode(child) for child in hit.children 
...         if isinstance(child, NavigableString) and not isinstance(child, Comment))
... 




      THIS IS MY TEXT

网友

2楼 · 编辑于 2024-09-20 22:51:50

了解有关如何导航through the parse tree in ^{}的详细信息。解析树得到了tags和NavigableStrings（因为这是一个文本）。一个例子

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

要向下移动解析树，有contents和string。

contents is an ordered list of the Tag and NavigableString objects contained within a page element
if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]

对于以上，也就是说你可以

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

对于多个子节点，可以有

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

因此，您可以在这里玩contents，并在所需索引处获取内容。

您还可以在标记上迭代，这是一个快捷方式。例如

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

网友

3楼 · 编辑于 2024-09-20 22:51:50

您可以使用^{}：

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print hit.contents[6].strip()
... 
THIS IS MY TEXT

相关问题更多 >

编程相关推荐

热门问题

热门文章