Python BeautifulSoup 提取元素间的文本

2024-09-20 22:51:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从以下HTML中提取“这是我的文本”:

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

我这样试过:

soup = BeautifulSoup(html)

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
    print hit.text

但是我得到所有嵌套标记之间的所有文本加上注释。

有谁能帮我把“这是我的短信”从这里面拿出来吗?


Tags: text文本htmltablemyclasscommentbodysomething
3条回答

请改用^{}

from bs4 import NavigableString, Comment
print ''.join(unicode(child) for child in hit.children 
    if isinstance(child, NavigableString) and not isinstance(child, Comment))

是的,这有点像跳舞。

输出:

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print ''.join(unicode(child) for child in hit.children 
...         if isinstance(child, NavigableString) and not isinstance(child, Comment))
... 




      THIS IS MY TEXT

了解有关如何导航through the parse tree in ^{}的详细信息。解析树得到了tagsNavigableStrings(因为这是一个文本)。一个例子

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

要向下移动解析树,有contentsstring

  • contents is an ordered list of the Tag and NavigableString objects contained within a page element

  • if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]

对于以上,也就是说你可以

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

对于多个子节点,可以有

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

因此,您可以在这里玩contents,并在所需索引处获取内容。

您还可以在标记上迭代,这是一个快捷方式。例如

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

您可以使用^{}

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print hit.contents[6].strip()
... 
THIS IS MY TEXT

相关问题 更多 >

    热门问题