与父di同名的子div

2024-09-30 16:20:44 发布

您现在位置:Python中文网/ 问答频道 /正文

html的结构如下:

  <div class="my_class">
       <div>important text</div>
       <div class="my_class">
            <div>not important</div>
       </div>
   </div>
   <div class="my_class">
       <div>important text</div>
       <div class="my_class">
            <div>not important</div>
       </div>
   </div>
   ...

基本上,有许多div与它们的子div同名,最终,我想找到partent div下的“重要文本”。你知道吗

当我试图用^{cl1}查找所有div时$

下面是获取class=“my\u class”的所有div并查找重要文本的代码:

my_div_list = soup.find_all('div', attrs={'class': 'my_class'})
for my_div in my_div_list:
    text_item = my_div.find('div') # to get to the div that contains the important text
    print(text_item.getText())

显然,输出是:

important text
not important
important text
not important
...

当我需要的时候:

 important text
 important text
 ...

Tags: thetotext文本divmyhtmlnot
3条回答

对于bs4.7.1,您可以使用:has和:first child

from bs4 import BeautifulSoup as bs

html = '''<div class="my_class">
       <div>important text</div>
       <div class="my_class">
            <div>not important</div>
       </div>
   </div>
   <div class="my_class">
       <div>important text</div>
       <div class="my_class">
            <div>not important</div>
       </div>
   </div>'''

soup = bs(html, 'lxml')
print([i.text for i in soup.select('.my_class:has(>.my_class) > div:first-child')])

您可以迭代soup.contents

from bs4 import BeautifulSoup as soup
r = [i.div.text for i in soup(html, 'html.parser').contents if i != '\n']

输出:

['important text', 'important text']

findall()文档中:

recursive is a boolean argument (defaulting to True) which tells Beautiful Soup whether to go all the way down the parse tree, or whether to only look at the immediate children of the Tag or the parser object.

因此,假设div的第一级位于标记<head><body>下,您可以设置

soup.html.body.find_all('div', attrs={'class': 'my_class'}, 
recursive=False)

输出: 你知道吗

 ['important text', 'important text']

相关问题 更多 >