在findall（）中提供字符串时，Python beautifulsou会改变行为

from bs4 import BeautifulSoup import re text = """ <code>LGEL 281220Z 33010G20KT CAVOK 32/11 Q1013</code> <code>TAF LGEL 281100Z 2812/2912 34018G28KT 9999 FEW020 BECMG 2816/2818 34015KT TEMPO 2909/2912 34015G25KT</code> <hr width="65%"/> """ soup = BeautifulSoup(text, 'html.parser') info = soup.find_all("code") value = soup.find_all('code',string = re.compile('LGEL')) print(value)#This will not find second code tag print(info)#This finds all code tags successfully

2条回答

网友

1楼 · 编辑于 2024-07-04 09:05:47

虽然给出了一个帮助开发者继续前进的答案，但我相信为什么这个问题仍然存在。这实际上可以通过参考BeautifulSoup的文档来回答。尤其是这里：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument。你知道吗

我认为这一节解释了当在find/find_all中使用string="some text"时，它会找到.string属性匹配的标记。你知道吗

.string属性描述如下：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string。它本质上是说.string只有当它的唯一子对象是文本时才返回一些东西。你知道吗

因此，它不能在每个code标记中都起作用的原因是，有些代码标记的内容比文本多。在您的例子中br标记。提供自己的过滤器实际上可以满足您的需求：

from bs4 import BeautifulSoup
import re

text = """<!  Data starts here  >
<code>LGEL 281220Z 33010G20KT CAVOK 32/11 Q1013</code><br/>
<br/><code>TAF LGEL 281100Z 2812/2912 34018G28KT 9999 FEW020 <br/>  BECMG 2816/2818 34015KT <br/>  TEMPO 2909/2912 34015G25KT</code><br/>
<hr width="65%"/>
<!  Data ends here  >"""

my_pattern = re.compile('LGEL')

def my_filter(tag):
    """Filter the tag."""

    return tag.name == 'code' and my_pattern.search(tag.get_text()) is not None


soup = BeautifulSoup(text, 'html.parser')
value = soup.find_all(my_filter)

print(value)#This will not find second code tag

输出

[<code>LGEL 281220Z 33010G20KT CAVOK 32/11 Q1013</code>, <code>TAF LGEL 281100Z 2812/2912 34018G28KT 9999 FEW020 <br/>  BECMG 2816/2818 34015KT <br/>  TEMPO 2909/2912 34015G25KT</code>]

我相信这回答了为什么要展示如何解决这个问题。你知道吗

网友

2楼 · 编辑于 2024-07-04 09:05:47

你必须首先extract()标记br，它们破坏了html结构。那你的代码就行了。你知道吗

from bs4 import BeautifulSoup
import re

text = """<!  Data starts here  >
<code>LGEL 281220Z 33010G20KT CAVOK 32/11 Q1013</code><br/>
<br/><code>TAF LGEL 281100Z 2812/2912 34018G28KT 9999 FEW020  <br/>  BECMG 2816/2818 34015KT  <br/>  TEMPO 2909/2912 34015G25KT</code><br/>
<hr width="65%"/>
<!  Data ends here  >"""


soup = BeautifulSoup(text, 'html.parser')
for br in soup.find_all('br'):
    br.extract()

info = soup.find_all("code")
value = soup.find_all('code', string = re.compile('LGEL'))

print(value)#This will not find second code tag
print(info)#This finds all code tags successfully

输出：

[<code>LGEL 281220Z 33010G20KT CAVOK 32/11 Q1013</code>, <code>TAF LGEL 281100Z 2812/2912 34018G28KT 9999 FEW020   BECMG 2816/2818 34015KT   TEMPO 2909/2912 34015G25KT</code>]
[<code>LGEL 281220Z 33010G20KT CAVOK 32/11 Q1013</code>, <code>TAF LGEL 281100Z 2812/2912 34018G28KT 9999 FEW020   BECMG 2816/2818 34015KT   TEMPO 2909/2912 34015G25KT</code>]

相关问题更多 >

编程相关推荐

热门问题

热门文章