基于元素字符串中的特定单词搜索HTML元素

2024-10-03 00:25:39 发布

您现在位置:Python中文网/ 问答频道 /正文

尝试创建一个程序,该程序可以使用Beautiful Soup模块查找和替换特定元素中的标记。但是,我很难通过在元素字符串中找到的特定单词来“搜索”这些元素。假设我可以让代码通过字符串中指定的单词“查找”这些元素,那么我将“展开”元素的“p”标记并将它们“包装”到新的“h1”标记中。在

下面是一些示例HTML代码作为输入:

<p> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </p>
<p> Example#2  this element ignored </p>
<p> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different <p>

以下是我目前为止的代码(通过“ExampleStringWord#1”搜索):

^{pr2}$

如果使用上面的示例HTML输入,我希望代码如下所示:

<h1> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </h1>
<p> Example#2  this element ignored </p>
<h1> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different <h1>

但是,我的代码只查找显式包含“ExampleStringWord#1”的元素,并将排除包含任何字符串措辞的元素。 我确信我需要使用正则表达式来查找指定单词的元素(除了后面的任何字符串措辞)。但是,我对正则表达式不是很熟悉,所以我不知道如何结合beauthoulsoup模块来处理这个问题。在

另外,我查看了Beautiful Soup中有关将正则表达式作为过滤器(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression)传递的文档,但我无法在我的案例中使用它。我也在这里回顾了其他关于通过beautifulsoup传递正则表达式的帖子,但是我没有找到任何能充分解决我的问题的东西。 感谢任何帮助!在


Tags: oftheto字符串代码元素elementfind
1条回答
网友
1楼 · 发布于 2024-10-03 00:25:39

如果使用指定的子字符串(注意re.compile()部分)定位p元素,然后将元素名称替换为h1

import re

from bs4 import BeautifulSoup

data = """
<body>
    <p> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </p>
    <p> Example#2  this element ignored </p>
    <p> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different </p>
</body>
"""

soup = BeautifulSoup(data, "html.parser")
for p in soup.find_all("p", string=re.compile("ExampleStringWord#1")):
    p.name = 'h1'
print(soup)

印刷品:

^{pr2}$

相关问题 更多 >