我该如何编写一个只解析标记之间带有特定文本的对象的beautifulsoupfilter？

def my_custom_strainer(self, elem, attrs): for attr in attrs: print("attr:" + attr + "=" + attrs[attr]) if elem == 'div' and 'class' in attr and attrs['class'] == "score": return True elif elem == "span" and elem.text == re.compile("my text"): return True article_stat_page_strainer = SoupStrainer(self.my_custom_strainer) soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)

3条回答

网友

1楼 · 编辑于 2024-10-03 11:13:28

似乎您试图在my_custom_strainer方法中循环soup元素。在

为此，您可以按如下方式进行操作：

soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
my_custom_strainer(soup, attrs)

然后稍微修改my_custom_strainer以满足如下要求：

^{pr2}$

这样就可以迭代地访问soup对象。在

网友

2楼 · 编辑于 2024-10-03 11:13:28

我最近为html文件创建了一个lxml/beauthoulsoup解析器，它还可以在特定的标记之间进行搜索。在

我编写的函数打开操作系统的文件管理器，允许您选择要解析的特定html文件。在

def openFile(self):
    options = QFileDialog.Options()

    options |= QFileDialog.DontUseNativeDialog
    fileName, _ = QFileDialog.getOpenFileName(self, "QFileDialog.getOpenFileName()", "",
                                              "All Files (*);;Python Files (*.py)", options=options)
    if fileName:
        file = open(fileName)
        data = file.read()
        soup = BeautifulSoup(data, "lxml")
        for item in soup.find_all('strong'):
            results.append(float(item.text))
    print('Score =', results[1])
    print('Fps =', results[0])

您可以看到我指定的标记是“strong”，我试图在该标记中查找文本。在

希望我能帮忙。在

网友

3楼 · 编辑于 2024-10-03 11:13:28

TLDR；不，这目前在beauthoulsoup中不容易实现（需要修改beauthulsoup和SoupStrainer对象）。在

说明：

问题是在handle_starttag()方法上调用了过滤器传递函数。正如您所猜到的，您只有开始标记中的值（例如元素名和属性）。在

https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524

if (self.parse_only and len(self.tagStack) <= 1
    and (self.parse_only.text
     or not self.parse_only.search_tag(name, attrs))):
return None

如您所见，如果您的过滤器函数返回False，元素将立即被丢弃，而没有机会考虑内部文本（不幸的是）。在

另一方面，如果你添加“文本”来搜索。在

^{pr2}$

它将开始在标记内搜索文本，但它没有元素或属性的上下文-您可以看到讽刺的是：/

把它们组合起来就什么也找不到。而且您甚至不能访问find函数中显示的父级： https://gist.github.com/RichardBronosky/4060082

所以目前过滤器只是很好的过滤元素/属性。你需要修改很多漂亮的soup代码才能工作。在

如果您真的需要这样做，我建议继承beauthulsoup和SoupStrainer对象并修改它们的行为。在

相关问题更多 >

编程相关推荐

热门问题

热门文章