我有一些简单的代码。。。在
from bs4 import BeautifulSoup, SoupStrainer
text = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<div></div>
<div class='detail'></div>
<div></div>
<div class='detail'></div>
<div></div>"""
for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })):
print(div)
…我希望打印两个div和'detail'类。相反,我得到了两个div和doctype,原因是:
^{pr2}$这是怎么回事?如何避免与doctype匹配?在
编辑
我发现了一种过滤方法:
from bs4 import BeautifulSoup, SoupStrainer, Doctype
...
for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })):
if type(div) is Doctype:
continue
仍然有兴趣知道如何避免在使用SoupStrainer
时必须过滤掉doctype的情况。在
我之所以要使用SoupStrainer
而不是find_all
,因为SoupStrainer
速度几乎快了两倍,这与1000个解析的页面相比相差了30秒:
def soup_strainer(text):
[div for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })) if type(div) is not Doctype]
def find_all(text):
[div for div in BeautifulSoup(text, 'lxml').find_all('div', { 'class': 'detail' })]
from timeit import timeit
print( timeit('soup_strainer(text)', number = 1000, globals = globals()) ) # 38.091634516923584
print( timeit('find_all(text)', number = 1000, globals = globals()) ) # 65.1686057066947
我认为您不需要在这个任务中使用
SoupStrainer
。相反,内置的findAll
方法应该能满足您的需要。下面是我测试过的代码,似乎运行良好:这将创建您要查找的}
div
的列表,不包括{希望这有帮助。在
相关问题 更多 >
编程相关推荐