Python beauthulsoup解析特定tex

<HTML> <HEAD><TITLE></TITLE></HEAD> <BODY> <DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of large accelerated filer, accelerated filer and smaller reporting company. (Check one): </DIV> <DIV align="center"> <TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">  <TR valign="bottom"> <TD width="22%"> </TD> <TD width="3%"> </TD> <TD width="22%"> </TD> <TD width="3%"> </TD> <TD width="22%"> </TD> <TD width="3%"> </TD> <TD width="22%"> </TD> </TR> <TR></TR>   <TR valign="bottom"> <TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT> </TD> <TD> </TD> <TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT> </TD> <TD> </TD> <TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">o</FONT> </FONT> <FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT> </TD> <TD> </TD> <TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">þ</FONT></FONT></TD> </TR>  </TABLE> </DIV></BODY></HTML>

3条回答

网友

1楼 · 编辑于 2024-10-02 22:31:42

如果知道wingding字符的位置不会改变，可以使用.next。在

>>> nodes = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
>>> nodes[-1].next.next  # last item in list is the only good one... kinda crap
u'&#254;'

或者你可以上去，然后find从那里：

^{pr2}$

或者你也可以反过来：

>>> soup.findAll(text='&#254;')[0].previous.previous
u' Smaller reporting company '

假设你知道你要找的翅膀特征。在

最后一个策略还有一个额外的好处，就是过滤掉正则表达式捕捉到的其他垃圾，我想你并不真的想要；然后你可以循环查看结果，知道你只在正确的列表中工作，这样你就可以随心所欲地阅读if。在

网友

2楼 · 编辑于 2024-10-02 22:31:42

您可以尝试遍历结构并检查内部标记内的值或检查外部标记中的值。我已经记不清该怎么做了，最后我使用了lxml来实现这一点，但我认为bsoup可以做到这一点。在

如果你不能让bsoup来做，那就看看lxml。它可能更快取决于你在做什么。它还有一些钩子，可以在lxml中使用bsoup。在

网友

3楼 · 编辑于 2024-10-02 22:31:42

lxml有一个容忍的HTML解析器。您不需要bsoup（它现在已经被作者弃用），并且应该避免使用regex来解析HTML。在

以下是您要寻找的第一个粗略的概述：

guff = """\
<HTML>
<HEAD><TITLE></TITLE></HEAD>
[snip]
</DIV></BODY></HTML>
"""
from lxml.html import fromstring
doc = fromstring(guff)
for td_el in doc.iter('td'):
    font_els = list(td_el.iter('font'))
    if not font_els: continue
    print
    for el in font_els:
        print (el.text, el.attrib)

这会产生：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章