Regex/beautifulsouphtml解析 - 问答 - Python中文网

Regex/beautifulsouphtml解析

2024-06-26 13:50:19 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

上下文：我有一个大型HTML文档，其中包含要提取的业务数据。我选择使用正则表达式，但如果用户希望提供BS逻辑来解决问题，我会打开Beautifulsoup。下面是该文档的一个片段。文档包含一系列重复的HTML部分，其模式如图所示。粗体是我想要提取的正则表达式模式目标。下面是我试图提取事务描述时启动的Python脚本的一个片段，这里是片段中的第一个字段（ISSUEMO）

第一个功能是扫描文档以获取交易描述&；打印每个文件的索引位置

String match "ISSUEMO" at 15102:15109

我想在第二个函数中做的是提取&；打印始终遵循功能一中的事务描述的事务ID（1MOI-00237）

HTML代码段

<tr class="style_12" valign="top" align="left">
                                            <td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
                                                    <div class="style_51" style=" text-align:left;">ISSUEMO</div>
                                                </td>
                                            <td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
                                                    <div class="style_51" style=" text-align:left;">1MOI-00237</div>
                                            ...
                                            ...
                                            ...

                                            <td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
                                                    <div class="style_97" style=" text-align:right;">12.86</div>
                                                </td>
                                            <td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
                                                    <div class="style_98" style=" text-align:right;">-64.30</div>
                                                </td>
                                            </tr>

Python

def find_transaction_desc():

    regex_pattern = re.compile(r'ADJQTY|ADJCST|ISSUEPAO|TRNFLOC|RCPTMISC|ISSUEPO|TRNFPAO|RESVN|ISSUEMO|RCPTMO|ADJSCRAP|TRNFRCPT|TRNFINSP|PO|RETVEND|TRNFMRB|PHYSCNT|REQ|SO|MO|APLYPOINFO|GENPO|STDCSTTVAR')

    for match in re.finditer(regex_pattern, html_doc):
        start = match.start()
        end = match.end()
        print('String match "%s" at %d:%d' % (html_doc[start:end], start, end))

find_transaction_desc()

#def extract_transaction_ids():

#extract_transaction_ids()

问题：我不是python专家。是否有人可以提供一些指针或一种新的模式来解决捕获&；打印ID或BS逻辑

Tags： text 文档 div style match rgb hidden class

1条回答

网友

1楼 · 发布于 2024-06-26 13:50:19

如果我理解正确，这就是如何使用beautifulsoup实现的，至少是使用您问题中的示例html（这些可能适用于您的实际文件，也可能不适用于您的实际文件）：

from bs4 import BeautifulSoup as bs
soup=bs(html_doc,'html.parser')
for item in soup.select('td'):
    if 'ISSUEMO' in item.text:
        target = item.findNextSibling('td')
        print(target.text.strip())

使用lxml和xpath实际上更容易：

import lxml.html as lh
doc = lh.fromstring(html_doc)
target = doc.xpath('//td["ISSUEMO"]//following-sibling::td')
print(target[0].text_content().strip())

或

target = doc.xpath('//td["ISSUEMO"]//following-sibling::td/div')
print(target[0].text.strip())

在这两种情况下，输出都是

1MOI-00237

相关问题更多 >

编程相关推荐

热门问题

热门文章