从“伪”HTML中提取文本

2024-10-03 19:22:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试将生产执行系统(MES)SQL数据库中的工单重新生成为.pdf格式,以便可以整体打印,而不是一次打印一份(MES只允许一份)

当涉及到包含链接等(伪html…不知道还能叫它什么)的工作说明时,我被卡住了。我对所需的数据运行SQL查询,并将其放入数据框中。以下是数据框中“文本”列(工作说明)的示例:

"DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW: <#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECTID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: PANEL, |',@Caption=DWG 123456-123 ,REF_ID=REFID))""><#Tab> MOA DWG: <#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECT ID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: FACEPLATES |',@Caption=DWG 98765 Plate,REF_ID=REFID))""> <#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: ARTWORK |',@Caption=DWG 9999-8888 ARTWORK ,REF_ID=REFID))""><#Tab>"

我试图返回的数据应如下所示:

DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW:

DWG 123456-123

MOA DWG:

DWG 98765 Plate
DWG 9999-8888 ARTWORK

那里的信息往往会插入大量的复制粘贴;因此,找到模式对于我的正则表达式技能来说太难了。从本质上讲,我认为如果“<;”和“>;”之间的所有内容都被删除,就会发生这种情况--,除了“@Caption=”和“,”之间的

我还试图用beautifulsoup提取文本,但标题始终没有出现

任何建议或帮助都将不胜感激


Tags: 数据idtrueobjecttabgeneralclassificationslide
1条回答
网友
1楼 · 发布于 2024-10-03 19:22:23

通过字符串操作(而不是正则表达式),可以使用以下方法:

work = '''DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW:
<#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECTID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: PANEL, |',@Caption=DWG 123456-123 ,REF_ID=REFID))""><#Tab>
MOA DWG:
<#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJECT ID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: FACEPLATES |',@Caption=DWG 98765 Plate,REF_ID=REFID))"">
<#Tab><UT=""LinkInvoke(Slide(OBJECT_ID=OBJID,@GlyphName=@InlineText,@Classification=General,@RenderDescText=True,'@Desc=| Description: ARTWORK |',@Caption=DWG 9999-8888 ARTWORK ,REF_ID=REFID))""><#Tab>"
'''

work_dat = work.splitlines()
for line in work_dat:
    line_lst = line.split('|')
    step_1 = [item  if "@Caption=" in item else line_lst for item in line_lst][0]
    step_2 = [item if len(step_1)==1 else step_1[2] for item in step_1]
    if len(step_2)>1:
        print(step_2[1].split('=')[1].split(',')[0].strip())
    else:
        print(step_2[0])

输出:

DWG/TECH DATA: ALL TASK WITHIN THIS WORK ORDER ARE TO BE ACCOMPLISHED IAW:
DWG 123456-123
MOA DWG:
DWG 98765 Plate
DWG 9999-8888 ARTWORK

相关问题 更多 >