Python:使用RegEx只在字符串中的特定单词之后查找完整的文本

2024-09-30 01:29:42 发布

您现在位置:Python中文网/ 问答频道 /正文

全文如下:

text = list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment 
slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything 
written manually on the checklist will not be considered invoice parth enterprise â invoice no dated 
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka 
mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order 
no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no 
vill bitta ta naliya abadasa despatched through destination march 18 terms of

目标: 我想提取“invoice”一词后面的文本,特别是“invoice”的第二个位置

我的方法:

txt = re.findall('invoice (.*)',text)

在上述方法中,我希望字符串列表如下:

txt = ['in favour of company z 02 cjpc abstract sheet weighment 
    slip goods receipt note iz checklist creator id name 30009460 xyz@abc.com
    checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything 
    written manually on the checklist will not be considered','parth enterprise â invoice no dated 
    kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment 
    taluka ..... #rest of the string]

但我得到的是text中给出的整个字符串,即原始字符串。 如果我使用text.partition('invoice'),我没有得到txt中提到的正确字符串。你知道吗

任何帮助都将不胜感激。你知道吗


Tags: oftheno字符串textintxtdate
3条回答

正则表达式invoice (.*)将匹配第一个文本invoice,后跟空格,然后(.*)将贪婪地捕获group1中的其余文本,这就是正在发生的事情,也是预期的正确行为。你知道吗

但是如果你想得到你提到的输出,你必须相应地编写你的正则表达式。您可以使用以下正则表达式来实现所需的结果

invoice (.*?)(?=(?:(?:invoice.*){2,}|$))

正则表达式解释:

  • invoice-匹配文本发票和空格
  • (.*?)-以惰性方式匹配文本
  • (?=(?:(?:invoice.*){2,}|$))-当match看到2个文本invoice文本或在整个输入结束时停止时,请向前看

Demo

Python演示

import re

s = '''list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered invoice parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of'''
print(re.findall(r'invoice (.*?)(?=(?:(?:invoice.*){2,}|$))', s))

输出你想要的

['in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered ', 'parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of']

如果你想得到你的问题中的2个匹配,你可以使用2个捕捉组。你知道吗

第一次匹配,直到发票第一次出现。然后在第二次出现发票之前在组1中捕获。你知道吗

然后再次匹配invoice,并捕获组2中字符串的其余部分。你知道吗

^.*? invoice (.*?) invoice (.*)

Regex demo| Python demo

例如

import re

text = "list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered invoice parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of"
regex = r"^.*? invoice (.*?) invoice (.*)"

matches = re.search(regex, text)

if matches:
    print(matches.group(1))
    print('\n')
    print(matches.group(2))

输出

in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 xyz@abc.comchecklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered


parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of

这可以通过split()方法轻松完成 例如:

myText="jhon is going abroad jhon is thinking about future jhon is absent"
1)  print(myText.split('jhon',1)[1])
    output -> is going abroad jhon is thinking about future jhon is absent
2)  print(myText.split('jhon',2)[2])
    output -> is thinking about future jhon is absent
3)  print(myText.split('jhon',3)[3])
    output -> is absent

1 -> it will print text after first occurrence of jhon
2 -> it will print text after second occurrence of jhon
3 -> it will print text after third occurrence of jhon

相关问题 更多 >

    热门问题