在python中使用正则表达式从文本文件中提取特定字符集后的文本

2024-09-27 22:21:55 发布

您现在位置:Python中文网/ 问答频道 /正文

您好,我有以下格式的文本,我想从中保存名称(例如:第二自然科学院)及其a.k.a.名称以及原始名称,如以下格式的词典

尝试使用以下代码执行此操作无法提取模式

re.findall(r'[a-z A-z 0-9 /n/-]+', ^[a.k.a.][a-z A-z 0-9 /n/-]+', textData)
re.findall(r'a.k.a. : (\S+)', textData)

完全不知道该怎么做,有人能帮忙吗


#预期产出

"2ND COMPLEX OF NEURAL SCIENCES":["2ND COMPLEX OF NATURAL NEURAL", "ACADEMY OF NEURAL 
SCIENCES", "CHE 2 CHAON KWAHAK-WON", "KUKPAN KAHAK-WON", "SECOND COMPLEX OF NEURAL SCIENCES 
RESEARCH INSTITUTE"]

"LOSTIK VE HAVAIK HIZMETLARI LTD":["LOSTIK VE HAVAIK HIZMETLARI LTD"]

"7 KARNES":["7 KARNES"]

"SWING OF TIR":["7TH OF TIR COMPLEX", "7TH OF TIR INDUSTRIAL COMPLEX", "7TH OF TIR 
INDUSTRIES", "7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN", "MOJTAMAE SANATE HAFTOME TIR" etc]

#textData.txt

2ND COMPLEX OF NEURAL SCIENCES (a.k.a. ACADEMY OF NEURAL 
SCIENCES; a.k.a. CHE 2 CHAON KAHAK-WON; a.k.a. CHE 2 CHAYON KAHAK-WON;
a.k.a. KUKPAN KAHAK-WON; a.k.a. NATIONAL DEFENSE ACADEMY; a.k.a.
SANSRI; a.k.a. SECOND COMPLEX OF NEURAL SCIENCES; a.k.a. SECOND
COMPLEX OF NEURAL SCIENCES RESEARCH INSTITUTE), Pyongyang, Korea,
North; Secondary sanctions risk: North Korea Sanctions Regulations,
sections 510.201 and 510.210; Transactions Prohibited For Persons
Owned or Controlled By U.S. Financial Institutions: North Korea
Sanctions Regulations section 510.214.

LOSTIK VE HAVAIK HIZMETLARI LTD., No. 3/182 Antepe
Bagdat Cad. Istasyon Yolu Sok., Istanbul 34840, Turkey; Additional
Sanctions Information - Subject to Secondary Sanctions.
[IFSR] (Linked To: MAHAN AIR).

7 KARNES, Avenida Ciudad de Cali No. 15A-91, Local A06-07, Bogota,
Colombia; Matricula Mercantil No 1978075 (Colombia).

SWING OF TIR (a.k.a. 7TH OF TIR COMPLEX; a.k.a. 7TH OF TIR INDUSTRIAL
COMPLEX; a.k.a. 7TH OF TIR INDUSTRIES; a.k.a. 7TH OF TIR INDUSTRIES
OF ISFAHAN/ESFAHAN; a.k.a. MOJTAMAE SANATE HAFTOME TIR; a.k.a.
SANAYE HAFTOME TIR; a.k.a. SEVENTH OF TIR), Mobarakeh Road Km 45,
Isfahan, Iran; P.O. Box 81465-478, Isfahan, Iran; Additional
Sanctions Information - Subject to Secondary Sanctions.


Tags: of名称chesecondcomplexneuralacademywon
2条回答

你似乎对方括号的含义感到困惑。也许复习一下What is the difference between square brackets and parentheses in a regex?

你的要求似乎不太清楚,但像这样的

import re

with open('textData.txt', 'r') as lines:
    text = lines.read()

for segment in text.split('\n\n'):
    para = ' '.join(segment.splitlines())
    if para:
        name = re.match(r'^[^,()]+(?=, | \()', para)
        if name:
            akas = [name.group(0)]
            akas.extend(re.findall(r'(?<=a\.k\.a\. )([^;)]+)', para))
            print('"%s": ["%s"]' % (name.group(0), '", "'.join(akas)))

这假设每个记录都是一个由空行与其他记录隔开的文件,并且文件足够小,可以放入内存中

您可以使用2个捕获组,并在(?:;\s)?a\.k\.a\.\s上拆分组2的值以获得单独的值

使用re.findall将返回捕获组值

^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b)(?: \((a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\))?

模式匹配

  • ^字符串的开头
  • (捕获组1
    • [A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b匹配大写字符和不以单词字符结尾的空格
  • )关闭组1
  • (?:非捕获组
    • \(匹配(
    • (捕获第2组
      • a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\匹配以a.k.a开头的重复部分,然后匹配除()之外的任何字符
    • )关闭组2
  • )?关闭非捕获组并将其设置为可选

Regex demoPython demo

比如说

import re
import pprint

pattern = r"^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b)(?: \((a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\))?"

with open('textData.txt') as f:
    textData = f.read()
    d = {}
    for t in re.findall(pattern, textData, re.M):
        parts = [p for p in re.split(r"(?:;\s)?a\.k\.a\.\s", t[1]) if p]
        parts.insert(0, (t[0]))
        d[t[0]] = parts

    pprint.pprint(d)

输出

{'2ND COMPLEX OF NEURAL SCIENCES': ['2ND COMPLEX OF NEURAL SCIENCES',
                                    'ACADEMY OF NEURAL \nSCIENCES',
                                    'CHE 2 CHAON KAHAK-WON',
                                    'CHE 2 CHAYON KAHAK-WON',
                                    'KUKPAN KAHAK-WON',
                                    'NATIONAL DEFENSE ACADEMY',
                                    'SANSRI',
                                    'SECOND COMPLEX OF NEURAL SCIENCES',
                                    'SECOND\n'
                                    'COMPLEX OF NEURAL SCIENCES RESEARCH '
                                    'INSTITUTE'],
 '7 KARNES': ['7 KARNES'],
 'LOSTIK VE HAVAIK HIZMETLARI LTD': ['LOSTIK VE HAVAIK HIZMETLARI LTD'],
 'SWING OF TIR': ['SWING OF TIR',
                  '7TH OF TIR COMPLEX',
                  '7TH OF TIR INDUSTRIAL\nCOMPLEX',
                  '7TH OF TIR INDUSTRIES',
                  '7TH OF TIR INDUSTRIES\nOF ISFAHAN/ESFAHAN',
                  'MOJTAMAE SANATE HAFTOME TIR',
                  'SANAYE HAFTOME TIR',
                  'SEVENTH OF TIR']}

相关问题 更多 >

    热门问题