您好,我有以下格式的文本,我想从中保存名称(例如:第二自然科学院)及其a.k.a.名称以及原始名称,如以下格式的词典
尝试使用以下代码执行此操作无法提取模式
re.findall(r'[a-z A-z 0-9 /n/-]+', ^[a.k.a.][a-z A-z 0-9 /n/-]+', textData)
re.findall(r'a.k.a. : (\S+)', textData)
完全不知道该怎么做,有人能帮忙吗
#预期产出
"2ND COMPLEX OF NEURAL SCIENCES":["2ND COMPLEX OF NATURAL NEURAL", "ACADEMY OF NEURAL
SCIENCES", "CHE 2 CHAON KWAHAK-WON", "KUKPAN KAHAK-WON", "SECOND COMPLEX OF NEURAL SCIENCES
RESEARCH INSTITUTE"]
"LOSTIK VE HAVAIK HIZMETLARI LTD":["LOSTIK VE HAVAIK HIZMETLARI LTD"]
"7 KARNES":["7 KARNES"]
"SWING OF TIR":["7TH OF TIR COMPLEX", "7TH OF TIR INDUSTRIAL COMPLEX", "7TH OF TIR
INDUSTRIES", "7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN", "MOJTAMAE SANATE HAFTOME TIR" etc]
#textData.txt
2ND COMPLEX OF NEURAL SCIENCES (a.k.a. ACADEMY OF NEURAL
SCIENCES; a.k.a. CHE 2 CHAON KAHAK-WON; a.k.a. CHE 2 CHAYON KAHAK-WON;
a.k.a. KUKPAN KAHAK-WON; a.k.a. NATIONAL DEFENSE ACADEMY; a.k.a.
SANSRI; a.k.a. SECOND COMPLEX OF NEURAL SCIENCES; a.k.a. SECOND
COMPLEX OF NEURAL SCIENCES RESEARCH INSTITUTE), Pyongyang, Korea,
North; Secondary sanctions risk: North Korea Sanctions Regulations,
sections 510.201 and 510.210; Transactions Prohibited For Persons
Owned or Controlled By U.S. Financial Institutions: North Korea
Sanctions Regulations section 510.214.
LOSTIK VE HAVAIK HIZMETLARI LTD., No. 3/182 Antepe
Bagdat Cad. Istasyon Yolu Sok., Istanbul 34840, Turkey; Additional
Sanctions Information - Subject to Secondary Sanctions.
[IFSR] (Linked To: MAHAN AIR).
7 KARNES, Avenida Ciudad de Cali No. 15A-91, Local A06-07, Bogota,
Colombia; Matricula Mercantil No 1978075 (Colombia).
SWING OF TIR (a.k.a. 7TH OF TIR COMPLEX; a.k.a. 7TH OF TIR INDUSTRIAL
COMPLEX; a.k.a. 7TH OF TIR INDUSTRIES; a.k.a. 7TH OF TIR INDUSTRIES
OF ISFAHAN/ESFAHAN; a.k.a. MOJTAMAE SANATE HAFTOME TIR; a.k.a.
SANAYE HAFTOME TIR; a.k.a. SEVENTH OF TIR), Mobarakeh Road Km 45,
Isfahan, Iran; P.O. Box 81465-478, Isfahan, Iran; Additional
Sanctions Information - Subject to Secondary Sanctions.
你似乎对方括号的含义感到困惑。也许复习一下What is the difference between square brackets and parentheses in a regex?
你的要求似乎不太清楚,但像这样的
这假设每个记录都是一个由空行与其他记录隔开的文件,并且文件足够小,可以放入内存中
您可以使用2个捕获组,并在
(?:;\s)?a\.k\.a\.\s
上拆分组2的值以获得单独的值使用re.findall将返回捕获组值
模式匹配
^
字符串的开头(
捕获组1[A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b
匹配大写字符和不以单词字符结尾的空格)
关闭组1(?:
非捕获组\(
匹配(
(
捕获第2组a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\
匹配以a.k.a
开头的重复部分,然后匹配除(
和)
之外的任何字符)
关闭组2)?
关闭非捕获组并将其设置为可选Regex demoPython demo
比如说
输出
相关问题 更多 >
编程相关推荐