如何微调regex语法来解析棘手的脏文本?

2024-09-29 19:18:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个OCR'ed.txt文件,它是一个包含书评数据的卷(它是一个书评索引)。我正在尝试分离作者、标题和评论数据。我已经能够清晰地区分作者,但仍然不能清晰地区分标题和评论数据。以下是.txt文件的示例:

MA, Chi-Hua - Huan Chiu Hsin Ying BL - v70 - Jl 1 ’74 - pi 183 c MA, Ching-Hsien - Pei Niang Niang Ti Ku Shih BL - v77 - Ja 1 ’81 - p630 MA, Hsin-Teh - Chinese Women In The Great Leap Forward Choice - v20 - N ’82 - p396 MA, Huan • The Overall Survey Of The Ocean’s Shores 1433 Choice - v8 - 0 ’71 - pl074 MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681 MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279 JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285 MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39 MA, Laurence J C - Urban Development In Modern China Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39 MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38 MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757 MA, Nancy Chih - Mrs. Ma’s Japanese Cooking VQR - v58 - Spring ’82 - p68 MA, Tsu Sheng - Microscale Manipulations In Chemistry Choice-vl3-N ’76 -pi 164 MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320 Pac A - v56 - Winter ’83 - p796 MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555 MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651 MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55 MA COY, Ramelle • Short-Time Compensation Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c MA De - The Cowherd And The Weaving Maid Cur R - v20 - S '81 -p325 c MA De - Crickets Cur R - v20 - S '81 - p325 c MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325 MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r MAACK, Mary N - Libraries In Senegal ARBA - vl3 - '82 - pi53 CRL - v45 - Mr '84-pl52 JAL - v7 - S '81 - p244 JLH - vl9 - Spring ’84 - p315 LJ - vl07 - My 1 ’82 - p865 LQ - v52 - Ap '82-pl75 MAACK, Reinhard • Kontinentaldrift Und Geologie Des Sudatlantischen Ozeans GJ - vl36 - Mr '70 - pl38 MAAG, Russell C - Observe And Understand The Sun S&T - v54 - S ’77 - p221 MAAG, Victor - Hiob Rel St Rev - vlO - Ap '84 - pi 75 MAAILMA Katettu Poyta

下面是一个更清晰的版本,以便更好地了解我要分离的内容:

MA, Chi-Hua - Huan Chiu Hsin Ying BL - v70 - Jl 1 ’74 - pi 183 c MA, Ching-Hsien - Pei Niang Niang Ti Ku Shih BL - v77 - Ja 1 ’81 - p630 MA, Hsin-Teh - Chinese Women In The Great Leap Forward Choice - v20 - N ’82 - p396 MA, Huan • The Overall Survey Of The Ocean’s Shores 1433 Choice - v8 - 0 ’71 - pl074 MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681 MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279 JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285 MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39 MA, Laurence J C - Urban Development In Modern China Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39 MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38 MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757 MA, Nancy Chih - Mrs. Ma’s Japanese Cooking VQR - v58 - Spring ’82 - p68 MA, Tsu Sheng - Microscale Manipulations In Chemistry Choice-vl3-N ’76 -pi 164 MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320 Pac A - v56 - Winter ’83 - p796 MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555 MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651 MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55 MA COY, Ramelle • Short-Time Compensation Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c MA De - The Cowherd And The Weaving Maid Cur R - v20 - S '81 -p325 c MA De - Crickets Cur R - v20 - S '81 - p325 c MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325 MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r

这是我的密码:

# read in review volume .txt file
import pandas as pd
import numpy as np
import re

file = '/Users/sinykin/Dropbox/US_LIT_PRODUCTION_DATA/REVIEWS_DATA/BOOK_REVIEWS_INDEX_TEXTS/1965_1984_Vol_5_M-P.txt'

with open(file) as f:
    content = f.readlines()
    content = [x.strip() for x in content]
    content = " ".join(content)

# Get all authors
pattern = r"[A-Z\-]{2,}[\,]+\s[A-Za-z\s\,\(\)\.]+\s[\-\*\•\.\■ ]{1}"
authors = re.findall(pattern, content)

# Now replace all found authors with XXX_XXX
if re.search(pattern, content):
    r = re.compile(pattern)
    content2 = r.sub(r'XXX_XXX', content)

# Now get all the content for each author
content3 = content2.split('XXX_XXX')
bib = content3[1:]

# Now separate reviews from titles
pattern2 = r"\s+(?:([A-Z][a-z][a-z]((?!\s+-|\s+Choice\s*-).)*?\w)(?:\s+-
\s*|\s+(?=Choice\s*-)|\s*$))"
bib2 = "".join(bib)
titles = re.findall(pattern2, bib2)
print (titles[:1000])

我正在努力解决的是pattern2中的正则表达式代码。例如,它现在给了我一个标题:

[('The Overall Survey Of The Ocean\xe2\x80\x99s Shores 1433', '3'),
('Ying-Yai Sheng-Lan AHR', 'H'),
('Commercial Development And Urban Change In Sung China 960-1279 JAS', 'A'),
('Pac A', ' '),
('Summer \xe2\x80\x9972', '7'),
('The Environment JAS', 'A'),
('Urban Development In Modern China', 'n'),
('Cook Chinese AB', 'A'),
('Don\xe2\x80\x99t Lick The Chopsticks CSM', 'S'),
('Mrs. Ma\xe2\x80\x99s Japanese Cooking VQR', 'Q'), ('Spring \xe2\x80\x9982', '8'),
('Microscale Manipulations In Chemistry', 'r')

正如你所看到的,我在标题后面得到了额外的数据,特别是那些标记评论缩写的大写字母。你知道吗

你能帮我改进我的正则表达式来捕捉标题吗?你知道吗


Tags: theindecontentmrchoicecurchinese
2条回答

Question: ... refine my regex to capture just the titles?

在这些模式上拆分content2以获得书籍列表:

books = []
for c in re.split('(XXX_XXX|MA, | MA, | MA )', content2):
    if c and not c in ['MA, ', ' MA, ', ' MA ', 'XXX_XXX']:
        books.append(c.strip())

循环books以提取title

reObj = re.compile(r'(.+?)(( [A-Z\&]{1,4})?| Choice| Cur ?R) ?- ?[’v][l]?(\d{1,2}|O )')
for book in books:
    match = reObj.match(book)
    if match:
        title = match.groups()[0]
        print('{}'.format(title))
    else:
        print('FAIL:{}'.format(book))

Output:

Chi-Hua - Huan Chiu Hsin Ying
Ching-Hsien - Pei Niang Niang Ti Ku Shih
Hsin-Teh - Chinese Women In The Great Leap Forward
The Overall Survey Of The Ocean’s Shores 1433
Ying-Yai Sheng-Lan
Commercial Development And Urban Change In Sung China 960-1279
The Environment
Urban Development In Modern China
Cook Chinese
Don’t Lick The Chopsticks
... (omitted for brevity)

测试Python:3.4.2-回复:2.2.1

您可以通过添加一些特定的表达式,在最后一个正向展望的基础上继续构建,以筛选出不需要的尾随字符。例如,扩展

(?:\s+-\s*|\s+(?=Choice\s*-)|\s*$))

进入

(?:\s+-\s*|\s+(?=Choice\s*-)|\s*$|\s+[A-Z]{2,3}|\s+Cur R))

将消除Cur R和代码结尾(ABJAS,等等):

('The Environment', 'n')
('Urban Development In Modern China', 'n')
('Cook Chinese', 's')
('Nancy Chih \xe2\x80\xa2 Don\xe2\x80\x99t Lick The Chopsticks', 'k')
('Mrs. Ma\xe2\x80\x99s Japanese Cooking', 'n')

相关问题 更多 >

    热门问题