回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我有一个OCR'ed.txt文件,它是一个包含书评数据的卷(它是一个书评索引)。我正在尝试分离作者、标题和评论数据。我已经能够清晰地区分作者,但仍然不能清晰地区分标题和评论数据。以下是.txt文件的示例:</p>
<blockquote>
<p>MA, Chi-Hua - Huan Chiu Hsin Ying BL - v70 - Jl 1 ’74 - pi 183 c MA, Ching-Hsien - Pei Niang Niang Ti Ku Shih
BL - v77 - Ja 1 ’81 - p630 MA, Hsin-Teh - Chinese Women In The Great Leap Forward
Choice - v20 - N ’82 - p396 MA, Huan • The Overall Survey Of The Ocean’s Shores 1433
Choice - v8 - 0 ’71 - pl074 MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681 MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279
JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285 MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39 MA, Laurence J C - Urban Development In Modern China
Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39 MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38 MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757 MA, Nancy Chih - Mrs. Ma’s Japanese Cooking
VQR - v58 - Spring ’82 - p68 MA, Tsu Sheng - Microscale Manipulations In Chemistry
Choice-vl3-N ’76 -pi 164 MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320
Pac A - v56 - Winter ’83 - p796 MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555 MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651 MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55 MA COY, Ramelle • Short-Time Compensation
Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c MA De - The Cowherd And The Weaving Maid
Cur R - v20 - S '81 -p325 c MA De - Crickets
Cur R - v20 - S '81 - p325 c MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325 MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r MAACK, Mary N - Libraries In Senegal ARBA - vl3 - '82 - pi53 CRL - v45 - Mr '84-pl52 JAL - v7 - S '81 - p244 JLH - vl9 - Spring ’84 - p315 LJ - vl07 - My 1 ’82 - p865 LQ - v52 - Ap '82-pl75 MAACK, Reinhard • Kontinentaldrift Und Geologie Des Sudatlantischen Ozeans GJ - vl36 - Mr '70 - pl38 MAAG, Russell C - Observe And Understand The Sun
S&T - v54 - S ’77 - p221 MAAG, Victor - Hiob
Rel St Rev - vlO - Ap '84 - pi 75 MAAILMA Katettu Poyta</p>
</blockquote>
<p>下面是一个更清晰的版本,以便更好地了解我要分离的内容:</p>
<blockquote>
<p>MA, Chi-Hua - Huan Chiu Hsin Ying BL - v70 - Jl 1 ’74 - pi 183 c
MA, Ching-Hsien - Pei Niang Niang Ti Ku Shih BL - v77 - Ja 1 ’81 - p630
MA, Hsin-Teh - Chinese Women In The Great Leap Forward Choice - v20 - N ’82 - p396
MA, Huan • The Overall Survey Of The Ocean’s Shores 1433 Choice - v8 - 0 ’71 - pl074
MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681
MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279 JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285
MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39
MA, Laurence J C - Urban Development In Modern China Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39
MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38
MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757
MA, Nancy Chih - Mrs. Ma’s Japanese Cooking VQR - v58 - Spring ’82 - p68
MA, Tsu Sheng - Microscale Manipulations In Chemistry Choice-vl3-N ’76 -pi 164
MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r
MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320 Pac A - v56 - Winter ’83 - p796
MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y
MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555
MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651
MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55
MA COY, Ramelle • Short-Time Compensation Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c
MA De - The Cowherd And The Weaving Maid Cur R - v20 - S '81 -p325 c
MA De - Crickets Cur R - v20 - S '81 - p325 c
MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c
MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c
MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325
MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r </p>
</blockquote>
<p>这是我的密码:</p>
<pre><code># read in review volume .txt file
import pandas as pd
import numpy as np
import re
file = '/Users/sinykin/Dropbox/US_LIT_PRODUCTION_DATA/REVIEWS_DATA/BOOK_REVIEWS_INDEX_TEXTS/1965_1984_Vol_5_M-P.txt'
with open(file) as f:
content = f.readlines()
content = [x.strip() for x in content]
content = " ".join(content)
# Get all authors
pattern = r"[A-Z\-]{2,}[\,]+\s[A-Za-z\s\,\(\)\.]+\s[\-\*\•\.\■ ]{1}"
authors = re.findall(pattern, content)
# Now replace all found authors with XXX_XXX
if re.search(pattern, content):
r = re.compile(pattern)
content2 = r.sub(r'XXX_XXX', content)
# Now get all the content for each author
content3 = content2.split('XXX_XXX')
bib = content3[1:]
# Now separate reviews from titles
pattern2 = r"\s+(?:([A-Z][a-z][a-z]((?!\s+-|\s+Choice\s*-).)*?\w)(?:\s+-
\s*|\s+(?=Choice\s*-)|\s*$))"
bib2 = "".join(bib)
titles = re.findall(pattern2, bib2)
print (titles[:1000])
</code></pre>
<p>我正在努力解决的是pattern2中的正则表达式代码。例如,它现在给了我一个标题:</p>
<blockquote>
<p>[('The Overall Survey Of The Ocean\xe2\x80\x99s Shores 1433', '3'),<br/>
('Ying-Yai Sheng-Lan AHR', 'H'),<br/>
('Commercial Development And Urban Change In Sung China 960-1279 JAS', 'A'),<br/>
('Pac A', ' '),<br/>
('Summer \xe2\x80\x9972', '7'),<br/>
('The Environment JAS', 'A'),<br/>
('Urban Development In Modern China', 'n'),<br/>
('Cook Chinese AB', 'A'),<br/>
('Don\xe2\x80\x99t Lick The Chopsticks CSM', 'S'),<br/>
('Mrs. Ma\xe2\x80\x99s Japanese Cooking VQR', 'Q'),
('Spring \xe2\x80\x9982', '8'),<br/>
('Microscale Manipulations In Chemistry', 'r')</p>
</blockquote>
<p>正如你所看到的,我在标题后面得到了额外的数据,特别是那些标记评论缩写的大写字母。你知道吗</p>
<p>你能帮我改进我的正则表达式来捕捉标题吗?你知道吗</p>