我有一个OCR'ed.txt文件,它是一个包含书评数据的卷(它是一个书评索引)。我正在尝试分离作者、标题和评论数据。我已经能够清晰地区分作者,但仍然不能清晰地区分标题和评论数据。以下是.txt文件的示例:
MA, Chi-Hua - Huan Chiu Hsin Ying BL - v70 - Jl 1 ’74 - pi 183 c MA, Ching-Hsien - Pei Niang Niang Ti Ku Shih BL - v77 - Ja 1 ’81 - p630 MA, Hsin-Teh - Chinese Women In The Great Leap Forward Choice - v20 - N ’82 - p396 MA, Huan • The Overall Survey Of The Ocean’s Shores 1433 Choice - v8 - 0 ’71 - pl074 MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681 MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279 JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285 MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39 MA, Laurence J C - Urban Development In Modern China Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39 MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38 MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757 MA, Nancy Chih - Mrs. Ma’s Japanese Cooking VQR - v58 - Spring ’82 - p68 MA, Tsu Sheng - Microscale Manipulations In Chemistry Choice-vl3-N ’76 -pi 164 MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320 Pac A - v56 - Winter ’83 - p796 MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555 MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651 MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55 MA COY, Ramelle • Short-Time Compensation Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c MA De - The Cowherd And The Weaving Maid Cur R - v20 - S '81 -p325 c MA De - Crickets Cur R - v20 - S '81 - p325 c MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325 MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r MAACK, Mary N - Libraries In Senegal ARBA - vl3 - '82 - pi53 CRL - v45 - Mr '84-pl52 JAL - v7 - S '81 - p244 JLH - vl9 - Spring ’84 - p315 LJ - vl07 - My 1 ’82 - p865 LQ - v52 - Ap '82-pl75 MAACK, Reinhard • Kontinentaldrift Und Geologie Des Sudatlantischen Ozeans GJ - vl36 - Mr '70 - pl38 MAAG, Russell C - Observe And Understand The Sun S&T - v54 - S ’77 - p221 MAAG, Victor - Hiob Rel St Rev - vlO - Ap '84 - pi 75 MAAILMA Katettu Poyta
下面是一个更清晰的版本,以便更好地了解我要分离的内容:
MA, Chi-Hua - Huan Chiu Hsin Ying BL - v70 - Jl 1 ’74 - pi 183 c MA, Ching-Hsien - Pei Niang Niang Ti Ku Shih BL - v77 - Ja 1 ’81 - p630 MA, Hsin-Teh - Chinese Women In The Great Leap Forward Choice - v20 - N ’82 - p396 MA, Huan • The Overall Survey Of The Ocean’s Shores 1433 Choice - v8 - 0 ’71 - pl074 MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681 MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279 JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285 MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39 MA, Laurence J C - Urban Development In Modern China Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39 MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38 MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757 MA, Nancy Chih - Mrs. Ma’s Japanese Cooking VQR - v58 - Spring ’82 - p68 MA, Tsu Sheng - Microscale Manipulations In Chemistry Choice-vl3-N ’76 -pi 164 MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320 Pac A - v56 - Winter ’83 - p796 MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555 MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651 MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55 MA COY, Ramelle • Short-Time Compensation Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c MA De - The Cowherd And The Weaving Maid Cur R - v20 - S '81 -p325 c MA De - Crickets Cur R - v20 - S '81 - p325 c MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325 MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r
这是我的密码:
# read in review volume .txt file
import pandas as pd
import numpy as np
import re
file = '/Users/sinykin/Dropbox/US_LIT_PRODUCTION_DATA/REVIEWS_DATA/BOOK_REVIEWS_INDEX_TEXTS/1965_1984_Vol_5_M-P.txt'
with open(file) as f:
content = f.readlines()
content = [x.strip() for x in content]
content = " ".join(content)
# Get all authors
pattern = r"[A-Z\-]{2,}[\,]+\s[A-Za-z\s\,\(\)\.]+\s[\-\*\•\.\■ ]{1}"
authors = re.findall(pattern, content)
# Now replace all found authors with XXX_XXX
if re.search(pattern, content):
r = re.compile(pattern)
content2 = r.sub(r'XXX_XXX', content)
# Now get all the content for each author
content3 = content2.split('XXX_XXX')
bib = content3[1:]
# Now separate reviews from titles
pattern2 = r"\s+(?:([A-Z][a-z][a-z]((?!\s+-|\s+Choice\s*-).)*?\w)(?:\s+-
\s*|\s+(?=Choice\s*-)|\s*$))"
bib2 = "".join(bib)
titles = re.findall(pattern2, bib2)
print (titles[:1000])
我正在努力解决的是pattern2中的正则表达式代码。例如,它现在给了我一个标题:
[('The Overall Survey Of The Ocean\xe2\x80\x99s Shores 1433', '3'),
('Ying-Yai Sheng-Lan AHR', 'H'),
('Commercial Development And Urban Change In Sung China 960-1279 JAS', 'A'),
('Pac A', ' '),
('Summer \xe2\x80\x9972', '7'),
('The Environment JAS', 'A'),
('Urban Development In Modern China', 'n'),
('Cook Chinese AB', 'A'),
('Don\xe2\x80\x99t Lick The Chopsticks CSM', 'S'),
('Mrs. Ma\xe2\x80\x99s Japanese Cooking VQR', 'Q'), ('Spring \xe2\x80\x9982', '8'),
('Microscale Manipulations In Chemistry', 'r')
正如你所看到的,我在标题后面得到了额外的数据,特别是那些标记评论缩写的大写字母。你知道吗
你能帮我改进我的正则表达式来捕捉标题吗?你知道吗
在这些模式上拆分
content2
以获得书籍列表:循环
books
以提取title
:测试Python:3.4.2-回复:2.2.1
您可以通过添加一些特定的表达式,在最后一个正向展望的基础上继续构建,以筛选出不需要的尾随字符。例如,扩展
(?:\s+-\s*|\s+(?=Choice\s*-)|\s*$))
进入
(?:\s+-\s*|\s+(?=Choice\s*-)|\s*$|\s+[A-Z]{2,3}|\s+Cur R))
将消除
Cur R
和代码结尾(AB
,JAS
,等等):相关问题 更多 >
编程相关推荐