
2024-09-29 19:18:59 发布

您现在位置:Python中文网/ 问答频道 /正文


MA, Chi-Hua - Huan Chiu Hsin Ying BL - v70 - Jl 1 ’74 - pi 183 c MA, Ching-Hsien - Pei Niang Niang Ti Ku Shih BL - v77 - Ja 1 ’81 - p630 MA, Hsin-Teh - Chinese Women In The Great Leap Forward Choice - v20 - N ’82 - p396 MA, Huan • The Overall Survey Of The Ocean’s Shores 1433 Choice - v8 - 0 ’71 - pl074 MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681 MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279 JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285 MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39 MA, Laurence J C - Urban Development In Modern China Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39 MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38 MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757 MA, Nancy Chih - Mrs. Ma’s Japanese Cooking VQR - v58 - Spring ’82 - p68 MA, Tsu Sheng - Microscale Manipulations In Chemistry Choice-vl3-N ’76 -pi 164 MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320 Pac A - v56 - Winter ’83 - p796 MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555 MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651 MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55 MA COY, Ramelle • Short-Time Compensation Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c MA De - The Cowherd And The Weaving Maid Cur R - v20 - S '81 -p325 c MA De - Crickets Cur R - v20 - S '81 - p325 c MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325 MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r MAACK, Mary N - Libraries In Senegal ARBA - vl3 - '82 - pi53 CRL - v45 - Mr '84-pl52 JAL - v7 - S '81 - p244 JLH - vl9 - Spring ’84 - p315 LJ - vl07 - My 1 ’82 - p865 LQ - v52 - Ap '82-pl75 MAACK, Reinhard • Kontinentaldrift Und Geologie Des Sudatlantischen Ozeans GJ - vl36 - Mr '70 - pl38 MAAG, Russell C - Observe And Understand The Sun S&T - v54 - S ’77 - p221 MAAG, Victor - Hiob Rel St Rev - vlO - Ap '84 - pi 75 MAAILMA Katettu Poyta


MA, Chi-Hua - Huan Chiu Hsin Ying BL - v70 - Jl 1 ’74 - pi 183 c MA, Ching-Hsien - Pei Niang Niang Ti Ku Shih BL - v77 - Ja 1 ’81 - p630 MA, Hsin-Teh - Chinese Women In The Great Leap Forward Choice - v20 - N ’82 - p396 MA, Huan • The Overall Survey Of The Ocean’s Shores 1433 Choice - v8 - 0 ’71 - pl074 MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681 MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279 JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285 MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39 MA, Laurence J C - Urban Development In Modern China Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39 MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38 MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757 MA, Nancy Chih - Mrs. Ma’s Japanese Cooking VQR - v58 - Spring ’82 - p68 MA, Tsu Sheng - Microscale Manipulations In Chemistry Choice-vl3-N ’76 -pi 164 MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320 Pac A - v56 - Winter ’83 - p796 MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555 MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651 MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55 MA COY, Ramelle • Short-Time Compensation Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c MA De - The Cowherd And The Weaving Maid Cur R - v20 - S '81 -p325 c MA De - Crickets Cur R - v20 - S '81 - p325 c MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325 MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r


# read in review volume .txt file
import pandas as pd
import numpy as np
import re

file = '/Users/sinykin/Dropbox/US_LIT_PRODUCTION_DATA/REVIEWS_DATA/BOOK_REVIEWS_INDEX_TEXTS/1965_1984_Vol_5_M-P.txt'

with open(file) as f:
    content = f.readlines()
    content = [x.strip() for x in content]
    content = " ".join(content)

# Get all authors
pattern = r"[A-Z\-]{2,}[\,]+\s[A-Za-z\s\,\(\)\.]+\s[\-\*\•\.\■ ]{1}"
authors = re.findall(pattern, content)

# Now replace all found authors with XXX_XXX
if re.search(pattern, content):
    r = re.compile(pattern)
    content2 = r.sub(r'XXX_XXX', content)

# Now get all the content for each author
content3 = content2.split('XXX_XXX')
bib = content3[1:]

# Now separate reviews from titles
pattern2 = r"\s+(?:([A-Z][a-z][a-z]((?!\s+-|\s+Choice\s*-).)*?\w)(?:\s+-
bib2 = "".join(bib)
titles = re.findall(pattern2, bib2)
print (titles[:1000])


[('The Overall Survey Of The Ocean\xe2\x80\x99s Shores 1433', '3'),
('Ying-Yai Sheng-Lan AHR', 'H'),
('Commercial Development And Urban Change In Sung China 960-1279 JAS', 'A'),
('Pac A', ' '),
('Summer \xe2\x80\x9972', '7'),
('The Environment JAS', 'A'),
('Urban Development In Modern China', 'n'),
('Cook Chinese AB', 'A'),
('Don\xe2\x80\x99t Lick The Chopsticks CSM', 'S'),
('Mrs. Ma\xe2\x80\x99s Japanese Cooking VQR', 'Q'), ('Spring \xe2\x80\x9982', '8'),
('Microscale Manipulations In Chemistry', 'r')



Tags: theindecontentmrchoicecurchinese

Question: ... refine my regex to capture just the titles?


books = []
for c in re.split('(XXX_XXX|MA, | MA, | MA )', content2):
    if c and not c in ['MA, ', ' MA, ', ' MA ', 'XXX_XXX']:


reObj = re.compile(r'(.+?)(( [A-Z\&]{1,4})?| Choice| Cur ?R) ?- ?[’v][l]?(\d{1,2}|O )')
for book in books:
    match = reObj.match(book)
    if match:
        title = match.groups()[0]


Chi-Hua - Huan Chiu Hsin Ying
Ching-Hsien - Pei Niang Niang Ti Ku Shih
Hsin-Teh - Chinese Women In The Great Leap Forward
The Overall Survey Of The Ocean’s Shores 1433
Ying-Yai Sheng-Lan
Commercial Development And Urban Change In Sung China 960-1279
The Environment
Urban Development In Modern China
Cook Chinese
Don’t Lick The Chopsticks
... (omitted for brevity)





(?:\s+-\s*|\s+(?=Choice\s*-)|\s*$|\s+[A-Z]{2,3}|\s+Cur R))

将消除Cur R和代码结尾(ABJAS,等等):

('The Environment', 'n')
('Urban Development In Modern China', 'n')
('Cook Chinese', 's')
('Nancy Chih \xe2\x80\xa2 Don\xe2\x80\x99t Lick The Chopsticks', 'k')
('Mrs. Ma\xe2\x80\x99s Japanese Cooking', 'n')

相关问题 更多 >
