<p>您可以将您的文本连接回一个文本,并使用<em>regex</em>提取所需的信息。似乎有点条理(每行):</p>
<pre><code>until 1st "-" : authors
after authors some unwanted stuf, followed by
year: 4 digit with spaces around it before next - and
from last "-" : publisher
</code></pre>
<p>我将使用以下表达式:<code>r'^(?P<author>[^-]+)(.+?) (?P<year>\d{4}).*-(?P<pub>.+)$',re.M)</code>:</p>
<pre><code>'^(?P<author>[^-]+)' Capture from start of line till first - into group author
'(.+?)' Capture anything into not named group
'(?P<year>\d{4}).*-' Capture anything with space + 4 digits + anything - into
group year
'(?P<pub>.+)$' capture anythin beyond that until end of line into group pub
</code></pre>
<p>然后在连接的文本上迭代:</p>
<pre><code>text=[['LR Hirsch, AM Gobin, AR Lowery, F Tam… - Annals of biomedical …, 2006 - Springer'],
['C Loo, A Lowery, N Halas, J West, R Drezek - Nano letters, 2005 - ACS Publications'],
['SJ Oldenburg, JB Jackson, SL Westcott… - Applied Physics …, 1999 - aip.scitation.org'],
['RD Averitt, SL Westcott, NJ Halas - JOSA B, 1999 - osapublishing.org'],
['LR Hirsch, JB Jackson, A Lee, NJ Halas… - Analytical …, 2003 - ACS Publications'],
['SJ Oldenburg, RD Averitt, NJ Halas - US Patent 6,344,272, 2002 - Google Patents'],
['AM Gobin, MH Lee, NJ Halas, WD James… - Nano …, 2007 - ACS Publications'],
['JB Lassiter, J Aizpurua, LI Hernandez, DW Brandl… - Nano …, 2008 - ACS Publications'],
['JB Jackson, NJ Halas - The Journal of Physical Chemistry B, 2001 - ACS Publications'],
['RD Averitt, D Sarkar, NJ Halas - Physical Review Letters, 1997 - APS']]
# until 1st "-" : authors
# from last "-" : publisher
# year: 4 digit with spaces around it
import re
# re.M == multiline
pattern = re.compile(r'^(?P<author>[^-]+)(.+?) (?P<year>\d{4}).*-(?P<pub>.+)$',re.M)
t = '\n'.join(a for b in text for a in b)
auth = []
year = []
pub = []
for p in pattern.finditer(t):
auth.append(p.group("author"))
year.append(p.group("year"))
pub.append(p.group("pub"))
print("Authors: ",auth)
print("Years: ",year)
print("Publishers: ",pub)
</code></pre>
<p>输出:</p>
<pre><code>Authors: ['LR Hirsch, AM Gobin, AR Lowery, F Tam… ',
'C Loo, A Lowery, N Halas, J West, R Drezek ',
'SJ Oldenburg, JB Jackson, SL Westcott… ',
'RD Averitt, SL Westcott, NJ Halas ',
'LR Hirsch, JB Jackson, A Lee, NJ Halas… ',
'SJ Oldenburg, RD Averitt, NJ Halas ',
'AM Gobin, MH Lee, NJ Halas, WD James… ',
'JB Lassiter, J Aizpurua, LI Hernandez, DW Brandl… ',
'JB Jackson, NJ Halas ',
'RD Averitt, D Sarkar, NJ Halas ']
Years: ['2006', '2005', '1999', '1999', '2003', '2002', '2007', '2008', '2001', '1997']
Publishers: [' Springer', ' ACS Publications', ' aip.scitation.org',
' osapublishing.org', ' ACS Publications', ' Google Patents',
' ACS Publications', ' ACS Publications', ' ACS Publications', ' APS']
</code></pre>
<p>您的捕获可以得到增强,在这里和那里随意摆弄和省略一些空白-我建议将此作为一个起点,在<a href="http://regex101.com" rel="nofollow noreferrer">http://regex101.com</a>(设置为python)优化模式,直到您完全统计完毕。你知道吗</p>