嵌套列表到smalls列表（按类别拆分）

3条回答

网友

1楼 · 编辑于 2024-10-01 11:38:47

您可以将您的文本连接回一个文本，并使用regex提取所需的信息。似乎有点条理（每行）：

until 1st "-" : authors
after authors some unwanted stuf, followed by 
year: 4 digit with spaces around it before next - and 
from last "-" : publisher

我将使用以下表达式：r'^(?P<author>[^-]+)(.+?) (?P<year>\d{4}).*-(?P<pub>.+)$',re.M)：

'^(?P<author>[^-]+)'       Capture from start of line till first - into group author
'(.+?)'                    Capture anything into not named group
'(?P<year>\d{4}).*-'       Capture anything with space + 4 digits + anything - into 
                           group year
'(?P<pub>.+)$'             capture anythin beyond that until end of line into group pub

然后在连接的文本上迭代：

text=[['LR Hirsch, AM Gobin, AR Lowery, F Tam… - Annals of biomedical …, 2006 - Springer'], 
 ['C Loo, A Lowery, N Halas, J West, R Drezek - Nano letters, 2005 - ACS Publications'], 
 ['SJ Oldenburg, JB Jackson, SL Westcott… - Applied Physics …, 1999 - aip.scitation.org'], 
 ['RD Averitt, SL Westcott, NJ Halas - JOSA B, 1999 - osapublishing.org'], 
 ['LR Hirsch, JB Jackson, A Lee, NJ Halas… - Analytical …, 2003 - ACS Publications'], 
 ['SJ Oldenburg, RD Averitt, NJ Halas - US Patent 6,344,272, 2002 - Google Patents'], 
 ['AM Gobin, MH Lee, NJ Halas, WD James… - Nano …, 2007 - ACS Publications'], 
 ['JB Lassiter, J Aizpurua, LI Hernandez, DW Brandl… - Nano …, 2008 - ACS Publications'], 
 ['JB Jackson, NJ Halas - The Journal of Physical Chemistry B, 2001 - ACS Publications'], 
 ['RD Averitt, D Sarkar, NJ Halas - Physical Review Letters, 1997 - APS']]

# until 1st "-" : authors
# from last "-" : publisher
# year: 4 digit with spaces around it
import re

# re.M == multiline
pattern = re.compile(r'^(?P<author>[^-]+)(.+?) (?P<year>\d{4}).*-(?P<pub>.+)$',re.M)

t = '\n'.join(a for b in text for a in b)

auth = []
year = []
pub = []
for p in pattern.finditer(t):
    auth.append(p.group("author"))
    year.append(p.group("year"))
    pub.append(p.group("pub"))

print("Authors: ",auth)
print("Years: ",year)
print("Publishers: ",pub)

输出：

Authors:  ['LR Hirsch, AM Gobin, AR Lowery, F Tam… ',
           'C Loo, A Lowery, N Halas, J West, R Drezek ',
           'SJ Oldenburg, JB Jackson, SL Westcott… ', 
           'RD Averitt, SL Westcott, NJ Halas ', 
           'LR Hirsch, JB Jackson, A Lee, NJ Halas… ', 
           'SJ Oldenburg, RD Averitt, NJ Halas ', 
           'AM Gobin, MH Lee, NJ Halas, WD James… ', 
           'JB Lassiter, J Aizpurua, LI Hernandez, DW Brandl… ', 
           'JB Jackson, NJ Halas ', 
           'RD Averitt, D Sarkar, NJ Halas ']

Years:  ['2006', '2005', '1999', '1999', '2003', '2002', '2007', '2008', '2001', '1997']

Publishers:  [' Springer', ' ACS Publications', ' aip.scitation.org', 
              ' osapublishing.org', ' ACS Publications', ' Google Patents', 
              ' ACS Publications', ' ACS Publications', ' ACS Publications', ' APS']

您的捕获可以得到增强，在这里和那里随意摆弄和省略一些空白-我建议将此作为一个起点，在http://regex101.com（设置为python）优化模式，直到您完全统计完毕。你知道吗

网友

2楼 · 编辑于 2024-10-01 11:38:47

big_list = [['LR Hirsch, AM Gobin, AR Lowery, F Tam… - Annals of biomedical …, 2006 - Springer'], 
 ['C Loo, A Lowery, N Halas, J West, R Drezek - Nano letters, 2005 - ACS Publications'], 
 ['SJ Oldenburg, JB Jackson, SL Westcott… - Applied Physics …, 1999 - aip.scitation.org'], 
 ['RD Averitt, SL Westcott, NJ Halas - JOSA B, 1999 - osapublishing.org'], 
 ['LR Hirsch, JB Jackson, A Lee, NJ Halas… - Analytical …, 2003 - ACS Publications'], 
 ['SJ Oldenburg, RD Averitt, NJ Halas - US Patent 6,344,272, 2002 - Google Patents'], 
 ['AM Gobin, MH Lee, NJ Halas, WD James… - Nano …, 2007 - ACS Publications'], 
 ['JB Lassiter, J Aizpurua, LI Hernandez, DW Brandl… - Nano …, 2008 - ACS Publications'], 
 ['JB Jackson, NJ Halas - The Journal of Physical Chemistry B, 2001 - ACS Publications'], 
 ['RD Averitt, D Sarkar, NJ Halas - Physical Review Letters, 1997 - APS']]

authors_list = [[n.strip() for n in l[0].split('-')[0].split(',')] for l in big_list]
years_list = [int(l[0].split('-')[1].split(',')[-1]) for l in big_list]
publishers_list = [l[0].split('-')[2].strip() for l in big_list]

网友

3楼 · 编辑于 2024-10-01 11:38:47

与所需的输出类型相同的嵌套小列表列表（按类别拆分）。你知道吗

import re

authors = []
years = []
publications = []

text=[['LR Hirsch, AM Gobin, AR Lowery, F Tam… - Annals of biomedical …, 2006 - Springer'], 
 ['C Loo, A Lowery, N Halas, J West, R Drezek - Nano letters, 2005 - ACS Publications'], 
 ['SJ Oldenburg, JB Jackson, SL Westcott… - Applied Physics …, 1999 - aip.scitation.org'], 
 ['RD Averitt, SL Westcott, NJ Halas - JOSA B, 1999 - osapublishing.org'], 
 ['LR Hirsch, JB Jackson, A Lee, NJ Halas… - Analytical …, 2003 - ACS Publications'], 
 ['SJ Oldenburg, RD Averitt, NJ Halas - US Patent 6,344,272, 2002 - Google Patents'], 
 ['AM Gobin, MH Lee, NJ Halas, WD James… - Nano …, 2007 - ACS Publications'], 
 ['JB Lassiter, J Aizpurua, LI Hernandez, DW Brandl… - Nano …, 2008 - ACS Publications'], 
 ['JB Jackson, NJ Halas - The Journal of Physical Chemistry B, 2001 - ACS Publications'], 
 ['RD Averitt, D Sarkar, NJ Halas - Physical Review Letters, 1997 - APS']]

regex = "\[\'(?P<author>[A-Za-z\s,]+)(.*?),\s+(?P<year>[\d]{4})\s+-\s+(?P<publication>.*?)\'\],"

matches = re.finditer(regex, str(text), re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    authors.append([match.group('author').strip()])
    years.append([match.group('year').strip()])
    publications.append([match.group('publication').strip()])

print('Authors = ', authors)
print('Year = ', years)
print('Publisher =', publications)

输出：

Authors =  [['LR Hirsch, AM Gobin, AR Lowery, F Tam'], ['C Loo, A Lowery, N Halas, J West, R Drezek'], ['SJ Oldenburg, JB Jackson, SL Westcott'], ['RD Averitt, SL Westcott, NJ Halas'], ['LR Hirsch, JB Jackson, A Lee, NJ Halas'], ['SJ Oldenburg, RD Averitt, NJ Halas'], ['AM Gobin, MH Lee, NJ Halas, WD James'], ['JB Lassiter, J Aizpurua, LI Hernandez, DW Brandl'], ['JB Jackson, NJ Halas']]
Year =  [['2006'], ['2005'], ['1999'], ['1999'], ['2003'], ['2002'], ['2007'], ['2008'], ['2001']]
Publisher = [['Springer'], ['ACS Publications'], ['aip.scitation.org'], ['osapublishing.org'], ['ACS Publications'], ['Google Patents'], ['ACS Publications'], ['ACS Publications'], ['ACS Publications']]

相关问题更多 >

编程相关推荐

热门问题

热门文章

嵌套列表到smalls列表（按类别拆分）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >