这是我的代码,在函数中:
xArray = []
for t in range(npapers):
nHeader = []
headers = browser.find_elements_by_xpath("(//div[@class='gs_a'])[%s]"%(t+1))
for nheaders in headers:
nHeader.append(nheaders.text)
xArray.append(nHeader)
return xArray
它给我打印了一个大列表,结果如下:
[['LR Hirsch, AM Gobin, AR Lowery, F Tam… - Annals of biomedical …, 2006 - Springer'],
['C Loo, A Lowery, N Halas, J West, R Drezek - Nano letters, 2005 - ACS Publications'],
['SJ Oldenburg, JB Jackson, SL Westcott… - Applied Physics …, 1999 - aip.scitation.org'],
['RD Averitt, SL Westcott, NJ Halas - JOSA B, 1999 - osapublishing.org'],
['LR Hirsch, JB Jackson, A Lee, NJ Halas… - Analytical …, 2003 - ACS Publications'],
['SJ Oldenburg, RD Averitt, NJ Halas - US Patent 6,344,272, 2002 - Google Patents'],
['AM Gobin, MH Lee, NJ Halas, WD James… - Nano …, 2007 - ACS Publications'],
['JB Lassiter, J Aizpurua, LI Hernandez, DW Brandl… - Nano …, 2008 - ACS Publications'],
['JB Jackson, NJ Halas - The Journal of Physical Chemistry B, 2001 - ACS Publications'],
['RD Averitt, D Sarkar, NJ Halas - Physical Review Letters, 1997 - APS']]
我试着把它分开,从大列表中得到一小部分,比如:
Authors = [LR Hirsch, AM Gobin, AR Lowery, F Tam],[C Loo, A Lowery, N Halas, J West, R Drezek],[SJ Oldenburg, JB Jackson, SL Westcott],[RD Averitt, SL Westcott, NJ Halas],[LR Hirsch, JB Jackson, A Lee, NJ Halas],[SJ Oldenburg, RD Averitt, NJ Halas],[AM Gobin, MH Lee, NJ Halas, WD James],[JB Lassiter, J Aizpurua, LI Hernandez, DW Brandl],[JB Jackson, NJ Halas],[RD Averitt, D Sarkar, NJ Halas]]
Year = [[2006],[2005],[1999],[1999],[2003],[2002],[2007],[2008],[2001],[1997]]
Publisher =[[Springer],[ACS Publications],[aip.scitation.org],[ACS Publications][osapublishing.org],[Google Patents],[ACS Publications],[ACS Publications],[ACS Publications],[APS]]
您可以将您的文本连接回一个文本,并使用regex提取所需的信息。似乎有点条理(每行):
我将使用以下表达式:
r'^(?P<author>[^-]+)(.+?) (?P<year>\d{4}).*-(?P<pub>.+)$',re.M)
:然后在连接的文本上迭代:
输出:
您的捕获可以得到增强,在这里和那里随意摆弄和省略一些空白-我建议将此作为一个起点,在http://regex101.com(设置为python)优化模式,直到您完全统计完毕。你知道吗
与所需的输出类型相同的嵌套小列表列表(按类别拆分)。你知道吗
输出:
相关问题 更多 >
编程相关推荐