Python用于在学术领域中匹配纸张id

2024-09-30 16:23:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下谷歌学者论文的作者名单:Zoe Pikramenou, James H. R. Tucker, Alison Rodger, Timothy Dafforn。我想提取并打印至少三篇论文的标题

您可以使用学术词典从每位作者处获取论文信息:

from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
for Author in AuthorList:
    search_query = scholarly.search_author(Author)
    author = next(search_query).fill()
    print(author)

输出看起来有点像(只是从一位作者那里得到的一个小摘录)

                  {'bib': {'cites': '69',
         'title': 'Chalearn looking at people and faces of the world: Face '
                  'analysis workshop and challenge 2016',
         'year': '2016'},
 'filled': False,
 'id_citations': 'ZhUEBpsAAAAJ:_FxGoFyzp5QC',
 'source': 'citations'},
                  {'bib': {'cites': '21',
         'title': 'The NoXi database: multimodal recordings of mediated '
                  'novice-expert interactions',
         'year': '2017'},
 'filled': False,
 'id_citations': 'ZhUEBpsAAAAJ:0EnyYjriUFMC',
 'source': 'citations'},
                  {'bib': {'cites': '11',
         'title': 'Automatic habitat classification using image analysis and '
                  'random forest',
         'year': '2014'},
 'filled': False,
 'id_citations': 'ZhUEBpsAAAAJ:qjMakFHDy7sC',
 'source': 'citations'},
                  {'bib': {'cites': '10',
         'title': 'AutoRoot: open-source software employing a novel image '
                  'analysis approach to support fully-automated plant '
                  'phenotyping',
         'year': '2017'},
 'filled': False,
 'id_citations': 'ZhUEBpsAAAAJ:hqOjcs7Dif8C',
 'source': 'citations'}

我如何收集四位作者中三位或三位以上的论文的bib,特别是title

编辑:事实上,有人指出id_citations并不是每一篇论文都是唯一的,我错了。最好只使用title本身


Tags: andidfalsesourcesearchtitle作者year
2条回答

扩展我的评论,您可以使用Pandas groupby实现这一点:

import pandas as pd
from scholarly import scholarly

AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
frames = []

for Author in AuthorList:
    search_query = scholarly.search_author(Author)
    author = next(search_query).fill()
    # creating DataFrame with authors
    df = pd.DataFrame([x.__dict__ for x in author.publications])
    df['author'] = Author
    frames.append(df.copy())

# joining all author DataFrames
df = pd.concat(frames, axis=0)

# taking bib dict into separate columns
df[['title', 'cites', 'year']] = pd.DataFrame(df.bib.to_list())

# counting unique authors attached to each title
n_authors = df.groupby('title').author.nunique()
# locating the unique titles for all publications with n_authors >= 2
output = n_authors[n_authors >= 2].index

这发现了202篇论文,其中有2位或更多的作者在该列表中(在774篇论文中)。以下是一个输出示例:

Index(['1, 1′-Homodisubstituted ferrocenes containing adenine and thymine nucleobases: synthesis, electrochemistry, and formation of H-bonded arrays',
       '722: Iron chelation by biopolymers for an anti-cancer therapy; binding up the'ferrotoxicity'in the colon',
       'A Luminescent One-Dimensional Copper (I) Polymer',
       'A Unidirectional Energy Transfer Cascade Process in a Ruthenium Junction Self-Assembled by r-and-Cyclodextrins',
       'A Zinc(II)-Cyclen Complex Attached to an Anthraquinone Moiety that Acts as a Redox-Active Nucleobase Receptor in Aqueous Solution',
       'A ditopic ferrocene receptor for anions and cations that functions as a chromogenic molecular switch',
       'A ferrocene nucleic acid oligomer as an organometallic structural mimic of DNA',
       'A heterodifunctionalised ferrocene derivative that self-assembles in solution through complementary hydrogen-bonding interactions',
       'A locking X-ray window shutter and collimator coupling to comply with the new Health and Safety at Work Act',
       'A luminescent europium hairpin for DNA photosensing in the visible, based on trimetallic bis-intercalators',
       ...
       'Up-Conversion Device Based on Quantum Dots With High-Conversion Efficiency Over 6%',
       'Vectorial Control of Energy‐Transfer Processes in Metallocyclodextrin Heterometallic Assemblies',
       'Verteporfin selectively kills hypoxic glioma cells through iron-binding and increased production of reactive oxygen species',
       'Vibrational Absorption from Oxygen-Hydrogen (Oi-H2) Complexes in Hydrogenated CZ Silicon',
       'Virginia review of sociology',
       'Wildlife use of log landings in the White Mountain National Forest',
       'Yttrium 1995',
       'ZUSCHRIFTEN-Redox-Switched Control of Binding Strength in Hydrogen-Bonded Metallocene Complexes Stichworter: Carbonsauren. Elektrochemie. Metallocene. Redoxchemie …',
       '[2] Rotaxanes comprising a macrocylic Hamilton receptor obtained using active template synthesis: synthesis and guest complexation',
       'pH-controlled delivery of luminescent europium coated nanoparticles into platelets'],
      dtype='object', name='title', length=202)

由于所有数据都在Pandas中,因此您还可以探索每篇论文的附加作者是什么,以及您可以在来自学术界的author.publications数组中访问的所有其他信息

首先,让我们将其转换为更友好的格式。您说id_citations对于每篇论文都是唯一的,所以我们将使用它作为哈希表/dict键

然后,我们可以将每个id_citation映射到它显示的bib dict和作者,作为元组列表(bib, author_name)

author_list = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
bibs = {}
for author_name in author_list:
    search_query = scholarly.search_author(author_name)
    for bib in search_query:
        bib = bib.fill()
        bibs.setdefault(bib['id_citations'], []).append((bib, author_name))

此后,我们可以根据附加到bibs中的作者数量对键进行排序:

most_cited = sorted(bibs.items(), key=lambda k: len(k[1]))
# most_cited is now a list of tuples (key, value)
# which maps to (id_citation, [(bib1, author1), (bib2, author2), ...])

和/或将该列表筛选为只有三个或更多外观的引用:

cited_enough = [tup[1][0][0] for tup in most_cited if len(tup[1]) >= 3]
# using key [0] in the middle is arbitrary. It can be anything in the 
# list, provided the bib objects are identical, but index 0 is guaranteed
# to be there.
# otherwise, the first index is to grab the list rather than the id_citation,
# and the last index is to grab the bib, rather than the author_name

现在我们可以从那里检索论文的标题:

paper_titles = [bib['bib']['title'] for bib in cited_enough]

相关问题 更多 >