<p>我想我拿到了,虽然花了一段时间。为了使这个练习变得有趣,我做了一些改变。你知道吗</p>
<p>首先,问题中的xml代码无效;<a href="https://codebeautify.org/xmlvalidator" rel="nofollow noreferrer">you can check it here, for example</a>。你知道吗</p>
<p>所以首先我修复了xml。另外,我将它转换为一个PubmedArticleSet,这样它就有2篇文章,第一篇文章有3个作者,第二篇文章有2个作者(显然是虚拟信息),以确保代码能够抓住所有作者。为了让它更简单一些,我删除了一些不相关的信息,比如隶属关系。你知道吗</p>
<p>所以我们只能这样了。
首先,修改xml:</p>
<pre><code>source = """
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">2844048</PMID>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Guarner</LastName>
<ForeName>J</ForeName>
<Initials>J</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Cohen</LastName>
<ForeName>C</ForeName>
<Initials>C</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Mushi</LastName>
<ForeName>E</ForeName>
<Initials>F</Initials>
</Author>
</AuthorList>
</MedlineCitation>
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">123456</PMID>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Smith</LastName>
<ForeName>C</ForeName>
<Initials>C</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Jones</LastName>
<ForeName>E</ForeName>
<Initials>F</Initials>
</Author>
</AuthorList>
</MedlineCitation>
</PubmedArticle>
</code></pre>
<p/>
<pre><code> """
</code></pre>
<p>接下来,导入需要导入的内容:</p>
<pre><code>from lxml import etree
import pandas as pd
</code></pre>
<p>接下来,代码:</p>
<pre><code>doc = etree.fromstring(source)
art_loc = '..//*/PubmedArticle' #this is the path to all the articles
#count the number of articles in the article set - that number is a float has to be converted to integer before use:
num_arts = int(doc.xpath(f'count({art_loc})')) # or could use len(doc.xpath(f'({art_loc})'))
grand_inf = [] #this list will hold the accumulated information at the end
for art in range(1,num_arts+1): #can't do range(num_arts) because of the different ways python and Pubmed count
loc_path = (f'{art_loc}[{art}]/*/') #locate the path to each article
#grab the article id:
id_path = loc_path+'PMID'
pmid = doc.xpath(id_path)[0].text
art_inf = [] #this list holds the information for each article
art_inf.append(pmid)
art_path = loc_path+'/Author' #locate the path to the author group
#determine the number of authors for this article; again, it's a float which needs to converted to integer
num_auths = int(doc.xpath(f'count({art_path})')) #again: could use len(doc.xpath(f'({art_path})'))
auth_inf = [] #this will hold the full name of each of the authors
for auth in range(1,num_auths+1):
auth_path = (f'{art_path}[{auth}]') #locate the path to each author
LastName = doc.xpath((f'{auth_path}/LastName'))[0].text
FirstName = doc.xpath((f'{auth_path}/ForeName'))[0].text
Middle = doc.xpath((f'{auth_path}/Initials'))[0].text
full_name = LastName+' '+FirstName+' '+Middle
auth_inf.append(full_name)
art_inf.append(auth_inf)
grand_inf.append(art_inf)
</code></pre>
<p>最后,将此信息加载到数据帧中:</p>
<pre><code>df=pd.DataFrame(grand_inf,columns=['PMID','Author(s)'])
df
</code></pre>
<p>输出:</p>
<pre><code> PMID Author(s)
0 2844048 [Guarner J J, Cohen C C, Mushi E F]
1 123456 [Smith C C, Jones E F]
</code></pre>
<p>我们现在可以休息了。。。你知道吗</p>