从xml文档获取文本

<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE PubmedArticleSet SYSTEM "http://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd"> <PubmedArticleSet> <PubmedArticle> <MedlineCitation Status="MEDLINE" Owner="NLM"> <PMID Version="1">2844048</PMID> <DateCompleted> <Year>1988</Year> <Month>10</Month> <Day>26</Day> </DateCompleted> <DateRevised> <Year>2010</Year> <Month>11</Month> <Day>18</Day> </DateRevised> <AuthorList CompleteYN="Y"> <Author ValidYN="Y"> <LastName>Guarner</LastName> <ForeName>J</ForeName> <Initials>J</Initials> <AffiliationInfo> <Affiliation>Department of Pathology and Laboratory Medicine, Emory University Hospital, Atlanta, Georgia.</Affiliation> </AffiliationInfo> </Author> <Author ValidYN="Y"> <LastName>Cohen</LastName> <ForeName>C</ForeName> <Initials>C</Initials> </Author> </AuthorList> </MedlineCitation>

tree = ET.parse('x.xml') root = tree.getroot() pid =[] for pmid in root.iter('PMID'): pid.append(pmid.text) lastname=[] for id in root.findall("./PubmedArticle/MedlineCitation/Article/AuthorList"): for ln in id.findall("./Author/LastName"): lastname.append(ln.text) forename=[] for id in root.findall("./PubmedArticle/MedlineCitation/Article/AuthorList"): for fn in id.findall("./Author/ForeName"): forename.append(fn.text) initialname=[] for id in root.findall("./PubmedArticle/MedlineCitation/Article/AuthorList"): for i in id.findall("./Author/Initials"): initialname.append(i.text)

2条回答

网友

1楼 · 编辑于 2024-09-28 20:53:37

XPath 1.0的数据模型在specification中定义：

3.3 Node-sets
3.4 Booleans
3.5 Numbers
3.6 Strings

节点集是正确的集：重复数据消除和无序。您需要一个sequence，一个有序的数据列表（例如节点集的有序列表）。此数据类型是XPath2.0及其后版本的一部分。你知道吗

对于在XPath1.0中作为嵌入语言进行分组，您可以选择“同类中的第一个”，然后使用宿主语言来传递文档以获取分组项，即使使用另一个XPath表达式也是如此。XSLT本身就是这样做的。你知道吗

网友

2楼 · 编辑于 2024-09-28 20:53:37

我想我拿到了，虽然花了一段时间。为了使这个练习变得有趣，我做了一些改变。你知道吗

首先，问题中的xml代码无效；you can check it here, for example。你知道吗

所以首先我修复了xml。另外，我将它转换为一个PubmedArticleSet，这样它就有2篇文章，第一篇文章有3个作者，第二篇文章有2个作者（显然是虚拟信息），以确保代码能够抓住所有作者。为了让它更简单一些，我删除了一些不相关的信息，比如隶属关系。你知道吗

所以我们只能这样了。首先，修改xml：

source = """
<PubmedArticleSet>
<PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
        <PMID Version="1">2844048</PMID>
        <AuthorList CompleteYN="Y">
            <Author ValidYN="Y">
                <LastName>Guarner</LastName>
                <ForeName>J</ForeName>
                <Initials>J</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Cohen</LastName>
                <ForeName>C</ForeName>
                <Initials>C</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Mushi</LastName>
                <ForeName>E</ForeName>
                <Initials>F</Initials>
            </Author>
        </AuthorList>
    </MedlineCitation>
</PubmedArticle>
<PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
        <PMID Version="1">123456</PMID>
        <AuthorList CompleteYN="Y">
            <Author ValidYN="Y">
                <LastName>Smith</LastName>
                <ForeName>C</ForeName>
                <Initials>C</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Jones</LastName>
                <ForeName>E</ForeName>
                <Initials>F</Initials>
            </Author>
        </AuthorList>
    </MedlineCitation>
</PubmedArticle>

"""

接下来，导入需要导入的内容：

from lxml import etree
import pandas as pd

接下来，代码：

doc = etree.fromstring(source)
art_loc = '..//*/PubmedArticle' #this is the path to all the articles
#count the number of articles in the article set - that number is a float has to be converted to integer before use:
num_arts = int(doc.xpath(f'count({art_loc})')) # or could use len(doc.xpath(f'({art_loc})')) 
grand_inf = [] #this list will hold the accumulated information at the end
for art in range(1,num_arts+1): #can't do range(num_arts) because of the different ways python and Pubmed count
    loc_path = (f'{art_loc}[{art}]/*/') #locate the path to each article
    #grab the article id:
    id_path = loc_path+'PMID'
    pmid = doc.xpath(id_path)[0].text
    art_inf = [] #this list holds the information for each article
    art_inf.append(pmid)
    art_path = loc_path+'/Author' #locate the path to the author group
    #determine the number of authors for this article; again, it's a float which needs to converted to integer
    num_auths = int(doc.xpath(f'count({art_path})')) #again: could use len(doc.xpath(f'({art_path})'))

    auth_inf = [] #this will hold the full name of each of the authors

    for auth in range(1,num_auths+1):
        auth_path = (f'{art_path}[{auth}]') #locate the path to each author
        LastName = doc.xpath((f'{auth_path}/LastName'))[0].text
        FirstName = doc.xpath((f'{auth_path}/ForeName'))[0].text
        Middle = doc.xpath((f'{auth_path}/Initials'))[0].text
        full_name = LastName+' '+FirstName+' '+Middle
        auth_inf.append(full_name)
   art_inf.append(auth_inf)
   grand_inf.append(art_inf)

最后，将此信息加载到数据帧中：

df=pd.DataFrame(grand_inf,columns=['PMID','Author(s)'])
df

输出：

     PMID       Author(s)
 0   2844048    [Guarner J J, Cohen C C, Mushi E F]
 1   123456     [Smith C C, Jones E F]

我们现在可以休息了。。。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章