用Python解析XML

<PubmedArticle> <MedlineCitation Owner="NLM" Status="MEDLINE"> <PMID Version="1">8981971</PMID> <Article PubModel="Print"> <Journal> <ISSN IssnType="Print">0002-9297</ISSN> <JournalIssue CitedMedium="Print"> <Volume>60</Volume> <Issue>1</Issue> <PubDate> <Year>1997</Year> <Month>Jan</Month> </PubDate> </JournalIssue> <Title>American journal of human genetics</Title> <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation> </Journal> <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle> <Pagination> <MedlinePgn>241-4</MedlinePgn> </Pagination> <AuthorList CompleteYN="Y"> <Author ValidYN="Y"> <LastName>Scozzari</LastName> <ForeName>R</ForeName> <Initials>R</Initials> </Author> </AuthorList> <MeshHeadingList> <MeshHeading> <DescriptorName MajorTopicYN="N">Alleles</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="Y">Y Chromosome</DescriptorName> </MeshHeading> </MeshHeadingList> <OtherID Source="NLM">PMC1712541</OtherID> </MedlineCitation> </PubmedArticle>

3条回答

网友

1楼 · 编辑于 2024-06-28 19:47:34

试试看lxml模块。在

要定位标题，可以将Xpath与lxml一起使用，也可以使用lxml中的xml对象结构来“索引”到title元素。在

网友

2楼 · 编辑于 2024-06-28 19:47:34

尝试使用Beautiful soup。我发现这个图书馆很方便。正如刚刚指出的，beautifulsttonesup专门用于解析XML。在

网友

3楼 · 编辑于 2024-06-28 19:47:34

我不知道你为什么要把每一个题目都列在自己的单子里，你的问题让我相信了。在

一张单子里所有的标题怎么样？下面的示例使用了示例XML的裁剪版本，再加上我复制了一个<Article/>来说明使用lxml.etree.xpath为您创建了<Title/>'s的列表：

>>> import lxml.etree

>>> xml_text = """<PubmedArticle>
  <MedlineCitation Owner="NLM" Status="MEDLINE">
    <PMID Version="1">8981971</PMID>
    <Article PubModel="Print">
      <Journal>
        <ISSN IssnType="Print">0002-9297</ISSN>
        <!-- <JournalIssue ... /> -->
        <Title>American journal of human genetics</Title>
        <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation>
      </Journal>
      <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle>
      <!--<Pagination>
          ...
          </MeshHeadingList>-->
      <OtherID Source="NLM">PMC1712541</OtherID>
    </Article>
    <Article PubModel="Print">
      <Journal>
        <ISSN IssnType="Print">9297-0002</ISSN>
        <!-- <JournalIssue ... /> -->
        <Title>American Journal of Pediatrics</Title>
        <ISOAbbreviation>Am. J. Ped.</ISOAbbreviation>
      </Journal>
      <ArticleTitle>Healthy Foo, Healthy Bar</ArticleTitle>
      <!--<Pagination>
          ...
          </MeshHeadingList>-->
      <OtherID Source="NLM">PMC1712541</OtherID>
    </Article>
  </MedlineCitation>
</PubmedArticle>"""

xpath用于返回节点，lxml.etree.xpath转换为节点对象的Python列表：

^{pr2}$

编辑1：现在使用Python的xml.etree.ElementTree

我想用包含的模块展示这个解决方案，以防安装第三方模块不可能或不吸引人。在

>>> import xml.etree.ElementTree as ETree
>>> element = ETree.fromstring(xml_text)
>>> xml_obj = ETree.ElementTree(element)
>>> for title_obj in xml_obj.findall('.//Article/Journal/Title'):
    print title_obj.text


American journal of human genetics
American Journal of Pediatrics

它很小，但是这个XPath与lxml示例中的XPath不同：在开头有一个句点（'.'）。如果没有句点，我得到以下警告（使用Python 2.7.2）：

>>> xml_obj.findall('//Article/Journal/Title')

Warning (from warnings module):
  File "__main__", line 1
FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version.  If you rely on the current behaviour, change it to './/Article/Journal/Title'

相关问题更多 >

编程相关推荐

热门问题

热门文章