用Python解析XML

2024-06-28 19:47:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我有几个大的.xml文件。我想把文件解析出来做几件事。在

我只想退出:

  • XML-/title1并将其保存到list A(例如)
  • XML-/title2并将其保存到列表B
  • XML-/title3并将其保存到list C
  • 等等

使用python2.x哪个库最好导入/使用。我该怎么安排? 有什么建议吗?在

例如:

 <PubmedArticle>
    <MedlineCitation Owner="NLM" Status="MEDLINE">
        <PMID Version="1">8981971</PMID>
        <Article PubModel="Print">
            <Journal>
                <ISSN IssnType="Print">0002-9297</ISSN>
                <JournalIssue CitedMedium="Print">
                    <Volume>60</Volume>
                    <Issue>1</Issue>
                    <PubDate>
                        <Year>1997</Year>
                        <Month>Jan</Month>
                    </PubDate>
                </JournalIssue>
                <Title>American journal of human genetics</Title>
                <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation>
            </Journal>
            <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle>
            <Pagination>
                <MedlinePgn>241-4</MedlinePgn>
            </Pagination>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Scozzari</LastName>
                    <ForeName>R</ForeName>
                    <Initials>R</Initials>
                </Author>
            </AuthorList>
        <MeshHeadingList>
            <MeshHeading>
                <DescriptorName MajorTopicYN="N">Alleles</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName MajorTopicYN="Y">Y Chromosome</DescriptorName>
            </MeshHeading>
        </MeshHeadingList>
        <OtherID Source="NLM">PMC1712541</OtherID>
    </MedlineCitation>
</PubmedArticle>

Tags: 文件nlmissuexmllistjournalprintvolume
3条回答

试试看lxml模块。在

要定位标题,可以将Xpath与lxml一起使用,也可以使用lxml中的xml对象结构来“索引”到title元素。在

尝试使用Beautiful soup。我发现这个图书馆很方便。正如刚刚指出的,beautifulsttonesup专门用于解析XML。在

我不知道你为什么要把每一个题目都列在自己的单子里,你的问题让我相信了。在

一张单子里所有的标题怎么样?下面的示例使用了示例XML的裁剪版本,再加上我复制了一个<Article/>来说明使用lxml.etree.xpath为您创建了<Title/>'s的列表:

>>> import lxml.etree

>>> xml_text = """<PubmedArticle>
  <MedlineCitation Owner="NLM" Status="MEDLINE">
    <PMID Version="1">8981971</PMID>
    <Article PubModel="Print">
      <Journal>
        <ISSN IssnType="Print">0002-9297</ISSN>
        <!-- <JournalIssue ... /> -->
        <Title>American journal of human genetics</Title>
        <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation>
      </Journal>
      <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle>
      <!--<Pagination>
          ...
          </MeshHeadingList>-->
      <OtherID Source="NLM">PMC1712541</OtherID>
    </Article>
    <Article PubModel="Print">
      <Journal>
        <ISSN IssnType="Print">9297-0002</ISSN>
        <!-- <JournalIssue ... /> -->
        <Title>American Journal of Pediatrics</Title>
        <ISOAbbreviation>Am. J. Ped.</ISOAbbreviation>
      </Journal>
      <ArticleTitle>Healthy Foo, Healthy Bar</ArticleTitle>
      <!--<Pagination>
          ...
          </MeshHeadingList>-->
      <OtherID Source="NLM">PMC1712541</OtherID>
    </Article>
  </MedlineCitation>
</PubmedArticle>"""

xpath用于返回节点,lxml.etree.xpath转换为节点对象的Python列表:

^{pr2}$

编辑1:现在使用Python的xml.etree.ElementTree

我想用包含的模块展示这个解决方案,以防安装第三方模块不可能或不吸引人。在

>>> import xml.etree.ElementTree as ETree
>>> element = ETree.fromstring(xml_text)
>>> xml_obj = ETree.ElementTree(element)
>>> for title_obj in xml_obj.findall('.//Article/Journal/Title'):
    print title_obj.text


American journal of human genetics
American Journal of Pediatrics

它很小,但是这个XPath与lxml示例中的XPath不同:在开头有一个句点('.')。如果没有句点,我得到以下警告(使用Python 2.7.2):

>>> xml_obj.findall('//Article/Journal/Title')

Warning (from warnings module):
  File "__main__", line 1
FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version.  If you rely on the current behaviour, change it to './/Article/Journal/Title'

相关问题 更多 >