pandas datafram的嵌套xml文件

2024-10-01 17:27:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我在解析XML文件以转换为pandas数据帧时遇到问题。示例条目如下:

<p>


 <persName id="t17200427-2-defend31" type="defendantName">
 Alice 
 Jones 
 <interp inst="t17200427-2-defend31" type="surname" value="Jones"/>
 <interp inst="t17200427-2-defend31" type="given" value="Alice"/>
 <interp inst="t17200427-2-defend31" type="gender" value="female"/>
 </persName> 

 , of <placeName id="t17200427-2-defloc7">St. Michael's Cornhill</placeName> 
 <interp inst="t17200427-2-defloc7" type="placeName" value="St. Michael's Cornhill"/>
 <interp inst="t17200427-2-defloc7" type="type" value="defendantHome"/>
 <join result="persNamePlace" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-defloc7"/>, was indicted for <rs id="t17200427-2-off8" type="offenceDescription">
 <interp inst="t17200427-2-off8" type="offenceCategory" value="theft"/>
 <interp inst="t17200427-2-off8" type="offenceSubcategory" value="shoplifting"/>
 privately stealing a Bermundas Hat, value 10 s. out of the Shop of 

 <persName id="t17200427-2-victim33" type="victimName">
 Edward 
 Hillior 
 <interp inst="t17200427-2-victim33" type="surname" value="Hillior"/>
 <interp inst="t17200427-2-victim33" type="given" value="Edward"/>
 <interp inst="t17200427-2-victim33" type="gender" value="male"/>
 <join result="offenceVictim" targOrder="Y" targets="t17200427-2-off8 t17200427-2-victim33"/>
 </persName> 



 </rs> , on the <rs id="t17200427-2-cd9" type="crimeDate">21st of April</rs> 
 <join result="offenceCrimeDate" targOrder="Y" targets="t17200427-2-off8 t17200427-2-cd9"/> last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her <rs id="t17200427-2-verdict10" type="verdictDescription">
 <interp inst="t17200427-2-verdict10" type="verdictCategory" value="guilty"/>
 <interp inst="t17200427-2-verdict10" type="verdictSubcategory" value="theftunder1s"/>
 Guilty to the value of 10 d.
 </rs> 
 <rs id="t17200427-2-punish11" type="punishmentDescription">
 <interp inst="t17200427-2-punish11" type="punishmentCategory" value="transport"/>
 <join result="defendantPunishment" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-punish11"/>
 Transportation
 </rs> .</p>

我想要一个包含性别、冒犯和审判文本列的数据框。我之前已经将所有数据提取到一个数据帧中,但无法获取标记之间的文本。

以下是示例代码:

^{pr2}$

**我有大约1000多个相同XML格式的文件要组成一个数据帧


Tags: andofthe数据idthatvaluetype
1条回答
网友
1楼 · 发布于 2024-10-01 17:27:52

因为XML非常复杂,文本值在节点之间溢出,所以考虑一下XSLT,这是一种专门用来将XML文件(尤其是复杂的文件)转换为更简单的XML文件的专用语言。在

Python的第三方模块lxml,可以运行xslt1.0甚至XPath 1.0来解析转换后的结果,以便迁移到pandas数据帧。另外,您可以使用外部的XSLT processors,Python可以用subprocess调用它。在

具体地说,下面的XSLT从根中使用XPath的descendant::*从被告和受害者以及整个段落文本值中提取必要的属性,假设{}是它的子元素。在

XSLT(另存为.xsl文件,一个特殊的.xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="xml"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/*">
    <xsl:apply-templates select="p"/>
  </xsl:template>

  <xsl:template match="p">
    <data>
      <defendantName><xsl:value-of select="normalize-space(descendant::persName[@type='defendantName'])"/></defendantName>
      <defendantGender><xsl:value-of select="descendant::persName[@type='defendantName']/interp[@type='gender']/@value"/></defendantGender>
      <offenceCategory><xsl:value-of select="descendant::interp[@type='offenceCategory']/@value"/></offenceCategory>
      <offenceSubCategory><xsl:value-of select="descendant::interp[@type='offenceSubcategory']/@value"/></offenceSubCategory>

      <victimName><xsl:value-of select="normalize-space(descendant::persName[@type='victimName'])"/></victimName>
      <victimGender><xsl:value-of select="descendant::persName[@type='victimName']/interp[@type='gender']/@value"/></victimGender>
      <verdictCategory><xsl:value-of select="descendant::interp[@type='verdictCategory']/@value"/></verdictCategory>
      <verdictSubCategory><xsl:value-of select="descendant::interp[@type='verdictSubcategory']/@value"/></verdictSubCategory>
      <punishmentCategory><xsl:value-of select="descendant::interp[@type='punishmentCategory']/@value"/></punishmentCategory>

      <trialText><xsl:value-of select="normalize-space(/p)"/></trialText>
    </data>
  </xsl:template>       

</xsl:stylesheet>

Python

^{pr2}$

对于1000个类似的XML文件,循环这个过程,并将每个一行试用数据帧附加到一个列表中,并用pd.concat进行堆叠。

XML输出

<?xml version="1.0"?>
<data>
  <defendantName>Alice Jones</defendantName>
  <defendantGender>female</defendantGender>
  <offenceCategory>theft</offenceCategory>
  <offenceSubCategory>shoplifting</offenceSubCategory>
  <victimName>Edward Hillior</victimName>
  <victimGender>male</victimGender>
  <verdictCategory>guilty</verdictCategory>
  <verdictSubCategory>theftunder1s</verdictSubCategory>
  <punishmentCategory>transport</punishmentCategory>
  <trialText>Alice Jones , of St. Michael's Cornhill, was indicted for privately stealing a Bermundas Hat, value 10 s. out of the Shop of Edward Hillior , on the 21st of April last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her Guilty to the value of 10 d. Transportation .</trialText>
</data>

数据帧输出

#   defendantGender defendantName offenceCategory offenceSubCategory  \
# 0          female   Alice Jones           theft        shoplifting   

#   punishmentCategory                                          trialText  \
# 0          transport  Alice Jones , of St. Michael's Cornhill, was i...   

#   verdictCategory verdictSubCategory victimGender      victimName  
# 0          guilty       theftunder1s         male  Edward Hillior  

相关问题 更多 >

    热门问题