如何在python中将XML转换为Pandas

2024-05-05 11:09:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个XML格式如下。你知道吗

<Data>
    <Info InfoId="ABC123456">
        <Period duration="PT3600S" endTime="2015-06-14T22:00:00+07:30"/>
        <rPeriod duration="PT3600S"/>
        <Types>1116722205 1116722216 1116722227 1116722238 1116722249 1116722250 </Types>
        <Value Obj="YYYYYYYYY">
            <Results>123234 0 10 4657 1212 766 </Results>
        </Value>
    </Info>
    <Info InfoId="XYZ123456">
        <Period duration="PT3600S" endTime="2015-06-14T22:00:00+07:30"/>
        <rPeriod duration="PT3600S"/>
        <Types>1222455217 1222455228 1222455239 1222455340 1222455351 1222455362 1222455373 1222455384 1222455395 1222455316 1222455327 1222460236 1222460247 </Types>
        <Value Obj="XXXXXXXX">
            <Results>1 0 100 20 50 70 80 90 0 10 1000 0 0 </Results>
        </Value>
        <Value Obj="FFFFFFFFF">
            <Results>13223 0 4354 11 0 1111 0 1 1 1 1 0 0 </Results>
        </Value>
    </Info>
</Data> 

如何使用python以最快和高效的方式将XML转换为数据帧。我尝试过使用xmltodict和json normalize,但是如何处理info块中的多个值

期望输出

Period  InfoID  Object  Types   Value
2015-06-14T22:00:00+07:30   ABC123456   YYYYYYYYY   1116722205  123234
2015-06-14T22:00:00+07:30   ABC123456   YYYYYYYYY   1116722216  0
2015-06-14T22:00:00+07:30   ABC123456   YYYYYYYYY   1116722227  10
2015-06-14T22:00:00+07:30   ABC123456   YYYYYYYYY   1116722238  4657
2015-06-14T22:00:00+07:30   ABC123456   YYYYYYYYY   1116722249  1212
2015-06-14T22:00:00+07:30   ABC123456   YYYYYYYYY   1116722250  766
2015-06-14T22:00:00+07:30   XYZ123456   XXXXXXXX    1222455217  1
2015-06-14T22:00:00+07:30   XYZ123456   XXXXXXXX    1222455228  0
2015-06-14T22:00:00+07:30   XYZ123456   XXXXXXXX    1222455239  100
2015-06-14T22:00:00+07:30   XYZ123456   XXXXXXXX    …   …
2015-06-14T22:00:00+07:30   XYZ123456   XXXXXXXX    …   …
2015-06-14T22:00:00+07:30   XYZ123456   XXXXXXXX    …   …
2015-06-14T22:00:00+07:30   XYZ123456   XXXXXXXX    …   …
2015-06-14T22:00:00+07:30   XYZ123456   XXXXXXXX    1222460247  0
2015-06-14T22:00:00+07:30   XYZ123456   FFFFFFFFF   1222455217  13223
2015-06-14T22:00:00+07:30   XYZ123456   FFFFFFFFF   1222455228  0
2015-06-14T22:00:00+07:30   XYZ123456   FFFFFFFFF   1222455239  4354
2015-06-14T22:00:00+07:30   XYZ123456   FFFFFFFFF   …   …
2015-06-14T22:00:00+07:30   XYZ123456   FFFFFFFFF   …   …
2015-06-14T22:00:00+07:30   XYZ123456   FFFFFFFFF   …   …
2015-06-14T22:00:00+07:30   XYZ123456   FFFFFFFFF   …   …
2015-06-14T22:00:00+07:30   XYZ123456   FFFFFFFFF   1222460247  0

Tags: infoobjdatavaluexmlresultsperiodtypes
1条回答
网友
1楼 · 发布于 2024-05-05 11:09:56

好吧,我不知道你是否会喜欢,但我很开心。。:)

使用BeautifulSoup

from bs4 import BeautifulSoup

soup = BeautifulSoup(xml)

df = pd.DataFrame(columns=['Period','InfoID','Object','Types','Value'])
pos = 0
l = []
for i in a.find_all('info'):
    p = i.find('period')
    l.append(p.get('endtime'))
    l.append(i.get('infoid'))
    v = i.find('value')
    l.append(v.get('obj'))
    l.append(i.find_all('types')[0].text)
    l.append(v.find_all('results')[0].text)
    df.loc[pos] = l
    pos+=1
    l = []

打印(df)

                      Period     InfoID     Object  \
0  2015-06-14T22:00:00+07:30  ABC123456  YYYYYYYYY   
1  2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX   

                                               Types  \
0  1116722205 1116722216 1116722227 1116722238 11...   
1  1222455217 1222455228 1222455239 1222455340 12...   

                                   Value  
0             123234 0 10 4657 1212 766   
1  1 0 100 20 50 70 80 90 0 10 1000 0 0  

拆分数据,堆叠数据,然后合并dfs

df1 = df.set_index(['Period','InfoID','Object'])['Types'].str.split().apply(pd.Series).stack().reset_index().rename({0:'Types'}, axis=1).drop('level_3', axis=1)

df2 = df.set_index(['Period','InfoID','Object'])['Value'].str.split().apply(pd.Series).stack().reset_index().rename({0:'Values'}, axis=1).drop('level_3', axis=1)

pd.merge(a,b, right_index=True, left_index=True).drop(['Period_y','InfoID_y','Object_y'], axis=1).rename({'Period_x':'Period','InfoID_x':'InfoID','Object_x':'Object'}, axis=1)

输出

                       Period     InfoID     Object       Types  Values
0   2015-06-14T22:00:00+07:30  ABC123456  YYYYYYYYY  1116722205  123234
1   2015-06-14T22:00:00+07:30  ABC123456  YYYYYYYYY  1116722216       0
2   2015-06-14T22:00:00+07:30  ABC123456  YYYYYYYYY  1116722227      10
3   2015-06-14T22:00:00+07:30  ABC123456  YYYYYYYYY  1116722238    4657
4   2015-06-14T22:00:00+07:30  ABC123456  YYYYYYYYY  1116722249    1212
5   2015-06-14T22:00:00+07:30  ABC123456  YYYYYYYYY  1116722250     766
6   2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455217       1
7   2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455228       0
8   2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455239     100
9   2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455340      20
10  2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455351      50
11  2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455362      70
12  2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455373      80
13  2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455384      90
14  2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455395       0
15  2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455316      10
16  2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222455327    1000
17  2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222460236       0
18  2015-06-14T22:00:00+07:30  XYZ123456   XXXXXXXX  1222460247       0

相关问题 更多 >