将以下xml元素转换为数据帧时出现问题?

2024-09-29 23:30:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用beautiful soup从一堆xml文件中解析和提取一些信息,如下所示:

import os
a_lis = []
for filepath in glob(os.path.join('../data/trainingFiles/', '*.xml')):
    with open(filepath) as f:
        content = f.read()
        results = BeautifulSoup(content, 'lxml')
        #print(results)
        for LabelInteractions in results.find_all("labelinteractions"):
            #print(LabelInteractions)
            for labelinteractions in LabelInteractions.findAll('labelinteraction'):
                print(labelinteractions)

输出:

<labelinteraction precipitant="ritonavir" precipitantcode="N0000007423" type="Unspecified interaction"></labelinteraction>
<labelinteraction precipitant="gc stimulator" precipitantcode="NO MAP" type="Unspecified interaction"></labelinteraction>
....
<labelinteraction precipitant="riociguat" precipitantcode="N0000188995" type="Unspecified interaction"></labelinteraction>
<labelinteraction effect=" 25064002: Headache (finding)" precipitant="alcohol" precipitantcode="N0000007432" type="Pharmacodynamic interaction"></labelinteraction>

如何将这些xml属性转换为dataframe格式?列将如下所示:

precipitant  precipitantcode type effect

Tags: inforostypexmlresultsprintinteraction
2条回答

可以将列存储在数组中,然后创建数据帧:

from collections import defaultdict

from bs4 import BeautifulSoup
import pandas as pd

soup = BeautifulSoup("""
<labelinteraction precipitant="ritonavir" precipitantcode="N0000007423" type="Unspecified interaction"></labelinteraction>
<labelinteraction precipitant="gc stimulator" precipitantcode="NO MAP" type="Unspecified interaction"></labelinteraction>
<LabelInteraction type="Pharmacodynamic interaction" precipitant="alcohol" precipitantCode="N0000007432" effect=" 25064002: Headache (finding)"/>
""") 

columns = ['precipitant', 'precipitantcode', 'type', 'effect']
d = defaultdict(list)

for labelinteraction in soup.findAll('labelinteraction'):
    for col in columns:
        d[col].append(labelinteraction[col] if labelinteraction.has_attr(col) else None)

df = pd.DataFrame(d)

输出:

     precipitant precipitantcode                         type                         effect
0      ritonavir     N0000007423      Unspecified interaction                           None
1  gc stimulator          NO MAP      Unspecified interaction                           None
2        alcohol     N0000007432  Pharmacodynamic interaction   25064002: Headache (finding)

如果有所需列的列表:

cols = ['precipitant', 'precipitantcode', 'type']

然后可以对它们进行迭代并附加到字典中的数组:

d = {}
for labelinteractions in LabelInteractions.findAll('labelinteraction'):
    for c in cols:
        if not c in d:
            d[c] = [labelinteractions[c]]
        else:
            d[c].append(labelinteractions[c])

完成后,您可以请求数据帧:

df = pd.DataFrame(d)

这是我从你的样品中得到的:

     precipitant precipitantcode                         type
0      ritonavir     N0000007423      Unspecified interaction
1  gc stimulator          NO MAP      Unspecified interaction
2      riociguat     N0000188995      Unspecified interaction
3        alcohol     N0000007432  Pharmacodynamic interaction

相关问题 更多 >

    热门问题