漂亮的汤窝圈

2024-10-06 11:18:58 发布

您现在位置:Python中文网/ 问答频道 /正文

My XML的嵌套结构与此类似:

<xml>
<top>
<main_record attr1="val1" attr2 = "val2" attr3="val3">
    <sub_record attrx="valx" attry="valy" />
</main_record>
<main_record attr1="val4" attr2 = "val5" attr3="val6">
    <sub_record attrx="valx2" attry="valy2" />
</main_record>
<main_record attr1="val7" attr2 = "val8" attr3="val9">
    <sub_record attrx="valx3" attry="valy3" />
</main_record>
</top>
</xml>

我正在尝试使用beautiful soup提取每个“主记录”及其“子记录”属性的数据,以便在CSV文件中按行使用它

我可以让一个循环打印出文件中的所有attr1、attr2和attr3值,但当我尝试在其中添加一个子循环以获取attrx和attry时,它无法正常工作

from bs4 import BeautifulSoup

f = open("C:\\tracker.log", "r")
x = f.read()

soup = BeautifulSoup(x, 'html.parser')

for entity in soup.find_all('main_record'):
    print(entity.get('attr1'))
    print(entity.get('attr2'))
    print(entity.get('attr3'))
    for positions in soup.find('sub_record'):
        print(positions.get('attrx'))
        print(positions.get('attry'))

感谢您的帮助/指点


Tags: getmaintop记录xmlrecordentityprint
3条回答

对于第二个For循环使用entity.find_all

检查以下代码:

for entity in soup.find_all('main_record'):
    print(entity.get('attr1'))
    print(entity.get('attr2'))
    print(entity.get('attr3'))
    for positions in entity.find_all('sub_record'):
        print(positions.get('attrx'))
        print(positions.get('attry'))

您可以尝试以下方法:

for index,entity in enumerate(soup.find_all('main_record')):
    attr1 = entity.get('attr1')
    attr2 = entity.get('attr2')
    attr3 = entity.get('attr3')
    attrx = entity.find('sub_record').get('attrx')
    attry = entity.find('sub_record').get('attry')
    print(f'{index}) attr1 is {attr1}, attr2 is {attr2}, attr3 is {attr3}, attrx is {attrx}, attry is {attry}')

输出:

0) attr1 is val1, attr2 is val2,attr3 is val3,attrx is valx,attry is valy
1) attr1 is val4, attr2 is val5,attr3 is val6,attrx is valx2,attry is valy2
2) attr1 is val7, attr2 is val8,attr3 is val9,attrx is valx3,attry is valy3

您可以转换为json/dictionary,然后让pandas将其展平。你需要pip install xmltodict

鉴于:

xml_file.xml = '''
<xml>
<top>
<main_record attr1="val1" attr2 = "val2" attr3="val3">
    <sub_record attrx="valx" attry="valy" />
</main_record>
<main_record attr1="val4" attr2 = "val5" attr3="val6">
    <sub_record attrx="valx2" attry="valy2" />
</main_record>
<main_record attr1="val7" attr2 = "val8" attr3="val9">
    <sub_record attrx="valx3" attry="valy3" />
</main_record>
</top>
</xml>'''

代码:

import xmltodict
import pandas as pd

with open("xml_file.xml") as xml_file:
    data_dict = xmltodict.parse(xml_file.read())

df = pd.json_normalize(data_dict, record_path=['xml','top', 'main_record'])

输出:

print(df)
  @attr1 @attr2 @attr3 sub_record.@attrx sub_record.@attry
0   val1   val2   val3              valx              valy
1   val4   val5   val6             valx2             valy2
2   val7   val8   val9             valx3             valy3

如果您想摆脱'@',只需将其替换为''

df.columns = [x.replace('@','') for x in df.columns]

print(df)
  attr1 attr2 attr3 sub_record.attrx sub_record.attry
0  val1  val2  val3             valx             valy
1  val4  val5  val6            valx2            valy2
2  val7  val8  val9            valx3            valy3

相关问题 更多 >