python中嵌套xml的解析与转换

+----------+--------+-------+-----------+------------+-----------+-----------+-------+-----+ | date | ticket | value | notenders | tendertype | tenderamt | receipeno | price | qty | +----------+--------+-------+-----------+------------+-----------+-----------+-------+-----+ | 20190101 | 12345 | 15 | 1 | 0 | 15 | 1096 | 7 | 1 | | 20190101 | 12345 | 15 | 1 | 0 | 15 | 786 | 8 | 1 | | 20190101 | 12345 | 15 | 1 | 0 | 15 | 599 | 0 | 1 | | 20190101 | 12345 | 15 | 1 | 0 | 15 | 605 | 0 | 1 | | 20190101 | 12345 | 15 | 1 | 0 | 15 | 608 | 0 | 4 | | 20190101 | 12345 | 15 | 1 | 0 | 15 | 143 | 0 | 1 | | 20190101 | 12345 | 15 | 1 | 0 | 15 | 381 | 7 | 1 | | 20190101 | 12345 | 15 | 1 | 0 | 15 | 607 | 0 | 1 | +----------+--------+-------+-----------+------------+-----------+-----------+-------+-----+

2条回答

网友

1楼 · 编辑于 2024-09-29 01:36:41

您可以尝试使用以下代码从嵌套的XML文件中获取所有数据，但我认为有一种优雅的方法可以实现此结果：

import pandas as pd, numpy as np
import xml.etree.ElementTree as ET

xml_data = 'your xml data'

# Prepare for the list of variable to save XML data
date=[]
ticket=[]
value=[]
notenders=[]
tendertype=[]
tenderamt=[]
receipeno=[]
price=[]
qty=[]

# Parse the XML File to get the desired data
root = ET.fromstring(xml_data)
# Get header data from XML (date, ticket, value, notenders, tenderdetail)
date.append(root.find('date').text)
ticket.append(root.find('ticket').text)
value.append(root.find('value').text)
notenders.append(int(root.find('notenders').text))
nested_node0=root.findall('tenderdetail')
for child0 in nested_node0:
    tendertype.append(int(child0.find('tendertype').text))
    tenderamt.append(int(child0.find('tenderamt').text))
# Get all data under first item tag
nested_node1 = root.findall('item') #1
for child in nested_node1:
    receipeno.append(int(child.find('receipeno').text))
    price.append(int(child.find('price').text))
    qty.append(int(child.find('qty').text))

    # Get all data under first items tag
    nested_node2 = child.findall('items') #2
    for child2 in nested_node2:
        # Get all data under second item tag
        nested_node3 = child2.findall('item') #3
        for child3 in nested_node3:
            receipeno.append(int(child3.find('receipeno').text))
            price.append(int(child3.find('price').text))
            qty.append(int(child3.find('qty').text))
            # Get all data under second items tag
            nested_node4 = child3.findall('items') #4
            for child4 in nested_node4:
                # Get all data under third item tag
                nested_node5 = child4.findall('item') #5
                for child5 in nested_node5:
                    receipeno.append(int(child5.find('receipeno').text))
                    price.append(int(child5.find('price').text))
                    qty.append(int(child5.find('qty').text))

# Make the same length of every list of data with the max length
date.extend([np.nan]*(len(receipeno)-len(date)))
ticket.extend([np.nan]*(len(receipeno)-len(ticket)))
value.extend([np.nan]*(len(receipeno)-len(value)))
notenders.extend([np.nan]*(len(receipeno)-len(notenders)))
tendertype.extend([np.nan]*(len(receipeno)-len(tendertype)))
tenderamt.extend([np.nan]*(len(receipeno)-len(tenderamt)))
data={'date':date,
      'ticket':ticket,
      'value':value,
      'notenders':notenders,
      'tendertype':tendertype,
      'tenderamt':tenderamt,
      'receipeno': receipeno,
      'price': price,
      'qty':qty}

# Create DataFrame from data
df = pd.DataFrame(data)
df = df.fillna(method='ffill')
df

输出：

希望这能对你有所帮助。你知道吗

网友

2楼 · 编辑于 2024-09-29 01:36:41

从必要的导入开始：

import pandas as pd
import xml.etree.ElementTree as et
import re

然后，要从要读取的标记中删除前导零，请定义以下函数：

def stripLZ(src):
    return re.sub(r'^0+(?=\d)', '', src)

要读取源文件及其根元素，请执行：

tree = et.parse('transaction.xml')
root = tree.getroot()

要从根级别读取标记（而不是从项读取），请执行：

dt = root.find('date').text
tck = root.find('ticket').text
val = root.find('value').text
notend = stripLZ(root.find('notenders').text)

剩下的两个标记向下一层，所以从读取它们的父标记开始：

tdet = root.find('tenderdetail')

从中读出这些标签：

tendtyp = stripLZ(tdet.find('tendertype').text)
tendamt = tdet.find('tenderamt').text

请注意，我在这里使用了stripLZ函数（将使用它）几倍以上）。你知道吗

现在有时间创建结果数据帧：

df_cols = ['date', 'ticket', 'value', 'notenders',
    'tendertype', 'tenderamt', 'receipeno', 'price', 'qty']
df = pd.DataFrame(columns = df_cols)

加载回路可采用iter法进行：

for it in root.iter('item'):
    rcp = it.find('receipeno').text
    prc = it.find('price').text
    qty = stripLZ(it.find('qty').text)
    df = df.append(pd.Series([dt, tck, val, notend,
        tendtyp, tendamt, rcp, prc, qty],
        index = df_cols), ignore_index=True)

此循环：

迭代所有项标记，无论其深度如何。你知道吗
从当前项读取3个标记。你知道吗
将行追加到结果数据帧。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章