用python解析深度嵌套的XML到dataframe中,而python则在与更深层的元素作斗争

2024-09-27 00:20:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图解析出一个相当嵌套的XML文件。我花了几个小时试图找到一个解决方案,但运气不好。我不确定问题是名称空间,还是需要在循环中查找

我能够提取更高级别的元素,但是没有提取更深层次的嵌套元素。我期待出口零件号,制造商名称,名称,产品和零售到df

这里的XML示例(并非所有提交都完全一致,有些字段缺失):

<?xml version="1.0" encoding="UTF-8"?><merchandiser xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="merchandiser.xsd"><header><merchantId>35386</merchantId><merchantName>Rock Bottom Golf</merchantName><createdOn>10/13/2021 14:01:49</createdOn></header>
<product product_id='15' name='Champ Golf- Max Pro Spike Wrench' sku_number='19CHPSPWRCH1111111111101' manufacturer_name='Champ Golf' part_number='19CHPSPWRCH1111111111101'><category><primary>Sporting Goods</primary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.15&amp;type=15&amp;murl=http%3A%2F%2Fwww.rockbottomgolf.com%2Faccessories%2Fother%2Fchamp-golf-max-pro-spike-wrench%2F%3Futm_source%3Drakuten%26utm_medium%3Dcse%26utm_term%3D19CHPSPWRCH1111111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19chpspwrch1111111111101.jpg</productImage></URL><description><short>A convenient and easy to use tool. No more struggling with your spikes. Features: Comfortable contoured soft touch dual density handle Three position ratchet for insertion, removal or lock in place Three bits to fit any spike, all will fit in drills Stand</short><long>A convenient and easy to use tool. No more struggling with your spikes. Features: Comfortable contoured soft touch dual density handle Three position ratchet for insertion, removal or lock in place Three bits to fit any spike, all will fit in drills Stand</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>9.99</retail></price><brand>Champ Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00036504884013</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.15&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>
<product product_id='21' name='Stinger Tees- 3&quot; Stinger Pro XL Competition Camo Mid Pack Poly Bag [125 Count]' sku_number='19STGTEEMID3CO1111111101' manufacturer_name='Stinger Tees' part_number='19STGTEEMID3CO1111111101'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.21&amp;type=15&amp;murl=http%3A%2F%2Fwww.rockbottomgolf.com%2Faccessories%2Ftees%2Fstinger-tees-3-stinger-pro-xl-competition-camo-mid-pack-poly-bag-125-count%2F%3Futm_source%3Drakuten%26utm_medium%3Dcse%26utm_term%3D19STGTEEMID3CO1111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/3%20tees%20125%20count.jpg</productImage></URL><description><short>Features: Resealable package Less resistance due to a smaller tee head Built to withstand the strongest swings High-quality 120 Tees</short><long>Features: Resealable package Less resistance due to a smaller tee head Built to withstand the strongest swings High-quality 120 Tees</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>7.99</retail></price><brand>Stinger Tees</brand><shipping><availability>in-stock</availability></shipping><upc>00853190005047</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.21&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>
<product product_id='23' name='Vegas Golf- Original Game' sku_number='19VEGORIGIN1111111111101' manufacturer_name='Vegas Golf' part_number='19VEGORIGIN1111111111101'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.23&amp;type=15&amp;murl=http%3A%2F%2Fwww.rockbottomgolf.com%2Faccessories%2Fother%2Fvegas-golf-original-game%2F%3Futm_source%3Drakuten%26utm_medium%3Dcse%26utm_term%3D19VEGORIGIN1111111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19vegorigin1111111111101.jpg</productImage></URL><description><short>For a limited time only, you&apos;ll get 2 bonus chips with your purchase for a total of 10 game chips! Vegas Golf: the ultimate on-the-course gambling game. Vegas Golf consists of real casino style chips, the object is to avoid the negative and obtain the pos</short><long>For a limited time only, you&apos;ll get 2 bonus chips with your purchase for a total of 10 game chips! Vegas Golf: the ultimate on-the-course gambling game. Vegas Golf consists of real casino style chips, the object is to avoid the negative and obtain the pos</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>14.99</retail></price><brand>Vegas Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00689076007030</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.23&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>
<product product_id='28' name='Ray Cook Golf- 12&apos; Compact Cup Ball Retriever' sku_number='19RAYBALRET1111111111201' manufacturer_name='Ray Cook Golf' part_number='19RAYBALRET1111111111201'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.28&amp;type=15&amp;murl=http%3A%2F%2Fwww.rockbottomgolf.com%2Faccessories%2Fball-retrievers%2Fray-cook-golf-12-compact-cup-ball-retriever%2F%3Futm_source%3Drakuten%26utm_medium%3Dcse%26utm_term%3D19RAYBALRET1111111111201</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19raybalret12.jpg</productImage></URL><description><short>The Ray Cook Golf Ball Retriever extends up to 12 feet and is the perfect companion for every golf bag. Features: Durable construction Telescoping shaft design makes the retriever easy to carry</short><long>The Ray Cook Golf Ball Retriever extends up to 12 feet and is the perfect companion for every golf bag. Features: Durable construction Telescoping shaft design makes the retriever easy to carry</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>19.99</retail></price><brand>Ray Cook Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00840254178410</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.28&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>

我已经创建了下面的python代码,其中提取了零件号、制造商名称和名称,而其他两个则是难以捉摸的

我的代码:

import pandas as pd 
import xml.etree.ElementTree as et 

xtree = et.parse(r"file.xml")
xroot = xtree.getroot() 

df_cols = ["part_number", "manufacturer", "name", "retail", "product"]
rows = []

for node in xroot: 
    part_number = node.attrib.get("part_number")
    manufacturer_name = node.attrib.get("manufacturer_name")
    name = node.attrib.get("name")  
    product = node.findall("product") if node is not None else None
    retail = node.findall("retail") if node is not None else None

    rows.append({"part_number": part_number, "manufacturer": manufacturer_name, "name": name, "retail": retail, "product": product,})


out_df = pd.DataFrame(rows, columns = df_cols)

out_df.head()

我当前的输出(零售、产品为空):

                part_number   manufacturer  ... retail product
0                      None           None  ...     []      []
1  19CHPSPWRCH1111111111101     Champ Golf  ...     []      []
2  19STGTEEMID3CO1111111101   Stinger Tees  ...     []      []
3  19VEGORIGIN1111111111101     Vegas Golf  ...     []      []
4  19RAYBALRET1111111111201  Ray Cook Golf  ...     []      []

我想要的输出(为了便于阅读,缩短了URL,但在产品之后是完整的URL):

                part_number   manufacturer  ... retail product
0                      None           None  ...     9.99     https://click.linksynergy.com/link?id=83...
1  19CHPSPWRCH1111111111101     Champ Golf  ...     7.99      https://click.linksynergy.com/link?id=83...
2  19STGTEEMID3CO1111111101   Stinger Tees  ...     14.99      https://click.linksynergy.com/link?id=83...
3  19VEGORIGIN1111111111101     Vegas Golf  ...     19.99      https://click.linksynergy.com/link?id=83...
4  19RAYBALRET1111111111201  Ray Cook Golf  ...     6.99      https://click.linksynergy.com/link?id=83...

任何帮助都将不胜感激


Tags: thetonamehttpscomidnumbertype
2条回答

假设XML结构是常量,xpath表达式以相同的顺序检索元素/属性

from lxml import etree
import pandas as pd

df_cols = ["part_number", "manufacturer", "name", "retail", "product"]
rows = []
tree = etree.parse('/home/luis/tmp/tmp.xml')
root = tree.getroot()
steps = tree.xpath('//product/attribute::*[name()="name" or name()="part_number" or name()="manufacturer_name"] | //product/URL/product/text() | //product/price/retail/text()')
i=0
d=dict()
for s in steps:

    if i == 0:
        d[df_cols[2]]=s
    if i == 1:
        d[df_cols[0]]=s
    if i == 2:
        d[df_cols[1]]=s
    if i == 3:
        d[df_cols[3]]=s
    if i == 4:
        d[df_cols[4]]=s
        rows.append(d)
        i=0
        d=dict()
        continue
    i+=1


out_df = pd.DataFrame(rows, columns = df_cols)

print(out_df.head())

结果:

     part_number              manufacturer                                               name                                             retail product
0     Champ Golf  19CHPSPWRCH1111111111101                   Champ Golf- Max Pro Spike Wrench  https://click.linksynergy.com/link?id=83wh4zNK...    9.99
1   Stinger Tees  19STGTEEMID3CO1111111101  Stinger Tees- 3" Stinger Pro XL Competition Ca...  https://click.linksynergy.com/link?id=83wh4zNK...    7.99
2     Vegas Golf  19VEGORIGIN1111111111101                          Vegas Golf- Original Game  https://click.linksynergy.com/link?id=83wh4zNK...   14.99
3  Ray Cook Golf  19RAYBALRET1111111111201      Ray Cook Golf- 12' Compact Cup Ball Retriever  https://click.linksynergy.com/link?id=83wh4zNK...   19.99

见下文

import requests
import xml.etree.ElementTree as ET
import pandas as pd

r = requests.get('https://raw.githubusercontent.com/dgs2021/golfdeals/main/35386_3864840_mp_delta.xml')
attrb_fields =  {'manufacturer_name': 'manufacturer','name':'name','part_number':'part_number'}
sub_elements = {'retail':'retail','product':'product'}

root = ET.fromstring(r.content)

data = []
for p in root.findall('product'):
  entry = {v:p.attrib.get(k,'NA') for k,v in attrb_fields.items()}
  for k,v in sub_elements.items():
    e = p.find(f'.//{v}')
    entry[k] = e.text if e is not None else 'NA'
  data.append(entry)
columns = list(attrb_fields.values()) + list(sub_elements.values())
df = pd.DataFrame(data,columns= columns)
print(df)

输出

          manufacturer  ...                                            product
0           Champ Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
1         Stinger Tees  ...  https://click.linksynergy.com/link?id=83wh4zNK...
2           Vegas Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
3        Ray Cook Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4     Rock Bottom Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
...                ...  ...                                                ...
4100     Callaway Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4101        Cobra Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4102      Odyssey Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4103   TaylorMade Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4104     Titleist Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...

[4105 rows x 5 columns]

相关问题 更多 >

    热门问题