如何将以下多维类xml数据解析为datafram

2024-09-27 19:27:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经为工作相关的事情探索了一个基于XML的API,它来自仓库数据。理想情况下,我想用python和pandas做一些分析。在

Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= UTC )) , None,  StringAggregateDimensionValue(value=u'VIRTUALLY_LABELED_CASE') ], quantity=127) ,  

Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= UTC )) ,  StringAggregateDimensionValue(value=u'PPTransMergeNonCon') ,  StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW') ], quantity=15)   

Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 27, 21, 0, tzinfo= UTC )) ,  StringAggregateDimensionValue(value=u'PPTransFRA1') ,  StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW') ], quantity=8) ,  

在我用VIM进行了一些查找和替换之后,数据看起来像上面的流(我知道我可以用python编写脚本)。我怎样才能最好地把这种奇怪的格式变成熊猫呢?理想情况下,我需要datetime、String aggregatedimension值和数量。但是有很多没有,在这个解析中需要的数据。 在数据帧中,做一些分析会很容易,但我在这里有点困惑(感觉很像n00b)。在

编辑: 这是我得到并想要解析的未经修正和未替换的数据。它不是真正的XML,所以XML不起作用。在

^{pr2}$

Tags: 数据datetimevalue情况xmlquantitylistaggregate
2条回答

如果您更喜欢类似于解析器的东西,下面是针对您的问题的pyparsing工具:

from pyparsing import Suppress,QuotedString,Word,alphas,nums,alphanums,Keyword,Optional
import datetime

# define UTC timezone for sake of eval
if hasattr(datetime,"timezone"):
    UTC = datetime.timezone(datetime.timedelta(0),"UTC")
else:
    UTC = None

_ = Suppress
evaltokens = lambda s,l,t: eval(''.join(t))

timevalue = 'datetime.datetime' + QuotedString('(', endQuoteChar=')', unquoteResults=False)
timevalue.setParseAction(evaltokens)

strvalue = 'u' + QuotedString("'", unquoteResults=False)
strvalue.setParseAction(evaltokens)

nonevalue = Keyword("None").setParseAction(lambda s,l,t: [None])
intvalue = Word(nums).setParseAction(lambda s,l,t: int(t[0]))

COMMA = Optional(_(","))

valuedexpr = lambda expr: (Word(alphas) + "(" + "value" + "=" + expr + ")").setParseAction(lambda t: t[4])

lineexpr = (_("Aggregate(aggregate_dimension_value_list=[") +
            valuedexpr(timevalue)("timestamp") + COMMA +
            (nonevalue | valuedexpr(strvalue))("s1") + COMMA +
            (nonevalue | valuedexpr(strvalue))("s2") + COMMA +
        "]" + COMMA +
        "quantity=" + intvalue("qty"))

使用lineexpr.searchString从每个聚合中提取数据:

^{pr2}$

给予:

[datetime.datetime(2013, 8, 28, 19, 30), None, u'VIRTUALLY_LABELED_CASE', ']', 'quantity=', 127]
- qty: 127
- s1: None
- s2: VIRTUALLY_LABELED_CASE
- timestamp: 2013-08-28 19:30:00
127

[datetime.datetime(2013, 8, 28, 19, 30), u'PPTransMergeNonCon', u'PRIME_BIN_RANDOM_STOW', ']', 'quantity=', 15]
- qty: 15
- s1: PPTransMergeNonCon
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-28 19:30:00
15

[datetime.datetime(2013, 8, 27, 21, 0), u'PPTransFRA1', u'PRIME_BIN_RANDOM_STOW', ']', 'quantity=', 8]
- qty: 8
- s1: PPTransFRA1
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-27 21:00:00
8

dump()将显示所有可用的命名结果值-请注意如何使用data.qty直接访问quantity属性。这是用"quantity=" + intvalue("qty")中的结果名“qty”的定义为您设置的。timestamps1和{}可以类似地访问。(这里还有一点eval,清理这些内容是留给读者的练习。)

编辑:

这里是修改后的pyparsing解析器,用于处理原始的XML类内容。变化真的很小:

^{4}$

从您粘贴的文本(其中一些文本格式错误)中可以看到:

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'PPTransCGN1', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 992]
- qty: 992
- s1: PPTransCGN1
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-26 20:30:00
992

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 23, 19, 30), None, u'TOTE', 'quantity=', 87]
- qty: 87
- s1: None
- s2: TOTE
- timestamp: 2013-08-23 19:30:00
87

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 17, 30), u'PPTransMUC3', u'TOTE', 'quantity=', 14]
- qty: 14
- s1: PPTransMUC3
- s2: TOTE
- timestamp: 2013-08-27 17:30:00
14

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 20, 30), u'PPTransEUK5', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 339]
- qty: 339
- s1: PPTransEUK5
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-27 20:30:00
339

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'PPTransCGN1', u'TOTE', 'quantity=', 1731]
- qty: 1731
- s1: PPTransCGN1
- s2: TOTE
- timestamp: 2013-08-26 20:30:00
1731

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'PPTransEUK5', u'TOTE', 'quantity=', 28]
- qty: 28
- s1: PPTransEUK5
- s2: TOTE
- timestamp: 2013-08-26 19:30:00
28

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 28, 19, 30), u'PPTransORY1', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 69]
- qty: 69
- s1: PPTransORY1
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-28 19:30:00
69

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'PPTransMAD4', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 47]
- qty: 47
- s1: PPTransMAD4
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-26 19:30:00
47

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 21, 0), None, None, 'quantity=', 78]
- qty: 78
- s1: None
- s2: None
- timestamp: 2013-08-26 21:00:00
78

您可以为AggregateDateAggregateDimensionValue、和{}定义最小类,然后eval依次为每一行定义:

import datetime

# define UTC timezone for sake of eval
if hasattr(datetime,"timezone"):
    UTC = datetime.timezone(datetime.timedelta(0),"UTC")
else:
    UTC = None

# define minimal classes to eval initializers
class AggregateDimensionValue(object):
    def __init__(self, value):
        self.value = value
class StringAggregateDimensionValue(AggregateDimensionValue): pass
class DateAggregateDimensionValue(AggregateDimensionValue): pass
class Aggregate(object):
    def __init__(self, aggregate_dimension_value_list, quantity):
        self.timestamp, self.s1, self.s2 = aggregate_dimension_value_list
        # pull values out of parsed "aggregate" instances
        self.timestamp = self.timestamp.value
        if self.s1 is not None:
            self.s1 = self.s1.value
        if self.s2 is not None:
            self.s2 = self.s2.value
        self.quantity = quantity

使用这些最小类来eval输入字符串:

^{pr2}$

给出:

{'timestamp': datetime.datetime(2013, 8, 28, 19, 30), 's1': None, 'quantity': 127, 's2': u'VIRTUALLY_LABELED_CASE'}
{'timestamp': datetime.datetime(2013, 8, 28, 19, 30), 's1': u'PPTransMergeNonCon', 'quantity': 15, 's2': u'PRIME_BIN_RANDOM_STOW'}
{'timestamp': datetime.datetime(2013, 8, 27, 21, 0), 's1': u'PPTransFRA1', 'quantity': 8, 's2': u'PRIME_BIN_RANDOM_STOW'}

当然,这与使用eval有关的所有常见警告一起出现,例如小心任何可能的恶意代码注入。但是我怀疑你自己已经控制了这个输入文件,所以如果你注入你自己的恶意代码,你只能怪你自己。在

相关问题 更多 >

    热门问题