如何使用分层键从HDF5访问数据？

2024-09-29 21:40:04 发布

男 | 程序猿一只，喜欢编程写python代码。

我在HDF5中创建了一个具有以下结构的分层键的存储

<class 'pandas.io.pytables.HDFStore'>
File path: path-analysis/data/store.h5
/attribution/attr_000000            frame        (shape->[1,5])
/attribution/attr_000001            frame        (shape->[1,5])
/attribution/attr_000002            frame        (shape->[1,5])
/attribution/attr_000003            frame        (shape->[1,5])
.....
/impression/imp_000000              frame        (shape->[1,5])
/impression/imp_000001              frame        (shape->[1,5])
/impression/imp_000002              frame        (shape->[1,5])
/impression/imp_000003              frame        (shape->[1,5])
.....

从我在文档中读到的内容，我应该能够通过以下方式访问印象和归因

^{pr2}$

但是，我得到一个错误： TypeError:如果对象不存在也未传递值，则无法创建存储器

为了存储数据，我必须迭代数据

store.put('impression/imp_' + name, df)

最初，我使用append api创建一个表“impression”，但是每个数据帧需要80秒的时间，而且考虑到我有将近200个文件要处理，“append”似乎太慢了。在

相比之下，“put”只需要不到一秒钟的时间就可以添加到存储中，但是它不允许我以后选择数据。在

鉴于上述结构，我应该如何访问我的数据？在

还有，为什么append比put慢得多？能快点吗？在

随后，我希望能够按某一列对印象数据进行分组，并对属性数据进行相同的分组。所以我还需要能够选择列。在

我在构建我的数据时采用了错误的方法吗？在

这是测向信息

<class 'pandas.core.frame.DataFrame'>
Int64Index: 251756 entries, 0 to 257114
Data columns (total 5 columns):
pmessage_type       251756 non-null object
channel             251756 non-null object
source_timestamp    251756 non-null object
winning_price       251756 non-null int64
ipaddress           251756 non-null object
dtypes: int64(1), object(4)None

这是数据样本

，pmessage_type，channel，source_timestamp，中标价，IP地址 0，印象，在线，14007920990001800,99.34.198.9 1，印象，在线，1401587896000200，99.60.68.61 2，印象，在线，1400873222000735，65.96.72.183 3，印象，在线，140076855560005550，73.182.225.30 在线印鉴，183.0096，邮编：183.0096 5，印象，在线，140099277000,88,73.182.225.30 6，印象，在线，14007099948000290162.228.58.98 7，印象，在线，14006346070001720162.228.58.98 8，印象，在线，139920156800710108.206.240.138

此数据通过使用导出df.to\U csv（…）为了简单起见。在

原始数据是如何加载到数据帧中的，下面是代码片段。在

data = pd.read_csv(events_csv_file,
                   delimiter='\x01',
                   header=None,
                   names=my_columns.keys(),
                   dtype=my_columns,
                   usecols=my_subset_columns,
                   iterator=True,
                   chunksize=1e6)
df = pd.concat(data)

在哪里

我的专栏是一本字典：

{'attribution_strategy': object,
 'channel': object,
'flight_uid': object,
'ipaddress': object,
'pixel_id': object,
'pmessage_type': object,
'source_timestamp': object,
'source_unique_id': object,
'unique_id': object,
'user_id': object,
'winning_price': numpy.int64}

手动指定类型的目的是为了提高速度。（我读过一些有助于处理的东西，但我没有观察到这种改进）

还有，我的熊猫版如果有什么不同的话

>>> pandas.__version__
'0.14.0'
>>>

====================

这里有一个很容易复制的例子

df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
                     'bar', 'bar', 'bar', 'bar',
                     'foo', 'foo', 'foo'],
               'B': ['one', 'one', 'one', 'two',
                     'one', 'one', 'one', 'two',
                     'two', 'two', 'one'],
               'C': ['dull', 'dull', 'shiny', 'dull',
                     'dull', 'shiny', 'shiny', 'dull',
                     'shiny', 'shiny', 'shiny'],
               'D': np.random.randn(11),
               'E': np.random.randn(11),
               'F': np.random.randn(11)})

store = pd.HDFStore('mystore.h5')
store.put('data/01', df)
store.put('data/02', df)

print store

<class 'pandas.io.pytables.HDFStore'>
File path: mystore.h5
/data/01            frame        (shape->[11,6])
/data/02            frame        (shape->[11,6])

这就是我想要得到的

store.select('data')

但是我得到一个错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-33-60360d11cde5> in <module>()
----> 1 store.select('data')

/Users/sshegheva/anaconda/envs/numba/lib/python2.7/site-packages/pandas/io/pytables.pyc in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    650         # create the storer and axes
    651         where = _ensure_term(where, scope_level=1)
--> 652         s = self._create_storer(group)
    653         s.infer_axes()
    654 

/Users/sshegheva/anaconda/envs/numba/lib/python2.7/site-packages/pandas/io/pytables.pyc in _create_storer(self, group, format, value, append, **kwargs)
   1157                 else:
   1158                     raise TypeError(
-> 1159                         "cannot create a storer if the object is not existing "
   1160                         "nor a value are passed")
   1161             else:

TypeError: cannot create a storer if the object is not existing nor a value are passed

相比之下，通过在层次结构中选择top key删除数据的工作与预期的一样

store.remove('data')

Tags： columns 数据 store pandas df data object foo

0条回答

目前没有回答

如何使用分层键从HDF5访问数据？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用分层键从HDF5访问数据？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >