将混合数据和类别的pandas数据帧存储到hdf5中

import pandas as pd mydf = pd.DataFrame({'endTime' : pd.Series([1443525810,1443540836,1443609470]), 'distance' : pd.Series([454.75,477.25,242.12]), 'signature' : pd.Series(['ab','cd','ab']), 'anchorName' : pd.Series(['tec','ing','pol']), 'stationList' : pd.Series([['t1','t2','t3'],['4','t2','t3'],['t3','t2','t4']]) }) # this works fine (no category) mydf.to_hdf('tmp_without_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2') for col in ['anchorName', 'signature']: mydf[col] = mydf[col].astype('category') # this crashes now because of category data # mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2') # switching to format='t' # this caused problems because of "mixed data" in column stationList mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2') mydf.pop('stationList') # this again works fine mydf.to_hdf('tmp_with_cat_without_stationList.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2')

1条回答

网友

1楼 · 发布于 2024-09-21 01:14:29

你有两个问题：

你想在HDF5文件中存储分类数据
您试图将任意对象（即stationList）存储在HDF5文件中。

正如您所发现的，分类数据是（当前？）仅支持HDF5的“表格”格式。

然而，存储任意对象（字符串列表等）实际上并不是HDF5格式本身所支持的。Pandas通过使用pickle序列化这些对象，然后将pickle存储为任意长度的字符串（我认为所有HDF5格式都不支持这种字符串）来解决这个问题。但这将是缓慢和低效的，而且永远不会得到HDF5的良好支持。

在我看来，你有两个选择：

旋转数据，以便按站名显示一行数据。然后可以将所有内容存储为HDF5文件的表格格式。（这通常是一个好的实践；请参见Hadley Wickham on Tidy Data。）
如果您真的想保持这种格式，那么您最好使用to_pickle（）保存整个数据帧。处理任何类型的对象（例如字符串列表等）都没有问题。

就我个人而言，我建议选择1。您可以使用快速的二进制文件格式。轴心也将使数据的其他操作更容易。

相关问题更多 >

编程相关推荐

热门问题

热门文章