PYTHON/NUMPY：与普通的numpyarray Python2.7相比，处理结构化数组

import numpy as np import h5py as hdf5 mydict={'name0': 'haloid', 'name1': 'hostid', ...} #dictionary of column names nr_rows = 200000 # approximated nr_files = 100 # up to 2200 nr_entries = 10 # up to 50 size = 0 size_before = 0 new_size = 0 # normal array: data_array=np.zeros((nr_rows, nr_entries), dtype=np.float64) # structured array: data_array=np.zeros((nr_rows,), dtype=dt) i=0 while i<nr_files: size_before=new_size f = hdf5.File(path, "r") size=f[mydict['name0']].size new_size+=size a=0 while a<nr_entries: name=mydict['name'+str(a)] # normal array: data_array[size_before:new_size, a] = f[name] # structured array: data_array[name][size_before:new_size] = f[name] a+=1 i+=1

# normal array: print data_array[0:3,:] [[ 1.21080866e+10 1.21080866e+10 0.00000000e+00 5.69363234e+08 1.28992369e+03 1.28894614e+03 1.32171442e+03 -1.08210000e+02 4.92900000e+02 6.50400000e+01] [ 1.21080711e+10 1.21080711e+10 0.00000000e+00 4.76329837e+06 1.29058079e+03 1.28741361e+03 1.32358059e+03 -4.23130000e+02 5.08720000e+02 -6.74800000e+01] [ 1.21080700e+10 1.21080700e+10 0.00000000e+00 2.22978043e+10 1.28750287e+03 1.28864306e+03 1.32270418e+03 -6.13760000e+02 2.19530000e+02 -2.28980000e+02]] # structured array: print data_array[0:3] #it returns a lot of data ... [[ (12108086595L, 12108086595L, 0, 105676938.02998888, 463686295.4907876,.7144191943337, -108.21, 492.9, 65.04) (12108071103L, 12108071103L, 0, 0.0, ... more data ... ... 228.02) ... more data ... (8394715323L, 8394715323L, 2, 0.0, 823505.2374262045, 0798, 812.0612163877823, -541.61, 544.44, 421.08)]]

# normal array: print data_array[0:3, [0,2,1]] [[ 1.21080866e+10 0.00000000e+00 1.21080866e+10] [ 1.21080711e+10 0.00000000e+00 1.21080711e+10] [ 1.21080700e+10 0.00000000e+00 1.21080700e+10]] # structured array: print data_array[['haloid','type','hostid']][0][0:3] [(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L) (12108069992L, 0, 12108069992L)]

[[12108086595 12108071103 12108069992 12108076356 12108075899 12108066340 9248632230 12108066342 10878169355 10077026070] [ 6093565531 10077025463 8046772253 7871669276 5558161476 5558161473 12108068704 12108068708 12108077435 12108066338] [ 8739142199 12108069995 12108069994 12108076355 12108092590 12108066312 12108075900 9248643751 6630111058 12108074389]]

# NOTE: col0,1,2 are some integer values of the column I want to address # col_name0,1,2 are corresponding names e.g. mstar, type, haloid # normal array mask = np.where(data[:,col2] > data[:,col1]) data[mask[:][0]] mask = np.where(data[:,col2]==2) data[:,col0][[mask[:][0]]]=data[:,col2][[mask[:][0]]] #structured array mask = np.where(data['x_pos'][0] > data['y_pos'][0]]) data[mask[:][0]] mask = np.where(data[:,col2]==2) data['haloid'][:,col0][[mask[:][0]]]=data['hostid'][:,col1][[mask[:][0]]]

3条回答

网友

1楼 · 编辑于 2024-09-24 22:22:51

第一点困惑。您将显示一个名为dt=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),...的dt定义。但是h5负载是

data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)]

换言之，该文件包含名称为name0，name1的数据集，您将这些数据集下载到具有相同名称字段的数组中。在

您可以使用

^{pr2}$

例如

In [20]: dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), 
    ...: ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), 
    ...: ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]
In [21]: arr = np.zeros((3,), dtype=dt)
In [22]: arr
Out[22]: 
array([(0, 0, 0,  0.,  0.,  0.,  0.,  0.,  0.,  0.),
       (0, 0, 0,  0.,  0.,  0.,  0.,  0.,  0.,  0.),
       (0, 0, 0,  0.,  0.,  0.,  0.,  0.,  0.,  0.)], 
      dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [23]: for name in arr.dtype.names:
    ...:     print(name)
    ...:     arr[name] = 1
    ...:     
haloid
hostid
 ....
In [24]: arr
Out[24]: 
array([(1, 1, 1,  1.,  1.,  1.,  1.,  1.,  1.,  1.),
       (1, 1, 1,  1.,  1.,  1.,  1.,  1.,  1.,  1.),
       (1, 1, 1,  1.,  1.,  1.,  1.,  1.,  1.,  1.)], 
      dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')])
In [25]: arr[0]     # get one record
Out[25]: (1, 1, 1,  1.,  1.,  1.,  1.,  1.,  1.,  1.)
In [26]: arr[0]['hostid']     # get one field, one record
In [27]: arr['hostid']       # get all values of a field
Out[27]: array([1, 1, 1], dtype=uint64)
In [28]: arr['hostid'][:2]    # subset of records
Out[28]: array([1, 1], dtype=uint64)

因此，按字段名填充结构化数组应该可以正常工作：

arr[name][n1:n2] = file[dataset_name]

像这样的印刷品：

structured array:
print data_array[['haloid','type','hostid']][0][0:3]
[(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L) (12108069992L, 0, 12108069992L)]

以及

[[ (12108086595L, 12108086595L, 0,

在我看来，结构化的data_array实际上是二维的，是用类似的东西创建的（见问题8）

data_array = np.zeros((10, nr_rows), dtype=dt)

这是[0][0:3]索引工作的唯一方法

对于二维阵列：

mask = np.where(data[:,col2] > data[:,col1])

比较两列。当有疑问时，首先查看布尔值data[:,col2] > data[:,col1]。where只返回布尔数组为真的索引。在

掩蔽索引的简单示例：

In [29]: x = np.array((np.arange(6), np.arange(6)[::-1])).T
In [33]: mask = x[:,0]>x[:,1]
In [34]: mask
Out[34]: array([False, False, False,  True,  True,  True], dtype=bool)
In [35]: idx = np.where(mask)
In [36]: idx
Out[36]: (array([3, 4, 5], dtype=int32),)
In [37]: x[mask,:]
Out[37]: 
array([[3, 2],
       [4, 1],
       [5, 0]])
In [38]: x[idx,:]
Out[38]: 
array([[[3, 2],
        [4, 1],
        [5, 0]]])

在这个结构化示例中，data['x_pos']选择字段。需要[0]来选择该2d数组的第一行（大小为10维）。剩下的比较和where应该和2d数组一样工作。在

mask = np.where(data['x_pos'][0] > data['y_pos'][0]])

可能不需要where元组上的mask[:][0]。mask是一个元组，[:]生成一个副本，[0]选择第一个元素，它是一个数组。有时可能需要一个arr[idx[0],:]，而不是arr[idx,:]，但不要经常这样做。在

我的第一条评论建议使用单独的数组

 dt1 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1')]
 data_id = np.zeros((n,), dtype=dt1)

 data = np.zeros((n,m), dtype=float)    # m float columns

甚至是

 haloid = np.zeros((n,), '<u8')
 hostid = np.zeros((n,), '<u8')
 type = np.zeros((n,), 'i1')

使用这些数组，data_array['hostid'][0]、data_id['hostid']和{}都应该返回相同的1d数组，并且在mask表达式中同样可用。在

ids有时保存在数据结构中是很方便的。如果写入/读取csv格式的文件，则尤其如此。但对于蒙面选拔来说，这并没有多大帮助。而对于跨数据域的数据计算来说，这可能是一件痛苦的事。在

我也可以建议一个复合数据类型，一个

dt2 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', 'f8', (m,))]

In [41]: np.zeros((4,), dtype=dt2)
Out[41]: 
array([(0, 0, 0, [ 0.,  0.,  0.]), (0, 0, 0, [ 0.,  0.,  0.]),
       (0, 0, 0, [ 0.,  0.,  0.]), (0, 0, 0, [ 0.,  0.,  0.])], 
      dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', '<f8', (3,))])
In [42]: _['data']
Out[42]: 
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

是按列号还是按“x_coor”之类的名称访问浮点数据更好？您需要同时使用多个浮点数列进行计算，还是总是单独访问它们？在

网友

2楼 · 编辑于 2024-09-24 22:22:51

通过您的描述，我认为简单的方法是只将有用的数据读入具有不同名称的数组（每个数组可能有一种类型？）如果您希望将所有数据读入一个数组，那么Pandas可能是您的选择： http://pandas.pydata.org http://pandas.pydata.org/pandas-docs/stable/ 但我还没试过。试试看吧。在

网友

3楼 · 编辑于 2024-09-24 22:22:51

对问题11的回答：

Question 11: Can I still use np.resize() like: data_array = np.resize(data_array,(new_size, nr_entries)) to resize/reshape my array?

如果我像这样调整数组的大小，我将为dt10列中的每个字段创建。所以我得到了问题8b的“奇怪”结果：一种结构（10x3）卤化物

修剪数组的正确方法是：

data_array = data_array[:newsize]

print np.info(data_array)

class:  ndarray
shape:  (42648,)
strides:  (73,)
type: [('haloid', '<u8'), ('hostid', '<u8'), ('orphan', 'i1'), 
('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), 
('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]

相关问题更多 >

编程相关推荐

热门问题

热门文章