PYTHON/NUMPY：与普通的numpyarray Python2.7相比，处理结构化数组问题的回答

PYTHON/NUMPY：与普通的numpyarray Python2.7相比，处理结构化数组

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我之所以要问这个问题，主要是因为我不太清楚结构化数组与普通数组相比是如何工作的，而且我在网上找不到适合我的例子。此外，我可能在一开始就错误地填充了结构化数组。在 所以，这里我要介绍的是“普通”numpy数组版本（以及我需要用它做什么）和新的“结构化”数组版本。我的（最大）数据集包含大约200e6个对象/行，最多有40-50个属性/列。除了一些特殊列：“haloid”、“hostid”、“type”之外，它们都具有相同的数据类型。它们是标识号或标志，我必须将它们与其他数据一起保存，因为我必须用它们来标识我的对象。在 数据集名称： <pre><code>data_array: ndarray shape: (42648, 10) </code></pre> 数据类型： ^{pr2}$ 从.hdf5文件格式读取数据到数组 大部分数据存储在hdf5文件中（其中2000个对应于我必须立即处理的一个快照），这些文件应该读入单个阵列 <pre><code>import numpy as np import h5py as hdf5 mydict={'name0': 'haloid', 'name1': 'hostid', ...} #dictionary of column names nr_rows = 200000 # approximated nr_files = 100 # up to 2200 nr_entries = 10 # up to 50 size = 0 size_before = 0 new_size = 0 # normal array: data_array=np.zeros((nr_rows, nr_entries), dtype=np.float64) # structured array: data_array=np.zeros((nr_rows,), dtype=dt) i=0 while i<nr_files: size_before=new_size f = hdf5.File(path, "r") size=f[mydict['name0']].size new_size+=size a=0 while a<nr_entries: name=mydict['name'+str(a)] # normal array: data_array[size_before:new_size, a] = f[name] # structured array: data_array[name][size_before:new_size] = f[name] a+=1 i+=1 </code></pre> 编辑：我编辑上面的代码是因为hpaulj幸运地注释了以下内容： <blockquote> First point of confusion. You show a dt definition with names like <code>dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),....</code> But the h5 load is data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)] In other words, the file has datasets with names like name0, name1, and you are downloading those to an array with fields with the same names. </blockquote> 这是一个“I-simplify-code”复制/粘贴错误，我纠正了它！在 问题1:这是填充结构化数组的正确方法吗？在 <pre><code>data_array[name][size_before:new_size] = f[name] </code></pre> 问题2:如何在结构化数组中寻址列？在 <pre><code>data_array[name] #--> column with a certain name </code></pre> 问题3:如何在结构化数组中寻址整行？在 <pre><code>data_array[0] #--> first row </code></pre> 问题4:如何处理3行和所有列？在 <pre><code># normal array: print data_array[0:3,:] [[ 1.21080866e+10 1.21080866e+10 0.00000000e+00 5.69363234e+08 1.28992369e+03 1.28894614e+03 1.32171442e+03 -1.08210000e+02 4.92900000e+02 6.50400000e+01] [ 1.21080711e+10 1.21080711e+10 0.00000000e+00 4.76329837e+06 1.29058079e+03 1.28741361e+03 1.32358059e+03 -4.23130000e+02 5.08720000e+02 -6.74800000e+01] [ 1.21080700e+10 1.21080700e+10 0.00000000e+00 2.22978043e+10 1.28750287e+03 1.28864306e+03 1.32270418e+03 -6.13760000e+02 2.19530000e+02 -2.28980000e+02]] # structured array: print data_array[0:3] #it returns a lot of data ... [[ (12108086595L, 12108086595L, 0, 105676938.02998888, 463686295.4907876,.7144191943337, -108.21, 492.9, 65.04) (12108071103L, 12108071103L, 0, 0.0, ... more data ... ... 228.02) ... more data ... (8394715323L, 8394715323L, 2, 0.0, 823505.2374262045, 0798, 812.0612163877823, -541.61, 544.44, 421.08)]] </code></pre> 问题5:为什么<code>data_array[0:3]</code>不仅返回前3行和10列？在 问题6:如何处理第一列中的前两个元素？在 <pre><code># normal array: print data_array[0:1,0] [ 1.21080866e+10 1.21080711e+10] # structured array: print data_array['haloid']][0][0:1] [12108086595 12108071103] </code></pre> 好吧！我明白了！在 问题7:如何按名称对三个特定列进行寻址，它们在该列中的前3行？在 <pre><code># normal array: print data_array[0:3, [0,2,1]] [[ 1.21080866e+10 0.00000000e+00 1.21080866e+10] [ 1.21080711e+10 0.00000000e+00 1.21080711e+10] [ 1.21080700e+10 0.00000000e+00 1.21080700e+10]] # structured array: print data_array[['haloid','type','hostid']][0][0:3] [(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L) (12108069992L, 0, 12108069992L)] </code></pre> 好的，最后一个例子似乎有效！！！在 问题8:这两者之间有什么区别： （a）<code>data_array['haloid'][0][0:3]</code>和（b）<code>data_array['haloid'][0:3]</code> 其中（a）返回前三个卤化物，（b）返回大量卤化物（10x3）。在 <pre><code>[[12108086595 12108071103 12108069992 12108076356 12108075899 12108066340 9248632230 12108066342 10878169355 10077026070] [ 6093565531 10077025463 8046772253 7871669276 5558161476 5558161473 12108068704 12108068708 12108077435 12108066338] [ 8739142199 12108069995 12108069994 12108076355 12108092590 12108066312 12108075900 9248643751 6630111058 12108074389]] </code></pre> 问题9:实际返回的是什么？在 问题10:如何使用<code>np.where()</code>屏蔽结构化数组 <pre><code># NOTE: col0,1,2 are some integer values of the column I want to address # col_name0,1,2 are corresponding names e.g. mstar, type, haloid # normal array mask = np.where(data[:,col2] > data[:,col1]) data[mask[:][0]] mask = np.where(data[:,col2]==2) data[:,col0][[mask[:][0]]]=data[:,col2][[mask[:][0]]] #structured array mask = np.where(data['x_pos'][0] > data['y_pos'][0]]) data[mask[:][0]] mask = np.where(data[:,col2]==2) data['haloid'][:,col0][[mask[:][0]]]=data['hostid'][:,col1][[mask[:][0]]] </code></pre> 但我不确定这行不行！在 问题11:我还能用<code>np.resize()</code>来调整数组的大小吗？在 问题12:如何对结构化数组进行排序？在 <pre><code># normal array: data_sorted = data[np.argsort(data[:,col2])] # structured array: data_sorted = data[np.argsort(data['mstar'][:,col3])] </code></pre> 谢谢，谢谢你的帮助和建议！在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

第一点困惑。您将显示一个名为dt=<code>[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),...</code>的<code>dt</code>定义。但是<code>h5</code>负载是 <pre><code>data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)] </code></pre> 换言之，该文件包含名称为<code>name0</code>，<code>name1</code>的数据集，您将这些数据集下载到具有相同名称字段的数组中。在 您可以使用 ^{pr2}$ 例如 <pre><code>In [20]: dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ...: ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ...: ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')] In [21]: arr = np.zeros((3,), dtype=dt) In [22]: arr Out[22]: array([(0, 0, 0, 0., 0., 0., 0., 0., 0., 0.), (0, 0, 0, 0., 0., 0., 0., 0., 0., 0.), (0, 0, 0, 0., 0., 0., 0., 0., 0., 0.)], dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]) In [23]: for name in arr.dtype.names: ...: print(name) ...: arr[name] = 1 ...: haloid hostid .... In [24]: arr Out[24]: array([(1, 1, 1, 1., 1., 1., 1., 1., 1., 1.), (1, 1, 1, 1., 1., 1., 1., 1., 1., 1.), (1, 1, 1, 1., 1., 1., 1., 1., 1., 1.)], dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('mstar', '<f8'), ('x_pos', '<f8'), ('y_pos', '<f8'), ('z_pos', '<f8'), ('x_vel', '<f8'), ('y_vel', '<f8'), ('z_vel', '<f8')]) In [25]: arr[0] # get one record Out[25]: (1, 1, 1, 1., 1., 1., 1., 1., 1., 1.) In [26]: arr[0]['hostid'] # get one field, one record In [27]: arr['hostid'] # get all values of a field Out[27]: array([1, 1, 1], dtype=uint64) In [28]: arr['hostid'][:2] # subset of records Out[28]: array([1, 1], dtype=uint64) </code></pre> 因此，按字段名填充结构化数组应该可以正常工作： <pre><code>arr[name][n1:n2] = file[dataset_name] </code></pre> 像这样的印刷品： <blockquote> structured array: print data_array[['haloid','type','hostid']][0][0:3] [(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L) (12108069992L, 0, 12108069992L)] </blockquote> 以及 <blockquote> [[ (12108086595L, 12108086595L, 0, </blockquote> 在我看来，结构化的<code>data_array</code>实际上是二维的，是用类似的东西创建的（见问题8） <pre><code>data_array = np.zeros((10, nr_rows), dtype=dt) </code></pre> 这是<code>[0][0:3]</code>索引工作的唯一方法 对于二维阵列： <pre><code>mask = np.where(data[:,col2] > data[:,col1]) </code></pre> 比较两列。当有疑问时，首先查看布尔值<code>data[:,col2] > data[:,col1]</code>。<code>where</code>只返回布尔数组为真的索引。在 掩蔽索引的简单示例： <pre><code>In [29]: x = np.array((np.arange(6), np.arange(6)[::-1])).T In [33]: mask = x[:,0]>x[:,1] In [34]: mask Out[34]: array([False, False, False, True, True, True], dtype=bool) In [35]: idx = np.where(mask) In [36]: idx Out[36]: (array([3, 4, 5], dtype=int32),) In [37]: x[mask,:] Out[37]: array([[3, 2], [4, 1], [5, 0]]) In [38]: x[idx,:] Out[38]: array([[[3, 2], [4, 1], [5, 0]]]) </code></pre> 在这个结构化示例中，<code>data['x_pos']</code>选择字段。需要<code>[0]</code>来选择该2d数组的第一行（大小为10维）。剩下的比较和where应该和2d数组一样工作。在 <pre><code>mask = np.where(data['x_pos'][0] > data['y_pos'][0]]) </code></pre> 可能不需要<code>where</code>元组上的<code>mask[:][0]</code>。<code>mask</code>是一个元组，<code>[:]</code>生成一个副本，[0]选择第一个元素，它是一个数组。有时可能需要一个<code>arr[idx[0],:]</code>，而不是<code>arr[idx,:]</code>，但不要经常这样做。在 我的第一条评论建议使用单独的数组 <pre><code> dt1 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1')] data_id = np.zeros((n,), dtype=dt1) data = np.zeros((n,m), dtype=float) # m float columns </code></pre> 甚至是 <pre><code> haloid = np.zeros((n,), '<u8') hostid = np.zeros((n,), '<u8') type = np.zeros((n,), 'i1') </code></pre> 使用这些数组，<code>data_array['hostid'][0]</code>、<code>data_id['hostid']</code>和{<cd21>}都应该返回相同的1d数组，并且在<code>mask</code>表达式中同样可用。在 ids有时保存在数据结构中是很方便的。如果写入/读取<code>csv</code>格式的文件，则尤其如此。但对于蒙面选拔来说，这并没有多大帮助。而对于跨数据域的数据计算来说，这可能是一件痛苦的事。在 我也可以建议一个复合数据类型，一个 <pre><code>dt2 = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', 'f8', (m,))] In [41]: np.zeros((4,), dtype=dt2) Out[41]: array([(0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.]), (0, 0, 0, [ 0., 0., 0.])], dtype=[('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'), ('data', '<f8', (3,))]) In [42]: _['data'] Out[42]: array([[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]]) </code></pre> 是按列号还是按“x_coor”之类的名称访问浮点数据更好？您需要同时使用多个浮点数列进行计算，还是总是单独访问它们？在

PYTHON/NUMPY：与普通的numpyarray Python2.7相比，处理结构化数组

1 个回答

相关Python问题