PYTHON/NUMPY：与普通的numpyarray Python2.7相比，处理结构化数组问题的回答

PYTHON/NUMPY：与普通的numpyarray Python2.7相比，处理结构化数组

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我之所以要问这个问题，主要是因为我不太清楚结构化数组与普通数组相比是如何工作的，而且我在网上找不到适合我的例子。此外，我可能在一开始就错误地填充了结构化数组。在 所以，这里我要介绍的是“普通”numpy数组版本（以及我需要用它做什么）和新的“结构化”数组版本。我的（最大）数据集包含大约200e6个对象/行，最多有40-50个属性/列。除了一些特殊列：“haloid”、“hostid”、“type”之外，它们都具有相同的数据类型。它们是标识号或标志，我必须将它们与其他数据一起保存，因为我必须用它们来标识我的对象。在 数据集名称： <pre><code>data_array: ndarray shape: (42648, 10) </code></pre> 数据类型： ^{pr2}$ 从.hdf5文件格式读取数据到数组 大部分数据存储在hdf5文件中（其中2000个对应于我必须立即处理的一个快照），这些文件应该读入单个阵列 <pre><code>import numpy as np import h5py as hdf5 mydict={'name0': 'haloid', 'name1': 'hostid', ...} #dictionary of column names nr_rows = 200000 # approximated nr_files = 100 # up to 2200 nr_entries = 10 # up to 50 size = 0 size_before = 0 new_size = 0 # normal array: data_array=np.zeros((nr_rows, nr_entries), dtype=np.float64) # structured array: data_array=np.zeros((nr_rows,), dtype=dt) i=0 while i<nr_files: size_before=new_size f = hdf5.File(path, "r") size=f[mydict['name0']].size new_size+=size a=0 while a<nr_entries: name=mydict['name'+str(a)] # normal array: data_array[size_before:new_size, a] = f[name] # structured array: data_array[name][size_before:new_size] = f[name] a+=1 i+=1 </code></pre> 编辑：我编辑上面的代码是因为hpaulj幸运地注释了以下内容： <blockquote> First point of confusion. You show a dt definition with names like <code>dt = [('haloid', '<u8'), ('hostid', '<u8'), ('type', 'i1'),....</code> But the h5 load is data_array['name'+str(a)][size_before:new_size] = f['name'+str(a)] In other words, the file has datasets with names like name0, name1, and you are downloading those to an array with fields with the same names. </blockquote> 这是一个“I-simplify-code”复制/粘贴错误，我纠正了它！在 问题1:这是填充结构化数组的正确方法吗？在 <pre><code>data_array[name][size_before:new_size] = f[name] </code></pre> 问题2:如何在结构化数组中寻址列？在 <pre><code>data_array[name] #--> column with a certain name </code></pre> 问题3:如何在结构化数组中寻址整行？在 <pre><code>data_array[0] #--> first row </code></pre> 问题4:如何处理3行和所有列？在 <pre><code># normal array: print data_array[0:3,:] [[ 1.21080866e+10 1.21080866e+10 0.00000000e+00 5.69363234e+08 1.28992369e+03 1.28894614e+03 1.32171442e+03 -1.08210000e+02 4.92900000e+02 6.50400000e+01] [ 1.21080711e+10 1.21080711e+10 0.00000000e+00 4.76329837e+06 1.29058079e+03 1.28741361e+03 1.32358059e+03 -4.23130000e+02 5.08720000e+02 -6.74800000e+01] [ 1.21080700e+10 1.21080700e+10 0.00000000e+00 2.22978043e+10 1.28750287e+03 1.28864306e+03 1.32270418e+03 -6.13760000e+02 2.19530000e+02 -2.28980000e+02]] # structured array: print data_array[0:3] #it returns a lot of data ... [[ (12108086595L, 12108086595L, 0, 105676938.02998888, 463686295.4907876,.7144191943337, -108.21, 492.9, 65.04) (12108071103L, 12108071103L, 0, 0.0, ... more data ... ... 228.02) ... more data ... (8394715323L, 8394715323L, 2, 0.0, 823505.2374262045, 0798, 812.0612163877823, -541.61, 544.44, 421.08)]] </code></pre> 问题5:为什么<code>data_array[0:3]</code>不仅返回前3行和10列？在 问题6:如何处理第一列中的前两个元素？在 <pre><code># normal array: print data_array[0:1,0] [ 1.21080866e+10 1.21080711e+10] # structured array: print data_array['haloid']][0][0:1] [12108086595 12108071103] </code></pre> 好吧！我明白了！在 问题7:如何按名称对三个特定列进行寻址，它们在该列中的前3行？在 <pre><code># normal array: print data_array[0:3, [0,2,1]] [[ 1.21080866e+10 0.00000000e+00 1.21080866e+10] [ 1.21080711e+10 0.00000000e+00 1.21080711e+10] [ 1.21080700e+10 0.00000000e+00 1.21080700e+10]] # structured array: print data_array[['haloid','type','hostid']][0][0:3] [(12108086595L, 0, 12108086595L) (12108071103L, 0, 12108071103L) (12108069992L, 0, 12108069992L)] </code></pre> 好的，最后一个例子似乎有效！！！在 问题8:这两者之间有什么区别： （a）<code>data_array['haloid'][0][0:3]</code>和（b）<code>data_array['haloid'][0:3]</code> 其中（a）返回前三个卤化物，（b）返回大量卤化物（10x3）。在 <pre><code>[[12108086595 12108071103 12108069992 12108076356 12108075899 12108066340 9248632230 12108066342 10878169355 10077026070] [ 6093565531 10077025463 8046772253 7871669276 5558161476 5558161473 12108068704 12108068708 12108077435 12108066338] [ 8739142199 12108069995 12108069994 12108076355 12108092590 12108066312 12108075900 9248643751 6630111058 12108074389]] </code></pre> 问题9:实际返回的是什么？在 问题10:如何使用<code>np.where()</code>屏蔽结构化数组 <pre><code># NOTE: col0,1,2 are some integer values of the column I want to address # col_name0,1,2 are corresponding names e.g. mstar, type, haloid # normal array mask = np.where(data[:,col2] > data[:,col1]) data[mask[:][0]] mask = np.where(data[:,col2]==2) data[:,col0][[mask[:][0]]]=data[:,col2][[mask[:][0]]] #structured array mask = np.where(data['x_pos'][0] > data['y_pos'][0]]) data[mask[:][0]] mask = np.where(data[:,col2]==2) data['haloid'][:,col0][[mask[:][0]]]=data['hostid'][:,col1][[mask[:][0]]] </code></pre> 但我不确定这行不行！在 问题11:我还能用<code>np.resize()</code>来调整数组的大小吗？在 问题12:如何对结构化数组进行排序？在 <pre><code># normal array: data_sorted = data[np.argsort(data[:,col2])] # structured array: data_sorted = data[np.argsort(data['mstar'][:,col3])] </code></pre> 谢谢，谢谢你的帮助和建议！在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

PYTHON/NUMPY：与普通的numpyarray Python2.7相比，处理结构化数组

1 个回答

相关Python问题