python pytables并行化了h5fi的文件处理和创建

2024-05-17 08:20:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要加速或并行化这段代码: 它在干什么?
它正在逐行读取一个大的文本文件
然后初始化H5文件
然后进行一些读取并将其存储在变量中,最后存储在H5文件中

下面是一个巨大文本文件的示例:

4478597 1:0.251805 2:0.219186 3:0.232865 4:0.141475 5:0.160595 7:0.112843 8:0.175104 9:0.124815 11:0.167695 13:0.106979 15:0.217335 17:0.149643 18:0.136473 19:0.181927 20:0.136473 22:0.131012 23:0.167695 24:0.0845032 26:0.167695 29:0.149643 31:0.149643 33:0.249998 38:0.110253 40:0.167695 41:0.157844 42:0.113761 43:0.26752 46:0.142617 47:0.149643 49:0.095709 51:0.167695 53:0.232865 55:0.101021 56:0.106979 63:0.142617 65:0.142617 70:0.126099
4644503 3:0.236699 4:0.125176 9:0.115716 23:0.236699 24:0.119275 49:0.170561 86:0.236699 87:0.224719 88:0.222794 90:0.266532 91:0.254156 93:0.211218 94:0.222794 95:0.201302 96:0.236699 99:0.236699 101:0.254114 102:0.211218 103:0.184922 104:0.236699 106:0.146668 107:0.236699
5570870 4:0.147005 7:0.0801011 9:0.0834675 24:0.108624 91:0.183326 117:0.298427 119:0.348945 120:0.215562 121:0.202898 122:0.156352 124:0.15109 125:0.168409 126:0.15109 128:0.231421 130:0.177332 132:0.348945 134:0.215562 137:0.103672 139:0.175428 141:0.360613 148:0.21717 149:0.162093 150:0.156352 152:1

代码如下:

import tables as tb
import numpy as np

with open('myHugeTextFile', 'r') as data_file:
    input_lines = [line.strip() for line in data_file.readlines()]
    full_data = [line for line in input_lines]


h5file = tb.open_file("myOutputfile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
group = h5file.create_group("/", 'data', 'Data Version Alpha')
a = rand(self.my_data_row_number, self.my_data_col_number, format='csr')
l, m = a.shape[0], a.shape[1]        
full_matrix_data = h5file.create_carray(group, 'full', tb.Float32Atom(), shape=(l, m), filters=filters, title="My Data")

number_docs = 0
for line in full_data:
    my_line = np.array(line.split())
    id_document = str(my_line[0])
    my_line = np.core.defchararray.split(my_line[1:], ":")
    self.matrix_data_index[id_document] = number_docs
    self.matrix_data_index_search_by_value[number_docs] = id_document
    for element in my_line:
        if int(element[0]) in self.ListOfIdsToKeep:
            column = self.index_word_to_keep[str(element[0])]                    
            full_matrix_data[number_docs, column] = float(element[1])


    number_docs += 1

我怎么做?巨大的文件大小占用了很多时间。地图可以减少帮助吗?还是其他算法


Tags: inselfdocsnumberfordatamyas