python2.7中大列表的时间复杂性

2024-09-30 23:35:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一份清单,大约有177071007项。 我正在尝试执行以下操作 a) 获取列表中唯一项的第一次和最后一次。 b) 发生次数。在

def parse_data(file, op_file_test):
    ins = csv.reader(open(file, 'rb'), delimiter = '\t')
    pc = list()
    rd = list()
    deltas = list()
    reoccurance = list()
    try:
        for row in ins:
            pc.append(int(row[0]))
            rd.append(int(row[1]))
    except:
        print row
        pass

    unique_pc = set(pc)
    unique_pc = list(unique_pc)
    print "closing file"

    #takes a long time from here!
    for a in range(0, len(unique_pc)):
        index_first_occurance = pc.index(unique_pc[a])
        index_last_occurance = len(pc) - 1 - pc[::-1].index(unique_pc[a])
        delta_rd = rd[index_last_occurance] - rd[index_first_occurance]
        deltas.append(int(delta_rd))
        reoccurance.append(pc.count(unique_pc[a]))
        print unique_pc[a] , delta_rd, reoccurance[a]

    print "printing to file"
    map_file =  open(op_file_test,'a')
    for a in range(0, len(unique_pc)):
        print >>map_file, "%d, %d, %d" % (unique_pc[a], deltas[a], reoccurance)
    map_file.close()

然而,复杂度是按O(n)的顺序排列的。 我说的是让它快速的跑吗?或者还有别的办法吗?不幸的是,我没有numpy


Tags: inforindexdeltasrdlistfileint
3条回答

扫描输入文件中的项时,将这些项放入collections.defaultdict(list),其中键是项,值是出现索引的列表。读取文件并建立此数据结构需要线性时间,而获取项的第一次和最后一次出现索引需要恒定时间,而获取项的出现次数则需要恒定时间。在

下面是它的工作原理

mydict = collections.defaultdict(list)
for item, index in itemfilereader: # O(n)
    mydict[item].append(index)

# first occurrence of item, O(1)
mydict[item][0]

# last occurrence of item, O(1)
mydict[item][-1]

# number of occurrences of item, O(1)
len(mydict[item])

也许它值得改变使用的数据结构。我将使用一个dict,它使用pc作为键,使用occurrence作为值。在

lookup = dict{}
counter = 0
for line in ins:
    values = lookup.setdefault(int(line[0]),[])
    values.append(tuple(counter,int(line[1])))
    counter += 1

for key, val in lookup.iteritems():
    value_of_first_occurence = lookup[key][1][1]
    value_of_last_occurence = lookup[key][-1][1]
    first_occurence = lookup[key][1][0]
    last_occurence = lookup[key][-1][0]
    value = lookup[key][0]

尝试以下操作:

from collections import defaultdict

# Keep a dictionary of our rd and pc values, with the value as a list of the line numbers each occurs on
# e.g. {'10': [1, 45, 79]}
pc_elements = defaultdict(list)
rd_elements = defaultdict(list)

with open(file, 'rb') as f:
    line_number = 0
    csvin = csv.reader(f, delimiter='\t')
    for row in csvin:
        try:
            pc_elements[int(row[0])].append(line_number)
            rd_elements[int(row[1])].append(line_number)
            line_number += 1
        except ValueError:
            print("Not a number")
            print(row)
            line_number += 1
            continue

for pc, indexes in pc_elements.iteritems():
    print("pc  {0} appears {1} times. First on row {2}, last on row {3}".format(
        pc,
        len(indexes),
        indexes[0],
        indexes[-1]
    ))

这是通过在读取TSV时创建一个字典,以pc值为键,以出现列表为值。根据dict的性质,键必须是唯一的,因此我们避免使用set,而{}值只用于保存键所在的行。在

示例:

^{pr2}$

将输出:

"pc 10 appears 4 times. First on row 4, last on row 101"
"pc 8 appears 3 times. First on row 3, last on row 13"

相关问题 更多 >