在python中，找到大字符串中所有子字符串发生的最快方法是什么

import time import numpy as np rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'} test_d = { '+' : {}, '-' : {}} nts = 'T' s = time.time() for chrom in seq_d: plus_pos, minus_pos = [], [] chrom_seq = seq_d[chrom] for pos, nt in enumerate(chrom_seq): if nt in nts: plus_pos.append(pos) if rev_comps[nt] in nts: minus_pos.append(pos) test_d['+'][chrom] = plus_pos test_d['-'][chrom] = minus_pos e = time.time() print 'The serial version took {} minutes...'.format((e-s)/60)

s = time.time() for chrom in seq_d: chrom_seq = np.array(list(seq_d[chrom])) nts = list(nts) rev_nts = [rev_comps[nt] for nt in nts] plus_pos = list(np.where(np.in1d(chrom_seq, nts) == True)[0]) minus_pos = list(np.where(np.in1d(chrom_seq, rev_nts) == True)[0]) test_d['+'][chrom] = plus_pos test_d['-'][chrom] = minus_pos e = time.time() print 'The numpy version took {} minutes...'.format((e-s)/60)

def getTestHelper(chrom_seq, nts, rev_comp): rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'} if rev_comp: nts = [rev_comps[nt] for nt in nts] else: nts = list(nts) chrom_seq = np.array(list(chrom_seq)) mask = np.in1d(chrom_seq, nts) pos_l = list(np.where(mask == True)[0]) return pos_l s = time.time() pool = Pool(4) plus_pos = pool.map(functools.partial(getTestHelper, nts=nts, rev_comp=False), seq_d.values()) minus_pos = pool.map(functools.partial(getTestHelper, nts=nts, rev_comp=True), seq_d.values()) e = time.time() print 'The parallel version took {} minutes...'.format((e-s)/60)

3条回答

网友

1楼 · 编辑于 2024-09-26 18:01:49

使用内置

不要手动迭代您的长字符串，请尝试^{}或^{}。不要自己切片字符串，使用这些方法的内置切片。在

这也抛弃了enumerate-ing，尽管这应该不会很昂贵。在

另外，您可以使用set来存储索引，而不是列表添加可以更快。在

不过，你得做两次才能找到你的核苷酸和它的补体。当然，在循环外查找补码。在

试试正则表达式

您也可以尝试regular expressions来做同样的事情（如果您要尝试这样做，请同时尝试2个regex（用于“T”和“A”）和一个用于“T | A”）。在

而且，不是这样做

for chrom in seq_d:

你可以的

^{pr2}$

这与性能没有多大关系，但可以使代码更具可读性。在

网友

2楼 · 编辑于 2024-09-26 18:01:49

老实说，我认为你做的事情是对的。在

不过，您还可以对代码进行一些调整。一般来说，当性能是关键时，只在最里面的循环中执行最小限度的操作。纵观您的代码，仍有一些快速优化留在这方面：

使用if…elif而不是if…if。在
不要在不需要的地方使用列表-例如，一个字符串就足够了，反之亦然。在
不要对同一个结果多次求值-例如反向查找。在

我猜您的多处理问题归结于这些非常大的字符串的序列化，抵消了并行运行可能带来的任何性能增益。但是，可能还有另一种方法可以做到这一点-请参见Parallelizing a Numpy vector operation。我无法验证，因为我在安装numexpr时遇到困难。在

把它们放在一起，并在本试验中尝试其他一些建议，我得到以下结果：

$ python test.py
Original serial took 0.08330821593602498 minutes...
Using sets took 0.09072601397832235 minutes...
Using built-ins took 0.061421032746632895 minutes...
Using regex took 0.11649663050969442 minutes...
Optimized serial took 0.05909080108006795 minutes...
Original numpy took 0.04050511916478475 minutes...
Optimized numpy took 0.03438538312911987 minutes...

我的代码如下。在

^{pr2}$

网友

3楼 · 编辑于 2024-09-26 18:01:49

查找最长的染色体字符串，并创建一个空数组，其中每个染色体有一行，列一直到字典中最长的一行。然后它将每个染色体放入自己的行中，在那里可以对整个数组调用np.where

import numpy as np

longest_chrom = max([len(x) for x in seq_d.values()])
genome_data = np.zeros((len(seq_d), dtype=str, longest_chrom)
for index, entry in enumerate(seq_d):
    gen_string = list(seq_d[entry]
    genome_data[index, :len(gen_string) + 1] =  gen_string
nuc_search = "T"
nuc_locs = np.where(genome_data == nuc_search)

使用这种方法，对于1个以上核苷酸的字符串，它略有不同，但我确信只需再多几个步骤就可以实现。在

使用内置

试试正则表达式

相关问题更多 >

编程相关推荐

热门问题

热门文章