如何使用排序列表作为筛选器?

2024-09-27 00:17:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我想获取所有在oneKmer中不存在的in.txt
in.txt的排序顺序与第0列的oneKmer相同

它应该在O(N)而不是O(N^2)中可行,因为两个列表的顺序相同

我怎么写这个

import csv
import itertools

tsvfile = open('in.txt', "r")
tsvreader = csv.reader(tsvfile, delimiter=" ")

for i in itertools.product('ACTG', repeat = 18):
    oneKmer = ''.join(i)
    flag = 1
    with open(InFile) as tsvfile:
        tsvreader = csv.reader(tsvfile, delimiter=" ")
        for line in tsvreader:
            if line[0] == oneKmer:
                flag = 0
                break
    if flag:
        print(oneKmer)

in.txt

AAAAAAAAAAAAAAAAAA 1400100
AAAAAAAAAAAAAAAAAC 37055
AAAAAAAAAAAAAAAAAT 70686
AAAAAAAAAAAAAAAAAG 192363
AAAAAAAAAAAAAAAACA 20042
AAAAAAAAAAAAAAAACC 12965
AAAAAAAAAAAAAAAACT 10596
AAAAAAAAAAAAAAAACG 1732
AAAAAAAAAAAAAAAATA 16440
AAAAAAAAAAAAAAAATC 18461
...

整个in.txt文件是38569002592字节,包含1836020688行

预期结果应该是(4^18-1836020688)行字符串。当然,稍后我将在脚本中进一步过滤它们


举个简单的例子,假设我要打印整数<;16在给定的排序列表中不存在的。结果应该是[1,2,4,7,9,12,13,14,15]。给定的列表是巨大的,所以我想一次读一个元素。所以当我读3的时候,我知道我可以打印出1和2。然后跳过3,读下一个5,现在我可以打印出4,跳过5


Tags: csvinimporttxt列表排序顺序open
2条回答

首先,打开文件很多次都很慢,所以ACTG循环必须包含在文件循环中。其次,Stdout比您想象的慢,所以停止print(onemake)并直接输出到文件。他们必须尽可能提高速度

一些解决方案,全部并行处理超序列和子序列,占用线性时间和恒定内存

使用您的简单示例:

full = iter(range(1, 16))
skip = iter([3,5,6,8,10,11])

解决方案0:(我最后想到的,但应该先做)

s = next(skip, None)
for x in full:
    if x == s:
        s = next(skip, None)
    else:
        print(x)

解决方案1:

from heapq import merge
from itertools import groupby

for x, g in groupby(merge(full, skip)):
    if len(list(g)) == 1:
        print(x)

解决方案2:

for s in skip:
    for x in iter(full.__next__, s):
        print(x)
for x in full:
    print(x)

解决方案3:

from functools import partial

until = partial(iter, full.__next__)
for s in skip:
    for x in until(s):
        print(x)
for x in full:
    print(x)

解决方案4:

from itertools import takewhile

for s in skip:
    for x in takewhile(s.__ne__, full):
        print(x)
for x in full:
    print(x)

所有解决方案的输出:

1
2
4
7
9
12
13
14
15

实际问题的解决方案0:

import csv
import itertools

with open('in.txt') as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter=' ')
    skip = next(tsvreader, [None])[0]
    for i in itertools.product('ACTG', repeat=18):
        oneKmer = ''.join(i)
        if oneKmer == skip:
            skip = next(tsvreader, [None])[0]
        else:
            print(oneKmer)

轻微变化:

import csv
from itertools import product
from operator import itemgetter

with open('in.txt') as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter=' ')
    skips = map(itemgetter(0), tsvreader)
    skip = next(skips, None)
    for oneKmer in map(''.join, product('ACTG', repeat=18)):
        if oneKmer == skip:
            skip = next(skips, None)
        else:
            print(oneKmer)

相关问题 更多 >

    热门问题