如何更有效地读取DNA序列？

3条回答

网友

1楼 · 编辑于 2024-10-02 10:31:21

我在这里看到的主要问题是你把所有的东西都写进了一个文件。这样做没有意义。您创建的大型输出文件是非常冗余的，并且在您进行分析时将其重新加载是没有帮助的。在

最初加载文件后，您感兴趣的每个窗口都是a[x:x+100]，有些是x。实际上根本不需要显式地生成这些窗口：这样做不会有任何好处。遍历，并直接从a的每个窗口生成这些矩阵。在

如果你真的需要整个东西，把它生成一个numpy数组。另外，如果我不使用任何简并的基代码，用0,1,2,3表示A，C，G，T，将序列转换成uint8s。这有助于加快速度，尤其是如果你需要在任何点上取补码，这可以通过简单地修改位来完成。在

Numpy可以使用stride_tricks非常有效地生成数组，如前所述in this blog post：

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
b = numpy.array([x for x in a], dtype=numpy.character)
rolling_window(b,100)

或者，转换为整数：

^{pr2}$

在我的机器上这个代码比你的快十倍。在

网友

2楼 · 编辑于 2024-10-02 10:31:21

这是一个可以做一些你可能想要的事情的类。在

"""
Read in genome of E. Coli (or whatever) from given input file,
process it in segments of 100 basepairs at a time.

Usage: 100pairs [-n <pairs>] [-p] <file>

<file>                 Input file.
-n, numpairs <pairs>  Use <pairs> per iteration. [default: 100]
-p, partial           Allow partial sequences at end of genome.
"""
import docopt

class GeneBuffer:
    def __init__(self, path, bases=100, partial=True):
        self._buf = None
        self.bases = int(bases)
        self.partial = partial
        self.path = path

    def __enter__(self):
        self._file = open(self.path, 'r')
        self._header = next(self._file)
        return self

    def __exit__(self, *args):
        if self._file:
            self._file.close()

    def __iter__(self):
        return self

    def __next__(self):
        if self._buf is None:
            self._buf = ''

        while self._file and len(self._buf) < self.bases:
            try:
                self._buf += next(self._file).strip()
            except StopIteration:
                self._file.close()
                self._file = None
                break

        if len(self._buf) < self.bases:
            if len(self._buf) == 0 or not self.partial:
                raise StopIteration

        result = self._buf[:self.bases]
        self._buf = self._buf[1:]

        return result

def analyze(basepairs):
    """
    Dammit, Jim! I'm a computer programmer, not a geneticist!
    """
    print(basepairs)

def main(args):
    numpairs = args[' numpairs']
    partial = args[' partial']
    with GeneBuffer(args['<file>'], bases=numpairs, partial=partial) as genome:
        print("Header: ", genome._header)
        count = 0
        for basepairs in genome:
            count += 1
            print(count, end=' ')
            analyze(basepairs)

if __name__ == '__main__':
    args = docopt.docopt(__doc__)
    main(args)

网友

3楼 · 编辑于 2024-10-02 10:31:21

如果它是类似.fasta的文件，它很可能包含多个序列。在
在stackoverflow上读取python中的大文件的例子很多，给出了一些有用的方法here。我通常使用上面给出的方法来回答这个问题（with open(...) file）。它速度快，占用的内存更少。在

似乎你想用固定大小的滑动窗口来处理数据。我会这样做：

def load_fasta(fasta_file_name, sliding_window_size = 100):
  buffer = ''
  with open(fasta_file_name) as f:
    for line in f:
      if line.startswith('>'):
        #skip or get some info from comment line
        buffer = ''
      else:
        #read next line
        buffer += line.strip('\r\n')
        offset = 0 # zero-based offset for current string
        while (offset + sliding_window_size <= len(buffer)):
          next_sliding_window = buffer[offset : offset + sliding_window_size]
          yield(next_sliding_window)
          offset += 1
        buffer = buffer[offset : ]

for str in load_fasta("a.fas.txt", 100):
  #do some processing with sliding window data
  print(str)

如果您真的想处理长度小于100（或者在我的示例中，小于sliding window size）的数据部分，您将不得不稍微修改该函数（在出现新的注释行和处理结束时）。在

您也可以biopython。在

相关问题更多 >

编程相关推荐

热门问题

热门文章