如何计算fasta格式文件中的氨基酸？

def FASTA(filename): try: f = file(filename) except IOError: print "The file, %s, does not exist" % filename return order = [] sequences = {} for line in f: if line.startswith('>'): name = line[1:].rstrip('\n') name = name.replace('_', ' ') order.append(name) sequences[name] = '' else: sequences[name] += line.rstrip('\n').rstrip('*') print "%d sequences found" % len(order) return order, sequences x, y = FASTA("drosoph_b.fasta")

3条回答

网友

1楼 · 编辑于 2024-10-02 02:39:29

正如Katrielexpoints out，^{}非常适合这项任务：

In [1]: from collections import Counter

In [2]: Counter('MRMRGRRLLPIIL')
Out[2]: Counter({'R': 4, 'L': 3, 'M': 2, 'I': 2, 'G': 1, 'P': 1})

您可以将其应用于代码中sequencesdict的值。在

但是，我建议不要在现实生活中使用此代码。像BioPython这样的图书馆做得更好。例如，您展示的代码将生成相当庞大的数据结构。由于FASTA文件有时非常大，可能会耗尽内存。而且，它不能以最好的方式处理可能的异常。在

最后，使用库只会节省您的时间。在

BioPython示例代码：

^{pr2}$

网友

2楼 · 编辑于 2024-10-02 02:39:29

Katrielex在评论中提到的另一个选择是使用另一个字典，代码如下

def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}
  counts = {}

  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')
      for aa in sequences[name]:
        if aa in counts:
            counts[aa] = counts[aa] + 1
        else:
            counts[aa] = 1  


  print "%d sequences found" % len(order)
  print counts
  return order, sequences

x, y = FASTA("drosoph_b.fasta")

该输出：

^{pr2}$

网友

3楼 · 编辑于 2024-10-02 02:39:29

这可以通过使用非常简单的bash命令获得，我的答案如下

cat input.fasta #my input file
>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
    MRMRGRRLLPIIL

cat input.fasta | grep -v ">" | fold -w1 | sort | uniq -c

输出：

^{pr2}$

fold-w1对每个字符进行拆分，对它们进行排序并计算唯一的字符数

相关问题更多 >

编程相关推荐

热门问题

热门文章