在python中解析此文件时遇到问题

1条回答

网友

1楼 · 发布于 2024-09-30 06:24:19

因此，假设您有一个遵循以下粗略结构的文件：

P1;1bgxt
sequence
MRGMLPLFEPKGRVLLVDGHHLAYRTFHALKGLTTSRGEPVQAVYGFAKSLLKALKEDGDAVIVVFDAKAPSFRH*
P1;1xo1a
sequence
     RRNLMIVDGTNLGFRFP       FASSYVSTIQSLAKSYSARTTIVLGDKG-KSVFR*
P1;1bgxt
secondary structure and phi angle
CPPCCCPPPCPCPCCCCCCCCHHHHCCCCPCCCCCCCPCCCCCCCCHHHHHHHHHHCPCCCCCCCCCCCCCCCCC*
P1;1xo1a
secondary structure and phi angle
     CCEEEEEEHHHHHCCCC       CHHHHHHHHHHHHHHCPEEEEEEECCCP-CCHHH*

解析此文件的“诀窍”在于，一次只需按3行一组读取文件，因为文件的结构很容易分解为以下重复周期：

Protein Name
Data Descriptor
Data

我之所以做出这些基本假设，是因为你在评论中提到：

数据描述符要么是序列，要么是与结构相关的东西
所有蛋白质都有一个序列和某种类型的结构
所有序列数据在结构数据之前
行数不能被6整除的文件是坏文件（因为每个蛋白质总共需要6行）

解析成`dict`

首先，我要说的是，您在原始帖子中想要的结构对于解析数据来说可能是次优的，并且您拥有的数据可以巧妙地融入到一个关联数据结构中，比如dict。你知道吗

假设您有一个类似于上面的文件，名为sequences.txt：

from itertools import izip_longest
import pprint

# This is adapted from [itertools recipes](https://docs.python.org/2/library/itertools.html#recipes)
def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return izip_longest(*args, fillvalue=fillvalue)

proteins = {}
with open('sequences.txt') as datafile:

    # The following line reads the file as groups of 3 lines
    # I do this rather than read the entire file in one go because
    # you might have a lot of data in a file
    line_groups = grouper(datafile, 3)

    for block in line_groups:
        # I am adding variables here to make it clearer
        protein_name = block[0].strip()
        descriptor = block[1].strip()
        data = block[2].strip()

        if descriptor.lower() == 'sequence':
            # sequence is found, that means we haven't
            # seen this protein yet
            proteins[protein_name] = [protein_name, data]
        else:
            # This is some type of structure data, so just append to what
            # we've already seen
            try:
                protein = proteins[protein_name]
                protein.append(data)
            except KeyError:
                # Wait, how did we see a structure before a sequence?
                # This file is invalid

                raise # Or handle it however you want - exit, or discard data, etc.

pprint.pprint(proteins)

这将输出如下结构：

{'P1;1bgxt': ['P1;1bgxt',
              'MRGMLPLFEPKGRVLLVDGHHLAYRTFHALKGLTTSRGEPVQAVYGFAKSLLKALKEDGDAVIVVFDAKAPSFRH*',
              'CPPCCCPPPCPCPCCCCCCCCHHHHCCCCPCCCCCCCPCCCCCCCCHHHHHHHHHHCPCCCCCCCCCCCCCCCCC*'],
 'P1;1xo1a': ['P1;1xo1a',
              '     RRNLMIVDGTNLGFRFP       FASSYVSTIQSLAKSYSARTTIVLGDKG-KSVFR*',
              '     CCEEEEEEHHHHHCCCC       CHHHHHHHHHHHHHHCPEEEEEEECCCP-CCHHH*']}

如果您想获得列表列表，只需在proteins上调用.values()：

>> proteins.values()

[['P1;1bgxt',
  'MRGMLPLFEPKGRVLLVDGHHLAYRTFHALKGLTTSRGEPVQAVYGFAKSLLKALKEDGDAVIVVFDAKAPSFRH*',
  'CPPCCCPPPCPCPCCCCCCCCHHHHCCCCPCCCCCCCPCCCCCCCCHHHHHHHHHHCPCCCCCCCCCCCCCCCCC*'],
 ['P1;1xo1a',
  '     RRNLMIVDGTNLGFRFP       FASSYVSTIQSLAKSYSARTTIVLGDKG-KSVFR*',
  '     CCEEEEEEHHHHHCCCC       CHHHHHHHHHHHHHHCPEEEEEEECCCP-CCHHH*']]

直接解析成`list`的`list`

如果您不想像在最后一段代码中那样使用dict，那么这涉及到一些不同的工作，但是您只需要更改几行。你知道吗

请注意，给定足够大的数据集，此版本可能需要更多的时间，因为它会在每次需要查找预先存在的蛋白质以将结构数据添加到现有子列表时扫描列表（一个O(n)操作），而在字典中，查找蛋白质数据通常是O(1)。尽管对你有用的东西可能会因你使用数据的目的而有所不同，但我离题了。你知道吗

from itertools import izip_longest
import pprint

# This is adapted from [itertools recipes](https://docs.python.org/2/library/itertools.html#recipes)
def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return izip_longest(*args, fillvalue=fillvalue)

proteins = []
with open('sequences.txt') as datafile:

    # The following line reads the file as groups of 3 lines
    # I do this rather than read the entire file in one go because
    # you might have a lot of data in a file
    line_groups = grouper(datafile, 3)

    for block in line_groups:
        # I am adding variables here to make it clearer
        protein_name = block[0].strip()
        descriptor = block[1].strip()
        data = block[2].strip()

        if descriptor.lower() == 'sequence':
            # sequence is found, that means we haven't
            # seen this protein yet
            proteins.append([protein_name, data])
        else:
            # This is some type of structure data, so just append to what
            # we've already seen

            # find the item in the list that contains this protein
            try:
                protein = next(x for x in proteins if x[0] == protein_name)
                protein.append(data)
            except StopIteration:
                # Hm, we couldn't find a sublist that has this protein name
                raise

pprint.pprint(proteins)

这将输出：

>> pprint.pprint(proteins)
[['P1;1bgxt',
  'MRGMLPLFEPKGRVLLVDGHHLAYRTFHALKGLTTSRGEPVQAVYGFAKSLLKALKEDGDAVIVVFDAKAPSFRH*',
  'CPPCCCPPPCPCPCCCCCCCCHHHHCCCCPCCCCCCCPCCCCCCCCHHHHHHHHHHCPCCCCCCCCCCCCCCCCC*'],
 ['P1;1xo1a',
  '     RRNLMIVDGTNLGFRFP       FASSYVSTIQSLAKSYSARTTIVLGDKG-KSVFR*',
  '     CCEEEEEEHHHHHCCCC       CHHHHHHHHHHHHHHCPEEEEEEECCCP-CCHHH*']]

请注意，我不做任何类似于严格的错误检查的事情，为了清晰起见，我尝试使用比惯用代码更具说明性的代码。我也没有尝试过使用类之类的东西来存储数据，或者使用元组或namedtuple来存储所有相关的数据，因为这需要对文件进行更多的预处理，以获得可能最小的有用性（元组在Python中是不可变的，所以我不能只修改元组而不创建一个新元组）。你知道吗

解析成`dict`

直接解析成`list`的`list`

相关问题更多 >

编程相关推荐

热门问题

热门文章