使用enumerate在python中打印前一行

2024-10-06 13:33:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个以下格式的文件

OperonID    GI      Synonym    Start    End  Strand Length  COG_number  Product
1132034 397671780   RVBD_0002   2052    3260    +   402 -   DNA polymerase III subunit beta
1132034 397671781   RVBD_0003   3280    4437    +   385 -   DNA replication and repair protein RecF
1132034 397671782   RVBD_0004   4434    4997    +   187 -   hypothetical protein
1132035 397671783   RVBD_0005   5123    7267    +   714 -   DNA gyrase subunit B
1132035 397671784   RVBD_0006   7302    9818    +   838 -   DNA gyrase subunit A
1132036 397671786   RVBD_0007Ac 11421   11528   -   35  -   hypothetical protein
1132036 397671787   RVBD_0007Bc 11555   11692   -   45  -   hypothetical protein
1132037 397671792   RVBD_0012   14089   14877   +   262 -   hypothetical protein
  • 我需要每个操作ID的开始和结束坐标加上 在它自己的文件/字符串中串入。 e、 g.对于操作装置1132034,起始坐标为2052,且 端部坐标为4997,钢绞线为-

我知道到目前为止,我可能可以使用enumerate并拥有以下脚本

lines = open('operonmap.opr', 'r').read().splitlines()
operon_id = 1132034
start = ''
end = ''
strand = ''

for i,line in enumerate(lines):
      if str(operon_id) in line:
            start += line[28:33]
      else:
            end += line[i-1]
            operonline += start
            operonline += end
            operonline += '\n'

然后,如果这种脚本有效,我会编辑字符串“operonline”以只包含开始-结束和串信息。不幸的是,它不起作用,但我希望你能看到我的逻辑

我希望有人能帮忙


Tags: 文件字符串脚本linestartdnaendlines
3条回答

下面是一个可能的实现parse_file包含以下变量:

  • this_info:包含与当前行相关信息的字典

  • previous_infothis_info来自上一次迭代

  • start_infothis_info来自新操纵子ID开头的最近一行

所需的输出并不完全清楚,但调整主程序(在末尾)以以您选择的任何形式写入提取的字段

def parse_file(input_file):
    """
    reads an opr file, returns a list of dictionaries with info about the operon ids
    """
    results = []
    start_info = previous_info = {}
    with open(input_file) as f:
        next(f)  # ignore first line
        for line in f:
            bits = line.split()

            # dictionary containing information extracted from a
            # particular line
            this_info = {'operon_id': int(bits[0]),
                         'start': int(bits[3]),
                         'end': int(bits[4]),
                         'strand': bits[5]}

            if not previous_info:
                # first line of file
                start_info = this_info

            elif previous_info['operon_id'] != this_info['operon_id']:
                # this is the first line with NEW Operon ID,
                # so add result for previous Operon ID,  
                # of which the end line was the PREVIOUS line
                _add_result(results, start_info, previous_info)
                start_info = this_info  # start line for this ID

            # also adding a sanity check here - the strand
            # should be the same for every line of a given
            # operon ID
            if start_info["strand"] != this_info["strand"]:
                print("warning, strand info inconsistent")

            previous_info = this_info  # ready for next iteration

        _add_result(results, start_info, this_info)  # last ID

    return results


def _add_result(results, start_info, end_info):
    """
    add to the results a dictionary based on start line info
    but with end line info used for the 'end' field
    """
    info = start_info.copy()
    info['end'] = end_info['end']
    results.append(info)


for result in parse_file('operonmap.opr'):
    # write out some info
    print(result['operon_id'],
          result['start'],
          result['end'],
          result['strand'])

这使得:

1132034 2052 4997 +
1132035 5123 9818 +
1132036 11421 11692 -
1132037 14089 14877 +

也许试试这种逻辑?它只是有一个临时变量,跟踪您看到的最后一个操作ID,并在更改后切换开始/结束:

In [21]: lines = open("test.csv").read().splitlines()

In [22]: lines
Out[22]:
['OperonID,GI,Synonym,Start,End,Strand,Length',
 '1132034,397671780,RVBD_0002,2052,3260,+,402',
 '1132034,397671781,RVBD_0003,3280,4437,+,385',
 '1132034,397671782,RVBD_0004,4434,4997,+,187',
 '1132035,397671783,RVBD_0005,5123,7267,+,714',
 '1132035,397671784,RVBD_0006,7302,9818,+,838',
 '1132036,397671786,RVBD_0007Ac,11421,11528,-,35',
 '1132036,397671787,RVBD_0007Bc,11555,11692,-,45',
 '1132037,397671792,RVBD_0012,14089,14877,+,262']

In [23]: cur_operonid = ''

In [24]: cur_end = None
In [27]: cur_start = None
    ...: for line in lines[1:]:
    ...:     cols = line.split(','). # or line.split('\t') for tab-delimit
    ...:     if cur_operonid != cols[0]:  # New OperonID reached
    ...:         if cur_start is not None:
    ...:             print(f"{cur_operonid} went from {cur_start} to {cur_end}")
    ...:         cur_operonid = cols[0]
    ...:         cur_start = cols[3]
    ...:     else:
    ...:         cur_end = cols[4]
    ...:
1132034 went from 2052 to 4997
1132035 went from 5123 to 9818
1132036 went from 11421 to 11692

如果你使用熊猫,如果你想走那条路,这是很容易的

我能够将您的数据读入pandas DataFrame,然后删除了其他列:

   Start    End Strand OperonID
0   2052   3260      +  1132034
1   3280   4437      +  1132034
2   4434   4997      +  1132034
3   5123   7267      +  1132035
4   7302   9818      +  1132035
5  11421  11528      -  1132036
6  11555  11692      -  1132036
7  14089  14877      +  1132037

然后我按OperonID分组,并将StartEndStrand值存储为列表,并创建一个新列,其中第一个Start和最后一个Endper OperonID值以及唯一的Strand值。您可以根据需要重新组织它

df2 = df.groupby('OperonID')[['Start', 'End', 'Strand']].agg(list)
df2['result'] = df2.apply(lambda x: (x['Start'][0], x['End'][-1], set(x['Strand'])), axis=1)

df2['result']:

OperonID
1132034      (2052, 4997, {+})
1132035      (5123, 9818, {+})
1132036    (11421, 11692, {-})
1132037    (14089, 14877, {+})

相关问题 更多 >