提取氨基酸两侧的fasta序列

输入

我有两个输入：一个fasta文件和一个panda数据帧。在

fasta文件如下所示：

> sp|P00001| some text here 1 MKLLILTCLVAVALARPKHPIKKVSPTFDTNMVGKHQGLPQEVLNENLLRFFVAPFPEVFGKEKVSLDAGPGMCSRNE >sp|P00002| some text here 2 MSSGNAKIGHPAPNFKATAVMPDGQFKDISLSDYKGKYVVFFFYPLDFTFVCPTGLGRSSYRATSCLPALCLP >sp|P00003| some text here 3 MSVLDSGNFSWKMTEACMKVKIPLVKKKSLRQNLIENGKLKEFMRTHKYNLGSKYIREAATLVSEQPLQN

这是我的第二个输入，熊猫的数据框（2列'ProteinID'和'Phosphopeptide'）

输出

我的输出是写入数据帧的新列，如下所示：

ProteinID -- Phosphopeptide -- NewColumn P00001 -- KVSPT*FDTNMVGK -- IKKVSPTFDTNMV P00001 -- SLDAGPGMCS*R -- AGPGMCSRNE P00003 -- LDS*GNFSWKMTEACMK -- MSVLDSGNFSWK

请注意，后两行在它们各自蛋白质的末端或开始处含有肽，因此我们在这些情况下不需要提取12个氨基酸。在

我有一个困难的时间（很少的编程经验）写这个程序，并将非常感谢任何帮助（建议，提示，函数等）。在

2条回答

网友

1楼 · 编辑于 2024-09-29 19:34:36

下面是一个提取相关子字符串的函数：

def flank(seq, pp):
    # 1: find the position of the AA preceding the '*' marker in the
    # phosphopeptide
    marked_pos = pp.find('*') - 1
    if (marked_pos < 0):
        raise ValueError("invalid phosphopeptide string")

    # 2: find the phosphopeptide (without '*') in the sequence
    pp_pos = s.find(pp.replace('*', ''))
    if pp_pos == -1:
        raise ValueError("phosphopeptide not found in the sequence")

    # avoid a negative starting index
    start = max(0, pp_pos + marked_pos - 6)

    # 3: use slicing to produce the result
    return seq[start : pp_pos + marked_pos + 7]

示例：

^{pr2}$

印刷品：

IKKVSPTFDTNMV

网友

2楼 · 编辑于 2024-09-29 19:34:36

嗨，请检查一下：我的fasta文件名为'txt'：

代码段：

#!/usr/bin/python
import re

protein_dict = [
    ('P00001', 'KVSPT*FDTNMVGK'),
    ('P00001', 'SLDAGPGMCS*R'),
    ('P00003', 'LDS*GNFSWKMTEACMK')
    ]

protein_id = None

def prepare_structure_from_fasta(file):
    fasta_structure = dict()
    with open(file, 'r') as fh:
        for line in fh:
            if '>' in line:
                protein_id = line.split('|')[1]
            else:
                if not protein_id:
                    raise Exception("Wrong fasta file structure")
                fasta_structure[protein_id] = line.strip()
    return fasta_structure


def match(pattern, string):
    matc = re.search(pattern, string)
    if matc:
        return matc.groups()[0]
    return None

fasta_struct = prepare_structure_from_fasta('txt')
final_struct = []

for pro_d in protein_dict:

    pro_id = pro_d[0]
    pep_id = pro_d[1]
    first, second = pep_id.split('*')

    if len(first) <= 6:
        f_count = 7 - len(first)
    else:
        first = first[len(first) - 7:]
        f_count = 0
    if len(second) <= 6:
        s_count = 7 - len(second)
    else:
        second = second[0:6]
        s_count = 0

    _regex = '([A-Z]{0,%d}%s%s[A-Z]{0,%d})' % (f_count,first,second,s_count)
    final_struct.append((pro_id, pep_id, match(_regex, fasta_struct[pro_id])))

for pro in final_struct:
    print pro

输出：

^{pr2}$

输入

目标

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章