具有多个拆分点的python字符串拆分

2024-09-30 00:23:00 发布

您现在位置:Python中文网/ 问答频道 /正文

好的,我直截了当地说这是我的代码

def digestfragmentwithenzyme(seqs, enzymes):

fragment = []
for seq in seqs:
    for enzyme in enzymes:
        results = []
        prog = re.compile(enzyme[0])
        for dingen in prog.finditer(seq):
           results.append(dingen.start() + enzyme[1])
        results.reverse()
        #result = 0
        for result in results:
            fragment.append(seq[result:])
            seq = seq[:result]
        fragment.append(seq[:result])
fragment.reverse()
return fragment

此函数的输入是多个字符串的列表(seq),例如:

^{pr2}$

和酶作为输入:

[["TC", 1],["GC",1]]

(注:可以有多个给定值,但大多数都是用ATCG的字母表示的)

函数应返回一个列表,在本例中,该列表包含两个列表:

Outputlist = [["AATT","CCGGT","CGGGG","CT","CGGGGG"],["AAAG","CAAAAT","CAAAAAAG","CAAAAAAT","C"]]

现在,我很难将其拆分两次并获得正确的输出。在

关于函数的更多信息。它通过字符串(seq)查找识别点。在这种情况下,TC或GC并在第二个酶指数上拆分它。它应该对列表中的两个字符串使用这两种酶。在


Tags: 函数字符串in列表forresultresultsseq
3条回答

下面是一些应该使用regex的方法。在这个解决方案中,我找到所有出现的酶串,并使用它们对应的索引进行拆分。在

def digestfragmentwithenzyme(seqs, enzymes):
    out = []
    dic = dict(enzymes) # dictionary of enzyme indices

    for seq in seqs:
        sub = []
        pos1 = 0

        enzstr = '|'.join(enz[0] for enz in enzymes) # "TC|GC" in this case
        for match in re.finditer('('+enzstr+')', seq):
            index = dic[match.group(0)]
            pos2 = match.start()+index
            sub.append(seq[pos1:pos2])
            pos1 = pos2
        sub.append(seq[pos1:])
        out.append(sub)
        # [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
    return out

假设这个想法是在每种酶上分裂,在酶是多个字母的索引点,而分裂,本质上是在两个字母之间。不需要正则表达式。在

您可以通过查找匹配项并在正确的索引处插入拆分指示符,然后对结果进行后期处理以实际拆分。在

例如:

def digestfragmentwithenzyme(seqs, enzymes):
    # preprocess enzymes once, then apply to each sequence
    replacements = []
    for enzyme in enzymes:
        replacements.append((enzyme[0], enzyme[0][0:enzyme[1]] + '|' + enzyme[0][enzyme[1]:]))
    result = []
    for seq in seqs:
        for r in replacements:
            seq = seq.replace(r[0], r[1])   # So AATTC becomes AATT|C
        result.append(seq.split('|'))       # So AATT|C becomes AATT, C
    return result

def test():
    seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
    enzymes = [["TC", 1],["GC",1]]
    print digestfragmentwithenzyme(seqs, enzymes)

我的解决方案是:

TC替换为T C,将GC替换为{}(这是根据给定的索引完成的),然后根据空格字符拆分。。。。在

def digest(seqs, enzymes):
    res = []
    for li in seqs:
        for en in enzymes: 
            li = li.replace(en[0],en[0][:en[1]]+" " + en[0][en[1]:])
        r = li.split()
        res.append(r)
    return res
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
#enzymes = [["AAT", 2],["GC",1]]
print seqs
print digest(seqs, enzymes)

结果是:

对于([["TC", 1],["GC",1]])

^{pr2}$

对于([["AAT", 2],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AA', 'TTCCGGTCGGGG', 'CTCGGGGG'], ['AAAG', 'CAAAA', 'TCAAAAAAG', 'CAAAAAA', '
TC']]

相关问题 更多 >

    热门问题