进行所有排列组合并替换为字符串

2024-10-02 16:30:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个带有字符串的dataframe列,类似于: 'TCCTGTAAATCAAAGGCCAAGRG''GNGCNCCNGAYATRGCNTTYCC''GATTTCTCTYCCTGTTCTTGCA'

我有一份信的清单:

SNPs={}
SNPs["Y"] = ['C', 'T']
SNPs["R"] = ['A', 'G']
SNPs["N"] = ['C', 'G', 'A', 'T']

每个R都需要换成A/G等等

例如:TCCTGTAAATCAAAGGCCAAGRGTCCTGTAAATCAAAGGCCAAGAGTCCTGTAAATCAAAGGCCAAGGG的更改

我想要所有的排列和组合,结果在另一列

请帮我做同样的事情

import re, itertools

text = "GNGCNCCNGAYATRGCNTTYCC"

def getList(dict):
    return list(dict.keys())
lsources = getList(SNPs)

ldests = []
for source in lsources:
    ldests.append(SNPs[source])
    #print(ldests)

# Generate the various pairings
for lproduct in itertools.product(*ldests):
    #print(lproduct)
    for i in text:
        output = i        
        for src, dest in zip(lsources, lproduct):
        # Replace each term (you could optimise this using a single re.sub)
            output = output.replace("%s" % src, dest)
            print(output)

这是我的代码..但是我没有得到想要的输出


Tags: textinreforoutputdictprintitertools
1条回答
网友
1楼 · 发布于 2024-10-02 16:30:24

试试这个:

>>> import itertools
>>> text = "GNGCNCCNGAYATRGCNTTYCC"
>>> SNPs={ "Y" : ['C', 'T'] , "R" : ['A', 'G'] , "N" : ['C', 'G', 'A', 'T']}

>>> text_tmp = ""
>>> dct = {}
>>> for idx, v in enumerate(text):
...    if v in SNPs:
...        dct[idx] = SNPs.get(v)
...        text_tmp += f'_{idx}_'
...    else:
...        text_tmp += v

>>> text_tmp  
'G_1_GC_4_CC_7_GA_10_AT_13_GC_16_TT_19_CC'

>>> dct
{1: ['C', 'G', 'A', 'T'],
 4: ['C', 'G', 'A', 'T'],
 7: ['C', 'G', 'A', 'T'],
 10: ['C', 'T'],
 13: ['A', 'G'],
 16: ['C', 'G', 'A', 'T'],
 19: ['C', 'T']}

>>> per_val = list(itertools.product(*dct.values()))
>>> per_key_val = list(map(dict,[zip(dct.keys(), p) for p in per_val]))
>>> per_key_val
[{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'C', 19: 'C'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'C', 19: 'T'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'G', 19: 'C'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'G', 19: 'T'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'A', 19: 'C'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'A', 19: 'T'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'T', 19: 'C'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'T', 19: 'T'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'C', 19: 'C'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'C', 19: 'T'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'G', 19: 'C'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'G', 19: 'T'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'A', 19: 'C'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'A', 19: 'T'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'T', 19: 'C'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'T', 19: 'T'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'C', 19: 'C'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'C', 19: 'T'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'G', 19: 'C'},
 {1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'G', 19: 'T'},
 ...
]

>>> out = []
>>> for pkl in per_key_val:
...    tmp = text_tmp
...    for k,v in pkl.items():
...        tmp = tmp.replace(f'_{k}_', v)
...    out.append(tmp)

>>> out
['GCGCCCCCGACATAGCCTTCCC',
 'GCGCCCCCGACATAGCCTTTCC',
 'GCGCCCCCGACATAGCGTTCCC',
 'GCGCCCCCGACATAGCGTTTCC',
 'GCGCCCCCGACATAGCATTCCC',
 'GCGCCCCCGACATAGCATTTCC',
 'GCGCCCCCGACATAGCTTTCCC',
 'GCGCCCCCGACATAGCTTTTCC',
 'GCGCCCCCGACATGGCCTTCCC',
 'GCGCCCCCGACATGGCCTTTCC',
 'GCGCCCCCGACATGGCGTTCCC',
 'GCGCCCCCGACATGGCGTTTCC',
 'GCGCCCCCGACATGGCATTCCC',
 'GCGCCCCCGACATGGCATTTCC',
 'GCGCCCCCGACATGGCTTTCCC',
 'GCGCCCCCGACATGGCTTTTCC',
 'GCGCCCCCGATATAGCCTTCCC',
 'GCGCCCCCGATATAGCCTTTCC',
 'GCGCCCCCGATATAGCGTTCCC',
 'GCGCCCCCGATATAGCGTTTCC',
 'GCGCCCCCGATATAGCATTCCC',
 'GCGCCCCCGATATAGCATTTCC',
 'GCGCCCCCGATATAGCTTTCCC',
 ...
]

更新:(在数据帧上运行)

def rplc_per(text):
    SNPs={ "Y" : ['C', 'T'] , "R" : ['A', 'G'] , "N" : ['C', 'G', 'A', 'T']}
    text_tmp = ""
    dct = {}
    for idx, v in enumerate(text):
        if v in SNPs:
            dct[idx] = SNPs.get(v)
            text_tmp += f'_{idx}_'
        else:
            text_tmp += v  
    per_val = list(itertools.product(*dct.values()))
    per_key_val = list(map(dict,[zip(dct.keys(), p) for p in per_val]))
    out = []
    for pkl in per_key_val:
        tmp = text_tmp
        for k,v in pkl.items():
            tmp = tmp.replace(f'_{k}_', v)
        out.append(tmp)
    return out

df = pd.DataFrame({'String': ['TCCTGTAAATCAAAGGCCAAGRG', 'GNGCNCCNGAYATRGCNTTYCC', 'GATTTCTCTYCCTGTTCTTGCA']})
df['all_per'] = df['String'].apply(rplc_per)
print(df)

输出:

    String                   all_per
0   TCCTGTAAATCAAAGGCCAAGRG [TCCTGTAAATCAAAGGCCAAGAG, TCCTGTAAATCAAAGGCCAA...
1   GNGCNCCNGAYATRGCNTTYCC  [GCGCCCCCGACATAGCCTTCCC, GCGCCCCCGACATAGCCTTTC...
2   GATTTCTCTYCCTGTTCTTGCA  [GATTTCTCTCCCTGTTCTTGCA, GATTTCTCTTCCTGTTCTTGCA]

相关问题 更多 >