提取其他两个保守字符串之间的字符串,并考虑python或R中的不匹配

2024-09-28 23:26:05 发布

您现在位置:Python中文网/ 问答频道 /正文

如果这里有副本,我深表歉意。就我所寻找的而言,我还没有找到这个问题的答案

我有上千个DNA序列,大约50个碱基(50个字符)长。 我在中间有一个可变的序列,范围从6~30 bp。 两个保守序列;一个在变量序列的左侧和右侧,长度约为10 bp! 我的数据是这样的

ATTGCGCGA NAAANNNANNNNNNA CGAAAATTTA
ATTGCGCGA = conserved area on the left (reference)

NAAANNNANNNNNNA = random sequence between 

CGAAAATTTA = conserved area on the right (reference)

到目前为止还不错。我知道如何提取保守区域之间的字符串; 然而,有时我预计保守区域会出现错误。 我想找到一种方法,允许保守区域出现一些不匹配(例如,两个或三个不匹配),并提取它们之间的任何序列/字符串,长度介于 6-30bp

我的数据是这样的

1 ATTGCGCGA NAAANNNANNNNNNA CGAAAATTTA # it looks good
2 ATTGCGCGA NAAANNNAN  CGAAAATTTA      # it looks good
3 ATTGCGCGA NAA CGAAAATTTA             # the variable sequence is too short
4 ATASGCGCGA NAAGGNNN CGAfATTTA        # two mismatches on the left and two on the right conserved sequences
5 ATASjkCGCGA NAAGGNNN CGAfjfkdfTTA    # more than 3 mismatches at the left area and more than 3 mismatches at the right area
6 ATTGCGCGA NAAGGNNN CGAfjfkdfTTA      # more than 3 mismatches at the left conserved area

我希望我的输出像这样

1 NAAANNNANNNNNNA # it looks good
2 NAAANNNAN       # it looks good
4 NAAGGNNN        # two mismatches on the left and two on the right conserved sequences

!!!重要的 我的数据没有被分割成有间隙的块。我把它放在这里是为了让它直观易懂 原始数据如下所示

1 ATTGCGCGANAAANNNANNNNNNACGAAAATTTA # it looks good
2 ATTGCGCGANAAANNNANCGAAAATTTA       # it looks good
3 ATTGCGCGANAACGAAAATTTA             # the variable sequence is too short
4 ATASGCGCGANAAGGNNNCGAfATTTA        # two mismatches on the left and two on the right conserved sequences
5 ATASjkCGCGANAAGGNNNCGAfjfkdfTTA    # more than 3 mismatches at the left area and more than 3 mismatches at the right area
6 ATTGCGCGANAAGGNNNCGAfjfkdfTTA      # more than 3 mismatches at the left conserved area

Tags: therightonmoreitarealeftat