如果这里有副本,我深表歉意。就我所寻找的而言,我还没有找到这个问题的答案
我有上千个DNA序列,大约50个碱基(50个字符)长。 我在中间有一个可变的序列,范围从6~30 bp。 两个保守序列;一个在变量序列的左侧和右侧,长度约为10 bp! 我的数据是这样的
ATTGCGCGA NAAANNNANNNNNNA CGAAAATTTA
ATTGCGCGA = conserved area on the left (reference)
NAAANNNANNNNNNA = random sequence between
CGAAAATTTA = conserved area on the right (reference)
到目前为止还不错。我知道如何提取保守区域之间的字符串; 然而,有时我预计保守区域会出现错误。 我想找到一种方法,允许保守区域出现一些不匹配(例如,两个或三个不匹配),并提取它们之间的任何序列/字符串,长度介于 6-30bp
我的数据是这样的
1 ATTGCGCGA NAAANNNANNNNNNA CGAAAATTTA # it looks good
2 ATTGCGCGA NAAANNNAN CGAAAATTTA # it looks good
3 ATTGCGCGA NAA CGAAAATTTA # the variable sequence is too short
4 ATASGCGCGA NAAGGNNN CGAfATTTA # two mismatches on the left and two on the right conserved sequences
5 ATASjkCGCGA NAAGGNNN CGAfjfkdfTTA # more than 3 mismatches at the left area and more than 3 mismatches at the right area
6 ATTGCGCGA NAAGGNNN CGAfjfkdfTTA # more than 3 mismatches at the left conserved area
我希望我的输出像这样
1 NAAANNNANNNNNNA # it looks good
2 NAAANNNAN # it looks good
4 NAAGGNNN # two mismatches on the left and two on the right conserved sequences
!!!重要的 我的数据没有被分割成有间隙的块。我把它放在这里是为了让它直观易懂 原始数据如下所示
1 ATTGCGCGANAAANNNANNNNNNACGAAAATTTA # it looks good
2 ATTGCGCGANAAANNNANCGAAAATTTA # it looks good
3 ATTGCGCGANAACGAAAATTTA # the variable sequence is too short
4 ATASGCGCGANAAGGNNNCGAfATTTA # two mismatches on the left and two on the right conserved sequences
5 ATASjkCGCGANAAGGNNNCGAfjfkdfTTA # more than 3 mismatches at the left area and more than 3 mismatches at the right area
6 ATTGCGCGANAAGGNNNCGAfjfkdfTTA # more than 3 mismatches at the left conserved area
每行有3个部分。让我们称之为X Y Z.
你对Y感兴趣,但你说有时X&;Z有一些变化
您应该使用Levenshtein distance来检查实际X&;Z向官方X&;Z
如果“距离”足够小,你可以选择Y
请参见here了解python库,该库将计算距离
相关问题 更多 >
编程相关推荐