
2024-09-26 22:49:50 发布

您现在位置:Python中文网/ 问答频道 /正文



# Opening sequence.txt and making it to a string
seqfile = open(sequence, "r")
seqfile = seqfile.read().replace("\n", "")

# Regex for each STR
pattern1 = r"AGATC"
pattern2 = r"TTTTTTCT"
pattern3 = r"AATG"
pattern4 = r"TCTAG"
pattern5 = r"GATA"
pattern6 = r"TATC"
pattern7 = r"GAAA"
pattern8 = r"TCTG"

# 3 lists to store value for the loop. Whereas outercount is the final value of each amount of STR corresponding data list

outercount = [0, 0, 0, 0, 0, 0, 0, 0]
innercount = [0, 0, 0, 0, 0, 0, 0, 0]
secondcount = [0, 0, 0, 0, 0, 0, 0, 0]

# Looping through the sequence and checking if pattern matches, if it does update secondcounter by 1 and continue...
for i in seqfile:
    if re.match(pattern1, seqfile):
        secondcount[0] += 1
    elif re.match(pattern2, seqfile):
        secondcount[1] += 1
    elif re.match(pattern3, seqfile):
        secondcount[2] += 1
    elif re.match(pattern4, seqfile):
        secondcount[3] += 1
    elif re.match(pattern5, seqfile):
        secondcount[4] += 1
    elif re.match(pattern6, seqfile):
        secondcount[5] += 1
    elif re.match(pattern7, seqfile):
        secondcount[6] += 1
    elif re.match(pattern8, seqfile):
        secondcount[7] += 1

# Looping through outercount and checking if certain value at innercount is less than secondcount update values.
for i in outercount:
        if secondcount[i] > innercount[i]:
        #stop counting
        innercount[i] = secondcount[i]
    # Reset secondcounts value so that it doesn't continue counting if it is not consecutively
    secondcount[i] = 0
    # Checking if innercount is greater than outercount, if it is set outercount[i] to equal innercount[i] value
    if innercount[i] > outercount[i]:
        outercount[i] = innercount[i]



请注意,这是比这更多的文字,但这只是供参考。 所以在这篇文章中,我要找出多达8种不同的DNA序列,以及它们在一行中出现的数量。例如,再次查找模式,然后计算它在一行中出现的最高次数。如果它在文本的某个地方先出现了3次,然后又出现了6次,那么我的AGATC计数器应该是6,因为它是一行中的最高值

因此,为了解释我的代码:我想有3个不同的数组,我想这不是最具可伸缩性的解决方案,因为文本中可以有3个或8个不同的模式。但我认为,从最大的数量开始,可能更容易计算出其余的。 所以我试着为每个不同的模式创建一个正则表达式,然后检查文本中是否可以找到每个模式,如果可以,我会将第二个计数列表更新到每个对应的索引



Tags: 代码reifisvaluematch模式it


import re

patterns = {"AGATC": 0, "TTTTTTCT": 0, "AATG": 0, "TCTAG": 0, ...}

with open(sequence, 'rt') as file:
    rows = file.readlines()

    for row in rows:
        for pattern in patterns:
            regex = r"({0}(?:{0})+)".format(pattern) # any consecutive sequence
            results = re.findall(regex, value) # list of consecutive sequences
            if results:
                longest_sequence = sorted(results, reverse=True)[0]
                count = len(longest_sequence) / len(pattern) # count the number of ocurrences
                patterns[pattern] = max(int(count), patterns [pattern])

regex的一个例子是(AGATC(?:AGATC)+),意思是:查找单词AGATC进行一次或多次(+)。?:the non-capture group,因此findall只返回一个组-整个匹配


import re

with open(sequence_file, 'rt') as f:
    rows = f.readlines()

patterns = { 
    re.compile("AGATC"): 0,
    re.compile("TCTAG"): 0,

for r in rows:
    for p in patterns:
        prev_end = 0
        freq = 0
        for m in p.finditer(r):
            span = m.span()
            if span[0] != prev_end:
                patterns[p] = max(freq, patterns[p])
                freq = 0

            prev_end = span[1]
            freq += 1

        if freq:
            patterns[p] = max(freq, patterns[p])


相关问题 更多 >
