实施高频词伪代码

3条回答

网友

1楼 · 编辑于 2024-09-20 05:40:38

使用dict代替：

def FrequentWords(Text, k):
    FrequentPatterns = {}
    for i in range(0, (len(Text) - k + 1)):
        Pattern = Text[i:i+k]
        if Pattern in FrequentPatterns:
            FrequentPatterns[Pattern] += 1
        else:
            FrequentPatterns.update({Pattern: 1})
    for x in sorted(FrequentPatterns.items(), key=lambda m: m[1], reverse=True):
        print(x)

网友

2楼 · 编辑于 2024-09-20 05:40:38

s = 'ACGTTGCATGTCGCATGATGCATGAGAGCT'
length = 4

我会用字典来积累结果。使用切片提取下一个单词；在字典中为该单词的值添加一个；然后删除字符串中的第一个字符；当字符串中有单词时循环。在

^{pr2}$

您可能可以将collections.Counter or collections.defaultdict用于该位。如果单词不能重叠，请从字符串前面删除length个字符。当从循环底部的字符串中删除字符时，保持流程简单确实会导致效率低下。除非数据很长或过程执行了很多次，否则这不重要。在

然后找到频率最高的单词

most = max(result.values())
frequent = []
for key, value in results:
    if value == most:
        frequent.append(key)

#frequent = [key for key, value in result.items() if value == most]

借用itertools recipe可以创建一个迭代器，它生成所需长度的单词

def n_wise(iterable, n=2):
    '''s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2'''
    tees = itertools.tee(iterable, n)
    for i, thing in enumerate(tees, 1):
        for _ in range(i):
            next(thing, None)
    return zip(*tees)

程序的计数部分将更改为

words = n_wise(s, length)
result = {}
for word in words:
    result[word] = result.get(word, 0) + 1

result中的键将是元组，例如('C', 'A', 'T', 'G')，但是它们可以用''.join(('C', 'A', 'T', 'G'))来重构。在

网友

3楼 · 编辑于 2024-09-20 05:40:38

你应该考虑使用列表理解。在

def pattern_count(text, pattern):
     matches = ([x for x in range(len(text) - len(pattern) + 1) if pattern in text[x:len(pattern) + x]])
     return len(matches)


def frequent_words(text, k):
    counts = [pattern_count(text, text[x:x + k]) for x in range(len(text) - k)]
    return set([text[x:x + k] for x in range(len(text) - k) if counts[x] == max(counts)])

模式计数将解析模式的字符串。我们拼接文本字符数组，以便检查该部分是否包含模式。这允许我们在结果中包含重叠的条目。例如

模式计数（ABA，ABA）->；结果为2，而不是1。在

^{pr2}$

frequent_words使用相同的文本，但不是模式，我们给它一个int，表示模式应该有多长。一旦我们得到满足characterk要求的每个模式的出现次数列表，我们就通过获取出现频率最高的条目来过滤它。最后，为了删除任何重复项，我们将列表转换为一个哈希集，该哈希集固有地防止重复条目，从而返回一个只有唯一值的集合。在

测试：

print(frequent_words("BABBASDCABCBABDDASDBBCASDBAB", 3))
[0] {'BAB', 'ASD'}

希望这对你有用。在

相关问题更多 >

编程相关推荐

热门问题

热门文章