从字典中找到句子的变音词的有效方法?

2024-10-02 00:26:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要做一个程序,把一个带有字典的文件和一个任意字符串作为输入,然后输出该字典中组成给定字符串的所有单词组合。 例如,使用100个最流行的英语单词和字符串"i not work",我应该得到类似于[' on it work', ' into work', ' not i work', ' know or it', ' work it no', ' to work in']的内容,我做到了。在

问题是我的程序效率太低了:字典中有100个单词,实际的字符串长度限制为7个字符,之后的所有操作都太长了。我试图寻找各种各样的算法都没有用。在

以下是我如何搜索字谜:

def sortstring(string):
    return ''.join(sorted(string))

def simplify(all_strings):
    possible_strings = defaultdict(list)
    for string in all_strings:
        possible_strings[sortstring(string).strip()].append(string)
    return possible_strings

def generate(database, length,curstring="", curdata=set()):
    if len(curstring.replace(" ", "")) > length:
        return set()
    if len((curstring).replace(" ", "")) == length:
        return curdata.union(set([curstring]))
    for i in database:
        if len((curstring+i).replace(" ", "")) <= length:
            curdata = curdata.union(generate(database.difference(set([i])),
                length, curstring+" "+i, curdata))
            database = database.difference(set([i]))
    return curdata

def analyse(database, input_string):
    cletters = countstring(input_string)
    strings = simplify(generate(database, cletters))
    data = list()
    sorted_string = sortstring(input_string).strip()
    if sorted_string in strings.keys():
        data = strings[sorted_string]
    return len(strings.values()), data

def countstring(string):
    a = countletters(string)
    return sum(a.values())

def countletters(string):
    result = {}
    for i in ascii_lowercase:
        result[i] = string.count(i)
    return result

有人能提出改进的方法吗?虽然我认为我使用的算法应该完全抛弃,因为它的复杂度似乎太高了,因为它是如此的慢。 以防万一:该程序的效率应该足以支持数万个单词的字典和多达数十个字符的字符串。那比我做的好多了。在


Tags: 字符串instringreturnif字典deflength
2条回答

下面是一个递归方法,实现了我在评论中建议的树方法:

def frequencyDict(s):
    s = s.lower()
    d = {}
    for c in s:
        if c.isalpha():
            if c in d:
                d[c] += 1
            else:
                d[c] = 1
    return d

def canMake(w,fdict):
    d = frequencyDict(w)
    return all(d[c] <= fdict.get(c,0) for c in d)

def candidates(wlist,fdict):
    return [w for w in wlist if canMake(w,fdict)]

def anagrams(wlist,fdict):
    if len(wlist) == 0 or len(fdict) == 0:
        return "no anagrams"
    hits = []
    firstWords = candidates(wlist,fdict)
    if len(firstWords) == 0:
        return "no anagrams"
    for w in firstWords:
        #create reduced frequency dict
        d = fdict.copy() 
        for c in w:
            d[c] -= 1
            if d[c] == 0: del d[c]
        #if d is empty, the first word is also a the last word
        if len(d) == 0:
            hits.append(w)
        else:
            #create reduced word list
            rlist = [v for v in wlist if canMake(v,d)]
            tails = anagrams(rlist, d)
            if tails != "no anagrams":
                hits.extend(w + " " + t for t in tails)
    if len(hits) == 0:
        return "no anagrams"
    else:
        return hits

def findAnagrams(wlist,s):
    return anagrams(wlist,frequencyDict(s.lower()))

f = open("linuxwords.txt")
words = f.read().split('\n')
f.close()
words = [w.strip().lower() for w in words if not '-' in w]
test = findAnagrams(words, "Donald Trump")

从一个旧的Linux单词列表中找到所有730个“唐纳德·特朗普”的字谜大约需要20秒。我最喜欢的是“潮湿的坚果主”

我自己解决了一部分问题。 已解决生成器代码中的for-if反模式:

def generate(database, length,letters,curstring="",curdata=set()):
if len(curstring.replace(" ",""))>length:
    return set()
if len((curstring).replace(" ",""))==length:
    return curdata.union(set([curstring]))
t=countletters(curstring)
for i in ascii_lowercase:
    if t[i]>letters[i]:
        return set()
for i in database:
    t=countletters(curstring+i)
    test=0
    for j in ascii_lowercase:
        if t[j]>letters[j]:
            test=1
    if test: continue
    if sum(t.values())<=length:
        curdata=curdata.union(generate(database.difference(set([i])),length,letters,curstring+" "+i,curdata))
        database=database.difference(set([i]))
return curdata

它现在快得多,但是如果字典包含数万个单词和/或输入字符串很长,则仍然很慢。在

相关问题 更多 >

    热门问题