以最容易发音的方式排列字母？

2条回答

网友

1楼 · 编辑于 2024-09-28 23:37:23

（为了完整起见，这里是我最初的纯Python解决方案，它激发了我尝试机器学习的灵感。）

我同意一个可靠的解决方案需要一个复杂的英语语言模型，但也许我们可以想出一个简单的启发式，这是相当糟糕的。在

我能想到大多数可发音单词所满足的两个基本规则：

1. contain a vowel sound
2. no more than two consonant sounds in succession

作为正则表达式，可以将其写入c?c?(v+cc?)*v*

现在简单地尝试从拼写中识别声音：

^{pr2}$

然后可以使用正则表达式来处理规则：

v = "({0})".format("|".join(vowels))
c = "({0})".format("|".join(consonants))

import re
pattern = re.compile("^{1}?{1}?({0}+{1}{1}?)*{0}*$".format(v, c))
def test(w):
    return re.search(pattern, w)

def predict(words):
    return ["word" if test(w) else "scrambled" for w in words]

在单词/乱序测试集中，这个分数大约为74%。在

             precision    recall  f1-score   support

  scrambled       0.90      0.57      0.70     52403
       word       0.69      0.93      0.79     52940

avg / total       0.79      0.75      0.74    105343

经过调整的版本得分为80%。在

网友

2楼 · 编辑于 2024-09-28 23:37:23

从解决一个更简单的问题开始：给定的单词是否可以发音？在

机器学习“监督学习”在这里可能是有效的。在字典单词和加扰词的训练集上训练二进制分类器（假设加扰词都是不可发音的）。对于特性，我建议数一数双元组和三元组。我的推理是：不发音的三元组，如“tns”和“srh”在字典单词中很少见，尽管每个字母都很常见。在

其思想是，经过训练的算法将学习将任何罕见的三元组单词划分为不可发音的单词，而只有普通三元组的单词可以发音。在

下面是一个使用scikit learnhttp://scikit-learn.org/的实现

import random
def scramble(s):
    return "".join(random.sample(s, len(s)))

words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
scrambled = [scramble(w) for w in words]

X = words+scrambled
y = ['word']*len(words) + ['unpronounceable']*len(scrambled)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([
    ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
    ('clf', MultinomialNB())
    ])

text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)

from sklearn import metrics
print(metrics.classification_report(y_test, predicted))

准确率为92%。考虑到发音是主观的，这可能是最好的。在

^{pr2}$

这与您的例子相符：

>>> text_clf.predict("scaroly crasoly oascrly yrlcsoa".split())
['word', 'word', 'unpronounceable', 'unpronounceable']

对于好奇的人来说，这里有10个拼凑的单词可以发音：

莫罗-加拉普-奥克菲-奥涅菲尔-奥涅-奥涅-奥涅-阿尔金波-雷托莫波里奥 suroatipsheq公司

最后有10个字典里的单词被误分类为不发音：

ilch tohubohu-usnea半步调焦蕻石

相关问题更多 >

编程相关推荐

热门问题

热门文章