<p>让我们尝试以最有效的方式在纯Python中实现它,只依赖于列表和字典理解</p>
<p>假设我们有一个由3个单词“a”、“b”和“c”组成的玩具文本:</p>
<pre><code>np.random.seed(42)
text = " ".join([np.random.choice(list("abc")) for _ in range(100)])
text
'c a c c a a c b c c c c a c b a b b b b a a b b a a a c c c b c b b c
b c c a c a c c a a c b a b b b a b a b c c a c c b a b b b b b b b a
c b b b b b b c c b c a b a a b c a b a a a a c a a a c a a'
</code></pre>
<p>然后,要制作单格图、双格图和三叉图,您可以按照以下步骤进行:</p>
<pre><code>unigrams = text.split()
unigram_counts = dict()
for unigram in unigrams:
unigram_counts[unigram] = unigram_counts.get(unigram, 0) +1
bigrams = ["".join(bigram) for bigram in zip(unigrams[:-1], unigrams[1:])]
bigram_counts = dict()
for bigram in bigrams:
bigram_counts[bigram] = bigram_counts.get(bigram, 0) +1
trigrams = ["".join(trigram) for trigram in zip(unigrams[:-2], unigrams[1:-1],unigrams[2:])]
trigram_counts = dict()
for trigram in trigrams:
trigram_counts[trigram] = trigram_counts.get(trigram, 0) +1
</code></pre>
<p>要合并权重并标准化:</p>
<pre><code>weights = [.2,.2,.6]
dics = [unigram_counts, bigram_counts, trigram_counts]
weighted_counts = {k:v*w for d,w in zip(dics, weights) for k,v in d.items()}
#desired output
freqs = {k:v/sum(weighted_counts.values()) for k,v in weighted_counts.items()}
</code></pre>
<p>我们得到的是:</p>
<pre><code>pprint(freqs)
</code></pre>
<hr/>
<pre><code>{'a': 0.06693711967545637,
'aa': 0.02434077079107505,
'aaa': 0.024340770791075043,
...
</code></pre>
<p>最后,健全性检查:</p>
<pre><code>print(sum(freqs.values()))
</code></pre>
<hr/>
<pre><code>0.999999999999999
</code></pre>
<p>此代码可以进一步定制以合并您的标记化规则,例如,或者通过一次循环不同的gram来缩短代码</p>