<p>不确定到目前为止您尝试了什么,下面是一个在<code>nltk</code>和<code>dict()</code>中使用<code>ngrams</code>的解决方案</p>
<pre><code>from nltk import ngrams
tweet = "Yesterday I had a coca cola, and a hot dog for lunch, and some bana split for desert. I liked the coke, but the banana in the banana split dessert was ripe"
# Your lexicons
lexicon_food = ["hot dog", "banana", "banana split"]
lexicon_beverage = ["coke", "cola", "coca cola"]
lexicon_dict = {x: [x, 'Food'] for x in lexicon_food}
lexicon_dict.update({x: [x, 'Beverage'] for x in lexicon_beverage})
# Function to extract lexicon items
def extract(g, lex):
if ' '.join(g) in lex.keys():
return lex.get(' '.join(g))
elif g[0] in lex.keys():
return lex.get(g[0])
else:
pass
# Your task
out = [[extract(g, lexicon_dict) for g in ngrams(sentence.split(), 2) if extract(g, lexicon_dict)]
for sentence in tweet.replace(',', '').lower().split('.')]
print(out)
</code></pre>
<p>输出:</p>
^{pr2}$
<hr/>
<p><strong>方法2</strong>(避免使用“可口可乐”和“可乐”)</p>
<pre><code>def extract2(sentence, lex):
extracted_words = []
words = sentence.split()
i = 0
while i < len(words):
if ' '.join(words[i:i+2]) in lex.keys():
extracted_words.append(lex.get(' '.join(words[i:i+2])))
i += 2
elif words[i] in lex.keys():
extracted_words.append(lex.get(words[i]))
i += 1
else:
i += 1
return extracted_words
out = [extract2(s, lexicon_dict) for s in tweet.replace(',', '').lower().split('.')]
print(out)
</code></pre>
<p>输出:</p>
<pre><code>[[['coca cola', 'Beverage'], ['hot dog', 'Food']],
[['coke', 'Beverage'], ['banana', 'Food'], ['banana split', 'Food']]]
</code></pre>
<p>注意这里不需要<code>nltk</code>。在</p>