用python从tweets中提取ngram

2024-10-01 02:38:46 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有100条tweet。
在这些推文中,我需要摘录:1)食品名称,2)饮料名称。

tweet示例:

"Yesterday I had a coca cola, and a hot dog for lunch, and some bana split for desert. I liked the coke, but the banana in the banana split dessert was ripe"

我有两本词典要用。一个是食品名称,一个是饮料名称。

食品名称词典中的示例:
“热狗”
“香蕉”
“香蕉分割”

饮料名称词典中的示例:
“可乐”
“可乐”
“可口可乐”

我应该能够提取的内容:

[[["coca cola", "beverage"], ["hot dog", "food"], ["banana split", "food"]],
[["coke", "beverage"], ["banana", "food"], ["banana split", "food"]]]

词典中的名字可以是1-5个单词。如何使用我的词典从tweets中提取n-gram呢?在


Tags: andthe名称食品示例food词典tweet
2条回答

不确定到目前为止您尝试了什么,下面是一个在nltkdict()中使用ngrams的解决方案

from nltk import ngrams

tweet = "Yesterday I had a coca cola, and a hot dog for lunch, and some bana split for desert. I liked the coke, but the banana in the banana split dessert was ripe"

# Your lexicons
lexicon_food = ["hot dog", "banana", "banana split"]
lexicon_beverage = ["coke", "cola", "coca cola"]
lexicon_dict = {x: [x, 'Food'] for x in lexicon_food}
lexicon_dict.update({x: [x, 'Beverage'] for x in lexicon_beverage})

# Function to extract lexicon items
def extract(g, lex):
    if ' '.join(g) in lex.keys():
        return lex.get(' '.join(g))
    elif g[0] in lex.keys():
        return lex.get(g[0])
    else:
        pass

# Your task
out = [[extract(g, lexicon_dict) for g in ngrams(sentence.split(), 2) if extract(g, lexicon_dict)] 
        for sentence in tweet.replace(',', '').lower().split('.')]
print(out)

输出:

^{pr2}$

方法2(避免使用“可口可乐”和“可乐”)

def extract2(sentence, lex):
    extracted_words = []
    words = sentence.split()
    i = 0
    while i < len(words):
        if ' '.join(words[i:i+2]) in lex.keys():
            extracted_words.append(lex.get(' '.join(words[i:i+2])))
            i += 2
        elif words[i] in lex.keys():
            extracted_words.append(lex.get(words[i]))
            i += 1
        else:
            i += 1
    return extracted_words

out = [extract2(s, lexicon_dict) for s in tweet.replace(',', '').lower().split('.')]
print(out)

输出:

[[['coca cola', 'Beverage'], ['hot dog', 'Food']], 
 [['coke', 'Beverage'], ['banana', 'Food'], ['banana split', 'Food']]]

注意这里不需要nltk。在

这里有一个简单的解决方案:

import re

def lexicon_by_word(lexicons):
    return {word:key for key in lexicons.keys() for word in lexicons[key]}



def split_sentences(st):
    sentences = re.split(r'[.?!]\s*', st)
    if sentences[-1]:
        return sentences
    else:
        return sentences[:-1]

def ngrams_finder(lexicons, text):
    lexicons_by_word = lexicon_by_word(lexicons)
    def pattern(lexicons):
        pattern = "|".join(lexicons_by_word.keys())
        pattern = re.compile(pattern)
        return pattern
    pattern = pattern(lexicons) 
    ngrams = []
    for sentence in split_sentences(text):
        try:
            ngram = []
            for result in pattern.findall(sentence):
                ngram.append([result, lexicons_by_word[result]])
            ngrams.append(ngram)
        except IndexError: #if re.findall does not find anything
            continue
    return ngrams

# You could customize it
text = "Yesterday I had a coca cola, and a hot dog for lunch, and some bana split for desert. I liked the coke, but the banana in the banana split dessert was ripe"

lexicons = {
    "food":["hot dog",
             "banana",
             "banana split"],

    "beverage":["coke",
                 "cola",
                 "coca cola"],
     }
print(ngrams_finder(lexicons, text))

分句函数取自此处:Splitting a sentence by ending characters

相关问题 更多 >