在使用Word2Vec之后，如何找到一组文档中的肯定词？

for i in documents: # iterating the documents for j in i: # iterating the words in the document for k in similar_words: # iterating the positive words if k[0] in j: # k[0] is the positive word, k[1] is the positive value print('found word')

def remove_stopwords(texts): # Removes stopwords in a text return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def sent_to_words(sentences): # Tokenize each sentence into a list of words and remove unwanted characters for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) df = pd.read_excel('my_file.xlsx') df.columns = map(str.lower, df.columns) data = df['Comment Section'].values.tolist() # Remove the new line character and single quotes data = [re.sub(r'\s+', ' ', str(sent)) for sent in data] data = [re.sub("\'", "", str(sent)) for sent in data] # Convert our data to a list of words. Now, data_words is a 2D array, # each index contains a list of words data_words = list(sent_to_words(data)) # Remove the stop words data_words_nostops = remove_stopwords(data_words) model = gensim.models.Word2Vec( data_words_nostops, alpha=0.1, min_alpha=0.001, size=250, window=1, min_count=2, workers=10) model.train(data_words_nostops, total_examples=len(data_words_nostops), epochs=10) print(model.wv.vocab) # At this step, the words are not stemmed positive = ['injuries', 'fail', 'dangerous', 'oil'] negative = ['train', 'westward', 'goods', 'calgary', 'car', 'automobile', 'appliance'] similar_words_size = array_length(model.wv.most_similar(positive=positive, negative=negative, topn=0)) for i in model.wv.most_similar(positive=positive, negative=negative, topn=similar_words_size): if len(i[0]) > 2: risks.append(i) print(risks) # At this step, the words are stemmed

2条回答

网友
1楼 · 编辑于 2024-09-28 22:25:59

在word2vec模型训练中，可以使用未修饰的单词。但实际上，这样做通常会显著降低生成向量的质量。你知道吗
如果使用预训练向量，则必须使用训练期间使用的相同词干分析器函数。你知道吗
当你有一本字典时，你可以用similar_words来编一本字典，然后用stem(word) in similar_words来匹配单词

网友
2楼 · 编辑于 2024-09-28 22:25:59

许多已发表的Word2Vec作品，包括来自Google的原始论文，都不需要词干分析。如果你有一个足够大的语料库，每种形式的单词都有很多不同的例子，那么每种形式都会得到一个非常好的向量（与其他形式的向量非常接近），即使是原始的非词组单词。（另一方面，在较小的语料库中，词干分析更有可能发挥作用，因为它允许一个词的所有不同形式将它们的出现贡献给一个好的向量。）
在训练过程中，Word2Vec只观察训练文本，寻找它所需的附近单词信息：它不记得单个文档的内容。如果需要该信息，则需要将其保留在Word2Vec之外的代码中。你知道吗
您可以遍历所有文档来查找单个单词，就像在代码中一样。（而且，正如@alexey的回答所指出的，您应该比较词干单词和词干单词，而不仅仅是检查子字符串是否包含在内。）
在全文搜索中使用的另一个选项是建立一个“反向索引”，它可以记住每个单词出现在哪个文档中（也可能出现在每个文档的哪个位置）。然后，您基本上有一个字典，在其中查找“iced”，并返回一个文档列表，如“doc1、doc17、doc42”。（或者可能是文档列表加上位置，如“doc2:pos11，pos91；doc17:pos22，doc42:pos77”。）需要更多的前期工作，并存储反向索引（根据保留的详细程度，其大小几乎与原始文本相同），但是找到包含单词的文档要比对每个单词进行完整的迭代搜索快得多。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章