在使用Word2Vec之后，如何找到一组文档中的肯定词？问题的回答

在使用Word2Vec之后，如何找到一组文档中的肯定词？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在测试Word2Vec，以找到具有相同含义的词，到目前为止，它是伟大的积极的话是准确的名单。不过，我想知道每个积极的词是在哪里找到的，比如在哪个文件里。你知道吗 我试图迭代每个文档，并将每个单词与肯定词列表进行比较，如下所示： <pre><code>for i in documents: # iterating the documents for j in i: # iterating the words in the document for k in similar_words: # iterating the positive words if k[0] in j: # k[0] is the positive word, k[1] is the positive value print('found word') </code></pre> 这个很好用。然而，有了这个，积极的话实际上是词干下来，这就是为什么我用“在”。所以让我们假设词干向下的肯定词是‘冰’，很多词中都包含‘冰’这个短语，可能文档中不止一个，只有一个是真正的肯定词。你知道吗 在使用Word2Vec时有没有避免词干的方法？或者有没有办法找到找到的每个肯定词的文档编号？你知道吗 更新 下面是我训练模型和使用“most\u similar（）”的代码 <pre><code>def remove_stopwords(texts): # Removes stopwords in a text return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def sent_to_words(sentences): # Tokenize each sentence into a list of words and remove unwanted characters for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) df = pd.read_excel('my_file.xlsx') df.columns = map(str.lower, df.columns) data = df['Comment Section'].values.tolist() # Remove the new line character and single quotes data = [re.sub(r'\s+', ' ', str(sent)) for sent in data] data = [re.sub("\'", "", str(sent)) for sent in data] # Convert our data to a list of words. Now, data_words is a 2D array, # each index contains a list of words data_words = list(sent_to_words(data)) # Remove the stop words data_words_nostops = remove_stopwords(data_words) model = gensim.models.Word2Vec( data_words_nostops, alpha=0.1, min_alpha=0.001, size=250, window=1, min_count=2, workers=10) model.train(data_words_nostops, total_examples=len(data_words_nostops), epochs=10) print(model.wv.vocab) # At this step, the words are not stemmed positive = ['injuries', 'fail', 'dangerous', 'oil'] negative = ['train', 'westward', 'goods', 'calgary', 'car', 'automobile', 'appliance'] similar_words_size = array_length(model.wv.most_similar(positive=positive, negative=negative, topn=0)) for i in model.wv.most_similar(positive=positive, negative=negative, topn=similar_words_size): if len(i[0]) > 2: risks.append(i) print(risks) # At this step, the words are stemmed </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

在使用Word2Vec之后，如何找到一组文档中的肯定词？

1 个回答

相关Python问题