回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我正在测试Word2Vec,以找到具有相同含义的词,到目前为止,它是伟大的积极的话是准确的名单。不过,我想知道每个积极的词是在哪里找到的,比如在哪个文件里。你知道吗</p>
<p>我试图迭代每个文档,并将每个单词与肯定词列表进行比较,如下所示:</p>
<pre><code>for i in documents: # iterating the documents
for j in i: # iterating the words in the document
for k in similar_words: # iterating the positive words
if k[0] in j: # k[0] is the positive word, k[1] is the positive value
print('found word')
</code></pre>
<p>这个很好用。然而,有了这个,积极的话实际上是词干下来,这就是为什么我用“在”。所以让我们假设词干向下的肯定词是‘冰’,很多词中都包含‘冰’这个短语,可能文档中不止一个,只有一个是真正的肯定词。你知道吗</p>
<p>在使用Word2Vec时有没有避免词干的方法?或者有没有办法找到找到的每个肯定词的文档编号?你知道吗</p>
<p><strong>更新</p>
<p>下面是我训练模型和使用“most\u similar()”的代码</p>
<pre><code>def remove_stopwords(texts):
# Removes stopwords in a text
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
def sent_to_words(sentences):
# Tokenize each sentence into a list of words and remove unwanted characters
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
df = pd.read_excel('my_file.xlsx')
df.columns = map(str.lower, df.columns)
data = df['Comment Section'].values.tolist()
# Remove the new line character and single quotes
data = [re.sub(r'\s+', ' ', str(sent)) for sent in data]
data = [re.sub("\'", "", str(sent)) for sent in data]
# Convert our data to a list of words. Now, data_words is a 2D array,
# each index contains a list of words
data_words = list(sent_to_words(data))
# Remove the stop words
data_words_nostops = remove_stopwords(data_words)
model = gensim.models.Word2Vec(
data_words_nostops,
alpha=0.1,
min_alpha=0.001,
size=250,
window=1,
min_count=2,
workers=10)
model.train(data_words_nostops, total_examples=len(data_words_nostops), epochs=10)
print(model.wv.vocab) # At this step, the words are not stemmed
positive = ['injuries', 'fail', 'dangerous', 'oil']
negative = ['train', 'westward', 'goods', 'calgary', 'car', 'automobile', 'appliance']
similar_words_size = array_length(model.wv.most_similar(positive=positive, negative=negative, topn=0))
for i in model.wv.most_similar(positive=positive, negative=negative, topn=similar_words_size):
if len(i[0]) > 2:
risks.append(i)
print(risks) # At this step, the words are stemmed
</code></pre>