Conceptnet Numberbatch(多语言)OOV单词

2024-09-27 00:22:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在研究一个文本分类问题(在法语语料库上),我正在试验不同的单词嵌入。我对ConceptNet能提供什么很感兴趣,所以我决定试一试

我无法为我的特定任务找到专门的教程,因此我采纳了他们的建议blog

How do I use ConceptNet Numberbatch?

To make it as straightforward as possible:

Work through any tutorial on machine learning for NLP that uses semantic vectors. Get to the part where they tell you to use word2vec. (A particularly enlightened tutorial may tell you to use GloVe 1.2.)

Get the ConceptNet Numberbatch data, and use it instead. Get better results that also generalize to other languages.

下面您可以找到我的方法(请注意,'numberbatch.txt'是包含推荐的多语言版本的文件:ConceptNet numberbatch 19.08):

embeddings_index = dict()

f = open('numberbatch.txt')

for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))

我首先测试一个单词是否存在:

word = 'fille'
missingWords = 0
if word not in embeddings_index:
    missingWords += 1
print(missingWords)

我惊讶地发现,像“fille”这样的简单单词(法语中的女孩)竟然找不到。然后,我创建了一个函数,用于打印我的语料库中的所有OOV单词。在分析结果时,我更为惊讶:超过22k的单词没有找到(包括诸如“我们”、“未来”等)

我还尝试了在GitHub page上为OOV单词提出的方法(结果相同):

Out-of-vocabulary strategy

ConceptNet Numberbatch is evaluated with an out-of-vocabulary strategy that helps its performance in the presence of unfamiliar words. The strategy is implemented in the ConceptNet code base. It can be summarized as follows:

Given an unknown word whose language is not English, try looking up the equivalently-spelled word in the English embeddings (because English words tend to end up in text of all languages).

Given an unknown word, remove a letter from the end, and see if that is a prefix of known words. If so, average the embeddings of those known words.

If the prefix is still unknown, continue removing letters from the end until a known prefix is found. Give up when a single character remains.

我的方法是否有问题


Tags: ofthetoinindexthatisuse
1条回答
网友
1楼 · 发布于 2024-09-27 00:22:18

您是否考虑了ConceptNet Numberbatch的格式?如图project's GitHub所示,它看起来是这样的:

/c/en/absolute_value -0.0847 -0.1316 -0.0800 -0.0708 -0.2514 -0.1687 -...

/c/en/absolute_zero 0.0056 -0.0051 0.0332 -0.1525 -0.0955 -0.0902 0.07...

此格式意味着fille将找不到,但/c/fr/fille将找到

相关问题 更多 >

    热门问题