多语言Bert语句向量捕获的语言比实习时使用的语言更具意义？问题的回答

多语言Bert语句向量捕获的语言比实习时使用的语言更具意义？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

与伯特一起玩，我下载了Huggingface多语言伯特，输入了三个句子，保存它们的句子向量（嵌入<code>[CLS]</code>），然后通过Google Translate进行翻译，通过模型传递并保存它们的句子向量 然后我用余弦相似性比较了结果 我惊讶地发现，每一个句子向量都与从它翻译出来的句子产生的向量相差甚远（0.15-0.27余弦距离），而来自同一种语言的不同句子却非常接近（0.02-0.04余弦距离） 因此，相同语言的不同句子之间的距离更近，而不是将意义相似（但语言不同）的句子组合在一起（在768维空间中） 据我所知，多语言Bert的全部要点是跨语言迁移学习——例如，在一种语言的表示上训练模型（比如，和FC网络），并使该模型可以在其他语言中使用 如果（不同语言的）具有确切含义的句子被映射成比同一语言的不同句子更为相隔，那么这怎么可能起作用呢 我的代码： <pre><code>import torch import transformers from transformers import AutoModel,AutoTokenizer bert_name="bert-base-multilingual-cased" tokenizer = AutoTokenizer.from_pretrained(bert_name) MBERT = AutoModel.from_pretrained(bert_name) #Some silly sentences eng1='A cat jumped from the trees and startled the tourists' e=tokenizer.encode(eng1, add_special_tokens=True) ans_eng1=MBERT(torch.tensor([e])) eng2='A small snake whispered secrets to large cats' t=tokenizer.tokenize(eng2) e=tokenizer.encode(eng2, add_special_tokens=True) ans_eng2=MBERT(torch.tensor([e])) eng3='A tiger sprinted from the bushes and frightened the guests' e=tokenizer.encode(eng3, add_special_tokens=True) ans_eng3=MBERT(torch.tensor([e])) # Translated to Hebrew with Google Translate heb1='חתול קפץ מהעץ והבהיל את התיירים' e=tokenizer.encode(heb1, add_special_tokens=True) ans_heb1=MBERT(torch.tensor([e])) heb2='נחש קטן לחש סודות לחתולים גדולים' e=tokenizer.encode(heb2, add_special_tokens=True) ans_heb2=MBERT(torch.tensor([e])) heb3='נמר רץ מהשיחים והפחיד את האורחים' e=tokenizer.encode(heb3, add_special_tokens=True) ans_heb3=MBERT(torch.tensor([e])) from scipy import spatial import numpy as np # Compare Sentence Embeddings result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_heb1[1].data.numpy()) print ('Eng1-Heb1 - Translated sentences',result) result = spatial.distance.cosine(ans_eng2[1].data.numpy(), ans_heb2[1].data.numpy()) print ('Eng2-Heb2 - Translated sentences',result) result = spatial.distance.cosine(ans_eng3[1].data.numpy(), ans_heb3[1].data.numpy()) print ('Eng3-Heb3 - Translated sentences',result) print ("\n---\n") result = spatial.distance.cosine(ans_heb1[1].data.numpy(), ans_heb2[1].data.numpy()) print ('Heb1-Heb2 - Different sentences',result) result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_eng2[1].data.numpy()) print ('Heb1-Heb3 - Similiar sentences',result) print ("\n---\n") result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_eng2[1].data.numpy()) print ('Eng1-Eng2 - Different sentences',result) result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_eng3[1].data.numpy()) print ('Eng1-Eng3 - Similiar sentences',result) #Output: """ Eng1-Heb1 - Translated sentences 0.2074061632156372 Eng2-Heb2 - Translated sentences 0.15557605028152466 Eng3-Heb3 - Translated sentences 0.275478720664978 --- Heb1-Heb2 - Different sentences 0.044616520404815674 Heb1-Heb3 - Similar sentences 0.027982771396636963 --- Eng1-Eng2 - Different sentences 0.027982771396636963 Eng1-Eng3 - Similar sentences 0.024596810340881348 """ </code></pre> 附言 至少Heb1比Heb2更接近Heb3。这一点也适用于英语对应词，但不太常见

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

多语言Bert语句向量捕获的语言比实习时使用的语言更具意义？

1 个回答

相关Python问题