计算关键字和文本文件中每个单词之间的度量值

from sklearn.metrics.pairwise import cosine_similarity from bert_serving.client import BertClient bc = BertClient() # Process words with open("./words.txt", "r", encoding='utf8') as textfile: words = textfile.read().split() with open("./100_keywords.txt", "r", encoding='utf8') as keyword_file: for keyword in keyword_file: vector_key = bc.encode([keyword]) for w in words: vector_word = bc.encode([w]) cosine_lib = cosine_similarity(vector_key,vector_word) print (cosine_lib)

1条回答

网友

1楼 · 发布于 2024-07-04 07:48:29

我对伯特一无所知……但是导入和运行有些可疑。我觉得你没有把它安装好。我尝试pip安装它并运行以下程序：

from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
print ('done importing')

它从未结束。看一下bert的dox，看看是否需要做其他事情

在您的代码中，通常最好先进行所有读取，然后进行处理，因此先导入两个列表，分别检查一些值，如：

# check first five
print(words[:5])

此外，您需要寻找一种不同的方法来进行比较，而不是嵌套循环。您现在意识到，您每次都在为每个关键字转换words中的每个单词，这不是必需的，而且可能非常慢。我建议您要么使用字典将单词与编码配对，要么制作一个包含（单词，编码）元组的列表，如果您对此比较满意的话

在你让伯特站起来跑步后，如果这不合理，请给我回复

编辑

下面是一段代码，其工作原理与您想要执行的类似。根据您的需要，您可以选择很多方法来保存结果等，但这应该让您从“假伯特”开始

from operator import itemgetter

# fake bert  ... just return something like length
def bert(word):
    return len(word)

# a fake compare function that will compare "bert" conversions
def bert_compare(x, y):
    return abs(x-y)

# Process words
with open("./word_data_file.txt", "r", encoding='utf8') as textfile:
    words = textfile.read().split()

# Process keywords
with open("./keywords.txt", "r", encoding='utf8') as keyword_file:
    keywords = keyword_file.read().split()

# encode the words and put result in dictionary
encoded_words = {}
for word in words:
    encoded_words[word] = bert(word)

encoded_keywords = {}
for word in keywords:
    encoded_keywords[word] = bert(word)

# let's use our bert conversions to find which keyword is most similar in
# length to the word

for word in encoded_words.keys():
    result = []   # make a new result set for each pass
    for kword in encoded_keywords.keys():
        similarity = bert_compare(encoded_words.get(word), encoded_keywords.get(kword))
        # stuff the answer into a tuple that can be sorted
        result.append((word, kword, similarity))
    result.sort(key=itemgetter(2))
    print(f'the keyword with the closest size to {result[0][0]} is {result[0][1]}')

相关问题更多 >

编程相关推荐

热门问题

热门文章