寻找最适合的tokens数组句子

2024-10-01 02:21:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下用于文本挖掘的数据帧:

df = pd.DataFrame({'text':["Anyone who reads Old and Middle English literary texts will be familiar with the mid-brown volumes of the EETS, with the symbol of Alfreds jewel embossed on the front cover",
                     "Most of the works attributed to King Alfred or to Aelfric, along with some of those by bishop Wulfstan and much anonymous prose and verse from the pre-Conquest period, are to be found within the Society's three series",
                     "all of the surviving medieval drama, most of the Middle English romances, much religious and secular prose and verse including the English works of John Gower, Thomas Hoccleve and most of Caxton's prints all find their place in the publications",
                     "Without EETS editions, study of medieval English texts would hardly be possible."]})



text
0   Anyone who reads Old and Middle English litera...
1   Most of the works attributed to King Alfred or...
2   all of the surviving medieval drama, most of t...
3   Without EETS editions, study of medieval Engli...

我有代币清单:

tokens = [['middl engl', 'mid-brown', 'symbol'], ["king", 'anonym', 'series'], ['mediev', 'romance', 'relig'], ['hocclev', 'edit', 'publ']]

我试图从上面的标记列表中为每个标记数组找到最合适的句子。你知道吗

更新:我被要求更详细地解释我的问题。你知道吗

问题是我是在非英语文本上做的,所以要说明更多的问题是非常有问题的。你知道吗

我正在寻找一个函数x,它将标记列表的每个元素作为输入,对于标记列表的每个元素,它在df.text中搜索最合适的(可能在某种度量意义上)句子。这是主要的想法,输出并不重要。我只想让它工作:)


Tags: andofthetotext标记middlemost
1条回答
网友
1楼 · 发布于 2024-10-01 02:21:44

正如我之前所说,这篇文章只是我问题的一个例证。我在解决聚类问题。我用LDA和K-means算法来做。为了找到最适合我的标记列表的句子,我使用了K-means距离参数。你知道吗

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import lda
from sklearn.feature_extraction.text import CountVectorizer
import logging
from sklearn.cluster import MiniBatchKMeans
from sklearn import preprocessing 

df = pd.DataFrame({'text':["Anyone who reads Old and Middle English literary texts will be familiar with the mid-brown volumes of the EETS, with the symbol of Alfreds jewel embossed on the front cover",
                         "Most of the works attributed to King Alfred or to Aelfric, along with some of those by bishop Wulfstan and much anonymous prose and verse from the pre-Conquest period, are to be found within the Society's three series",
                         "all of the surviving medieval drama, most of the Middle English romances, much religious and secular prose and verse including the English works of John Gower, Thomas Hoccleve and most of Caxton's prints all find their place in the publications",
                         "Without EETS editions, study of medieval English texts would hardly be possible."],
                  'tokens':[['middl engl', 'mid-brown', 'symbol'], ["king", 'anonym', 'series'], ['mediev', 'romance', 'relig'], ['hocclev', 'edit', 'publ']]})
df['tokens'] = df.tokens.str.join(',')


vectorizer = TfidfVectorizer(min_df=1, max_features=10000, ngram_range=(1, 2))
vz = vectorizer.fit_transform(df['tokens'])

logging.getLogger("lda").setLevel(logging.WARNING)
cvectorizer = CountVectorizer(min_df=1, max_features=10000, ngram_range=(1,2))
cvz = cvectorizer.fit_transform(df['tokens'])

n_topics = 4

n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)

num_clusters = 4
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1, 
                         init_size=1000, batch_size=1000, verbose=False, max_iter=1000)
kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)

X_all  = X_topics
kmeans1 = kmeans_model.fit(X_all)
kmeans_clusters1 = kmeans1.predict(X_all)
kmeans_distances1 = kmeans1.transform(X_all)
d = dict()
l = 1


for i, desc in enumerate(df.text):
    if(i < 3):
        num = 3
        if kmeans_clusters1[i] == num:
            if l > kmeans_distances1[i][kmeans_clusters1[i]]:
                l = kmeans_distances1[i][kmeans_clusters1[i]]  
            d['Cluster' + str(kmeans_clusters1[i])] = "distance:  " + str(l)+ "   "+ df.iloc[i]['text']
            print("Cluster " + str(kmeans_clusters1[i]) + ": " + desc + 
                  "(distance: " + str(kmeans_distances1[i][kmeans_clusters1[i]]) + ")")
            print(' -')
print("Cluster " + str(num) + "   " + str(d.get('Cluster' + str(num))))

因此,特定簇内距离最小的令牌是最合适的。你知道吗

相关问题 更多 >