我有以下用于文本挖掘的数据帧:
df = pd.DataFrame({'text':["Anyone who reads Old and Middle English literary texts will be familiar with the mid-brown volumes of the EETS, with the symbol of Alfreds jewel embossed on the front cover",
"Most of the works attributed to King Alfred or to Aelfric, along with some of those by bishop Wulfstan and much anonymous prose and verse from the pre-Conquest period, are to be found within the Society's three series",
"all of the surviving medieval drama, most of the Middle English romances, much religious and secular prose and verse including the English works of John Gower, Thomas Hoccleve and most of Caxton's prints all find their place in the publications",
"Without EETS editions, study of medieval English texts would hardly be possible."]})
text
0 Anyone who reads Old and Middle English litera...
1 Most of the works attributed to King Alfred or...
2 all of the surviving medieval drama, most of t...
3 Without EETS editions, study of medieval Engli...
我有代币清单:
tokens = [['middl engl', 'mid-brown', 'symbol'], ["king", 'anonym', 'series'], ['mediev', 'romance', 'relig'], ['hocclev', 'edit', 'publ']]
我试图从上面的标记列表中为每个标记数组找到最合适的句子。你知道吗
更新:我被要求更详细地解释我的问题。你知道吗
问题是我是在非英语文本上做的,所以要说明更多的问题是非常有问题的。你知道吗
我正在寻找一个函数x,它将标记列表的每个元素作为输入,对于标记列表的每个元素,它在df.text
中搜索最合适的(可能在某种度量意义上)句子。这是主要的想法,输出并不重要。我只想让它工作:)
正如我之前所说,这篇文章只是我问题的一个例证。我在解决聚类问题。我用LDA和K-means算法来做。为了找到最适合我的标记列表的句子,我使用了K-means距离参数。你知道吗
因此,特定簇内距离最小的令牌是最合适的。你知道吗
相关问题 更多 >
编程相关推荐