关键词提取
comparativeExtraction的Python项目详细描述
简介
本模块帮助您使用比较方法从语料库中提取关键术语和主题。在
安装
pip install --upgrade comparativeExtraction
使用
导入包
^{pr2}$加载样本数据
importpandasaspdimportnumpyasnpPATH="/Users/xiaoma/Desktop/gitrepo/associate-term-search/data/switch_reviews.csv"data=pd.read_csv(PATH)label=[x<=3forxindata['stars']]
data.columns
Index(['stars', 'titles', 'reviews', 'dates'], dtype='object')
在这里,我们使用在线亚马逊评论任天堂交换机来说明模块的用法。在
该模块需要一个语料库和一组二进制标签作为输入。标签应该根据我们要回答的问题类型来创建。标签集的长度应该与语料库的长度相同。在
在这里,假设我们想知道人们为什么不喜欢这个产品,并找到相关的关键字。为了回答这个问题,我们将标签创建为一个二进制变量,指示评审者给出的是3星还是更低。在
用回顾语料库和标签初始化模块
kw_init=comparative_keyword_extraction(corpus=data['reviews'],labels=label)
提取关键字
kw=kw_init.get_distinguishing_terms(ngram_range=(1,3),top_n=10)
# Get the keywords that are mentioned significantly more in the less than or equal to 3 star reviewskw.incre_df
# Get the keywords that are mentioned significantly less in the less than or equal to 3 star reviewskw.decline_df
如果我们需要一个单词的更多上下文,或者我们需要更多可解释的主题,我们可以:
- 输出包含术语的评论
- 切换ngram_范围
- 使用补充功能模块
输出评论
假设我们想知道更多关于“工作”这个重要术语,我们可以直接输出包含该术语的所有评论。在
输出类“kw”包含一个热编码文档术语矩阵,其中包含从语料库中找到的所有术语。我们可以利用它找到每个学期的相应评论。在
# The binary_dtm provides a convenient way to extract reviews with specific termsprint(kw.binary_dtm[['work','not']])
work not
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
... ... ...
4995 1 0
4996 0 1
4997 0 0
4998 0 0
4999 0 0
[5000 rows x 2 columns]
reviews_contain_term_work=data['reviews'][[x==1forxinkw.binary_dtm['work']]]len(reviews_contain_term_work)
557
forxinpd.Series(reviews_contain_term_work).sample(1):print(x)
It's alright, only got it to give Nintendo another chance. It's a neat concept. Overall, it's aggressively mediocre, good for casual stuff, but will never get as much use as my ps4.Wi-Fi is god awful though. The worst I've dealt with. It's connection capabilities are atrocious compared with any other wireless device. Don't expect it to just work. Honestly, this singular problem is enough for me to rate it 1 star. I suppose they had to cut corners somewhere.
更改n-gram范围以排除uni-gram
kw=kw_init.get_distinguishing_terms(ngram_range=(2,4),top_n=10)kw.incre_df
kw.decline_df
使用补充功能
有时,当我们想深入到一个特定的术语时,我们可以利用内置的补充函数来查找包含该术语的相关n-gram
fromcomparativeExtraction.supplement_funcsimportget_ngrams_on_term
target_term="work"reviews_contain_term_work=data['reviews'][[x==1forxinkw.binary_dtm['work']]]related_ngrams=get_ngrams_on_term(target_term,reviews_contain_term_work,filter_by_extreme=False)
related_ngrams.related_ngrams.head()
在这里,计数也是一个文档频率
- 项目
标签: