使用python的受控词汇对句子进行分类 - 问答 - Python中文网

使用python的受控词汇对句子进行分类

2024-05-17 00:36:07 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我有几个不同的医学词汇（如药物、症状、体征、疾病）和一些免费文本诊断报告。我想使用tfidf或机器学习技术，首先分解自由文本，然后将重要的句子分为不同的类别。 Python作为一种编程语言例如，“患者需要服用阿司匹林”被归类为“药物使用”，而“阿司匹林”可以在药物词汇中找到。你能给我推荐一些算法吗？谢谢：）

Tags：文本机器症状患者报告类别编程语言词汇

1条回答

网友

1楼 · 发布于 2024-05-17 00:36:07

I would suggest you to use CountVectorizer as you already have the list of keywords. In CountVectorizer there is a parameter to set Vocabulary. You can stick to your list of keywords as Vocabulary. So what CountVectorizer will do is check the document for those keywords and build a feature vector on basis of those keywords. Lets look at the example

from sklearn.feature_extraction.text import CountVectorizer
keywords=["aspirin","medication","patients"]
sen1="patients need to take aspirin"
sen2 = "medication required immediately"
vectorizer = CountVectorizer(vocabulary=keywords) 
corpus=[sen1,sen2]
X = vectorizer.transform(corpus)

After this when you print feature names of vectorizer:- print(vectorizer.get_feature_names()) You will see ['aspirin', 'medication', 'patients']
And when you see the vectors for each sentence by print(X.toarray()) you will see following matrix:- [[1 0 1][0 1 0]] So it has built a vector on basis of presence(1) and absence(0) of the keywords

相关问题更多 >

编程相关推荐

热门问题

热门文章