使用python的受控词汇对句子进行分类

2024-05-17 00:36:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我有几个不同的医学词汇(如药物、症状、体征、疾病)和一些免费文本诊断报告。我想使用tfidf或机器学习技术,首先分解自由文本,然后将重要的句子分为不同的类别。 Python作为一种编程语言 例如,“患者需要服用阿司匹林”被归类为“药物使用”,而“阿司匹林”可以在药物词汇中找到。 你能给我推荐一些算法吗?谢谢:)


Tags: 文本机器症状患者报告类别编程语言词汇
1条回答
网友
1楼 · 发布于 2024-05-17 00:36:07

I would suggest you to use CountVectorizer as you already have the list of keywords. In CountVectorizer there is a parameter to set Vocabulary. You can stick to your list of keywords as Vocabulary. So what CountVectorizer will do is check the document for those keywords and build a feature vector on basis of those keywords. Lets look at the example

from sklearn.feature_extraction.text import CountVectorizer
keywords=["aspirin","medication","patients"]
sen1="patients need to take aspirin"
sen2 = "medication required immediately"
vectorizer = CountVectorizer(vocabulary=keywords) 
corpus=[sen1,sen2]
X = vectorizer.transform(corpus)

After this when you print feature names of vectorizer:- print(vectorizer.get_feature_names()) You will see ['aspirin', 'medication', 'patients']

And when you see the vectors for each sentence by print(X.toarray()) you will see following matrix:- [[1 0 1][0 1 0]] So it has built a vector on basis of presence(1) and absence(0) of the keywords

相关问题 更多 >