I would suggest you to use CountVectorizer as you already have the list of keywords.
In CountVectorizer there is a parameter to set Vocabulary. You can stick to your list of keywords as Vocabulary. So what CountVectorizer will do is check the document for those keywords and build a feature vector on basis of those keywords. Lets look at the example
from sklearn.feature_extraction.text import CountVectorizer
keywords=["aspirin","medication","patients"]
sen1="patients need to take aspirin"
sen2 = "medication required immediately"
vectorizer = CountVectorizer(vocabulary=keywords)
corpus=[sen1,sen2]
X = vectorizer.transform(corpus)
After this when you print feature names of vectorizer:-
print(vectorizer.get_feature_names())
You will see ['aspirin', 'medication', 'patients']
And when you see the vectors for each sentence by print(X.toarray()) you will see following matrix:- [[1 0 1][0 1 0]] So it has built a vector on basis of presence(1) and absence(0) of the keywords
相关问题 更多 >
编程相关推荐