我正在尝试使用NLTK for python来训练自己的情感分析语料库。我有两个文本文件:一个有25K条正推,每行分开,另一个有25K条负面tweets。在
I use this Stackoverflow article, method 2
当我运行此代码创建语料库时:
import string
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk
mydir = 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
我收到错误消息:
^{pr2}$有人知道怎么解决这个问题吗?在
我不是百分之百的肯定,因为我现在不是在Windows机器上测试这个,但是我想可能是@alvas原始示例中的路径斜线方向和您对Windows的适应之间的差异。在
具体来说,您使用:
'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'
,而他的示例使用'/home/alvas/my_movie_reviews'
。在大多数情况下,这是好的,但是您尝试重用他的cat_pattern
regex:r'(neg|pos)/.*'
,这将匹配他的路径中的斜杠,但拒绝您的路径中的斜杠。在相关问题 更多 >
编程相关推荐