<p>在预处理步骤中,在<code>nltk</code>和{<cd2>}之间获得完全相同的结果似乎很棘手,因此我认为最好的方法是使用<code>rpy2</code>在R中运行预处理并将结果拉入python:</p>
<pre><code>import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)''')]
</code></pre>
<p>然后,可以将其加载到<code>scikit-learn</code>中,要使<code>CountVectorizer</code>和{<cd6>}之间的匹配,只需删除长度小于3的项:</p>
^{pr2}$
<p>让我们用R来验证这个匹配:</p>
<pre><code>tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
dtm
# A document-term matrix (1181 documents, 3289 terms)
#
# Non-/sparse entries: 8980/3875329
# Sparsity : 100%
# Maximal term length: 115
# Weighting : term frequency (tf)
sparse = removeSparseTerms(dtm, 0.995)
sparse
# A document-term matrix (1181 documents, 309 terms)
#
# Non-/sparse entries: 4669/360260
# Sparsity : 99%
# Maximal term length: 20
# Weighting : term frequency (tf)
</code></pre>
<p>正如您所看到的,现在两种方法之间存储的元素和术语的数量完全匹配。在</p>