R tm包和Spark/python对文档词频任务提供不同的词汇大小

library(RJDBC) library(Matrix) library(tm) library(wordcloud) library(devtools) library(lsa) library(data.table) library(dplyr) library(lubridate) corpus <- read.csv(paste(inputDir, "corpus.csv", sep="/"), stringsAsFactors=FALSE) DescriptionDocuments<-c(corpus$doc_clean) DescriptionDocuments <- VCorpus(VectorSource(DescriptionDocuments)) DescriptionDocuments.DTM <- DocumentTermMatrix(DescriptionDocuments, control = list(tolower = FALSE, stopwords = FALSE, removeNumbers = FALSE, removePunctuation = FALSE, stemming=FALSE)) # VOCABULARY SIZE = 83758

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer} var corpus = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "false").load("/path/to/corpus.csv") // RegexTokenizer splits by default on one or more spaces, which is ok val rTokenizer = new RegexTokenizer().setInputCol("doc").setOutputCol("words") val words = rTokenizer.transform(corpus) val cv = new CountVectorizer().setInputCol("words").setOutputCol("tf") val cv_model = cv.fit(words) var dtf = cv_model.transform(words) // VOCABULARY SIZE = 84290

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer corpus = pd.read_csv("/path/to/corpus.csv") docs = corpus.loc[:, "doc"].values def tokenizer(text): return text.split cv = CountTokenizer(tokenizer=tokenizer, stop_words=None) dtf = cv.fit_transform(docs) print len(dtf.vocabulary_) # VOCABULARY SIZE = 84290

1条回答

网友

1楼 · 发布于 2024-06-27 02:49:20

产生这种差异的原因是创建文档术语矩阵时的默认选项。如果您选中?termFreq，您可以找到选项wordlength：

An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.

c（3，Inf）的默认设置会删除所有短于3的单词，如“at”、“in”、“I”等

这个默认值是导致tm和spark/python之间差异的原因

请参见下面示例中字长设置的差异。你知道吗

library(tm)

data("crude")

dtm <- DocumentTermMatrix(crude)
nTerms(dtm)
[1] 1266

dtm2 <- DocumentTermMatrix(crude, control = list(wordLengths = c(1, Inf)))
nTerms(dtm2)
[1] 1305

相关问题更多 >

编程相关推荐

热门问题

热门文章