java HashingTF不提供唯一索引

1 周，4 日 Questions & Answers 430

我正在使用EclipseMars、Java8和spark-spark-assembly-1.6.1-hadoop2实现潜在语义分析LSA。4.0.罐子我将文档作为代币传递，然后获得SVD等等

HashingTF hf = new HashingTF(hashingTFSize); JavaRDD<Vector> ArticlesAsV = hf.transform(articles.map(x->x.tokens)); JavaRDD<Vector> ArticlesTFIDF = idf.fit(ArticlesAsV).transform(ArticlesAsV); RowMatrix matTFIDF = new RowMatrix(ArticlesTFIDF.rdd()); double rCond= 1.0E-9d; int k = 50; SingularValueDecomposition<RowMatrix, Matrix> svd = matTFIDF.computeSVD(k, true, rCond);

每件事都很完美，除了一件，那就是当我试图从hashingTF中获取术语的索引时

int index = hf.indexOf(term);

我发现有很多术语都有相同的索引，这些是我得到的

0：术语
1：全部
1：下一步
2:tt
3:the
7：文件
9：这样
9：矩阵
11：文件
11：关于
11：每个
12：功能
12：机会
14：这个
14：提供
这意味着，当我试图用它来获取某个项的向量时，我可能会得到另一个具有相同索引的项的向量，我在柠檬化和删除停止词后这样做了，但仍然得到了相同的错误，是否有我遗漏的任何内容，或者需要更新的组件（例如MLip）有错误；如何为每个学期保留一个唯一的

# 1 楼答案

火花类HashingTF 利用hashing trick

A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e., the number of buckets of the hash table. The default feature dimension is 2^20=1,048,576.

因此术语组可以具有相同的索引

相对于下面的注释，如果需要所有术语，可以使用CountVectorizer而不是HashingTF。 CountVectorizer还可用于获取术语频率向量。使用CountVectorizer和随后使用IDF 您必须使用DataFrame而不是JavaRDD，因为CountVectorizer仅在ml包中受支持

这是带有列id和words的数据帧示例：

id | words -| 0 | Array("word1", "word2", "word3") 1 | Array("word1", "word2", "word2", "word3", "word1")

因此，如果您将文章JavaRDD转换为具有列id和单词的数据框架，其中每行都是一个句子或文档中的单词包，您可以使用如下代码计算TfIdf：

CountVectorizerModel cvModel = new CountVectorizer() .setInputCol("words") .setOutputCol("rawFeatures") .setVocabSize(100000) // < Specify the Max size of the vocabulary. .setMinDF(2) // Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. .fit(df); DataFrame featurizedData = cvModel.transform(articles); IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features"); IDFModel idfModel = idf.fit(featurizedData);

共 (1) 个答案

# 1 楼答案
火花类HashingTF 利用hashing trick

A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e., the number of buckets of the hash table. The default feature dimension is 2^20=1,048,576.

因此术语组可以具有相同的索引

相对于下面的注释，如果需要所有术语，可以使用CountVectorizer而不是HashingTF。 CountVectorizer还可用于获取术语频率向量。使用CountVectorizer和随后使用IDF 您必须使用DataFrame而不是JavaRDD，因为CountVectorizer仅在ml包中受支持

这是带有列id和words的数据帧示例：

id | words -| 0 | Array("word1", "word2", "word3") 1 | Array("word1", "word2", "word2", "word3", "word1")

因此，如果您将文章JavaRDD转换为具有列id和单词的数据框架，其中每行都是一个句子或文档中的单词包，您可以使用如下代码计算TfIdf：

CountVectorizerModel cvModel = new CountVectorizer() .setInputCol("words") .setOutputCol("rawFeatures") .setVocabSize(100000) // < Specify the Max size of the vocabulary. .setMinDF(2) // Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. .fit(df); DataFrame featurizedData = cvModel.transform(articles); IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features"); IDFModel idfModel = idf.fit(featurizedData);

Python中文网

有 Java 编程相关的问题?

java HashingTF不提供唯一索引

共 (1) 个答案

# 1 楼答案