pyspark中CountVectorizer的minDF参数是什么？ - 问答 - Python中文网

pyspark中CountVectorizer的minDF参数是什么？

2024-10-02 02:36:24 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我读了spark文件，上面说

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary.

有人能给我解释清楚吗？在

Tags：文件 the in by top process select will

1条回答

网友

1楼 · 发布于 2024-10-02 02:36:24

minDF用于删除出现频率太低的术语。在

例如： minDF=0.01表示“忽略出现在不到1%文档中的术语”。 minDF=5表示“忽略出现在少于5个文档中的术语”。在

默认的minDF是1，这意味着“忽略出现在少于1个文档中的术语”。因此，默认设置不会忽略任何术语。在

vocabSize是词汇表中可以拥有的最大标记数。默认值为1<；<；18。一、 e.，2^18或262144。在

注意事项：https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L430-L435vocabSize：https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L444-L446

相关问题更多 >

编程相关推荐

热门问题

热门文章