基于缺口统计的k-均值最优值的确定
gapkmean的Python项目详细描述
此脚本使用间隙统计信息多次运行k-means算法以 找到数据集的最佳k值。
因为k-均值实际上取决于初始点,因此在给定不同的初始点时,结果可能不同; 因此,使用sklearn包以不同的初始ponit运行多次,这可以是gap统计的一个参数。
这个模块应该导入到其他python脚本中,并与sklearn相结合,以找到最佳的k值。
参数:
refs: np.array or None, it is the replicated data that you want to compare with if there exists one; if no existing replicated/proper data, just use None, and the function will automatically generates them;
B: int, the number of replicated samples to run gap-statistics; it is recommended as 10, and it should not be changed/decreased that to a smaller value;
K: list, the range of K values to test on;
N_init: int, states the number of initial starting points for each K-mean running under sklearn, in order to get stable clustering result each time; you may not need such many starting points, so it can be reduced to a smaller number to quicken the computation;
n_jobs: int, clarifies the parallel computing, could fasten the computation, this can be only changed inside the script, not as an argument of the function;
- 要安装
- pip安装gapkmean
- 用作python中的模块
- 来自gapkmean进口缺口
#寻找k-均值算法的最佳k值
#note data should be an numpy.array gaps, s_k, K = gap.gap_statistic(data, refs=None, B=10, K=range(1,11), N_init = 10) bestKValue = gap.find_optimal_k(gaps, s_k, K)