如何在Scikit学习聚集聚类中使用Pearson相关作为距离度量

State Murder Assault UrbanPop Rape Alabama 13.200 236 58 21.200 Alaska 10.000 263 48 44.500 Arizona 8.100 294 80 31.000 Arkansas 8.800 190 50 19.500 California 9.000 276 91 40.600 Colorado 7.900 204 78 38.700 Connecticut 3.300 110 77 11.100 Delaware 5.900 238 72 15.800 Florida 15.400 335 80 31.900 Georgia 17.400 211 60 25.800 Hawaii 5.300 46 83 20.200 Idaho 2.600 120 54 14.200 Illinois 10.400 249 83 24.000 Indiana 7.200 113 65 21.000 Iowa 2.200 56 57 11.300 Kansas 6.000 115 66 18.000 Kentucky 9.700 109 52 16.300 Louisiana 15.400 249 66 22.200 Maine 2.100 83 51 7.800 Maryland 11.300 300 67 27.800 Massachusetts 4.400 149 85 16.300 Michigan 12.100 255 74 35.100 Minnesota 2.700 72 66 14.900 Mississippi 16.100 259 44 17.100 Missouri 9.000 178 70 28.200 Montana 6.000 109 53 16.400 Nebraska 4.300 102 62 16.500 Nevada 12.200 252 81 46.000 New Hampshire 2.100 57 56 9.500 New Jersey 7.400 159 89 18.800 New Mexico 11.400 285 70 32.100 New York 11.100 254 86 26.100 North Carolina 13.000 337 45 16.100 North Dakota 0.800 45 44 7.300 Ohio 7.300 120 75 21.400 Oklahoma 6.600 151 68 20.000 Oregon 4.900 159 67 29.300 Pennsylvania 6.300 106 72 14.900 Rhode Island 3.400 174 87 8.300 South Carolina 14.400 279 48 22.500 South Dakota 3.800 86 45 12.800 Tennessee 13.200 188 59 26.900 Texas 12.700 201 80 25.500 Utah 3.200 120 80 22.900 Vermont 2.200 48 32 11.200 Virginia 8.500 156 63 20.700 Washington 4.000 145 73 26.200 West Virginia 5.700 81 39 9.300 Wisconsin 2.600 53 66 10.800 Wyoming 6.800 161 60 15.600

import pandas as pd from sklearn.cluster import AgglomerativeClustering df = pd.io.parsers.read_table("http://dpaste.com/031VZPM.txt") samples = df["State"].tolist() ndf = df[["Murder", "Assault", "UrbanPop","Rape"]] X = ndf.as_matrix() cluster = AgglomerativeClustering(n_clusters=3, linkage='complete',affinity='euclidean').fit(X) label = cluster.labels_ outclust = list(zip(label, samples)) outclust_df = pd.DataFrame(outclust,columns=["Clusters","Samples"]) for clust in outclust_df.groupby("Clusters"): print (clust)

dat <- read.table("http://dpaste.com/031VZPM.txt",sep="\t",header=TRUE) dist2 = function(x) as.dist(1-cor(t(x), method="pearson")) dat = dat[c("Murder","Assault","UrbanPop","Rape")] hclust(dist2(dat), method="ward.D")

1条回答

网友

1楼 · 发布于 2024-06-01 07:06:31

您可以将自定义关联矩阵定义为接收数据并返回关联矩阵的函数：

from scipy.stats import pearsonr
import numpy as np

def pearson_affinity(M):
   return 1 - np.array([[pearsonr(a,b)[0] for a in M] for b in M])

然后，您可以用这个作为关联函数来调用聚集聚类（您必须更改链接，因为“ward”只适用于欧氏距离。

cluster = AgglomerativeClustering(n_clusters=3, linkage='average',
                           affinity=pearson_affinity)
cluster.fit(X)

请注意，由于某些原因，它对您的数据似乎不是很有效：

cluster.labels_
Out[107]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0])

相关问题更多 >

编程相关推荐

热门问题

热门文章