从数据框中查找与用户输入最相似的行

raw_data['distance']= distance.cdist(raw_data, raw_user.values.reshape(1, -1) , metric='euclidean') #Sort the rows of dataframe by column 'Distance' raw_data = raw_data.sort_values(by ='distance') print(raw_data.distance)

155 3.047796e+09 177 3.047797e+09 162 3.047797e+09 23 3.047797e+09 192 3.047797e+09 ... 72 3.047931e+09 104 3.047931e+09 Name: distance, Length: 203, dtype: float64

1条回答

网友

1楼 · 发布于 2024-10-02 12:26:56

此处不应使用直接欧氏距离，因为原始数据中的特征变化量可变，即二进制特征最多变化1个单位，而连续特征变化数个单位。因此，我建议使用标准化的欧几里德距离来度量记录之间的相似性。你应该试试这个

# storing the standard deviation
columnWiseStandardDeviation = raw_data.std()

# calculating normalised(delta is Devided With Standard Deviation of the Column) euclidean distance
# .values have been used to access values in form of numpy array which can handle differently shaped operands
# while doing binary opration : '-' here
# deltas between corresponding column of raw_data and raw_user values are divided by their 
# column-wise standard deviations 
# to normalize them.
# then normalized DELTAS are squared and summed up and then square root of the sum is normalized euclidean distance(not a standard term), I coined it in this context

distance = ((((raw_data.values - raw_user.values)/columnWiseStandardDeviation.values)**2).sum(axis=1))**0.5

# getting the record closest to the user input record 
# df.iloc has to be user here as distance does not have indexes of original 
# dataframe as we have use value(np array) of dfs
closestRecord = raw_data.iloc[list(distance==distance.min()).index(True)]
print(closestRecord)

因为我没有实际的数据，所以我生成了一个带有随机数的数据帧来测试脚本

import random
rows, cols = 50, 10
_m = [5*random.randint(1, cols) for c in range(cols)]
print(_m)

df=pd.DataFrame(data={i:[random.randint(0, _m[i]) for j in range(rows)] for i in range(cols)})
print(df)

columnWiseStandardDeviation = df.std()
print(columnWiseStandardDeviation.values)

df1 = pd.DataFrame(data=[[random.randint(0, _m[i]) for i in range(cols)]])
distance = ((((df.values - df1.values)/columnWiseStandardDeviation.values)**2).sum(axis=1))**0.5
print(df1)

print(sorted(list(enumerate(distance)), key=lambda d:d[1]))
print('Closest Record: ', df.iloc[list(distance==distance.min()).index(True)].values)

相关问题更多 >

编程相关推荐

热门问题

热门文章