从数据框中查找与用户输入最相似的行

2024-10-02 12:26:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从数据集中找到与用户输入最相似的行

我的数据集如下所示:

enter image description here

这是用户输入:

enter image description here

我使用了scipy和sklearn的许多距离度量(欧几里德、汉明、城市块、相关性、余弦…),但没有找到好的结果

我的daset形状是(400,70),对于70个特征,我有25个二进制特征和45个连续特征

这是我的Python代码:

raw_data['distance']= distance.cdist(raw_data,
                                     raw_user.values.reshape(1, -1) , 
                                     metric='euclidean')

#Sort the rows of dataframe by column 'Distance'
raw_data = raw_data.sort_values(by ='distance')
print(raw_data.distance)

结果如下所示:

155    3.047796e+09
177    3.047797e+09
162    3.047797e+09
23     3.047797e+09
192    3.047797e+09
       ...     
72     3.047931e+09
104    3.047931e+09
Name: distance, Length: 203, dtype: float64

如果您有其他方法或技巧来解决这个问题,请毫不犹豫地向我提供您的建议。谢谢


Tags: 数据用户距离datarawby度量二进制
1条回答
网友
1楼 · 发布于 2024-10-02 12:26:56

此处不应使用直接欧氏距离,因为原始数据中的特征变化量可变,即二进制特征最多变化1个单位,而连续特征变化数个单位。因此,我建议使用标准化的欧几里德距离来度量记录之间的相似性。 你应该试试这个

# storing the standard deviation
columnWiseStandardDeviation = raw_data.std()

# calculating normalised(delta is Devided With Standard Deviation of the Column) euclidean distance
# .values have been used to access values in form of numpy array which can handle differently shaped operands
# while doing binary opration : '-' here
# deltas between corresponding column of raw_data and raw_user values are divided by their 
# column-wise standard deviations 
# to normalize them.
# then normalized DELTAS are squared and summed up and then square root of the sum is normalized euclidean distance(not a standard term), I coined it in this context

distance = ((((raw_data.values - raw_user.values)/columnWiseStandardDeviation.values)**2).sum(axis=1))**0.5

# getting the record closest to the user input record 
# df.iloc has to be user here as distance does not have indexes of original 
# dataframe as we have use value(np array) of dfs
closestRecord = raw_data.iloc[list(distance==distance.min()).index(True)]
print(closestRecord)

因为我没有实际的数据,所以我生成了一个带有随机数的数据帧来测试脚本

import random
rows, cols = 50, 10
_m = [5*random.randint(1, cols) for c in range(cols)]
print(_m)

df=pd.DataFrame(data={i:[random.randint(0, _m[i]) for j in range(rows)] for i in range(cols)})
print(df)

columnWiseStandardDeviation = df.std()
print(columnWiseStandardDeviation.values)

df1 = pd.DataFrame(data=[[random.randint(0, _m[i]) for i in range(cols)]])
distance = ((((df.values - df1.values)/columnWiseStandardDeviation.values)**2).sum(axis=1))**0.5
print(df1)

print(sorted(list(enumerate(distance)), key=lambda d:d[1]))
print('Closest Record: ', df.iloc[list(distance==distance.min()).index(True)].values)

相关问题 更多 >

    热门问题