将用户定义的函数应用于中的groupby

review_id user_id prod_id review 0 10 5 this restaurant is the best. 1 30 10 Worst food. 2 10 15 Best place! 3 30 5 the food is too expensive. 4 30 10 Yummy! I love it.

def ACS(rvw1,rvw2): rvw1=rvw1.replace(",", "").replace(".", "").replace("?","").replace("!","").lower() rvw2=rvw2.replace(",", "").replace(".", "").replace("?","").replace("!","").lower() rvw1words = rvw1.split() rvw2words = rvw2.split() allwords = list(set(rvw1words) | set(rvw2words)) rvw1freq=[] rvw2freq=[] for word in allwords: rvw1freq.append(rvw1words.count(word)) rvw2freq.append(rvw2words.count(word)) return np.dot(rvw1freq,rvw2freq)/(np.linalg.norm(rvw1freq)*np.linalg.norm(rvw2freq))

1条回答

网友

1楼 · 发布于 2024-10-04 03:22:12

可以使用pd.merge获得行的笛卡尔积，然后使用pd.DataFrame.apply应用函数：

import pandas as pd

# Helper function to get combinations of a dataframe and their cosine similarity
def groupSimilarity(df):
    combinations = (df.assign(dummy=1)
                     .merge(df.assign(dummy=1), on="dummy")
                     .drop("dummy", axis=1))
    similarity = combinations.apply(lambda x: ACS(x["review_x"], x["review_y"]), axis=1)
    combinations.loc[:, "similarity"] = similarity
    return combinations

# apply function to each group
grouped = (df.groupby("user_id")
            .apply(combinations)
            .reset_index())

# >>> grouped[["review_id_x", "review_id_y", "user_id_x", "user_id_y", "distance"]]
#     review_id_x  review_id_y  user_id_x  user_id_y  distance
# 0             0            0         10         10  1.000000
# 1             0            2         10         10  0.316228
# 2             2            0         10         10  0.316228
# 3             2            2         10         10  1.000000
# 4             1            1         30         30  1.000000
# 5             1            3         30         30  0.316228
# 6             1            4         30         30  0.000000
# 7             3            1         30         30  0.316228
# 8             3            3         30         30  1.000000
# 9             3            4         30         30  0.000000
# 10            4            1         30         30  0.000000
# 11            4            3         30         30  0.000000
# 12            4            4         30         30  1.000000

相关问题更多 >

编程相关推荐

热门问题

热门文章