我有两个pyspark数据帧,并希望根据第一个数据帧中的列值计算第二个数据帧中点列的和

2024-10-01 04:51:17 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我的第一个数据帧,其中包括玩家点数

^{tb1}$

这是我的第二个数据帧,我希望我的总和是根据我在这一行的玩家来计算的

second dataframe

我想要的输出是这样的

Desired output

因此,我有一个解决方案,但我想要一个有效的方法,使它在pyspark

## Function, will return corrosponsing point for Player from df1
def replacepoints(x):
    return df1['points'].where(df1['Playername']==x).sum()

## Iterating through All Names and replacing with their points so that we can sum whole single row to get total points

df3 = df2[['p1','p2','p3','p4','p5','p6','p7','p8','p9','p10','p11']].copy()
# df3
length = len(df3)
for i in range(length):
    j_len = len(df3.iloc[i])
    for j in range(j_len):
        name = df3.iloc[i][j]
        df3.iloc[i][j] = replacepoints(name)
#         print(name)
#         print(replacepoints(name))

## now df3 contains points only
# df3

## storing the sum in points
points = df3.sum(axis=1)
points

# Add points to df2 ponts
df2['points'] = points

Tags: to数据nameinforlenreturn玩家
2条回答

代码

import pandas as pd

df1 = pd.read_csv('player_points.csv')

df2 = pd.read_csv('small_input_spark.csv')

player_name = list(df1['Playername'])

points = list(df1['points'])

j = 0

for i in player_name:
    df2.iloc[:,7:] = df2.iloc[:,7:].replace([i], int(points[j]))
    j += 1

df2['points'] = df2.iloc[:,7:].sum(axis=1)

df2.head()

这样可以避免嵌套循环,我们还可以创建副本,使其不会更改原始数据集

注意:它使用相同的方法,将玩家名称替换为点数,然后进行行相加

我就是这样做的

sc = SparkContext('local[*]')
spark = SparkSession(sparkContext=sc)

df2 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\small_input_spark.csv")
df1 = spark.read.options(inferSchema='True',delimiter=',',header='True').csv("D:\\bop\\player_points.csv")

# start = time.time()

player_name = df1.select('Playername').collect()
points = df1.select('points').collect()



dictn = {row['Playername']:row['points'] for row in df1.collect()}

print(dictn)

# user_func =  udf(lambda x: dictn.get(x), IntegerType())
# newdf = df2.withColumn('p1','p2',user_func(df2.p1,df2.p2))

dictn = {k:str(v) for k,v in zip(dictn.keys(),dictn.values())}

df3 = df2.na.replace(dictn,1,("captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"))

integer_type = ["captain","v-captain","MoM","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","p11"]

for c in integer_type:
    df3 = df3.withColumn(c, df3[c].cast(IntegerType()))

numeric_col_list=df3.schema.names
numeric_col_list=numeric_col_list[4:]   


df3 = df3.withColumn('v-captain', ((col('v-captain') / 2 )))
df3 = df3.withColumn('MoM', ((col('MoM') * 2 )))


df3 = df3.withColumn('points',reduce(add, [col(x) for x in numeric_col_list]))

相关问题 更多 >