如何在pandas中获得cume_dist的SQL等价物？

设置

import numpy as np import pandas as pd df = pd.DataFrame({'name': ['Molly', 'Ashes', 'Felix', 'Smudge', 'Tigger', 'Alfie', 'Oscar', 'Millie', 'Misty', 'Puss', 'Smokey', 'Charlie'], 'breed': ['Persian', 'Persian', 'Persian', 'British Shorthair', 'British Shorthair', 'Siamese', 'Siamese', 'Maine Coon', 'Maine Coon', 'Maine Coon', 'Maine Coon', 'British Shorthair'], 'weight': [4.2, 4.5, 5.0, 4.9, 3.8, 5.5, 6.1, 5.4, 5.7, 5.1, 6.1, 4.8], 'color': ['Black', 'Black', 'Tortoiseshell', 'Black', 'Tortoiseshell', 'Brown', 'Black', 'Tortoiseshell', 'Brown', 'Tortoiseshell', 'Brown', 'Black'], 'age': [1, 5, 2, 4, 2, 5, 1, 5, 2, 2, 4, 4]})

name weight percent Tigger 3.8 8 Molly 4.2 17 Ashes 4.5 25 Charlie 4.8 33 Smudge 4.9 42 Felix 5.0 50 Puss 5.1 58 Millie 5.4 67 Alfie 5.5 75 Misty 5.7 83 Oscar 6.1 100 Smokey 6.1 100

2条回答

网友

1楼 · 编辑于 2024-10-01 13:36:06

以下是Python（PySpark）版本：

import pyspark.sql.functions as F
from pyspark.sql import Window

# Define two windows for cumulating weight
win = Window().orderBy('weight') # rolling sum window
win2 = Window().orderBy(F.lit(1)) # total sum window

# get cumulative distribution
df = df.withColumn('cume_dist', F.sum('weight').over(win)*100./F.sum('weight').over(win2))

网友

2楼 · 编辑于 2024-10-01 13:36:06

创建spark df

schema = StructType([
    StructField('name', StringType(), True),
    StructField('breed', StringType(), True),
    StructField('weight', DoubleType(), True),
    StructField('color', StringType(), True),
    StructField('age', IntegerType(), True),
])

sdf = sqlContext.createDataFrame(df, schema)
sdf.createOrReplaceTempView("cats")

在spark df中使用sql函数

from pyspark.sql.window import Window
from pyspark.sql.functions import cume_dist

w = Window.orderBy(sdf['weight'])

sdf.select("weight", (cume_dist().over(w) * 100).cast(
    IntegerType()).alias("percentile")).show()

输出

+   +     +
|weight|percentile|
+   +     +
|   3.8|         8|
|   4.2|        16|
|   4.5|        25|
|   4.8|        33|
|   4.9|        41|
|   5.0|        50|
|   5.1|        58|
|   5.4|        66|
|   5.5|        75|
|   5.7|        83|
|   6.1|       100|
|   6.1|       100|
+   +     +

设置

cume_dist的SQL代码

所需的输出（sql给出了这一点，如何在pandas中执行？）

问题：如何在熊猫身上做到这一点

创建spark df

在spark df中使用sql函数

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章