如何在pandas中获得cume_dist的SQL等价物?

2024-10-01 13:36:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试了不同的pandas方法,如rank、qcut、quantile,但无法获得cume_dist()的SQL等价物。如何在熊猫中获得以下结果

在这个网站上可以找到用SQL解决的完整问题:https://www.windowfunctions.com/questions/ranking/4

设置

import numpy as np
import pandas as pd

df = pd.DataFrame({'name': ['Molly', 'Ashes', 'Felix', 'Smudge', 'Tigger', 'Alfie', 'Oscar', 'Millie', 'Misty', 'Puss', 'Smokey', 'Charlie'],
          'breed': ['Persian', 'Persian', 'Persian', 'British Shorthair', 'British Shorthair', 'Siamese', 'Siamese', 'Maine Coon', 'Maine Coon', 'Maine Coon', 'Maine Coon', 'British Shorthair'],
          'weight': [4.2, 4.5, 5.0, 4.9, 3.8, 5.5, 6.1, 5.4, 5.7, 5.1, 6.1, 4.8],
          'color': ['Black', 'Black', 'Tortoiseshell', 'Black', 'Tortoiseshell', 'Brown', 'Black', 'Tortoiseshell', 'Brown', 'Tortoiseshell', 'Brown', 'Black'],
          'age': [1, 5, 2, 4, 2, 5, 1, 5, 2, 2, 4, 4]})

cume_dist的SQL代码

select name, weight, ntile(4) over ( order by weight) as weight_quartile from cats order by weight

所需的输出(sql给出了这一点,如何在pandas中执行?)

name    weight  percent
Tigger  3.8 8
Molly   4.2 17
Ashes   4.5 25
Charlie 4.8 33
Smudge  4.9 42
Felix   5.0 50
Puss    5.1 58
Millie  5.4 67
Alfie   5.5 75
Misty   5.7 83
Oscar   6.1 100
Smokey  6.1 100

问题:如何在熊猫身上做到这一点

有没有办法让我们只使用numpy和pandas


Tags: namepandassqldistasblackweightpersian
2条回答

以下是Python(PySpark)版本:

import pyspark.sql.functions as F
from pyspark.sql import Window

# Define two windows for cumulating weight
win = Window().orderBy('weight') # rolling sum window
win2 = Window().orderBy(F.lit(1)) # total sum window

# get cumulative distribution
df = df.withColumn('cume_dist', F.sum('weight').over(win)*100./F.sum('weight').over(win2))

创建spark df

schema = StructType([
    StructField('name', StringType(), True),
    StructField('breed', StringType(), True),
    StructField('weight', DoubleType(), True),
    StructField('color', StringType(), True),
    StructField('age', IntegerType(), True),
])

sdf = sqlContext.createDataFrame(df, schema)
sdf.createOrReplaceTempView("cats")

在spark df中使用sql函数

from pyspark.sql.window import Window
from pyspark.sql.functions import cume_dist

w = Window.orderBy(sdf['weight'])

sdf.select("weight", (cume_dist().over(w) * 100).cast(
    IntegerType()).alias("percentile")).show()

输出

+   +     +
|weight|percentile|
+   +     +
|   3.8|         8|
|   4.2|        16|
|   4.5|        25|
|   4.8|        33|
|   4.9|        41|
|   5.0|        50|
|   5.1|        58|
|   5.4|        66|
|   5.5|        75|
|   5.7|        83|
|   6.1|       100|
|   6.1|       100|
+   +     +

相关问题 更多 >