百分位_近似值不适用于Pyspark,返回相同的值

2024-05-06 08:56:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图得到25、50和75个百分位数,但我的代码工作得不太好,我试图将一些代码从SAS翻译成pyspark,但这让我抓狂

我试图以1,2的比例删除我想要的列名称的组,但无论如何都不起作用

spark.sql('select distinct tAuxiliar.canalidad, \
                          case \
                              when tAuxiliar.percentil < 0.25 then "p25" \
                              when tAuxiliar.percentil >= 0.25 and tAuxiliar.percentil < 0.50 then "p50" \
                              when tAuxiliar.percentil >= 0.50 and tAuxiliar.percentil < 0.75 then "p75" \
                          end as grupo, \
                          max(corte_max) as percentil \
                   from (select "02_Ambos" as canalidad, \
                                 tPercen.percentil, \
                                 min(tPercen.operaciones) as corte_min, \
                                 max(tPercen.operaciones) as corte_max \
                         from (select operaciones, \
                                      percentile_approx(operaciones, 1) as percentil \
                               from base_canalidad_unica \
                               where canalidad = "02_Ambos" \
                               group by operaciones) as tPercen group by operaciones,percentil \
                               ) as tAuxiliar \
                                   where tAuxiliar.percentil < 0.75 \
                                       group by distinct tAuxiliar.canalidad , grupo')

原始SAS查询:

      select distinct
            tAuxiliar.canalidad,
            case 
                when tAuxiliar.percentil < 25 then 'p25'
                when tAuxiliar.percentil >= 25  and tAuxiliar.percentil < 50 then 'p50'
                when tAuxiliar.percentil >= 50 and tAuxiliar.percentil < 75 then 'p75'
            end as grupo,
            max(corte_max) as percentil
        From
        (       
                select '00_Solo_Internet' as canalidad, 
                        tPercen.percentil, 
                        min(tPercen.operaciones) as corte_min, 
                        max(tPercen.operaciones) as corte_max
                from
                (
                    select operaciones, 
                           Quantile(100,operaciones) as 
                           percentil
                    from BASE_CANALIDAD_UNICA
                    where canalidad = '00_Solo_Internet'
                ) as tPercen
                group by 1,2
            )
            as tAuxiliar
        where tAuxiliar.percentil < 75
        group by 1,2

Tags: andfrombyasgroupselectmaxwhen