带条件的Pandasql

2024-10-16 17:24:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧:

  • 第一,我有学生信息。我称之为df1

    user_id  | plan      | subplan | matrix_code | student_semester 
    102532   | GADMSSP   | GSP10   | 1501        | 8
    106040   | GRINTSP   |         | 1901        | 4
    106114   | GCSOSSULA |         | 1901        | 4
    106504   | GCSOSSP   |         | 1902        | 3
    106664   | GCINESP   |         | 1901        | 4
    
  • 第二,我有一个机构的选修课要求。我称之为df2

    plan      | subplan | matrix_code | semester  | credits| cumulative_credits
    
    GADMSSP   |         | 1501        | 5         | 4      | 4 
    GADMSSP   |         | 1501        | 6         | 4      | 8
    GADMSSP   |         | 1501        | 7         | 4      | 12 
    GADMSSP   |         | 1501        | 8         | 0      | 12
    
    GRINTSP   |         | 1901        | 7         | 2      | 2
    GRINTSP   |         | 1901        | 8         | 0      | 2
    
    GCSOSSULA |         | 1901        | 3         | 4      | 4
    GCSOSSULA |         | 1901        | 4         | 0      | 4
    GCSOSSULA |         | 1901        | 5         | 0      | 4
    GCSOSSULA | GSUL5   | 1901        | 5         | 4      | 8
    GCSOSSULA |         | 1901        | 6         | 0      | 4
    GCSOSSULA | GSUL5   | 1901        | 6         | 0      | 8
    GCSOSSULA |         | 1901        | 7         | 0      | 4
    GCSOSSULA | GSUL5   | 1901        | 7         | 0      | 8
    GCSOSSULA |         | 1901        | 8         | 0      | 4
    GCSOSSULA | GSUL5   | 1901        | 8         | 0      | 8
    
    GCSOSSP   |         | 1902        | 5         | 4      | 4
    GCSOSSP   |         | 1902        | 6         | 4      | 8
    GCSOSSP   |         | 1902        | 7         | 4      | 12
    GCSOSSP   |         | 1902        | 8         | 0      | 12
    
    GCINESP   |         | 1901        | 2         | 4      | 4
    GCINESP   |         | 1901        | 3         | 4      | 8
    GCINESP   |         | 1901        | 4         | 4      | 12
    GCINESP   |         | 1901        | 5         | 4      | 16
    GCINESP   |         | 1901        | 6         | 4      | 24
    GCINESP   |         | 1901        | 7         | 4      | 32
    GCINESP   |         | 1901        | 8         | 4      | 40
    
    

因此,考虑到一些条件,我必须合并df:

  1. 对于df1和df2,planmatrix_code必须相同

  2. df1.subplan要么与df2.subplan相同,要么可以为空。因此,df1第1行中的用户id 102532将获得df2.subplannull的要求,因为没有指示此计划和matrix_code的具体subplan要求

  3. 获取student_semester +1,但将maxdf2.semester视为student_semester的限制。所以第1行的用户id 102532必须留在第8学期。这一个我不能添加+1学期,但我想指出,这是一个用户,没有达到要求,在最后一学期

  4. 我只对累计学分感兴趣

对于这两个dfs,结果应如下所示:

user_id  | plan      | subplan | matrix_code | semester | student_semester | cumulative_credits
102532   | GADMSSP   | GSP10   | 1501        | 8        | 9                | 12               
106040   | GRINTSP   |         | 1901        | 5        | 4                | 0
106114   | GCSOSSULA |         | 1901        | 5        | 4                | 4
106504   | GCSOSSP   |         | 1902        | 4        | 3                | 0
106664   | GCINESP   |         | 1901        | 5        | 4                | 16

但是,如果无法获得0个累积大学学分的学生,结果应该是:

user_id  | plan      | subplan | matrix_code | semester | student_semester | cumulative_credits
102532   | GADMSSP   | GSP10   | 1501        | 8        | 9                | 12               
106114   | GCSOSSULA |         | 1901        | 5        | 4                | 4
106664   | GCINESP   |         | 1901        | 5        | 4                | 16

到目前为止,我所做的是:

pip install -U pandasql

import pandas as pd
pysqldf = lambda q: sqldf(q, globals())


df2 = df2.groupby(['plan', 'subplan', 'matrix_code', 'semester']).cumulative_credits.max()
df2 = df2.to_frame()
df2 = df2.reset_index()
electives = """ 
SELECT user_id
  ,a.plan
  ,a.subplan as "student_subplan"
  ,a.matrix_code
  ,a.student_semester
  ,b.subplan as "matrix_subplan"
  ,b.semester
  ,cumulative_credits
FROM df1 a
LEFT JOIN df2 b 
ON a.plan = b.plan
AND a.matrix_code = b.matrix_code
WHERE (b.subplan = '' OR a.subplan = b.subplan)
""" 
electives = pysqldf(electives)

然后我试图得到第三个条件,但我没有正确的方法做这件事的线索。我想我可以用lambda,但我不知道怎么用

df_s['semester_x'] = df_s['student_semester'] +1 | df_s['student_semester'] == df_s['semester'].max()

另外,如果有更好的方法使用合并条件来执行前面的条件步骤,那就更好了

编辑-解决方案:

我用了部分冻糕溶液。我只是做了一个条件逻辑来获得学生下学期的累计学分,而不是矩阵代码的最大累计学分

以下是我所做的:

第一部分-冻糕解决方案:

agg = (pd.merge(df1, df2, on=['plano', 'matriz'], suffixes=["", "_"])
         .fillna('')
         .query("(subplano == '') | (subplano_aluno == subplano)")
         .rename({'subplano':'subplano_matriz', 'semestre_': 'semestre_matriz', 'semestre': 'semestre_aluno'}, axis='columns')

第二部分:

y = """
with a as 
(
SELECT DISTINCT plan
  ,CASE 
        WHEN plan LIKE '%SULB%' OR plano LIKE '%SULC%' THEN 10
        WHEN plan LIKE '%SULD%' OR plano LIKE '%SULE%' THEN 12
        ELSE 8
      END as "semester_max"
FROM agg
)
SELECT DISTINCT 
  user_id
  ,student_semester
  ,plan
  ,student_subplan
  ,matrix_code
  ,matrix_subplan
  ,cumulative_credits
  ,matrix_semester
  ,semester_max
  ,CASE 
      WHEN student_semester < semester_max THEN (student_semester)+1
      WHEN student_semester = semester_max THEN student_semester
  END as "next_semester"
FROM 
(
    SELECT DISTINCT 
      user_id
      ,student_semester
      ,b.plan
      ,student_subplan
      ,matrix_code
      ,matrix_subplan
      ,cumulative_credits
      ,matrix_semester
      ,semester_max
    FROM agg b
    INNER JOIN a ON b.plano = a.plano
) x
WHERE matrix_semester = next_semester
"""
z = pysqldf(x)


Tags: idcodestudentmatrixmaxdf1creditsdf2
1条回答
网友
1楼 · 发布于 2024-10-16 17:24:22

考虑在SQL查询中添加一个^ {CD1>}语句:

SELECT d1.user_id
       , d1.plan
       , d1.subplan AS student_subplan
       , d1.matrix_code
       , d1.student_semester
       , d2.subplan AS matrix_subplan
       , CASE 
             WHEN d1.student_semester = MAX(d2.semester)
             THEN d1.student_semester
             ELSE d1.student_semester + 1
         END AS semester
       , MAX(d2.cumulative_credits) AS cumulative_credits
FROM df1 d1
LEFT JOIN df2 d2 
   ON d1.plan = d2.plan
   AND d1.matrix_code = d2.matrix_code
WHERE (d2.subplan IS NULL OR d1.subplan = d2.subplan)
GROUP BY d1.user_id
       , d1.plan
       , d1.subplan
       , d1.matrix_code
       , d1.student_semester
       , d2.subplan;

Online Demo

在Pandas中,翻译将使用merge+groupby+Series.where作为case条件逻辑:

# MERGE
agg = (pd.merge(df1, df2, on=['plan', 'matrix_code'], suffixes=["", "_"])
         .fillna('')
         .query("(subplan_ == '') | (subplan == subplan_)")
         .rename({'subplan':'student_subplan', 'subplan_':'matrix_subplan'}, axis='columns')
      )

# AGGRGEATION
agg = (agg.groupby(['user_id', 'plan', 'student_subplan', 'matrix_code', 
                    'student_semester', 'matrix_subplan'], as_index=False)
          .agg({'semester':'max', 'cumulative_credits':'max'})
      )

# CONDITIONAL LOGIC
agg['semester'] = agg['student_semester'].where(agg['semester'] == agg['student_semester'], 
                                                agg['student_semester'].add(1))

agg
#    user_id       plan student_subplan  matrix_code  student_semester matrix_subplan  semester  cumulative_credits
# 0   102532    GADMSSP           GSP10         1501                 8                        8                  12
# 1   106040    GRINTSP                         1901                 4                        5                   2
# 2   106114  GCSOSSULA                         1901                 4                        5                   4
# 3   106504    GCSOSSP                         1902                 3                        4                  12
# 4   106664    GCINESP                         1901                 4                        5                  40

相关问题 更多 >