Python数据管道的merge，melt，pivot\u表

2024-09-27 07:34:31 发布

男 | 程序猿一只，喜欢编程写python代码。

我用Python和Pandas编写了一个数据管道。它起作用了，我很喜欢它。Python不错。在这一点上，对熊猫不太好。然而，作为一名Python的学生，我总是希望自己更地道，而不是依赖于可能在其他语言中使用的循环和逻辑。具体地说，我总是惊讶于如何使用Pandas方法迭代序列和帧，而不必围绕它们构建循环。你知道吗

这是源文件。我还有来自API调用的其他人口统计信息。这段代码在我的示例中被截断了，但是假设data['record']包含我的人口统计数据，我可以使用pd.合并(). 你知道吗

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|StudentFirstName|StudentMiddleName|StudentLastName|UniqueIdentifier|Grade|GOM   |SchoolYear|Fall_September|Fall_SeptemberDateGiven|Fall_SeptemberNationalPercentileRank|Winter_January|Winter_JanuaryDateGiven|Winter_JanuaryNationalPercentileRank|Spring_May|Spring_MayDateGiven|Spring_MayNationalPercentileRank|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                |                 |               |100             |1    |LNF   |2017      |29            |9/11/2017              |14                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |100             |1    |LSF   |2017      |31            |9/11/2017              |51                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |100             |1    |M-COMP|2017      |8             |9/20/2017              |48                                  |15            |2/5/2018               |17                                  |42        |5/8/2018           |65                              |
|                |                 |               |100             |1    |MNM   |2017      |9             |9/11/2017              |36                                  |7             |2/1/2018               |3                                   |5         |5/8/2018           |1                               |
|                |                 |               |100             |1    |NIM   |2017      |35            |9/11/2017              |34                                  |62            |2/1/2018               |52                                  |51        |5/8/2018           |18                              |
|                |                 |               |100             |1    |NWF   |2017      |28            |9/11/2017              |37                                  |69            |2/1/2018               |71                                  |31        |5/8/2018           |5                               |
|                |                 |               |100             |1    |OCM   |2017      |62            |9/11/2017              |30                                  |89            |2/1/2018               |58                                  |94        |5/8/2018           |51                              |
|                |                 |               |100             |1    |PSF   |2017      |14            |9/11/2017              |10                                  |              |                       |                                    |49        |5/8/2018           |33                              |
|                |                 |               |100             |1    |QDM   |2017      |23            |9/11/2017              |57                                  |31            |2/1/2018               |46                                  |26        |5/8/2018           |15                              |
|                |                 |               |100             |1    |R-CBM |2017      |8             |9/11/2017              |36                                  |17            |2/1/2018               |22                                  |29        |5/8/2018           |15                              |
|                |                 |               |200             |1    |LNF   |2017      |47            |9/11/2017              |51                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |LSF   |2017      |47            |9/11/2017              |86                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |M-COMP|2017      |27            |9/22/2017              |92                                  |34            |2/2/2018               |67                                  |47        |5/8/2018           |88                              |
|                |                 |               |200             |1    |MNM   |2017      |11            |9/11/2017              |48                                  |23            |2/1/2018               |80                                  |21        |5/9/2018           |55                              |
|                |                 |               |200             |1    |NIM   |2017      |56            |9/11/2017              |81                                  |80            |2/1/2018               |95                                  |80        |5/9/2018           |92                              |
|                |                 |               |200             |1    |NWF   |2017      |63            |9/11/2017              |87                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |OCM   |2017      |107           |9/11/2017              |                                    |109           |2/1/2018               |                                    |109       |5/9/2018           |                                |
|                |                 |               |200             |1    |PSF   |2017      |50            |9/11/2017              |73                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |QDM   |2017      |28            |9/11/2017              |75                                  |38            |2/1/2018               |78                                  |40        |5/9/2018           |84                              |
|                |                 |               |200             |1    |R-CBM |2017      |40            |9/11/2017              |80                                  |76            |2/1/2018               |80                                  |84        |5/9/2018           |65                              |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

因为这是去，我需要分为两个文件，一个用于读取数据和一个用于数学数据。它进入哪个文件是由GOM列决定的。大约一半的GOM值用于阅读，另一半用于数学。你知道吗

下面是两个输出文件的样子。阅读：

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date      |LNF 1|LNF 2|LNF 3    |LSF 1|LSF 2|LSF 3   |PSF 1|PSF 2|PSF 3   |NWF 1|NWF 2|NWF 3  |R-CBM 1|R-CBM 2|R-CBM 3|MAZE 1|MAZE 2|MAZE 3|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100           |         |          |1    |Fall  |9/11/2017 |29   |14   |Level 2  |31   |51   |Level 3 |14   |10   |Level 1 |28   |37   |Level 3|8      |36     |Level 3|      |      |      |
|100           |         |          |1    |Spring|5/8/2018  |     |     |         |     |     |        |49   |33   |Level 3 |31   |5    |Level 1|29     |15     |Level 2|      |      |      |
|100           |         |          |1    |Winter|2/1/2018  |     |     |         |     |     |        |     |     |        |69   |71   |Level 3|17     |22     |Level 2|      |      |      |
|200           |         |          |1    |Fall  |9/11/2017 |47   |51   |Level 3  |47   |86   |Level 4 |50   |73   |Level 3 |63   |87   |Level 4|40     |80     |Level 4|      |      |      |
|200           |         |          |1    |Spring|5/9/2018  |     |     |         |     |     |        |     |     |        |     |     |       |84     |65     |Level 3|      |      |      |
|200           |         |          |1    |Winter|2/1/2018  |     |     |         |     |     |        |     |     |        |     |     |       |76     |80     |Level 4|      |      |      |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

和数学：

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date     |OCM 1|OCM 2|OCM 3  |NIM 1|NIM 2|NIM 3  |QDM 1|QDM 2|QDM 3  |MNM 1|MNM 2|MNM 3  |M-COMP 1|M-COMP 2|M-COMP 3|M-CAP 1|M-CAP 2|M-CAP 3|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100           |         |          |1    |Fall  |9/11/2017|62   |30   |Level 3|35   |34   |Level 3|23   |57   |Level 3|9    |36   |Level 3|8       |48      |Level 3 |       |       |       |
|100           |         |          |1    |Spring|5/8/2018 |94   |51   |Level 3|51   |18   |Level 2|26   |15   |Level 2|5    |1    |Level 1|42      |65      |Level 3 |       |       |       |
|100           |         |          |1    |Winter|2/1/2018 |89   |58   |Level 3|62   |52   |Level 3|31   |46   |Level 3|7    |3    |Level 1|15      |17      |Level 2 |       |       |       |
|200           |         |          |1    |Fall  |9/11/2017|107  |     |       |56   |81   |Level 4|28   |75   |Level 3|11   |48   |Level 3|27      |92      |Level 5 |       |       |       |
|200           |         |          |1    |Spring|5/8/2018 |109  |     |       |80   |92   |Level 5|40   |84   |Level 4|21   |55   |Level 3|47      |88      |Level 4 |       |       |       |
|200           |         |          |1    |Winter|2/1/2018 |109  |     |       |80   |95   |Level 5|38   |78   |Level 4|23   |80   |Level 4|34      |67      |Level 3 |       |       |       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

最棘手的事情是为每个主题找到合适的日期。孩子们大致在同一天吃每一个GOM，但出来的文件给出了每个GOM的不同日期。当数据进入我们的另一个系统时，每个主题只有一个日期，它由几个gom组成。我求助于编写一个函数来遍历它们，但这可能更容易做到。你知道吗

通常，我一次只运行一个赛季的脚本，然后对其他赛季进行评论，但我现在正在导入去年的数据，同时执行所有三个赛季。你知道吗

输出列的快速解释。对于每个GOM，1是数字分数，2是百分位数，3是绩效水平的字母描述。这是一种可怕的文件组织方式，但这正是供应商所需要的。你知道吗

我希望用更少的步骤或者至少用一种更惯用的方式来实现这一点。我发现自己撤销了pivotètable生成的一些内容，只是为了能够以我知道的方式再次使用df，这仍然非常有限。你知道吗

我最担心的是第65-77行，在那里我融化和旋转桌子。不确定是否可以更有效地完成。而且，从Python的角度来看，第108-111行非常愚蠢。我只是不想第二次列出GOM，我希望能列出LNF 1，LNF 2，LNF 3，LSF 1。。。快速无需重新键入。一定有更好的办法。你知道吗

请让我知道，如果你有任何建议，如何改进这个过程，无论是直接Python或（特别是）熊猫。你知道吗

import pandas as pd
import os
import json
import glob
import re
import datetime
import numpy as np
import itertools

# build of list of seasons to process.  usually only one.
seasons = []
seasons.append('Fall')
seasons.append('Winter')
seasons.append('Spring')

# Read demographic data into dataframe
# data['record'] comes from something that was truncated for simplicity
demographics = pd.DataFrame(data['record'], columns=['student_number', 'state_studentnumber', 'last_name', 'first_name', 'dob'])

# iterate through score files
for file in glob.glob('To Do/*.csv'):
    # read score data into dataframe
    aimsweb = pd.read_csv(file)

    aimsweb["Fall_SeptemberDateGiven"] = pd.to_datetime(aimsweb['Fall_SeptemberDateGiven']).dt.strftime('%m/%d/%Y');
    aimsweb["Winter_JanuaryDateGiven"] = pd.to_datetime(aimsweb['Winter_JanuaryDateGiven']).dt.strftime('%m/%d/%Y');
    aimsweb["Spring_MayDateGiven"] = pd.to_datetime(aimsweb['Spring_MayDateGiven']).dt.strftime('%m/%d/%Y');    

    # perform left outer join of score data to demographics.  this ensures these are real students.
    merged = aimsweb.merge(demographics, how='left', left_on=['UniqueIdentifier', 'StudentLastName', 'StudentFirstName'], right_on=['student_number', 'last_name', 'first_name'], indicator=True)

    renames = {}
    # rename table
    renames['Fall_September'] = 'Fall_1'
    renames['Fall_SeptemberDateGiven'] = 'Fall_Date'
    renames['Fall_SeptemberNationalPercentileRank'] = 'Fall_2'

    renames['Winter_January'] = 'Winter_1'
    renames['Winter_JanuaryDateGiven'] = 'Winter_Date'
    renames['Winter_JanuaryNationalPercentileRank'] = 'Winter_2'

    renames['Spring_May'] = 'Spring_1'
    renames['Spring_MayDateGiven'] = 'Spring_Date'
    renames['Spring_MayNationalPercentileRank'] = 'Spring_2'    

    merged.rename(index=str, columns=renames, inplace=True)

    # function for converting percentile into level 1-5
    def percentile2level(percentile):
        if percentile >= 91:
            return 'Level 5'
        if percentile >= 76:
            return 'Level 4'
        if percentile >= 26:
            return 'Level 3'
        if percentile >= 11:
            return 'Level 2'
        if percentile >= 0:
            return 'Level 1'

    # convert percentile to level
    for season in ['Fall', 'Winter', 'Spring']:
        merged[season + '_3'] = merged[season + '_2'].apply(percentile2level);

    # melt data from columns into rows
    merged = merged.melt(id_vars = ['student_number', 'GOM', 'last_name', 'first_name', 'Grade'],value_vars=['Fall_Date', 'Fall_1', 'Fall_2', 'Fall_3', 'Winter_Date', 'Winter_1', 'Winter_2', 'Winter_3', 'Spring_Date', 'Spring_1', 'Spring_2', 'Spring_3'])

    # split the variable into season and attribute
    merged['Season'], merged['Attribute'] = merged['variable'].str.split('_', 1).str

    # pivot data back to columns
    merged = pd.pivot_table(merged, index=['student_number', 'Season', 'last_name', 'first_name', 'Grade'], values=['value'], columns=['GOM', 'Attribute'], aggfunc=np.max, fill_value='')

    # condense the levels and reset index to get back to flat df
    merged.columns = merged.columns.droplevel(0)
    merged.columns = [' '.join(col).strip() for col in merged.columns.values]       
    merged.reset_index(inplace=True);   

    # build subject-GOM mappings
    subjects = {};
    subjects['Reading'] = ['LNF', 'LSF', 'PSF', 'NWF', 'R-CBM', 'MAZE']
    subjects['Mathematics'] = ['OCM', 'NIM', 'QDM', 'MNM', 'M-COMP', 'M-CAP']   

    # function for finding the min non-null date from a list
    def mindate(row, headings):     
        dates = list(row[headings])
        dates = list(filter(None, dates))
        if len(dates) > 0:
            return min(dates)       
        return "";

    # iterate through two subjects
    for subject in subjects:
        # build list of date headers for all GOMs
        headings = [];
        for gom in subjects[subject]:
            headings.append(gom + " Date");

        # create df for this subject
        df = merged

        # create date column with value of minimum from list of dates
        df['Date'] = df.apply(mindate, axis=1, args=(headings,))        

        # filter out records with no real date
        df = df[ df['Date'] != 'NaT' ]

        # build a list of GOM headers with 1, 2, 3
        headings = [];      
        for combo in list(itertools.product( subjects[subject], [1,2,3] )):
            headings.append( combo[0] + " " + str(combo[1]) )           

        # save csv
        df[ df['Season'].isin(seasons) ][ ['student_number', 'last_name', 'first_name', 'Grade', 'Season', 'Date'] + headings].to_csv("./" + os.path.splitext(os.path.basename(file))[0] + "_" + "".join(seasons) + "_" + subject + ".csv", index=False);

Tags： columns to name df for data date merged

0条回答

目前没有回答

Python数据管道的merge，melt，pivot\u表

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python数据管道的merge，melt，pivot\u表

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >