我用Python和Pandas编写了一个数据管道。它起作用了,我很喜欢它。Python不错。在这一点上,对熊猫不太好。然而,作为一名Python的学生,我总是希望自己更地道,而不是依赖于可能在其他语言中使用的循环和逻辑。具体地说,我总是惊讶于如何使用Pandas方法迭代序列和帧,而不必围绕它们构建循环。你知道吗
这是源文件。我还有来自API调用的其他人口统计信息。这段代码在我的示例中被截断了,但是假设data['record']包含我的人口统计数据,我可以使用pd.合并(). 你知道吗
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|StudentFirstName|StudentMiddleName|StudentLastName|UniqueIdentifier|Grade|GOM |SchoolYear|Fall_September|Fall_SeptemberDateGiven|Fall_SeptemberNationalPercentileRank|Winter_January|Winter_JanuaryDateGiven|Winter_JanuaryNationalPercentileRank|Spring_May|Spring_MayDateGiven|Spring_MayNationalPercentileRank|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | |100 |1 |LNF |2017 |29 |9/11/2017 |14 | | | | | | |
| | | |100 |1 |LSF |2017 |31 |9/11/2017 |51 | | | | | | |
| | | |100 |1 |M-COMP|2017 |8 |9/20/2017 |48 |15 |2/5/2018 |17 |42 |5/8/2018 |65 |
| | | |100 |1 |MNM |2017 |9 |9/11/2017 |36 |7 |2/1/2018 |3 |5 |5/8/2018 |1 |
| | | |100 |1 |NIM |2017 |35 |9/11/2017 |34 |62 |2/1/2018 |52 |51 |5/8/2018 |18 |
| | | |100 |1 |NWF |2017 |28 |9/11/2017 |37 |69 |2/1/2018 |71 |31 |5/8/2018 |5 |
| | | |100 |1 |OCM |2017 |62 |9/11/2017 |30 |89 |2/1/2018 |58 |94 |5/8/2018 |51 |
| | | |100 |1 |PSF |2017 |14 |9/11/2017 |10 | | | |49 |5/8/2018 |33 |
| | | |100 |1 |QDM |2017 |23 |9/11/2017 |57 |31 |2/1/2018 |46 |26 |5/8/2018 |15 |
| | | |100 |1 |R-CBM |2017 |8 |9/11/2017 |36 |17 |2/1/2018 |22 |29 |5/8/2018 |15 |
| | | |200 |1 |LNF |2017 |47 |9/11/2017 |51 | | | | | | |
| | | |200 |1 |LSF |2017 |47 |9/11/2017 |86 | | | | | | |
| | | |200 |1 |M-COMP|2017 |27 |9/22/2017 |92 |34 |2/2/2018 |67 |47 |5/8/2018 |88 |
| | | |200 |1 |MNM |2017 |11 |9/11/2017 |48 |23 |2/1/2018 |80 |21 |5/9/2018 |55 |
| | | |200 |1 |NIM |2017 |56 |9/11/2017 |81 |80 |2/1/2018 |95 |80 |5/9/2018 |92 |
| | | |200 |1 |NWF |2017 |63 |9/11/2017 |87 | | | | | | |
| | | |200 |1 |OCM |2017 |107 |9/11/2017 | |109 |2/1/2018 | |109 |5/9/2018 | |
| | | |200 |1 |PSF |2017 |50 |9/11/2017 |73 | | | | | | |
| | | |200 |1 |QDM |2017 |28 |9/11/2017 |75 |38 |2/1/2018 |78 |40 |5/9/2018 |84 |
| | | |200 |1 |R-CBM |2017 |40 |9/11/2017 |80 |76 |2/1/2018 |80 |84 |5/9/2018 |65 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
因为这是去,我需要分为两个文件,一个用于读取数据和一个用于数学数据。它进入哪个文件是由GOM列决定的。大约一半的GOM值用于阅读,另一半用于数学。你知道吗
下面是两个输出文件的样子。阅读:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date |LNF 1|LNF 2|LNF 3 |LSF 1|LSF 2|LSF 3 |PSF 1|PSF 2|PSF 3 |NWF 1|NWF 2|NWF 3 |R-CBM 1|R-CBM 2|R-CBM 3|MAZE 1|MAZE 2|MAZE 3|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 | | |1 |Fall |9/11/2017 |29 |14 |Level 2 |31 |51 |Level 3 |14 |10 |Level 1 |28 |37 |Level 3|8 |36 |Level 3| | | |
|100 | | |1 |Spring|5/8/2018 | | | | | | |49 |33 |Level 3 |31 |5 |Level 1|29 |15 |Level 2| | | |
|100 | | |1 |Winter|2/1/2018 | | | | | | | | | |69 |71 |Level 3|17 |22 |Level 2| | | |
|200 | | |1 |Fall |9/11/2017 |47 |51 |Level 3 |47 |86 |Level 4 |50 |73 |Level 3 |63 |87 |Level 4|40 |80 |Level 4| | | |
|200 | | |1 |Spring|5/9/2018 | | | | | | | | | | | | |84 |65 |Level 3| | | |
|200 | | |1 |Winter|2/1/2018 | | | | | | | | | | | | |76 |80 |Level 4| | | |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
和数学:
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date |OCM 1|OCM 2|OCM 3 |NIM 1|NIM 2|NIM 3 |QDM 1|QDM 2|QDM 3 |MNM 1|MNM 2|MNM 3 |M-COMP 1|M-COMP 2|M-COMP 3|M-CAP 1|M-CAP 2|M-CAP 3|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 | | |1 |Fall |9/11/2017|62 |30 |Level 3|35 |34 |Level 3|23 |57 |Level 3|9 |36 |Level 3|8 |48 |Level 3 | | | |
|100 | | |1 |Spring|5/8/2018 |94 |51 |Level 3|51 |18 |Level 2|26 |15 |Level 2|5 |1 |Level 1|42 |65 |Level 3 | | | |
|100 | | |1 |Winter|2/1/2018 |89 |58 |Level 3|62 |52 |Level 3|31 |46 |Level 3|7 |3 |Level 1|15 |17 |Level 2 | | | |
|200 | | |1 |Fall |9/11/2017|107 | | |56 |81 |Level 4|28 |75 |Level 3|11 |48 |Level 3|27 |92 |Level 5 | | | |
|200 | | |1 |Spring|5/8/2018 |109 | | |80 |92 |Level 5|40 |84 |Level 4|21 |55 |Level 3|47 |88 |Level 4 | | | |
|200 | | |1 |Winter|2/1/2018 |109 | | |80 |95 |Level 5|38 |78 |Level 4|23 |80 |Level 4|34 |67 |Level 3 | | | |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
最棘手的事情是为每个主题找到合适的日期。孩子们大致在同一天吃每一个GOM,但出来的文件给出了每个GOM的不同日期。当数据进入我们的另一个系统时,每个主题只有一个日期,它由几个gom组成。我求助于编写一个函数来遍历它们,但这可能更容易做到。你知道吗
通常,我一次只运行一个赛季的脚本,然后对其他赛季进行评论,但我现在正在导入去年的数据,同时执行所有三个赛季。你知道吗
输出列的快速解释。对于每个GOM,1是数字分数,2是百分位数,3是绩效水平的字母描述。这是一种可怕的文件组织方式,但这正是供应商所需要的。你知道吗
我希望用更少的步骤或者至少用一种更惯用的方式来实现这一点。我发现自己撤销了pivotètable生成的一些内容,只是为了能够以我知道的方式再次使用df,这仍然非常有限。你知道吗
我最担心的是第65-77行,在那里我融化和旋转桌子。不确定是否可以更有效地完成。而且,从Python的角度来看,第108-111行非常愚蠢。我只是不想第二次列出GOM,我希望能列出LNF 1,LNF 2,LNF 3,LSF 1。。。快速无需重新键入。一定有更好的办法。你知道吗
请让我知道,如果你有任何建议,如何改进这个过程,无论是直接Python或(特别是)熊猫。你知道吗
import pandas as pd
import os
import json
import glob
import re
import datetime
import numpy as np
import itertools
# build of list of seasons to process. usually only one.
seasons = []
seasons.append('Fall')
seasons.append('Winter')
seasons.append('Spring')
# Read demographic data into dataframe
# data['record'] comes from something that was truncated for simplicity
demographics = pd.DataFrame(data['record'], columns=['student_number', 'state_studentnumber', 'last_name', 'first_name', 'dob'])
# iterate through score files
for file in glob.glob('To Do/*.csv'):
# read score data into dataframe
aimsweb = pd.read_csv(file)
aimsweb["Fall_SeptemberDateGiven"] = pd.to_datetime(aimsweb['Fall_SeptemberDateGiven']).dt.strftime('%m/%d/%Y');
aimsweb["Winter_JanuaryDateGiven"] = pd.to_datetime(aimsweb['Winter_JanuaryDateGiven']).dt.strftime('%m/%d/%Y');
aimsweb["Spring_MayDateGiven"] = pd.to_datetime(aimsweb['Spring_MayDateGiven']).dt.strftime('%m/%d/%Y');
# perform left outer join of score data to demographics. this ensures these are real students.
merged = aimsweb.merge(demographics, how='left', left_on=['UniqueIdentifier', 'StudentLastName', 'StudentFirstName'], right_on=['student_number', 'last_name', 'first_name'], indicator=True)
renames = {}
# rename table
renames['Fall_September'] = 'Fall_1'
renames['Fall_SeptemberDateGiven'] = 'Fall_Date'
renames['Fall_SeptemberNationalPercentileRank'] = 'Fall_2'
renames['Winter_January'] = 'Winter_1'
renames['Winter_JanuaryDateGiven'] = 'Winter_Date'
renames['Winter_JanuaryNationalPercentileRank'] = 'Winter_2'
renames['Spring_May'] = 'Spring_1'
renames['Spring_MayDateGiven'] = 'Spring_Date'
renames['Spring_MayNationalPercentileRank'] = 'Spring_2'
merged.rename(index=str, columns=renames, inplace=True)
# function for converting percentile into level 1-5
def percentile2level(percentile):
if percentile >= 91:
return 'Level 5'
if percentile >= 76:
return 'Level 4'
if percentile >= 26:
return 'Level 3'
if percentile >= 11:
return 'Level 2'
if percentile >= 0:
return 'Level 1'
# convert percentile to level
for season in ['Fall', 'Winter', 'Spring']:
merged[season + '_3'] = merged[season + '_2'].apply(percentile2level);
# melt data from columns into rows
merged = merged.melt(id_vars = ['student_number', 'GOM', 'last_name', 'first_name', 'Grade'],value_vars=['Fall_Date', 'Fall_1', 'Fall_2', 'Fall_3', 'Winter_Date', 'Winter_1', 'Winter_2', 'Winter_3', 'Spring_Date', 'Spring_1', 'Spring_2', 'Spring_3'])
# split the variable into season and attribute
merged['Season'], merged['Attribute'] = merged['variable'].str.split('_', 1).str
# pivot data back to columns
merged = pd.pivot_table(merged, index=['student_number', 'Season', 'last_name', 'first_name', 'Grade'], values=['value'], columns=['GOM', 'Attribute'], aggfunc=np.max, fill_value='')
# condense the levels and reset index to get back to flat df
merged.columns = merged.columns.droplevel(0)
merged.columns = [' '.join(col).strip() for col in merged.columns.values]
merged.reset_index(inplace=True);
# build subject-GOM mappings
subjects = {};
subjects['Reading'] = ['LNF', 'LSF', 'PSF', 'NWF', 'R-CBM', 'MAZE']
subjects['Mathematics'] = ['OCM', 'NIM', 'QDM', 'MNM', 'M-COMP', 'M-CAP']
# function for finding the min non-null date from a list
def mindate(row, headings):
dates = list(row[headings])
dates = list(filter(None, dates))
if len(dates) > 0:
return min(dates)
return "";
# iterate through two subjects
for subject in subjects:
# build list of date headers for all GOMs
headings = [];
for gom in subjects[subject]:
headings.append(gom + " Date");
# create df for this subject
df = merged
# create date column with value of minimum from list of dates
df['Date'] = df.apply(mindate, axis=1, args=(headings,))
# filter out records with no real date
df = df[ df['Date'] != 'NaT' ]
# build a list of GOM headers with 1, 2, 3
headings = [];
for combo in list(itertools.product( subjects[subject], [1,2,3] )):
headings.append( combo[0] + " " + str(combo[1]) )
# save csv
df[ df['Season'].isin(seasons) ][ ['student_number', 'last_name', 'first_name', 'Grade', 'Season', 'Date'] + headings].to_csv("./" + os.path.splitext(os.path.basename(file))[0] + "_" + "".join(seasons) + "_" + subject + ".csv", index=False);
目前没有回答
相关问题 更多 >
编程相关推荐