用另一个数据帧替换整个数据帧(覆盖)(python3.4pandas)

2024-10-02 12:32:27 发布

您现在位置:Python中文网/ 问答频道 /正文

更新更新:

我做了以下工作,结果奏效了: 1将if-if-elif结构替换为if-elif-else(见下文)。 2将dec计算为字符串(即dec=='1'而不是dec==1)

if len(SframeDup.index) > 0 and dec == '1':
    SframeDup.to_csv('NWEA CSVs/Students/StudentDuplicates.csv', sep=',')
    print ("%d instances of repeated student IDs detected." % len(SframeDup.index))
    print ("See StudentDuplicates.csv for duplicates.")
    print ("\nThis program will now stop.")
    raise SystemExit      

    #quit() and exit() work too, but only in the editor
    #doing this in Ipython Notebook will restart the kernal and require
    #re-running and re-compiling preceeding code
elif len(SframeDup.index) >0  and dec == '2':
    print ("%d instances of repeated student IDs detected." % len(SframeDup.index))
    print ("See StudentDuplicates.csv for duplicates.")
    Sframe['dup_check_1'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = False)
    Sframe['dup_check_2'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = True)
    Sframe = Sframe[(Sframe['dup_check_1'] == False) & (Sframe['dup_check_2'] == False)]
    del Sframe['dup_check_1'], Sframe['dup_check_2']

else:
    print ("No duplicates found. Oh yeah!")

更新:

尽管我已经尽我所能“继续前进”,但我还是想尽可能地记录下来。我粘贴了两组代码;第一组尝试使用if elif,但未能使Sframe消除重复项。第二个成功地省略了重复项,但要做到这一点,我必须去掉if elif。在

^{pr2}$

输出:2840

import pandas as pd
import numpy as np
import glob
import csv
import os
import sys

path = r'NWEA CSVs/Students/Raw'
allFiles = glob.glob(path + "/*.csv")
Sframe = pd.DataFrame()

list = []
for file in allFiles:
    sdf = pd.read_csv(file,index_col=None, header=0)
    list.append(sdf)
Sframe = pd.concat(list,ignore_index=False)

Sframe.to_csv('NWEA CSVs/Students/OutStudents.csv', sep=',')

Sframe["TermSchoolStudent"]=Sframe["TermName"]+Sframe["SchoolName"]+\
Sframe["StudentID"].map(str)

SframeDup = Sframe[Sframe.duplicated("TermSchoolStudent") == True]


if len(SframeDup.index) > 0:
    SframeDup.to_csv('NWEA CSVs/Students/StudentDuplicates.csv', sep=',')
    print ("%d instances of repeated student IDs detected." % len(SframeDup.index))
    print ("See StudentDuplicates.csv for duplicates.")



Sframe['dup_check_1'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = False)
Sframe['dup_check_2'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = True)
Sframe = Sframe[(Sframe['dup_check_1'] == False) & (Sframe['dup_check_2'] == False)]
del Sframe['dup_check_1'], Sframe['dup_check_2']



print (len(Sframe))

输出:2834

**

  • 老东西:

在** 我有一个我认为是一个简单的问题,答案对我作为一个新程序员来说并不明显。基本上,我有一个数据帧(Sframe),我的程序会检查它是否重复。如果用户指示程序应在没有重复项的情况下继续,则从数据帧中删除重复项(及其唯一值),并且在删除重复项的情况下使Sframe等于Sframe(因此用修改后的Sframe替换原始Sframe)。之后,在主程序中,如果用户如上所述选择了“2”,则Sframe应该是修改后的版本。否则,如果一开始就没有检测到重复项(因此用户输入从未输入),则应该使用原始的Sframe。在

我的代码如下所示:

Import Pandas as pd
Sframe = pd.DataFrame()

在这里,代码检查重复项。如果它们存在,则以下运行。 如果它们不存在,则跳过以下内容,并按最初定义使用Sframe。在

这是假定检测到重复项的代码:

dec = input("-->")
if dec == 1:
    print ("This program will now stop.")
    print ("this_file.csv to resolve a problem.")    
    raise SystemExit

elif dec == 2:       
    # add "Repeated" field to student with duplicates table. Values="NaN"
    SframeDup["Repeated"]="NaN"

    # New table joins (left, inner) Sframe with duplicates table (SframeDup) to
    # identify all rows of duplicates (including the unique values that had
    # duplicates)
    SframeWDup=pd.merge(Sframe, SframeDup, on='identifier', how='left')
    # Eliminate all repeating rows, including originals as pulled during left join
    SframeWODup=SframeWDup[SframeWDup.Repeated_y!="NaN"]
    # So here, in my mind, I should be able to just do this and the rest of
    # the code should treat replace Sframe with SframeWODup (without the found
    # duplicates)...
    Sframe = SframeWODup

但它不起作用。我知道这一点是因为当我在选择2以消除重复项(及其唯一的原始值)后选中len(Sframe)时,我得到的数字与处理重复项之前的相同。在

提前谢谢你的帮助。如果不清楚,我很乐意澄清。在

更新: Sframe.类型 TermName对象

DistrictName对象

学校对象名称

StudentLastName对象

StudentFirstName对象

StudentMI对象

StudentID对象

StudentDateOfBirth对象

StudentEthnicGroup对象

学生性别对象

Grade对象

TermSchoolStudent对象

数据类型:对象

在S框架.头部()返回映像中以下链接处的表: https://drive.google.com/file/d/0B1cr7dwUpr_JR3d0YzlwLWFwQU0/view?usp=sharing


Tags: csvto对象falseindexlenifcheck
2条回答

我做了以下几件事,它奏效了:1。将if-if-elif结构替换为if-elif-else(见下文)。2将dec计算为字符串(即dec=='1'而不是dec==1)

if len(SframeDup.index) > 0 and dec == '1':
    SframeDup.to_csv('NWEA CSVs/Students/StudentDuplicates.csv', sep=',')
    print ("%d instances of repeated student IDs detected." % len(SframeDup.index))
    print ("See StudentDuplicates.csv for duplicates.")
    print ("\nThis program will now stop.")
    raise SystemExit      

    #quit() and exit() work too, but only in the editor
    #doing this in Ipython Notebook will restart the kernal and require
    #re-running and re-compiling preceeding code
elif len(SframeDup.index) >0  and dec == '2':
    print ("%d instances of repeated student IDs detected." % len(SframeDup.index))
    print ("See StudentDuplicates.csv for duplicates.")
    Sframe['dup_check_1'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = False)
    Sframe['dup_check_2'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = True)
    Sframe = Sframe[(Sframe['dup_check_1'] == False) & (Sframe['dup_check_2'] == False)]
    del Sframe['dup_check_1'], Sframe['dup_check_2']

else:
    print ("No duplicates found. Oh yeah!")

尝试Sframe = SframeWODup.copy() 更新: 你能用这段代码来达到你想要的结果吗?在

# Made-up data
Sframe = pd.DataFrame({'TermName': ['Fall', 'Fall', 'Fall', 'Fall'], 
'DistrictName': ['Downtown', 'Downtown', 'Downtown', 'Downtown'], 
'SchoolName': ['Seattle Central', 'Ballard', 'Ballard', 'Ballard'], 
'StudentLastName': ['Doe', 'Doe', 'Doe', 'Doe'], 
'StudentFirstName': ['John', 'Jane', 'Jane', 'Jane'],
'StudentMI': ['X', 'X', 'X', 'X'],
'StudentID': ['1234', '9876', '9876', '9876'],
'StudentDateOfBirth': ['2000-01-01', '2001-01-01', '2001-01-01', '2001-01-01'],
'StudentEthnicGroup': ['Asian American', 'White', 'White', 'White'],
'StudentGender': ['M', 'F', 'F', 'F'],
'Grade': ['10th', '9th', '9th', '9th'],
'TermSchoolStudent': ['Z', 'Z', 'Z', 'Z']})

# Remove duplicates based upon StudentID, in-place (i.e., modify object 'Sframe'). 
# UPDATE: I read that you want duplicates completely removed from data frame.
# Sframe.drop_duplicates(cols = ['StudentID'], take_last = False, inplace = True)

Sframe['dup_check_1'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = False)
Sframe['dup_check_2'] = Sframe.duplicated(cols = ['TermName', 'SchoolName', 'StudentID'], take_last = True)
Sframe = Sframe[(Sframe['dup_check_1'] == False) & (Sframe['dup_check_2'] == False)]
del Sframe['dup_check_1'], Sframe['dup_check_2']

相关问题 更多 >

    热门问题