查找缺少的值

2024-10-03 21:35:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个庞大的学生数据集,每个学生都有自己的csv文件, 数据集B有297444个csv文件,我想知道该数据集中缺少哪个学生csv文件

如图所示,该数据集中没有u2.csv文件,因此如何使用pandas检查丢失的所有csv文件

这是我到目前为止试过的代码

import pandas as pd
import numpy as np
import glob

path = r'C:/Users/user1/Desktop/EDNET DATA/EdNet-KT4/KT4' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for i,filename in enumerate (all_files):
    
    df = pd.read_csv(filename, ',' ,index_col=None, header=0).assign(user_iD=filename.split("\\")[-1].split(".")[0])
    
    li.append(df)

data = pd.concat(li, axis=0, ignore_index=True)
df = data.copy()

df.isnull().sum()

df.to_feather('KT4.ftr')
data1= pd.read_feather('KT4.ftr')
data1.head()

enter image description here

enter image description here


Tags: 文件csv数据pathimportpandasdfas
1条回答
网友
1楼 · 发布于 2024-10-03 21:35:34

解决方案

💡 Note: You only need the list of files names. But what you are doing in the code you posted, is reading the contents of the files (which is not what you want)!

您可以选择使用以下两种方法中的任何一种。为了再现性,我创建了一些虚拟数据,并在Google Colab上测试了解决方案。我发现使用熊猫(方法2)在某种程度上更快

Open In Colab

通用代码

import glob
# import pandas as pd

all_files = glob.glob(path + "/*.csv")

# I am deliberately using this for 
#   a small number of students to 
#   test the code.
num_students = 20 # 297444

方法1:简单Python循环

  • 对于100,000文件,在googlecolab上大约花了1分钟29秒
  • 在jupyter笔记本电脑单元中运行以下操作
%%time
missing_files = []

for i in range(15):
    student_file = f'u{i}.csv'
    if f'{path}/{student_file}' not in all_files:
        missing_files.append(student_file)

#print(f"Total missing: {len(missing_files)}")
#print(missing_files)

## Runtime
# CPU times: user 1min 29s, sys: 0 ns, total: 1min 29s
# Wall time: 1min 29s

方法2:使用熊猫库进行处理(更快)🔥🔥🔥

  • 对于100,000文件,在googlecolab上大约花费了358毫秒
  • 几乎比方法1快
  • 在jupyter笔记本电脑单元中运行以下操作
%%time
# import pandas as pd

existing_student_ids = (
    pd.DataFrame({'Filename': all_files})
      .Filename.str.extract(f'{path}/u(?P<StudentID>\d+)\.csv')
      .astype(int)
      .sort_values('StudentID')
      .StudentID.to_list()
)

missing_student_ids = list(set(range(num_students)) - set(existing_student_ids))

# print(f"Total missing students: {len(missing_student_ids)}")
# print(f'missing_student_ids: {missing_student_ids}')

## Runtime
# CPU times: user 323 ms, sys: 31.1 ms, total: 354 ms
# Wall time: 358 ms

虚拟数据

在这里,我将定义一些虚拟数据,以便 该溶液重现性好,易于测试

我将跳过以下学生ID(skip_student_ids),并且不会为它们创建任何.csv文件

import os

NUM_STUDENTS = 20

## CREATE FILE NAMES
num_students = NUM_STUDENTS
skip_student_ids = [3, 8, 10, 13] ##  > we will skip these student-ids
skip_files = [f'u{i}.csv' for i in skip_student_ids]
all_files = [f'u{i}.csv' for i in range(num_students) if i not in skip_student_ids]

if num_students <= 20:
    print(f'skip_files: {skip_files}')
    print(f'all_files: {all_files}')

## CREATE FILES
path = 'test'
if not os.path.exists(path):
    os.makedirs(path)
for filename in all_files:
    with open(path + '/' + filename, 'w') as f:
        student_id = str(filename).split(".")[0].replace('u', '')
        content = f"""
        Filename,StudentID
        {filename},{student_id}
        """
        f.write(content)

参考资料

  1. ^{} - Docs

  2. Can I add message to the tqdm progressbar?

相关问题 更多 >