查找缺少的值

import pandas as pd import numpy as np import glob path = r'C:/Users/user1/Desktop/EDNET DATA/EdNet-KT4/KT4' # use your path all_files = glob.glob(path + "/*.csv") li = [] for i,filename in enumerate (all_files): df = pd.read_csv(filename, ',' ,index_col=None, header=0).assign(user_iD=filename.split("\\")[-1].split(".")[0]) li.append(df) data = pd.concat(li, axis=0, ignore_index=True) df = data.copy() df.isnull().sum() df.to_feather('KT4.ftr') data1= pd.read_feather('KT4.ftr') data1.head()

1条回答

网友

1楼 · 发布于 2024-10-03 21:35:34

解决方案

💡 Note: You only need the list of files names. But what you are doing in the code you posted, is reading the contents of the files (which is not what you want)!

您可以选择使用以下两种方法中的任何一种。为了再现性，我创建了一些虚拟数据，并在Google Colab上测试了解决方案。我发现使用熊猫（方法2）在某种程度上更快

通用代码

import glob
# import pandas as pd

all_files = glob.glob(path + "/*.csv")

# I am deliberately using this for 
#   a small number of students to 
#   test the code.
num_students = 20 # 297444

方法1：简单Python循环

对于100,000文件，在googlecolab上大约花了1分钟29秒
在jupyter笔记本电脑单元中运行以下操作

%%time
missing_files = []

for i in range(15):
    student_file = f'u{i}.csv'
    if f'{path}/{student_file}' not in all_files:
        missing_files.append(student_file)

#print(f"Total missing: {len(missing_files)}")
#print(missing_files)

## Runtime
# CPU times: user 1min 29s, sys: 0 ns, total: 1min 29s
# Wall time: 1min 29s

方法2：使用熊猫库进行处理（更快）🔥🔥🔥

对于100,000文件，在googlecolab上大约花费了358毫秒
几乎比方法1快
在jupyter笔记本电脑单元中运行以下操作

%%time
# import pandas as pd

existing_student_ids = (
    pd.DataFrame({'Filename': all_files})
      .Filename.str.extract(f'{path}/u(?P<StudentID>\d+)\.csv')
      .astype(int)
      .sort_values('StudentID')
      .StudentID.to_list()
)

missing_student_ids = list(set(range(num_students)) - set(existing_student_ids))

# print(f"Total missing students: {len(missing_student_ids)}")
# print(f'missing_student_ids: {missing_student_ids}')

## Runtime
# CPU times: user 323 ms, sys: 31.1 ms, total: 354 ms
# Wall time: 358 ms

虚拟数据

在这里，我将定义一些虚拟数据，以便该溶液重现性好，易于测试

我将跳过以下学生ID（skip_student_ids），并且不会为它们创建任何.csv文件

import os

NUM_STUDENTS = 20

## CREATE FILE NAMES
num_students = NUM_STUDENTS
skip_student_ids = [3, 8, 10, 13] ##  > we will skip these student-ids
skip_files = [f'u{i}.csv' for i in skip_student_ids]
all_files = [f'u{i}.csv' for i in range(num_students) if i not in skip_student_ids]

if num_students <= 20:
    print(f'skip_files: {skip_files}')
    print(f'all_files: {all_files}')

## CREATE FILES
path = 'test'
if not os.path.exists(path):
    os.makedirs(path)
for filename in all_files:
    with open(path + '/' + filename, 'w') as f:
        student_id = str(filename).split(".")[0].replace('u', '')
        content = f"""
        Filename,StudentID
        {filename},{student_id}
        """
        f.write(content)

解决方案

通用代码

方法1：简单Python循环

方法2：使用熊猫库进行处理（更快）🔥🔥🔥

虚拟数据

参考资料

相关问题更多 >

编程相关推荐

热门问题

热门文章