使用Python和Pandas将CSV中的时间戳列设置为索引并解析日期

2024-06-26 01:58:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个使用pandas的Python脚本,它从压缩在ZIP文件中的csv中获取covi-19上的web抓取数据。这是web刮取数据的原始数据源:https://github.com/statistikat/coronaDAT

我从CSV文件加载的时间戳列有问题。数据似乎与原始CSV文件中的所有五列一起正确加载到数据框中。第五列是数据的时间戳。当我使用print(df_master.columns)时,我得到了正确的五列,包括时间戳

这是我从中得到的

print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 903 entries, 87 to 87
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Bezirk           903 non-null    object
 1   Anzahl           903 non-null    int64
 2   Anzahl_Inzidenz  903 non-null    object
 3   GKZ              859 non-null    float64
 4   Timestamp        859 non-null    object
dtypes: float64(1), int64(1), object(3)
memory usage: 42.3+ KB
None
         Bezirk  Anzahl   Anzahl_Inzidenz    GKZ            Timestamp
87  Wien(Stadt)    2231   117,57631524998  900.0  2020-04-22T06:00:00
87  Wien(Stadt)    2264  119,315453933642  900.0  2020-04-22T19:00:00
87  Wien(Stadt)    2243  118,208729316766  900.0  2020-04-22T12:00:00
87  Wien(Stadt)    2254   118,78844221132  900.0  2020-04-22T16:00:00
87  Wien(Stadt)    2242  118,156028144534  900.0  2020-04-22T09:00:00
87  Wien(Stadt)    2266  119,420856278106  900.0  2020-04-22T23:00:00
87  Wien(Stadt)    2231   117,57631524998  900.0  2020-04-22T02:00:00
87  Wien(Stadt)    2256  118,893844555784  900.0  2020-04-22T18:00:00
87  Wien(Stadt)    2237  117,892522283373  900.0  2020-04-22T07:00:00
87  Wien(Stadt)    2244  118,261430488998  900.0  2020-04-22T13:00:00
Index(['Bezirk', 'Anzahl', 'Anzahl_Inzidenz', 'GKZ', 'Timestamp'], dtype='object')
Export to CSV Successful

但是,当我尝试将数据帧索引设置为时间戳列(index_col=['Timestamp'])或分析时间戳列(parse_dates=['Timestamp'])的日期时,我会收到以下错误消息:

ValueError: Index Timestamp invalid

我试着在CSV中指定确切的列,但没有什么不同。正在读取的某些CSV文件可能没有值,或者时间戳列中没有值的字符串。我尝试用NaN替换Timestamp列中的任何空字符串,然后删除所有NaN,这将删除Timestamp列中没有值的所有行。我还尝试将Timestamp列的数据类型设置为datetime

将时间戳列中的空字符串设置为NaN并删除行:

#replace empty strings in Timestamp column with NaN values
                df['Timestamp'].replace('', np.nan, inplace=True)
                #replace whitespace in Timestamp column with NaN values
                df['Timestamp'].replace('  ', np.nan, inplace=True)
                #drop rows where Timestamp column has NaN values 
                df.dropna(subset=['Timestamp'], inplace=True)

将数据类型设置为日期时间:

pd.to_datetime(df['Timestamp'],errors='ignore')

当我执行这两项操作之一时,会收到错误消息:

KeyError: 'Timestamp'

知道为什么我不能对时间戳列做任何事情吗,比如设置为索引、解析日期或对该列中的值做任何事情吗?

以下是完整的代码:

import fnmatch
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile


#set root path
rootPath = r"/Users/matt/test/"

#set file extension pattern - get all ZIPs with data from 10:00 AM
pattern_ext = '*00_orig_csv.zip'

#set file name - get all CSVs with data from Bezirke
pattern_filename = 'Bezirke.csv'
#set Bezirk to export to CSV
set_bezirk = 'Wien(Stadt)'

#initialize variables
df_master = pd.DataFrame()
flag = False


#crawl entire directory in root folder
for root, dirs, files in os.walk(rootPath):
    #filter files that match pattern of .zip
    for filename in fnmatch.filter(files, pattern_ext):
        #create complete file name of ZIP file
        zip_file = ZipFile(os.path.join(root, filename))
        for text_file in zip_file.infolist():
            #if the filename starts with variable file_name
            if text_file.filename.startswith(pattern_filename):
                df = pd.read_csv(zip_file.open(text_file.filename), 
                    delimiter = ';', 
                    header = 0, 
                    #index_col = 'Timestamp', 
                    #parse_dates = 'Timestamp'
                    )

                #set data type of Timestamp column to datetime
                #pd.to_datetime(df['Timestamp'],errors='ignore') 

                #replace empty strings in Timestamp column with NaN values
                #df['Timestamp'].replace('', np.nan, inplace=True)
                #replace whitespace in Timestamp column with NaN values
                #df['Timestamp'].replace('  ', np.nan, inplace=True)
                #drop rows where Timestamp column has NaN values 
                #df.dropna(subset=['Timestamp'], inplace=True)


                #filter for Bezirk values that equal variable set_bezirk
                df_vienna = df[df['Bezirk'] == set_bezirk]

                ##filter for Timestamp values that equal variable set_time
                #df_vienna = df[df['Timestamp'] != 0]

                #insert filtered values for variable set_bezirk to dataframe df
                df = df_vienna
                if not flag:
                    df_master = df
                    flag = True
                else:
                    df_master = pd.concat([df_master, df])

#sort index field Timestamp
df_master.set_index('Timestamp').sort_index(inplace=True, na_position='first')

#print master dataframe info
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)


#prepare date to export to csv
frame = df_master

#export to csv
try:
    frame.to_csv( "combined_zip_Bezirk_Wien.csv", encoding='utf-8-sig')
    print("Export to CSV Successful")
except:
    print("Export to CSV Failed")


#verify if the dataset is present
    #if not present, download data set from GitHub
    #if present, verfify with GitHUb if dataset is updated
        #update dataset

Tags: csvtomasterdf时间nantimestampreplace