使用Python和Pandas将CSV中的时间戳列设置为索引并解析日期

2024-06-26 01:58:44 发布

您现在位置：Python中文网/ 问答频道 /正文

3962

网友

男 | 程序猿一只，喜欢编程写python代码。

我有一个使用pandas的Python脚本，它从压缩在ZIP文件中的csv中获取covi-19上的web抓取数据。这是web刮取数据的原始数据源：https://github.com/statistikat/coronaDAT

我从CSV文件加载的时间戳列有问题。数据似乎与原始CSV文件中的所有五列一起正确加载到数据框中。第五列是数据的时间戳。当我使用print(df_master.columns)时，我得到了正确的五列，包括时间戳

这是我从中得到的

print(df_master.info())
print(df_master.head(10))
print(df_master.columns)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 903 entries, 87 to 87
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Bezirk           903 non-null    object
 1   Anzahl           903 non-null    int64
 2   Anzahl_Inzidenz  903 non-null    object
 3   GKZ              859 non-null    float64
 4   Timestamp        859 non-null    object
dtypes: float64(1), int64(1), object(3)
memory usage: 42.3+ KB
None
         Bezirk  Anzahl   Anzahl_Inzidenz    GKZ            Timestamp
87  Wien(Stadt)    2231   117,57631524998  900.0  2020-04-22T06:00:00
87  Wien(Stadt)    2264  119,315453933642  900.0  2020-04-22T19:00:00
87  Wien(Stadt)    2243  118,208729316766  900.0  2020-04-22T12:00:00
87  Wien(Stadt)    2254   118,78844221132  900.0  2020-04-22T16:00:00
87  Wien(Stadt)    2242  118,156028144534  900.0  2020-04-22T09:00:00
87  Wien(Stadt)    2266  119,420856278106  900.0  2020-04-22T23:00:00
87  Wien(Stadt)    2231   117,57631524998  900.0  2020-04-22T02:00:00
87  Wien(Stadt)    2256  118,893844555784  900.0  2020-04-22T18:00:00
87  Wien(Stadt)    2237  117,892522283373  900.0  2020-04-22T07:00:00
87  Wien(Stadt)    2244  118,261430488998  900.0  2020-04-22T13:00:00
Index(['Bezirk', 'Anzahl', 'Anzahl_Inzidenz', 'GKZ', 'Timestamp'], dtype='object')
Export to CSV Successful

但是，当我尝试将数据帧索引设置为时间戳列（index_col=['Timestamp']）或分析时间戳列（parse_dates=['Timestamp']）的日期时，我会收到以下错误消息：

ValueError: Index Timestamp invalid

我试着在CSV中指定确切的列，但没有什么不同。正在读取的某些CSV文件可能没有值，或者时间戳列中没有值的字符串。我尝试用NaN替换Timestamp列中的任何空字符串，然后删除所有NaN，这将删除Timestamp列中没有值的所有行。我还尝试将Timestamp列的数据类型设置为datetime

将时间戳列中的空字符串设置为NaN并删除行：

#replace empty strings in Timestamp column with NaN values
                df['Timestamp'].replace('', np.nan, inplace=True)
                #replace whitespace in Timestamp column with NaN values
                df['Timestamp'].replace('  ', np.nan, inplace=True)
                #drop rows where Timestamp column has NaN values 
                df.dropna(subset=['Timestamp'], inplace=True)

将数据类型设置为日期时间：

pd.to_datetime(df['Timestamp'],errors='ignore')

当我执行这两项操作之一时，会收到错误消息：

KeyError: 'Timestamp'

知道为什么我不能对时间戳列做任何事情吗，比如设置为索引、解析日期或对该列中的值做任何事情吗？

以下是完整的代码：

import fnmatch
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile


#set root path
rootPath = r"/Users/matt/test/"

#set file extension pattern - get all ZIPs with data from 10:00 AM
pattern_ext = '*00_orig_csv.zip'

#set file name - get all CSVs with data from Bezirke
pattern_filename = 'Bezirke.csv'
#set Bezirk to export to CSV
set_bezirk = 'Wien(Stadt)'

#initialize variables
df_master = pd.DataFrame()
flag = False


#crawl entire directory in root folder
for root, dirs, files in os.walk(rootPath):
    #filter files that match pattern of .zip
    for filename in fnmatch.filter(files, pattern_ext):
        #create complete file name of ZIP file
        zip_file = ZipFile(os.path.join(root, filename))
        for text_file in zip_file.infolist():
            #if the filename starts with variable file_name
            if text_file.filename.startswith(pattern_filename):
                df = pd.read_csv(zip_file.open(text_file.filename), 
                    delimiter = ';', 
                    header = 0, 
                    #index_col = 'Timestamp', 
                    #parse_dates = 'Timestamp'
                    )

                #set data type of Timestamp column to datetime
                #pd.to_datetime(df['Timestamp'],errors='ignore') 

                #replace empty strings in Timestamp column with NaN values
                #df['Timestamp'].replace('', np.nan, inplace=True)
                #replace whitespace in Timestamp column with NaN values
                #df['Timestamp'].replace('  ', np.nan, inplace=True)
                #drop rows where Timestamp column has NaN values 
                #df.dropna(subset=['Timestamp'], inplace=True)


                #filter for Bezirk values that equal variable set_bezirk
                df_vienna = df[df['Bezirk'] == set_bezirk]

                ##filter for Timestamp values that equal variable set_time
                #df_vienna = df[df['Timestamp'] != 0]

                #insert filtered values for variable set_bezirk to dataframe df
                df = df_vienna
                if not flag:
                    df_master = df
                    flag = True
                else:
                    df_master = pd.concat([df_master, df])

#sort index field Timestamp
df_master.set_index('Timestamp').sort_index(inplace=True, na_position='first')

#print master dataframe info
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)


#prepare date to export to csv
frame = df_master

#export to csv
try:
    frame.to_csv( "combined_zip_Bezirk_Wien.csv", encoding='utf-8-sig')
    print("Export to CSV Successful")
except:
    print("Export to CSV Failed")


#verify if the dataset is present
    #if not present, download data set from GitHub
    #if present, verfify with GitHUb if dataset is updated
        #update dataset

Tags： csv to master df 时间 nan timestamp replace

1条回答

网友

1楼 · 发布于 2024-06-26 01:58:44

使用

df2 = pd.to_datetime(df_master['Timestamp'], format="%Y-%m-%dT%H:%M:%S")

要转换为时间戳列，请执行您的处理

使用Python和Pandas将CSV中的时间戳列设置为索引并解析日期

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用Python和Pandas将CSV中的时间戳列设置为索引并解析日期

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >