我有一个使用pandas的Python脚本,它从压缩在ZIP文件中的csv中获取covi-19上的web抓取数据。这是web刮取数据的原始数据源:https://github.com/statistikat/coronaDAT
我从CSV文件加载的时间戳列有问题。数据似乎与原始CSV文件中的所有五列一起正确加载到数据框中。第五列是数据的时间戳。当我使用print(df_master.columns)
时,我得到了正确的五列,包括时间戳
这是我从中得到的
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 903 entries, 87 to 87
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bezirk 903 non-null object
1 Anzahl 903 non-null int64
2 Anzahl_Inzidenz 903 non-null object
3 GKZ 859 non-null float64
4 Timestamp 859 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 42.3+ KB
None
Bezirk Anzahl Anzahl_Inzidenz GKZ Timestamp
87 Wien(Stadt) 2231 117,57631524998 900.0 2020-04-22T06:00:00
87 Wien(Stadt) 2264 119,315453933642 900.0 2020-04-22T19:00:00
87 Wien(Stadt) 2243 118,208729316766 900.0 2020-04-22T12:00:00
87 Wien(Stadt) 2254 118,78844221132 900.0 2020-04-22T16:00:00
87 Wien(Stadt) 2242 118,156028144534 900.0 2020-04-22T09:00:00
87 Wien(Stadt) 2266 119,420856278106 900.0 2020-04-22T23:00:00
87 Wien(Stadt) 2231 117,57631524998 900.0 2020-04-22T02:00:00
87 Wien(Stadt) 2256 118,893844555784 900.0 2020-04-22T18:00:00
87 Wien(Stadt) 2237 117,892522283373 900.0 2020-04-22T07:00:00
87 Wien(Stadt) 2244 118,261430488998 900.0 2020-04-22T13:00:00
Index(['Bezirk', 'Anzahl', 'Anzahl_Inzidenz', 'GKZ', 'Timestamp'], dtype='object')
Export to CSV Successful
但是,当我尝试将数据帧索引设置为时间戳列(index_col=['Timestamp']
)或分析时间戳列(parse_dates=['Timestamp']
)的日期时,我会收到以下错误消息:
ValueError: Index Timestamp invalid
我试着在CSV中指定确切的列,但没有什么不同。正在读取的某些CSV文件可能没有值,或者时间戳列中没有值的字符串。我尝试用NaN替换Timestamp列中的任何空字符串,然后删除所有NaN,这将删除Timestamp列中没有值的所有行。我还尝试将Timestamp列的数据类型设置为datetime
将时间戳列中的空字符串设置为NaN并删除行:
#replace empty strings in Timestamp column with NaN values
df['Timestamp'].replace('', np.nan, inplace=True)
#replace whitespace in Timestamp column with NaN values
df['Timestamp'].replace(' ', np.nan, inplace=True)
#drop rows where Timestamp column has NaN values
df.dropna(subset=['Timestamp'], inplace=True)
将数据类型设置为日期时间:
pd.to_datetime(df['Timestamp'],errors='ignore')
当我执行这两项操作之一时,会收到错误消息:
KeyError: 'Timestamp'
知道为什么我不能对时间戳列做任何事情吗,比如设置为索引、解析日期或对该列中的值做任何事情吗?
以下是完整的代码:
import fnmatch
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile
#set root path
rootPath = r"/Users/matt/test/"
#set file extension pattern - get all ZIPs with data from 10:00 AM
pattern_ext = '*00_orig_csv.zip'
#set file name - get all CSVs with data from Bezirke
pattern_filename = 'Bezirke.csv'
#set Bezirk to export to CSV
set_bezirk = 'Wien(Stadt)'
#initialize variables
df_master = pd.DataFrame()
flag = False
#crawl entire directory in root folder
for root, dirs, files in os.walk(rootPath):
#filter files that match pattern of .zip
for filename in fnmatch.filter(files, pattern_ext):
#create complete file name of ZIP file
zip_file = ZipFile(os.path.join(root, filename))
for text_file in zip_file.infolist():
#if the filename starts with variable file_name
if text_file.filename.startswith(pattern_filename):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter = ';',
header = 0,
#index_col = 'Timestamp',
#parse_dates = 'Timestamp'
)
#set data type of Timestamp column to datetime
#pd.to_datetime(df['Timestamp'],errors='ignore')
#replace empty strings in Timestamp column with NaN values
#df['Timestamp'].replace('', np.nan, inplace=True)
#replace whitespace in Timestamp column with NaN values
#df['Timestamp'].replace(' ', np.nan, inplace=True)
#drop rows where Timestamp column has NaN values
#df.dropna(subset=['Timestamp'], inplace=True)
#filter for Bezirk values that equal variable set_bezirk
df_vienna = df[df['Bezirk'] == set_bezirk]
##filter for Timestamp values that equal variable set_time
#df_vienna = df[df['Timestamp'] != 0]
#insert filtered values for variable set_bezirk to dataframe df
df = df_vienna
if not flag:
df_master = df
flag = True
else:
df_master = pd.concat([df_master, df])
#sort index field Timestamp
df_master.set_index('Timestamp').sort_index(inplace=True, na_position='first')
#print master dataframe info
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
#prepare date to export to csv
frame = df_master
#export to csv
try:
frame.to_csv( "combined_zip_Bezirk_Wien.csv", encoding='utf-8-sig')
print("Export to CSV Successful")
except:
print("Export to CSV Failed")
#verify if the dataset is present
#if not present, download data set from GitHub
#if present, verfify with GitHUb if dataset is updated
#update dataset
使用
要转换为时间戳列,请执行您的处理
相关问题 更多 >
编程相关推荐