从excel表导出后如何清理dataframe中的datetime字符串？

3条回答

网友

1楼 · 编辑于 2024-09-29 22:03:52

好吧。在

再次编辑。我运行下面的代码，花了很长时间！我最终流产了，但这在明智的时候也很管用——祝你好运！公司名称：

import pandas as pd

f = "string\to\file\here.xlsx"
df = pd.read_excel(f)

def alter_date(timestamp):

    try:
        date_time = timestamp.to_datetime().strftime("%Y-%d-%m %H:%M:%S")
        time_stamp = pd.Timestamp(date_time)
        return time_stamp
    except:
        return timestamp

new_starts = df["trip_start_time"].apply(alter_date)
df["trip_start_time"] = new_starts
new_ends =  df["trip_stop_time"].apply(alter_date)
df["trip_stop_time"] = new_ends

编辑：我有点挖苦，基于我之前所做的，这看起来是可能的，这里有新代码：

^{pr2}$

它有点慢（有一堆数据），但我的计算机似乎正在处理它-如果失败，它会再次更新。在

旧回答：所以，所发生的是，每个不可能出现歧义的日期/时间都在原始数据集中，格式为：DD/MM/yyyyyhh:MM:SS。在

如果有可能到年月日时：分：秒

我要做的是迭代每个列

for row in df.index:
    try:
        new_dt = datetime.strptime(row, "%Y-%d-%m %H:%M:%S")
        #write back to the df here
    except ValueError:
        pass#ignore anything  that cannot be converted

网友

2楼 · 编辑于 2024-09-29 22:03:52

Andrew observed可以通过翻转所有月和日来修复数据帧，这样做会产生一个有效的日期。在

这里有一个快速的方法来“翻转”所有的日期。无效的日期被强制转换为NaT（非时间戳）值，然后被删除。剩余的翻转日期可以重新分配给df：

import pandas as pd

df = pd.read_excel('2016_Bike_Share_Toronto_Ridership_Q4.xlsx')

for col in ['trip_start_time', 'trip_stop_time']:
    df[col] = pd.to_datetime(df[col])
    swapped = pd.to_datetime({'year':df[col].dt.year, 
                              'month':df[col].dt.day, 
                              'day':df[col].dt.month,
                              'hour':df[col].dt.hour,
                              'minute':df[col].dt.minute,
                              'second':df[col].dt.second,}, errors='coerce')
    swapped = swapped.dropna()
    mask = swapped.index
    df.loc[mask, col] = swapped

# check that now all dates are in 2016Q4
for col in ['trip_start_time', 'trip_stop_time']:
    mask = (pd.PeriodIndex(df[col], freq='Q') == '2016Q4')
    assert mask.all()

# check that `trip_start_times` are in chronological order
assert (df['trip_start_time'].diff().dropna() >= pd.Timedelta(0)).all()

# check that `trip_stop_times` are always greater than `trip_start_times`
assert ((df['trip_stop_time']-df['trip_start_time']).dropna() >= pd.Timedelta(0)).all()

上面的assert语句验证了结果日期都在2016Q4中，trip_start_times是按时间顺序排列的，并且{}总是大于其关联的{}。在

网友

3楼 · 编辑于 2024-09-29 22:03:52

您可以在pd.to_datetime中使用参数format：

>>> date= pd.Series(['2016-01-10', '2016-02-10'])
>>> pd.to_datetime(date, format='%Y-%d-%m')
Out: 
0   2016-10-01
1   2016-10-02

相关问题更多 >

编程相关推荐

热门问题

热门文章