pandas将字符串列转换为datetime,允许缺失但不无效

2024-10-01 07:35:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个pandas数据框,其中有多列表示日期的字符串,空字符串表示丢失的日期。例如

import numpy as np
import pandas as pd

# expected date format is 'm/%d/%Y'

custId = np.array(list(range(1,6)))
eventDate = np.array(["06/10/1992","08/24/2012","04/24/2015","","10/14/2009"])
registerDate = np.array(["06/08/2002","08/20/2012","04/20/2015","","10/10/2009"])

# both date columns of dfGood should convert to datetime without error
dfGood = pd.DataFrame({'custId':custId, 'eventDate':eventDate, 'registerDate':registerDate}) 

我想:

  • 有效地将所有字符串都是有效日期或为空的列转换为datetime64类型的列(对于空的,NaT为空)
  • 当任何非空字符串不符合预期格式时引发ValueError

应在何处引发ValueError的示例:

^{pr2}$

此函数在元素级别执行我想要的操作:

from datetime import datetime

def parseStrToDt(s, format = '%m/%d/%Y'):
    """Parse a string to datetime with the supplied format."""
    return pd.NaT if s=='' else datetime.strptime(s, format)

print(parseStrToDt("")) # correctly returns NaT
print(parseStrToDt("12/31/2011")) # correctly returns 2011-12-31 00:00:00
print(parseStrToDt("12/31/11")) # correctly raises ValueError

但是,我有一个read字符串操作不应该是np.vectorize-d。我认为使用pandas.DataFrame.apply可以有效地实现这一点,如:

dfGood[['eventDate','registerDate']].applymap(lambda s: parseStrToDt(s)) # raises TypeError

dfGood.loc[:,'eventDate'].apply(lambda s: parseStrToDt(s)) # raises same TypeError

我猜TypeError与我的函数返回一个不同的dtype有关,但我确实想利用动态类型并用日期时间替换字符串(除非ValueError被引发)。。。那我该怎么做呢?在


Tags: 字符串importformatpandasdatetimenparraynat
2条回答

pandas没有一个完全复制您想要的内容的选项,这里有一种方法可以做到,这应该是相对有效的。在

In [4]: dfBad
Out[4]: 
   custId   eventDate registerDate
0       1  06/10/1992   06/08/2002
1       2  08/24/2012   20/08/2012
2       3  04/24/2015   04/20/2015
3       4                         
4       5  10/14/2009   10/10/2009

In [7]: cols
Out[7]: ['eventDate', 'registerDate']

In [9]: dts = dfBad[cols].apply(lambda x: pd.to_datetime(x, errors='coerce', format='%m/%d/%Y'))

In [10]: dts
Out[10]: 
   eventDate registerDate
0 1992-06-10   2002-06-08
1 2012-08-24          NaT
2 2015-04-24   2015-04-20
3        NaT          NaT
4 2009-10-14   2009-10-10

In [11]: mask = pd.isnull(dts) & (dfBad[cols] != '')

In [12]: mask
Out[12]: 
  eventDate registerDate
0     False        False
1     False         True
2     False        False
3     False        False
4     False        False


In [13]: mask.any()
Out[13]: 
eventDate       False
registerDate     True
dtype: bool

In [14]: is_bad = mask.any()

In [23]: if is_bad.any():
    ...:     raise ValueError("bad dates in col(s) {0}".format(is_bad[is_bad].index.tolist()))
    ...: else:
    ...:     df[cols] = dts
    ...:     
                                     -
ValueError                                Traceback (most recent call last)
<ipython-input-23-579c06ce3c77> in <module>()
      1 if is_bad.any():
  > 2     raise ValueError("bad dates in col(s) {0}".format(is_bad[is_bad].index.tolist()))
      3 else:
      4     df[cols] = dts
      5 

ValueError: bad dates in col(s) ['registerDate']

为了更进一步,我将所有有效或缺失字符串的列替换为其解析的日期时间,然后对其余未分析的列引发一个错误:

dtCols = ['eventDate', 'registerDate']
dts = dfBad[dtCols].apply(lambda x: pd.to_datetime(x, errors='coerce', format='%m/%d/%Y'))

mask = pd.isnull(dts) & (dfBad[dtCols] != '')
colHasError = mask.any()

invalidCols = colHasError[colHasError].index.tolist() 
validCols = list(set(dtCols) - set(invalidCols))

dfBad[validCols] = dts[validCols] # replace the completely valid/empty string cols with dates
if colHasError.any():
    raise ValueError("bad dates in col(s) {0}".format(invalidCols))
# raises:  ValueError: bad dates in col(s) ['registerDate']

print(dfBad) # eventDate got converted, registerDate didn't

但是,接受的答案包含了主要的见解,即继续将错误强制到NaT,然后将非空但无效的字符串与带掩码的空字符串区分开来。在

相关问题 更多 >