从字符串返回所有日期

2024-06-16 20:01:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从长字符串列表中获取所有日期,每个字符串都有几个日期,并且格式不同,我想获取所有日期。我试过了{}和{}

import datefinder
import dateutil.parser as dparser
input_string = 'the document is valid from 2018-11-20 until 2021-11-19, or 25 October 2020 until 25 October 2021, or 3/14/2020 to 3/13/2021, or April 4, 2015 until April 3 2018, or 3rd March 2007 to 4th March 2008'
print(list(datefinder.find_dates(input_string)))
print(dparser.parse(input_string,fuzzy=True))

输出:

[datetime.datetime(2020, 3, 14, 0, 0), datetime.datetime(2021, 3, 13, 0, 0), datetime.datetime(2007, 3, 3, 0, 0), datetime.datetime(2008, 3, 4, 0, 0)]
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-1244-b5979411a38b> in <module>
      4 print(list(datefinder.find_dates(input_string)))
      5 
----> 6 print(dparser.parse(input_string,fuzzy=True))

~\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in parse(timestr, parserinfo, **kwargs)
   1372         return parser(parserinfo).parse(timestr, **kwargs)
   1373     else:
-> 1374         return DEFAULTPARSER.parse(timestr, **kwargs)
   1375 
   1376 

~\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    647 
    648         if res is None:
--> 649             raise ParserError("Unknown string format: %s", timestr)
    650 
    651         if len(res) == 0:

ParserError: Unknown string format: the document is valid from 2018-11-20 until 2021-11-19, or 25 October 2020 until 25 October 2021, or 3/14/2020 to 3/13/2021, or April 4, 2015 until April 3 2018, or 3rd March 2007 to 4th March 2008

datefinder在字符串中的10个日期中找到了4个日期,dparser如果字符串有一个日期,则可以单独识别它们,但如果一个字符串中有多个日期,则返回错误

PS:格式不限于示例中的格式,而且这些字符串由pytesseract拉出,因此存在错误字符和类似问题,因此regex是一个复杂的选择,我正在寻找另一个更好的


Tags: orto字符串parserinputdatetimestringparse