Python:使用Regex捕获模式中的子模式

2024-09-28 05:42:21 发布

您现在位置:Python中文网/ 问答频道 /正文

免责声明:这是我的第一篇文章。请随时给我反馈意见,以及我应该或不应该如何格式化这个问题。谢谢!

我希望通过捕捉任何与日期格式后跟冒号模式相匹配的内容从文本块中提取数据。我已经成功地使用正则表达式捕捉信息,包括观察日期、冒号和下一日期之前的任何后续文本。在

例如:
1999-01-01:观察到10只鸟类。在

我遇到的问题是,我的一些数据包含站点名称,然后是观察日期之后的观察数据中的冒号和第一个冒号。“sitename:data”的子模式可能在观察日期之后的块内出现零次或多次。在

例如:
1999-01-01:BS-001:观察到5只鸟。身体健康。BS-002:观察到5只鸟,有些健康状况不佳。在

我应该使用什么模式来捕获日期格式和冒号之后的所有文本,包括潜在的站点名称、冒号和下一个观察日期之前的相关数据?在

我目前使用以下模式按日期和观察提取简单的观测数据(其中没有多个站点):

pattern = re.compile(r'(\d\d\d\d\-*\s*\&*\d+\-*\d*:[A-Za-z0-9\s\,\(\)\;\"\-]*\.*)')  

上面的代码让我可以提取出各种形式的观察日期。使用句点作为模式的一部分很困难,因为观察数据可能是一个或多个句子。在

下面是一个我试图搜索和拆分的文本示例。每一个新的匹配都应该以一个观察日期开始,因此在下面的数据中应该有3个匹配返回(2013-04-13:data,2017-01-01:data和2018-07-04:data):

2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched. 2017-01-01: 23 individuals observed. Egg masses were not present. 2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.

理想情况下,输出如下:

2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched.

2017-01-01: 23 individuals observed. Egg masses were not present.

2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.


Tags: theto数据in文本bsegg模式
3条回答

您可以尝试用两个换行符替换后跟日期的所有空白:

s = re.sub(r'\s+(?=\d{4}-*\s*&*\d+-*\d*:)', "\n\n", s)

这样就不会匹配字符串开头的第一个日期。在

如果您不确定每个日期前面都有空格,也可以这样写:

^{pr2}$

您可以使用split()和regex^{}

output = re.compile(" (?=\d{4}-\d{2}-\d{2})").split(text)

Code demo

基本上,这听起来像是你想把你的文本分成以日期开头,在日期或文本结尾之前结束的字段。有一种可能:

\d{4}-\d\d-\d\d:           # date with colon
.*?                        # the minimal amount of any characters required to match
(?=                        # positive lookahead (match text but don't consume it)
   \d{4}-\d\d-\d\d:        # date with colon
  |                        # or
   $                       # end of text
)                          # end lookahead

re.findall()结合使用:

^{pr2}$

对照上面的示例文本:

['2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat.
  Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk
  old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing
  in the masses were AMJE-like). BS-443: 3 egg masses observed in
  vernal pool habitat. A few egg masses may have been missed due to
  poor light conditions. Smith-019: 250 egg masses observed in
  vernal pool habitat. Observer searched only portions abutting the 
  road (SW margin of pool). Many AMJE masses observed attached
  to herbaceous vegetation and difficult to differentiate from
  one another. AMJE egg-mass count is a rough estimate within
  area searched. ',
 '2017-01-01: 23 individuals observed. Egg masses were not present. ',
 '2018-07-04: BS-440: All individuals took a break from breeding for
  the long holiday weekend.']

相关问题 更多 >

    热门问题