使用regex尝试从电子邮件中提取段落

2024-09-30 06:30:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试使用正则表达式从文本中提取段落,其形式如下:

<0.30.1.92.13.39.38.marian+@MARIAN.ADM.CS.CMU.EDU (Marian D'Amico).0>
Type:     cmu.cs.scs
Topic:    LOGIC COLLOQUIUM
Dates:    6-Feb-92
Time:     3:30
Host:     Stephen D. Brookes
PostedBy: marian+ on 30-Jan-92 at 13:39 from MARIAN.ADM.CS.CMU.EDU 
(Marian D'Amico)
Abstract: 



***********************************************************************
          Logic Colloquium
            Thursday February 6
           3:30 Wean 5409
 **********************************************************************
       On The Mathematics of Non-monotonic Reasoning
          Menachem Magidor
       Hebrew University of Jerusalem
          (Joint work with Daniel Lehman)

Non-monotonic reasoning is an attempt to develop reasoning systems
where an inference means that the conclusion holds in the "normal 
case",
in "most cases", but it does not necessarily hold in all cases. It 
seems 
that this type of reasoning is needed if one wants to model everyday
common-sense reasoning. There have been many models suggested for
non-monotonic reasoning (like circumscription, default logic, 
autoepistemic logic, etc). We study all these approaches in a more 
abstract fashion by considering the inference relation of the 
reasoning system, and clarify the role of different inference rules 
and the impact they have on the model theory of the logic. We are 
especially interested in a particular rule called "Rational Monotony" 
and the connection between it and probabilistic models.

 NOTE: Prof. Magidor will also give a Math Department Colloquium on 
Friday
 February 7.

-------------------------
 Host:  Stephen D. Brookes

Appointments can be made through Marian D'Amico, marian@cs, x7665.

我正在尝试: paragraphRegex=r'(?<;=\n\n)(?:(?:\s*\b.+\b:(?:.|\s)+?)|(\s{0,4}A-Za-z0-9+? \s*)(?=\n\n)'

然而,这个正则表达式捕获了一些情况,而在其他情况下,它要么不捕获段落,要么挂起。你知道吗

任何帮助都将不胜感激


Tags: andoftheinon段落logicadm
1条回答
网友
1楼 · 发布于 2024-09-30 06:30:01

我会尝试另一种方法。你知道吗

可以基于新行拆分文本:

texts = text.split('\n')

在此基础上,开发一个测试,以确定文本是电子邮件正文的一部分还是其他内容。也许可以在前导行/后行空白的地方查找文本块。像这样的方法可能有用:

段落=[]

for i, text in enumerate(texts):
  if i>0:
    if (text != '' and texts[i-1] == '' and texts[i+1]):
       paragraphs.append(text)

顺便说一句,使用regexp只能做到这一点。大多数文本数据源的格式通常有很多变化,您的正则表达式将永远无法捕获每种边缘情况。我必须这样做一次,而且建立一个分类模型来识别段落会更健壮(也更容易)。你知道吗

这是它自己的研究项目,但如果你这样做,看看配对术语频率-逆文档频率(TF-IDF)与支持向量分类器(SVC),不要让任何人说服你使用神经网络,除非你有很多好的训练数据:)。你知道吗

相关问题 更多 >

    热门问题