regex python提取电子邮件头名称

2024-05-07 04:36:40 发布

您现在位置:Python中文网/ 问答频道 /正文

您好,我正在寻找一种方法来提取标题名称(粗体)从这个文本块(最初是从mbox文件) 我尝试了这个正则表达式,它在崇高文本正则表达式搜索中起作用,但在python上没有 ^\w+-?(\w+)?-?(\w+)?公司名称:

rgex = re.findall('^\w+-?(\w+)?-?(\w+)?:', mail);

这是里面的邮件

X-Apparently-To: test@yahoo.com; Thu, 09 Jun 2016 13:41:21 +0000
Return-Path:
Received-SPF: pass (domain of yahoo.com designates 72.30.235.45 as permitted sender)
Received: from 127.0.0.1 (EHLO n3-vm9.bullet.mail.bf1.yahoo.com) (72.30.235.45) by mta1287.mail.ne1.yahoo.com with SMTPS; Thu, 09 Jun
2016 13:41:21 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo-inc.com; s=yibm; t=1465479679;
To: test@yahoo.com
From: "Yahoo"
Reply-To: "Yahoo"
X-YMailISG: PCypxycWLDvGv4Bg8ShrtzVYi3vpFMAjYaqWyWybcVJ_ZQff eyquyqb..Qu6UKhX_Tyz5b3da2iDtRStJpVnNulZHOb8GznJQTCKk9sjvboS KsbzY4E1uScWz0Ieo0jjG0YHrB1dTCzOSeMiPNumCCFS1sR3_SkyMBGG_D2D wWtdRducxLa2YgEMMubVpMtNJMBv.bwk0.E.jQNEy8I3LnJEqcDpmIUM7bZL XgkEFz7yl1Zo6Sj4r0z6pGlVIFOql7uG9Bwq2VJoK1Q1upKJUOBfQqzf64y2 9fXLnQsWENpZloxwncGzLhdzEYGgE3xNuFV8QFxZGXyvtKZFoykH49M03URN jtx8Yg6ypjyRbBIRVJGVFbjAvW6io3yeyIFh042jlgYQtLxbneFA60hn9ifT Mit3bQ5l7Tginw0OgRM2cbqLo0tEZFt9vlN597Z3vPGwsVdBcTp9wnk6orj2 TqjEpAmODy3Yru2HzDP7Dbwq9CGaIozUm91VNWqw5Dy7AMQEsuvnBop7Fflk G21m1WKMBgrS.2bOLQ4797E09LjlyyoWI9FouUNNhDljnPPf2AeKUKzauctw ULOQPveWAm4lDsNLMp5yvXDYNIe5HMor84SVd8_xF3Icna1PAftXGzJUHrXK NZSEN_VO0GprGfaNQg4uSW_0wXFXwC6TYQ4CMjz53o0qNGpILogVfRLwFCFL DtW8nimkLLsNzmDajzJsR_juA86Orw2NE5ED4qdpPxmyxyrXYOQPu3O6zeYf 7mBzU0aX7VHJUxJ4L3HdB9qTjbTaCdnySrnjGtd7u9Cn9yRJirDNeg3UA82P PeA1ZDfc0vKdrn5QI6e6YKa2TTt7Dspy3jObgSapH5epc3LyQVyN7yjpxrq_ MXAbpqedjUfcwq3c7lpt8xxUxy.MXWg0fJO059xijvb_sYTaQTGUWAMeVU.6 IW.hSksejwpn._CgE9Kqabbk5qgYIdYRW1pmz5OBYh0skCX1TrFRuxbGvDit R_wr.wbTpJGiSST.b0ZetmgN72bVvlRtmNPw1Dk.zxaacXxhGSMWupPUDLJZ OMrap2ax8oiQrxT3jIhk8seIkaNJ.tGUhlPx6G4lJJaz0g89LmjBaEjGUG8P W3Phh9db3hjxUIX5UC0jg5ai2XZ7u_wXn2Muk61N1eRCZ0oA2S25YDPK1dh. 3VQ6pH8SSBxVkQHUJXbZUNqLAzi5V5wRS7oeitXERGgA2DiZB268.rJxS7di OMT5eGoITG4LnAo1M3nsVQ6xceHDd4v6KD9KfBgTHX_iLUv_skCv4dVUgVvj edKOFiOMHBTpJ9J9BECjTTzEUpc.fCNUcRwSsiSkqbRhUsAdCbxQZir3Nb1Z 6FzI6J2eNqpj4azjmDeI15R8MyN7VFc6bl6pCZySk2Tx5SQESDm.sVkADSVR pI2nuscEjU3xo_qGUxbh5mbAA17K2zYpcFXaOce8_9Eszos5pURCcdtBYUqI I_DOtvNe.zWY1ShRcr9ZzTj3ibmc7NBmvumhVMjqirb12mfJ6oxHv8d86gze HtAJmJghczUg5otSzdxSgEJJxjMZrzSidJ9FP.gPiPWtuukz82YpZ32MnCVs 6.V2DRxpUmZa31KH93QSEzwMlCn3FFTLBv9izcjoFP81yeAn.3QloF8XIC3K WmtXtloyeGjuygAhlkd_prXmMGGC5JmPlY8xu4k1NavkdDh6pG6zIkt83Wsd p.D.0BgM
X-Originating-IP: [75.30.245.45]
Authentication-Results: mta1287.mail.ne1.yahoo.com from=yahoo-inc.com; domainkeys=neutral (no sig); from=yahoo-inc.com; dkim=pass (ok)


Tags: tofromtest文本名称commailpass
3条回答

Python提供了一个电子邮件包,可以为您完成这些低级任务,但是如果您想通过艰苦的方式学习电子邮件头,那么可以参考RFC5322(格式为RFC822)

在其他合理的信息中,您可以找到标题字段的定义:

Header fields are lines beginning with a field name, followed by a colon (":"), followed by a field body, and terminated by CRLF. A field name MUST be composed of printable US-ASCII characters (i.e., characters that have values between 33 and 126, inclusive), except colon. A field body may be composed of printable US-ASCII characters as well as the space (SP, ASCII value 32) and horizontal tab (HTAB, ASCII value 9) characters (together known as the white space characters, WSP). A field body MUST NOT include CR and LF except when used in "folding" and "unfolding"

折叠的定义如下:

the field body portion of a header field can be split into a multiple-line representation; this is called "folding". The general rule is that wherever this specification allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP.

这意味着:

  • 当一行不以WSP开头(在regex中,\s)时,列的起始行是头名称。在
  • 当一行以WSP开头时,它是一个延续行。在

所以这个regex就足够了:'([\x21-\x7e]+?):'

如果你在整个mbox文件上运行一个正则表达式,那么正则表达式就不起作用了——你必须编写一个程序。原因是消息体有可能具有与头令牌完全匹配的令牌。在

假设您只在mbox文件的头部分运行regex,那么看看email RFC(第2.2节),那么下面的regex应该可以工作:

'^([^:]+):'

比设计合适的正则表达式更简单的方法可能是使用python附带的更合适的工具。。。^{}模块,它被设计用来解析rcf822消息。在

>>> from email import parser
>>> txt = """X-Apparently-To: test@yahoo.com; Thu, 09 Jun 2016 13:41:21 +0000 
... Return-Path: 
... Received-SPF: pass (domain of yahoo.com designates 72.30.235.45 as permitted sender) 
... Received: from 127.0.0.1 (EHLO n3-vm9.bullet.mail.bf1.yahoo.com) (72.30.235.45) by mta1287.mail.ne1.yahoo.com with SMTPS; Thu, 09 Jun 2016 13:41:21 +0000
... DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo-inc.com; s=yibm; t=1465479679; 
... To: test@yahoo.com 
... From: "Yahoo" 
... Reply-To: "Yahoo"
... X-YMailISG: PCypxy...
... X-Originating-IP: [75.30.245.45] 
... Authentication-Results: mta1287.mail.ne1.yahoo.com from=yahoo-inc.com; domainkeys=neutral (no sig); from=yahoo-inc.com; dkim=pass (ok)
... """
>>> msg = parser.Parser().parsestr(txt, headersonly=True)
>>> print(msg.keys())
['X-Apparently-To', 'Return-Path', 'Received-SPF', 'Received', 'DKIM-Signature', 'To', 'From', 'Reply-To' 'X-YMailISG', 'X-Originating-IP', 'Authentication-Results']

相关问题 更多 >