<blockquote>
<p><strong>Update:</strong> It’s 2019, so I have rewritten this answer for Python 3, following a confused comment from a programmer trying to use the code. The original Python 2 code is now down at the bottom of the answer.</p>
</blockquote>
<p>标准库中有非常好的工具,既可以解析RFC 821头,也可以解析整个HTTP请求。下面是一个示例请求字符串(请注意,Python将其视为一个大字符串,即使我们为了可读性而将其分成几行),我们可以将其提供给我的示例:</p>
<pre><code>request_text = (
b'GET /who/ken/trust.html HTTP/1.1\r\n'
b'Host: cm.bell-labs.com\r\n'
b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
b'Accept: text/html;q=0.9,text/plain\r\n'
b'\r\n'
)
</code></pre>
<p>正如@trypy所指出的,您可以使用Python的email消息库来解析头-不过,我们应该添加的是,一旦您完成创建,生成的<code>Message</code>对象就像头字典一样:</p>
<pre><code>from email.parser import BytesParser
request_line, headers_alone = request_text.split(b'\r\n', 1)
headers = BytesParser().parsebytes(headers_alone)
print(len(headers)) # -> "3"
print(headers.keys()) # -> ['Host', 'Accept-Charset', 'Accept']
print(headers['Host']) # -> "cm.bell-labs.com"
</code></pre>
<p>当然,这会忽略请求行,或者让您自己解析它。结果发现有一个更好的解决方案。</p>
<p>如果您使用标准库的<code>BaseHTTPRequestHandler</code>,它将为您解析HTTP。尽管它的文档有点晦涩——标准库中的整个HTTP和URL工具套件都有问题——但是要使它解析字符串,您所要做的就是(a)将字符串包装在<code>BytesIO()</code>中,(b)阅读<code>raw_requestline</code>以便它随时可以被解析,(c)捕获解析过程中出现的任何错误代码,而不是让它尝试将它们写回客户端(因为我们没有这样的代码!)。</p>
<p>下面是我们对标准库类的专门化:</p>
<pre><code>from http.server import BaseHTTPRequestHandler
from io import BytesIO
class HTTPRequest(BaseHTTPRequestHandler):
def __init__(self, request_text):
self.rfile = BytesIO(request_text)
self.raw_requestline = self.rfile.readline()
self.error_code = self.error_message = None
self.parse_request()
def send_error(self, code, message):
self.error_code = code
self.error_message = message
</code></pre>
<p>同样,我希望标准库的人已经意识到HTTP解析应该以一种不需要我们编写九行代码来正确调用它的方式进行,但是你能做什么呢?下面是如何使用这个简单的类:</p>
<pre><code># Using this new class is really easy!
request = HTTPRequest(request_text)
print(request.error_code) # None (check this first)
print(request.command) # "GET"
print(request.path) # "/who/ken/trust.html"
print(request.request_version) # "HTTP/1.1"
print(len(request.headers)) # 3
print(request.headers.keys()) # ['Host', 'Accept-Charset', 'Accept']
print(request.headers['host']) # "cm.bell-labs.com"
</code></pre>
<p>如果在解析过程中出现错误,<code>error_code</code>将不会是<code>None</code>:</p>
<pre><code># Parsing can result in an error code and message
request = HTTPRequest(b'GET\r\nHeader: Value\r\n\r\n')
print(request.error_code) # 400
print(request.error_message) # "Bad request syntax ('GET')"
</code></pre>
<p>我更喜欢使用这样的标准库,因为我怀疑他们已经遇到并解决了任何边缘情况,如果我自己尝试用正则表达式重新实现一个Internet规范,这些情况可能会让我感到不快。</p>
<h2>旧Python 2代码</h2>
<p>这是我第一次写这个答案时的原始代码:</p>
<pre><code>request_text = (
'GET /who/ken/trust.html HTTP/1.1\r\n'
'Host: cm.bell-labs.com\r\n'
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
'Accept: text/html;q=0.9,text/plain\r\n'
'\r\n'
)
</code></pre>
<p>以及:</p>
<pre><code># Ignore the request line and parse only the headers
from mimetools import Message
from StringIO import StringIO
request_line, headers_alone = request_text.split('\r\n', 1)
headers = Message(StringIO(headers_alone))
print len(headers) # -> "3"
print headers.keys() # -> ['accept-charset', 'host', 'accept']
print headers['Host'] # -> "cm.bell-labs.com"
</code></pre>
<p>以及:</p>
<pre><code>from BaseHTTPServer import BaseHTTPRequestHandler
from StringIO import StringIO
class HTTPRequest(BaseHTTPRequestHandler):
def __init__(self, request_text):
self.rfile = StringIO(request_text)
self.raw_requestline = self.rfile.readline()
self.error_code = self.error_message = None
self.parse_request()
def send_error(self, code, message):
self.error_code = code
self.error_message = message
</code></pre>
<p>以及:</p>
<pre><code># Using this new class is really easy!
request = HTTPRequest(request_text)
print request.error_code # None (check this first)
print request.command # "GET"
print request.path # "/who/ken/trust.html"
print request.request_version # "HTTP/1.1"
print len(request.headers) # 3
print request.headers.keys() # ['accept-charset', 'host', 'accept']
print request.headers['host'] # "cm.bell-labs.com"
</code></pre>
<p>以及:</p>
<pre><code># Parsing can result in an error code and message
request = HTTPRequest('GET\r\nHeader: Value\r\n\r\n')
print request.error_code # 400
print request.error_message # "Bad request syntax ('GET')"
</code></pre>