解析二进制文件的正则表达式?

2024-05-12 01:17:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个混合二进制数据和文本数据的文件。我想通过正则表达式解析它,但是我得到了一个错误:

TypeError: can't use a string pattern on a bytes-like object

我猜这条消息意味着Python不想解析二进制文件。 我正在打开带有"rb"标志的文件。

如何用Python中的正则表达式解析二进制文件?

编辑:我正在使用Python3.2.0


Tags: 文件数据文本消息stringbytesobjectuse
3条回答

我想你用的是Python 3。

1.Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that the mode parameter contains a 'b' character.

........

4.Here’s one difference, though: a binary stream object has no encoding attribute. That makes sense, right? You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do.

http://www.diveintopython3.net/files.html#read

然后,在Python 3中,由于来自文件的二进制流是字节流,所以分析来自文件的流的正则表达式必须用字节序列而不是字符序列来定义。

In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths).

http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html

以及

In Python 3, all strings are sequencesof Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that.

http://www.diveintopython3.net/strings.html#boring-stuff

以及

4.6. Strings vs. Bytes# Bytes are bytes; characters are an abstraction. An immutable sequence of Unicode characters is called a string. An immutable sequence of numbers-between-0-and-255 is called a bytes object.

....

1.To define a bytes object, use the b' ' “byte literal” syntax. Each byte within the byte literal can be an ASCII character or an encoded hexadecimal number from \x00 to \xff (0–255).

http://www.diveintopython3.net/strings.html#boring-stuff

所以您将按如下方式定义正则表达式

pat = re.compile(b'[a-f]+\d+')

而不是

pat = re.compile('[a-f]+\d+')

更多解释如下:

15.6.4. Can’t use a string pattern on a bytes-like object

对于python 2.6,这对我很有用

>>> import re
>>> r = re.compile(".*(ELF).*")
>>> f = open("/bin/ls")
>>> x = f.readline()
>>> r.match(x).groups()
('ELF',)

在您的re.compile中,您需要使用bytes对象,该对象由初始b表示:

r = re.compile(b"(This)")

这是Python 3对字符串和字节之间的区别的挑剔。

相关问题 更多 >