获取magic line/shebang中指定的编码（从模块内）

1条回答

网友

1楼 · 发布于 2024-05-19 11:30:26

我借用Python2中的Python3^{} function，稍作调整以符合Python2的预期。我已经将函数签名更改为接受文件名，并删除了迄今为止读取的行；您的用例不需要这些行：

import re
from codecs import lookup, BOM_UTF8

cookie_re = re.compile(r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)')
blank_re = re.compile(br'^[ \t\f]*(?:[#\r\n]|$)')

def _get_normal_name(orig_enc):
    """Imitates get_normal_name in tokenizer.c."""
    # Only care about the first 12 characters.
    enc = orig_enc[:12].lower().replace("_", "-")
    if enc == "utf-8" or enc.startswith("utf-8-"):
        return "utf-8"
    if enc in ("latin-1", "iso-8859-1", "iso-latin-1") or \
       enc.startswith(("latin-1-", "iso-8859-1-", "iso-latin-1-")):
        return "iso-8859-1"
    return orig_enc

def detect_encoding(filename):
    bom_found = False
    encoding = None
    default = 'ascii'

    def find_cookie(line):
        match = cookie_re.match(line)
        if not match:
            return None
        encoding = _get_normal_name(match.group(1))
        try:
            codec = lookup(encoding)
        except LookupError:
            # This behaviour mimics the Python interpreter
            raise SyntaxError(
                "unknown encoding for {!r}: {}".format(
                    filename, encoding))

        if bom_found:
            if encoding != 'utf-8':
                # This behaviour mimics the Python interpreter
                raise SyntaxError(
                    'encoding problem for {!r}: utf-8'.format(filename))
            encoding += '-sig'
        return encoding

    with open(filename, 'rb') as fileobj:        
        first = next(fileobj, '')
        if first.startswith(BOM_UTF8):
            bom_found = True
            first = first[3:]
            default = 'utf-8-sig'
        if not first:
            return default

        encoding = find_cookie(first)
        if encoding:
            return encoding
        if not blank_re.match(first):
            return default

        second = next(fileobj, '')

    if not second:
        return default    
    return find_cookie(second) or default

与原始函数一样，上面的函数从源文件中读取两行max，如果cookie中的编码无效或在UTF-8bom存在时不是UTF-8，则将引发SyntaxError异常。在

演示：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章

获取magic line/shebang中指定的编码（从模块内）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >