获取magic line/shebang中指定的编码(从模块内)

2024-05-19 11:30:26 发布

您现在位置:Python中文网/ 问答频道 /正文

如果我在python模块的“magic line”或shebang中指定字符编码(如PEP 263所建议的那样)

# -*- coding: utf-8 -*-

我可以从那个模块中检索这个编码吗?在

(使用Python2.7.9运行Windows 7 x64)


我尝试(但没有成功)检索默认编码或shebang

^{pr2}$

将产生:

sys.getdefaultencoding(): ascii

shebang: None

(iso-8859-1相同)


Tags: 模块编码windowsmagicsysline字符建议
1条回答
网友
1楼 · 发布于 2024-05-19 11:30:26

我借用Python2中的Python3^{} function,稍作调整以符合Python2的预期。我已经将函数签名更改为接受文件名,并删除了迄今为止读取的行;您的用例不需要这些行:

import re
from codecs import lookup, BOM_UTF8

cookie_re = re.compile(r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)')
blank_re = re.compile(br'^[ \t\f]*(?:[#\r\n]|$)')

def _get_normal_name(orig_enc):
    """Imitates get_normal_name in tokenizer.c."""
    # Only care about the first 12 characters.
    enc = orig_enc[:12].lower().replace("_", "-")
    if enc == "utf-8" or enc.startswith("utf-8-"):
        return "utf-8"
    if enc in ("latin-1", "iso-8859-1", "iso-latin-1") or \
       enc.startswith(("latin-1-", "iso-8859-1-", "iso-latin-1-")):
        return "iso-8859-1"
    return orig_enc

def detect_encoding(filename):
    bom_found = False
    encoding = None
    default = 'ascii'

    def find_cookie(line):
        match = cookie_re.match(line)
        if not match:
            return None
        encoding = _get_normal_name(match.group(1))
        try:
            codec = lookup(encoding)
        except LookupError:
            # This behaviour mimics the Python interpreter
            raise SyntaxError(
                "unknown encoding for {!r}: {}".format(
                    filename, encoding))

        if bom_found:
            if encoding != 'utf-8':
                # This behaviour mimics the Python interpreter
                raise SyntaxError(
                    'encoding problem for {!r}: utf-8'.format(filename))
            encoding += '-sig'
        return encoding

    with open(filename, 'rb') as fileobj:        
        first = next(fileobj, '')
        if first.startswith(BOM_UTF8):
            bom_found = True
            first = first[3:]
            default = 'utf-8-sig'
        if not first:
            return default

        encoding = find_cookie(first)
        if encoding:
            return encoding
        if not blank_re.match(first):
            return default

        second = next(fileobj, '')

    if not second:
        return default    
    return find_cookie(second) or default

与原始函数一样,上面的函数从源文件中读取两行max,如果cookie中的编码无效或在UTF-8bom存在时不是UTF-8,则将引发SyntaxError异常。在

演示:

^{pr2}$

相关问题 更多 >

    热门问题