如何在Python中获得可靠的unicode字符数？

2条回答

网友

1楼 · 编辑于 2024-10-04 03:26:18

I know I can just encode it to UTF-8 and then decode again

是的，当您有“UCS-4string中的UTF-16代理”输入时，这是解决问题的常用习惯用法。但正如机械蜗牛说的，这个输入是畸形的，你应该优先修复产生它的任何东西。在

is there a more straightforward/efficient way?

嗯。。。您可以使用regex手动执行，例如：

re.sub(
    u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
    lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
    s
)

当然不是更直接。。。我也怀疑它是否真的更有效！在

网友

2楼 · 编辑于 2024-10-04 03:26:18

不幸的是，在3.3之前的版本中，CPython解释器的行为取决于它是使用“窄”还是“宽”Unicode支持构建的。因此，相同的代码，例如对len的调用，在标准解释器的不同构建中可能有不同的结果。有关示例，请参见this question。在

“窄”和“宽”的区别在于，“窄”解释程序内部存储16位代码单元（UCS-2），而“宽”解释程序内部存储32位代码单元（UCS-4）。代码点U+10000及以上（基本多语言平面之外）有两个on“窄”解释程序的len，因为需要两个UCS-2代码单元来表示它们（使用代理），这就是len度量的。在“wide”构建中，非BMP代码点只需要一个UCS-4代码单元，因此对于那些构建，len是此类代码点的一个。在

我已经确认，下面的代码可以处理所有的unicode字符串，不管它们是否包含代理项对，并且在cpython2.7的窄版本和宽版本中都能工作。（可以说，在一个宽解释器中指定一个类似u'\ud83d\udc4d'的字符串反映了一种肯定的愿望，即表示一个完整的代理代码（point）不同于部分字符代码（partial character codeunit），因此不会自动更正错误，但我在这里忽略了这一点。这是一种边缘情况，通常不是理想的用例。）

下面使用的@invoke技巧是一种避免重复计算的方法，而无需向模块的__dict__添加任何内容。在

invoke = lambda f: f()  # trick taken from AJAX frameworks

@invoke
def codepoint_count():
  testlength = len(u'\U00010000')  # pre-compute once
  assert (testlength == 1) or (testlength == 2)
  if testlength == 1:
    def closure(data):  # count function for "wide" interpreter
      u'returns the number of Unicode code points in a unicode string'
      return len(data.encode('UTF-16BE').decode('UTF-16BE'))
  else:
    def is_surrogate(c):
      ordc = ord(c)
      return (ordc >= 55296) and (ordc < 56320)
    def closure(data):  # count function for "narrow" interpreter
      u'returns the number of Unicode code points in a unicode string'
      return len(data) - len(filter(is_surrogate, data))
  return closure

assert codepoint_count(u'hello \U0001f44d') == 7
assert codepoint_count(u'hello \ud83d\udc4d') == 7

相关问题更多 >

编程相关推荐

热门问题

热门文章