unicode为什么Java字符使用UTF16？

1 周，4 日 Questions & Answers 4896

最近我读了很多关于Unicode代码点的东西，以及它们是如何随着时间的推移而演变的，当然我也读了这篇文章

但我找不到真正的原因是为什么Java使用UTF-16作为字符

例如，如果我有一个包含1024个ASCII范围字符串字母的字符串。它意味着1024 * 2 bytes，它将以任何方式消耗等于2KB的字符串内存

因此，如果JavaBaseChar是UTF-8，那么只有1KB的数据。即使字符串中有任何需要2字节的字符，例如10个字符“字“它自然会增加内存消耗的大小。(1014 * 1 byte) + (10 * 2 bytes) = 1KB + 20 bytes”

结果并不是那么明显1KB + 20 bytes VS. 2KB我没有说ASCII，但我对此感到好奇的是，为什么它不是UTF-8，它也只处理多字节字符。UTF-16看起来像是在任何包含大量非多字节字符的字符串中浪费内存

这背后有什么好的理由吗

Tags:

# 1 楼答案

其中一个原因是随机访问或迭代字符串字符的性能特征：

UTF-8编码使用可变数字（1-4）字节对unicode字符进行编码。因此，通过索引来访问字符：String.charAt(i)将比java.lang.String使用的数组访问更复杂，速度也更慢
# 2 楼答案

Java在2004/2005中通过UTF-16转换之前使用了UCS-2。最初选择UCS-2的原因是mainly historical：

Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.

这一点，以及UTF-16的诞生，是进一步的explained by the Unicode FAQ page：

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

由于@wero有already mentioned，随机访问不能用UTF-8有效地完成。因此，综合考虑所有因素，UCS-2似乎是当时的最佳选择，尤其是在该阶段分配了无补充字符的情况下。这使得UTF-16成为最简单的自然进展