显然，Python字符串并不是“天生平等的”问题的回答

显然，Python字符串并不是“天生平等的”

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我想把我的大脑围绕在“文本编码标准”上。当把一堆字节解释为“text”时，必须知道哪个“encoding sheme”适用。据我所知，可能的候选人： <ul> <li>ASCII：非常基本的编码方案，支持128个字符。在</li> <li>CP-1252:Windows拉丁字母编码方案。也称为“ANSI”。在</li> <li>UTF-8:Unicode表的编码方案（1.114.112个字符）。如果可能，用一个字节表示每个字符，如果需要，用更多字节表示（最多4个字节）。在</li> <li>UTF-16:Unicode表的另一种编码方案（1.114.112个字符）。表示每个字符，最少2个字节，最多4个字节。在</li> <li>UTF-32:Unicode表的另一种编码方案。用4个字节表示每个字符。在</li> <li>。在</li> </ul> 现在我希望Python对其内置字符串类型始终使用一种编码方案。我做了下面的测试，结果让我发抖。我开始相信Python并不是一直坚持一种编码方案来在内部存储字符串。换句话说：Python字符串似乎“生来就不平等”。。在 编辑： 我忘了提到我使用的是python3.x。对不起：-） 1。测试 我在一个文件夹中有两个简单的文本文件：<code>myAnsi.txt</code>和{<cd2>}。正如您所猜到的，第一个是用<code>CP-1252</code>编码方案编码的，也称为<code>ANSI</code>。后者用<code>utf-8</code>编码。在我的测试中，我打开每个文件并读出其内容。我将内容分配给一个本机Python字符串变量。然后我关闭文件。之后，我创建一个新文件并将字符串变量的内容写入该文件。下面是实现这些功能的代码： <pre><code> ############################## # TEST ON THE ANSI-coded # # FILE # ############################## import os file = open(os.getcwd() + '\\myAnsi.txt', 'r') fileText = file.read() file.close() file = open(os.getcwd() + '\\outputAnsi.txt', 'w') file.write(fileText) file.close() # A print statement here like: # >> print(fileText) # will raise an exception. # But if you're typing this code in a python terminal, # you can just write: # >> fileText # and get the content printed. In my case, it is the exact # content of the file. # PS: I use the native windows cmd.exe as my Python terminal ;-) ############################## # TEST ON THE Utf-coded # # FILE # ############################## import os file = open(os.getcwd() + '\\myUtf.txt', 'r') fileText = file.read() file.close() file = open(os.getcwd() + '\\outputUtf.txt', 'w') file.write(fileText) file.close() # A print statement here like: # >> print(fileText) # will just work fine (at least for me). ############# END OF TEST ############# </code></pre> 2。我期望的结果 让我们假设Python对它的所有字符串始终坚持一种内部编码方案，例如<code>utf-8</code>。将其他内容分配给字符串会导致某种类型的隐式转换。在这些假设下，我希望两个输出文件都是<code>utf-8</code>类型： ^{pr2}$ 3。我得到的结果 我得到的结果是： <pre><code> outputAnsi.txt -> CP-1252 encoded (ANSI) outputUtf.txt -> utf-8 encoded </code></pre> 从这些结果中，我必须得出结论：字符串变量<code>fileText</code>以某种方式存储了它所遵循的编码方案。在 很多人在他们的回答中告诉我： <blockquote> When no encoding is passed explicitly, <code>open()</code> uses the preferred system encoding both for reading and for writing. </blockquote> 我只是不能把我的大脑围绕着那句话。{t{t>如果这两个输出都是<cd9>编码的，那么这两个输出都应该用cd9来编码？在 4。问题.. 我的测试向我提出了几个问题： （1）当我打开一个文件来读取其内容时，Python如何知道该文件的编码方案？我没有指定打开文件的时间。在 （2）显然，Python字符串可以遵循Python支持的任何编码方案。因此，并非所有Python字符串生来都是相等的。如何找出特定字符串的编码方案，以及如何转换它？或者如何确保新创建的Python字符串是预期的类型？在 （3）当我创建一个文件时，Python如何决定将以何种编码方案创建该文件？在我的测试中创建这些文件时，我没有指定编码方案。不过，Python做出了不同的（！）每种情况下的决定。在 5。额外信息（基于对该问题的评论）： <ul> <li>Python版本：python3.x（从Anaconda安装）</li> <li>操作系统：Windows 10</li> <li>终端：标准Windows命令提示符<code>cmd.exe</code></li> <li>关于临时变量<code>fileText</code>的一些问题。显然，指令<code>print(fileText)</code>不适用于ANSI情况。引发异常。但是在python终端窗口中，我可以简单地输入变量名<code>fileText</code>并打印出文件内容。在</li> <li>安可丁g文件检测：记事本右下角第一次检查，在线工具复查：<a href="https://nlp.fi.muni.cz/projects/chared/" rel="nofollow">https://nlp.fi.muni.cz/projects/chared/</a></li> <li>输出文件<code>outputAnsi.txt</code>和{<cd16>}在测试开始时不存在。它们是在我使用<code>'w'</code>选项发出<code>open(..)</code>命令时创建的。在</li> </ul> 6。实际文件（为了完整性）： 我得到了一些建议，鼓励我分享我正在做这个测试的实际文件。那些文件相当大，所以我把它们删减了，重新做了测试。结果相似。以下是这些文件（注：当然，我的文件包含源代码，还有什么？）公司名称： 我的ANSI.txt <pre><code>/* ****************************************************************************** ** ** File : LinkerScript.ld ** ** Author : Auto-generated by Ac6 System Workbench ** ** Abstract : Linker script for STM32F746NGHx Device from STM32F7 series ** ** Target : STMicroelectronics STM32 ** ** Distribution: The file is distributed “as is,” without any warranty ** of any kind. ** ***************************************************************************** ** @attention ** ** <h2><center>&copy; COPYRIGHT(c) 2014 Ac6</center></h2> ** ***************************************************************************** */ /* Entry Point */ /*ENTRY(Reset_Handler)*/ ENTRY(Default_Handler) /* Highest address of the user mode stack */ _estack = 0x20050000; /* end of RAM */ _Min_Heap_Size = 0; /* required amount of heap */ _Min_Stack_Size = 0x400; /* required amount of stack */ /* Memories definition */ MEMORY { RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 320K ROM (rx) : ORIGIN = 0x8000000, LENGTH = 1024K } </code></pre> <code>fileText</code>变量的print语句导致以下异常： <pre><code>>>> print(fileText) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Anaconda3\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 357: character maps to <undefined> </code></pre> 但只需键入变量的名称就可以毫无问题地打印出内容： <pre><code>>>> fileText ### contents of the file are printed out :-) ### </code></pre> myUtf.txt文件 <pre><code>/*--------------------------------------------------------------------------------------------------------------------*/ /* _ _ _ */ /* / -,- \ __ _ _ */ /* // | \\ / __\ | ___ ___| | __ _ _ */ /* | 0--,| / / | |/ _ \ / __| |/ / __ ___ _ _ __| |_ __ _ _ _| |_ ___ */ /* \\ // / /___| | (_) | (__| < / _/ _ \ ' \(_-< _/ _` | ' \ _(_-< */ /* \_-_-_/ \____/|_|\___/ \___|_|\_\ \__\___/_||_/__/\__\__,_|_||_\__/__/ */ /*--------------------------------------------------------------------------------------------------------------------*/ #include "clock_constants.h" #include "../CMSIS/stm32f7xx.h" #include "stm32f7xx_hal_rcc.h" /*--------------------------------------------------------------------------------------------------*/ /* S y s t e m C o r e C l o c k i n i t i a l v a l u e */ /*--------------------------------------------------------------------------------------------------*/ /* */ /* This variable is updated in three ways: */ /* 1) by calling CMSIS function SystemCoreClockUpdate() */ /* 2) by calling HAL API function HAL_RCC_GetHCLKFreq() */ /* 3) each time HAL_RCC_ClockConfig() is called to configure the system clock frequency */ /* Note: If you use this function to configure the system clock; then there */ /* is no need to call the 2 first functions listed above, since SystemCoreClock */ /* variable is updated automatically. */ /* */ uint32_t SystemCoreClock = 16000000; const uint8_t AHBPrescTable[16] = {0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 6, 7, 8, 9}; /*--------------------------------------------------------------------------------------------------*/ /* S y s t e m C o r e C l o c k v a l u e u p d a t e */ /*--------------------------------------------------------------------------------------------------*/ /* */ /* @brief Update SystemCoreClock variable according to Clock Register Values. */ /* The SystemCoreClock variable contains the core clock (HCLK), it can */ /* be used by the user application to setup the SysTick timer or configure */ /* other parameters. */ /*--------------------------------------------------------------------------------------------------*/ </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

显然，Python字符串并不是“天生平等的”

1 个回答

相关Python问题