<h2>为什么</h2>
<p>在python中,<code>string.ascii_letters</code>是以字节字符串开头的。python2的“魔力”首先在调用方法<code>.encode('utf-8')</code>时使用默认编码对其进行解码,然后根据请求重新编码。在python2和python3中,编码的结果都是<code>bytes</code>。在</p>
<p>在python 3中,字节串在迭代时的行为不同:<em>它返回整数</em>,而不是长度为1的字节串:</p>
<pre><code>In [52]: list(string.ascii_letters.encode('utf-8'))
Out[52]:
[97,
98,
99,
...
</code></pre>
<p>因此在python3中</p>
^{pr2}$
<p>is<strong>not</strong>N 15个1字节字符串元素的数组。它是由15个整数组成的N个数组。当您稍后调用<a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tostring.html" rel="nofollow">^{<cd4>}</a>来获取数组的原始字节时,您可以得到4个或8个字节的整数。在你的例子中,你似乎得到了4,在这台机器上是8。在</p>
<h2>可能的修复</h2>
<p>一种选择是添加一个强制转换:</p>
<pre><code>In [63]: [(u.tostring(),str(v)) for u, v in zip(
np.random.choice(list(string.ascii_letters.encode("utf-8")),
(N, 15)).astype('|S1'), # Cast to array-protocol type string
np.random.randint(0, 100, N))]
Out[63]:
[(b'811881611111171', '82'),
(b'816878668111171', '46'),
(b'811118881668718', '53'),
(b'971861817181818', '49'),
(b'118618991678978', '81'),
...
</code></pre>
<p>另一种方法是完全跳过编码,尽可能信任本机字符串类型(除非确实需要字节字符串),并使用<code>str.join()</code>:</p>
<pre><code>In [74]: [(''.join(u), str(v)) for u, v in zip(
np.random.choice(list(string.ascii_letters),
(N, 15)),
np.random.randint(0, 100, N))]
Out[74]: [('IJTlleYqZXmSJaW', '32')]
</code></pre>
<p>{{cd7}而不是cd6}:</p>
<pre><code>In [95]: [(u.tostring(), str(v)) for u, v in zip(
np.random.choice(bytearray(string.ascii_letters.encode('utf-8')),
(N, 15)),
np.random.randint(0, 100, N))]
Out[95]: [(b'MPvbDEQIdAVBQVz', '83')]
</code></pre>
<h2>一些时间安排</h2>
<p>下面是他们在python3中使用<code>N = 2000000</code>在这台机器上执行的操作:</p>
<p>无需原始铸件:</p>
<pre><code>In [69]: %timeit [(u.tostring(), str(v)) for u, v in zip( np.random.choice(list(string.ascii_letters.encode('utf-8')), (N, 15)), np.random.randint(0, 100, N))]
1 loops, best of 3: 4.62 s per loop
</code></pre>
<p>演员阵容:</p>
<pre><code>In [70]: %timeit [(u.tostring(), str(v)) for u, v in zip( np.random.choice(list(string.ascii_letters.encode('utf-8')), (N, 15)).astype('|S1'), np.random.randint(0, 100, N))]
1 loops, best of 3: 7.07 s per loop
</code></pre>
<p>使用本机字符串类型和联接:</p>
<pre><code>In [71]: %timeit [(''.join(u), str(v)) for u, v in zip( np.random.choice(list(string.ascii_letters), (N, 15)), np.random.randint(0, 100, N))]
1 loops, best of 3: 12.1 s per loop
</code></pre>
<p>用<code>bytearray()</code>包装:</p>
<pre><code>In [93]: %timeit [(u.tostring(), str(v)) for u, v in zip( np.random.choice(bytearray(string.ascii_letters.encode('utf-8')), (N, 15)), np.random.randint(0, 100, N))]
1 loops, best of 3: 4.56 s per loop
</code></pre>