<p>第一件事是第一件事,如果你想要一个“字符”数组,你必须小心你所期望的。在python3中,字符串现在是unicode代码点的序列。在Python2中,字符串是C等语言中的经典“字节序列”字符串。这意味着,从内存pov来看,unicode类型可能会占用更多内存:</p>
<pre><code>In [1]: import numpy as np
In [2]: chararray = np.zeros((4,10), dtype='S1')
In [3]: unicodearray = np.zeros((4,10), dtype='U1')
In [4]: chararray.itemsize, unicodearray.itemsize
Out[4]: (1, 4)
In [5]: chararray.nbytes
Out[5]: 40
In [6]: unicodearray.nbytes
Out[6]: 160
</code></pre>
<p>因此,如果您知道您只想使用ascii字符,那么可以使用<code>S1</code>数据类型将内存使用量减少到1/4。还要注意,由于Python 3中的<code>S1</code>实际上对应于<code>bytes</code>数据类型(这与Python 2<code>str</code>相等),所以<code>b'this is a bytes object'</code>前面加了一个<code>b</code>,因此<code>b'this is a bytes object'</code>:</p>
^{pr2}$
<p>现在,假设您有一些负载,您想将消息分配给您的数组。如果消息包含可表示为ascii的字符,则可以快速而松散地使用数据类型:</p>
<pre><code>In [15]: message = 'This'
In [16]: unicodearray.reshape(-1)[:len(message)] = list(message)
In [17]: unicodearray
Out[17]:
array(['T', 'h', 'i', 's', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', ''],
dtype='<U1')
In [18]: chararray.reshape(-1)[:len(message)] = list(message)
In [19]: chararray
Out[19]:
array([[b'T', b'h', b'i', b's', b'', b'', b'', b'', b'', b''],
[b'', b'', b'', b'', b'', b'', b'', b'', b'', b''],
[b'', b'', b'', b'', b'', b'', b'', b'', b'', b''],
[b'', b'', b'', b'', b'', b'', b'', b'', b'', b'']],
dtype='|S1')
</code></pre>
<p>然而,如果情况并非如此:</p>
<pre><code>In [22]: message = "กขฃคฅฆงจฉ"
In [23]: len(message)
Out[23]: 9
In [24]: unicodearray.reshape(-1)[:len(message)] = list(message)
In [25]: unicodearray
Out[25]:
array(['ก', 'ข', 'ฃ', 'ค', 'ฅ', 'ฆ', 'ง', 'จ', 'ฉ', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', ''],
dtype='<U1')
In [26]: chararray.reshape(-1)[:len(message)] = list(message)
-
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-26-7d7cdb93de1f> in <module>()
> 1 chararray.reshape(-1)[:len(message)] = list(message)
UnicodeEncodeError: 'ascii' codec can't encode character '\u0e01' in position 0: ordinal not in range(128)
In [27]:
</code></pre>
<p>注意,如果您想用一个元素初始化数组,而不是它默认使用的<code>np.zeros</code>,可以使用<code>np.full</code>:</p>
<pre><code>In [27]: chararray = np.full((4,10), '*', dtype='S1')
In [28]: chararray
Out[28]:
array([[b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*'],
[b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*'],
[b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*'],
[b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*']],
dtype='|S1')
</code></pre>
<p>最后,要使用for循环执行此长表单:</p>
<pre><code>In [17]: temp = "a test"
In [18]: display = np.full((4,10), '*', dtype='U1')
In [19]: display
Out[19]:
array([['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
dtype='<U1')
In [20]: it = iter(temp) # give us a single-pass iterator
...: for i in range(display.shape[0]):
...: for j, c in zip(range(display.shape[1]), it):
...: display[i, j] = c
...:
In [21]: display
Out[21]:
array([['a', ' ', 't', 'e', 's', 't', '*', '*', '*', '*'],
['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
dtype='<U1')
</code></pre>
<p>另一个关于良好度量的测试,跨越行:</p>
<pre><code>In [36]: temp = "this is a test, a test this is"
In [37]: display = np.full((4,10), '*', dtype='U1')
In [38]: it = iter(temp) # give us a single-pass iterator
...: for i in range(display.shape[0]):
...: for j, c in zip(range(display.shape[1]), it):
...: display[i, j] = c
...:
In [39]: display
Out[39]:
array([['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' '],
['t', 'e', 's', 't', ',', ' ', 'a', ' ', 't', 'e'],
['s', 't', ' ', 't', 'h', 'i', 's', ' ', 'i', 's'],
['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
dtype='<U1')
</code></pre>
<p><strong>警告</strong>传递给<code>zip</code>的参数顺序很重要,因为<code>it</code>是一个单循环迭代器:</p>
<pre><code>zip(range(display.shape[1]), it)
</code></pre>
<p>它应该是最后一个参数,否则它将跳过行之间的字符!在</p>
<p>最后,请注意,<code>numpy</code>提供了一个方便的函数,用于按顺序迭代数组:</p>
<pre><code>In [49]: temp = "this is yet another test"
In [50]: display = np.full((4,10), '*', dtype='U1')
In [51]: for c, x in zip(temp, np.nditer(display, op_flags=['readwrite'])):
...: x[...] = c
...:
In [52]: display
Out[52]:
array([['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'y', 'e'],
['t', ' ', 'a', 'n', 'o', 't', 'h', 'e', 'r', ' '],
['t', 'e', 's', 't', '*', '*', '*', '*', '*', '*'],
['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
dtype='<U1')
</code></pre>
<p>为了确保返回的迭代器允许对底层数组进行修改,必须将<code>op_flags=['readwrite']</code>传递给函数,这有一个小的复杂性,但它极大地简化了代码,而且我们不需要使用单次迭代器。不过,我还是喜欢切片分配。在</p>