创建一个2d numpy数组来保存字符

2条回答

网友

1楼 · 编辑于 2024-09-30 00:27:02

从字符串到列表再到每个元素包含一个单词的数组：

In [402]: astr = "This is a message for the display array"
In [403]: alist = astr.split()
In [404]: alist
Out[404]: ['This', 'is', 'a', 'message', 'for', 'the', 'display', 'array']
In [405]: arr = np.array(alist)
In [406]: arr
Out[406]: 
array(['This', 'is', 'a', 'message', 'for', 'the', 'display', 'array'], 
      dtype='<U7')
In [407]: arr.shape
Out[407]: (8,)

我使用的是PY3，因此数据类型是U7，由np.array自动选择，使其足够大以容纳列表中最大的字符串。在

对于包含单个字符的数组：

^{pr2}$

从字符串中生成一个由单个字符组成的数组：

In [430]: np.array(list(astr))
Out[430]: 
array(['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 'm', 'e', 's',
       's', 'a', 'g', 'e', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'e', ' ',
       'd', 'i', 's', 'p', 'l', 'a', 'y', ' ', 'a', 'r', 'r', 'a', 'y'], 
      dtype='<U1')

将单词列表映射到单字符数组有点繁琐。This进入{}等

下面是一种将单词列表映射到数组的方法：

In [462]: alist
Out[462]: ['This', 'is', 'a', 'message', 'for', 'the', 'display', 'array']
In [463]: ''.join(alist)                     # back to one string
Out[463]: 'Thisisamessageforthedisplayarray'
In [464]: np.array(list(''.join(alist)))     # a flat array of char
Out[464]: 
array(['T', 'h', 'i', 's', 'i', 's', 'a', 'm', 'e', 's', 's', 'a', 'g',
       'e', 'f', 'o', 'r', 't', 'h', 'e', 'd', 'i', 's', 'p', 'l', 'a',
       'y', 'a', 'r', 'r', 'a', 'y'], 
      dtype='<U1')
In [465]: _.shape
Out[465]: (32,)

或者我可以将字符列表复制到现有数组中（使用flat将其视为1d）：

In [466]: arr = np.zeros((4,10), 'U1')
In [467]: arr.flat[:32] = list(''.join(alist))  
In [468]: arr
Out[468]: 
array([['T', 'h', 'i', 's', 'i', 's', 'a', 'm', 'e', 's'],
       ['s', 'a', 'g', 'e', 'f', 'o', 'r', 't', 'h', 'e'],
       ['d', 'i', 's', 'p', 'l', 'a', 'y', 'a', 'r', 'r'],
       ['a', 'y', '', '', '', '', '', '', '', '']], 
      dtype='<U1')

如果我在单词之间加上空格：

In [471]: arr.flat[:39] = list(' '.join(alist))
In [472]: arr
Out[472]: 
array([['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' '],
       ['m', 'e', 's', 's', 'a', 'g', 'e', ' ', 'f', 'o'],
       ['r', ' ', 't', 'h', 'e', ' ', 'd', 'i', 's', 'p'],
       ['l', 'a', 'y', ' ', 'a', 'r', 'r', 'a', 'y', '']], 
      dtype='<U1')

网友

2楼 · 编辑于 2024-09-30 00:27:02

第一件事是第一件事，如果你想要一个“字符”数组，你必须小心你所期望的。在python3中，字符串现在是unicode代码点的序列。在Python2中，字符串是C等语言中的经典“字节序列”字符串。这意味着，从内存pov来看，unicode类型可能会占用更多内存：

In [1]: import numpy as np

In [2]: chararray = np.zeros((4,10), dtype='S1')

In [3]: unicodearray =  np.zeros((4,10), dtype='U1')

In [4]: chararray.itemsize, unicodearray.itemsize
Out[4]: (1, 4)

In [5]: chararray.nbytes
Out[5]: 40

In [6]: unicodearray.nbytes
Out[6]: 160

因此，如果您知道您只想使用ascii字符，那么可以使用S1数据类型将内存使用量减少到1/4。还要注意，由于Python 3中的S1实际上对应于bytes数据类型（这与Python 2str相等），所以b'this is a bytes object'前面加了一个b，因此b'this is a bytes object'：

^{pr2}$

现在，假设您有一些负载，您想将消息分配给您的数组。如果消息包含可表示为ascii的字符，则可以快速而松散地使用数据类型：

In [15]: message = 'This'

In [16]: unicodearray.reshape(-1)[:len(message)] = list(message)

In [17]: unicodearray
Out[17]:
array(['T', 'h', 'i', 's', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', ''],
      dtype='<U1')

In [18]: chararray.reshape(-1)[:len(message)] = list(message)

In [19]: chararray
Out[19]:
array([[b'T', b'h', b'i', b's', b'', b'', b'', b'', b'', b''],
       [b'', b'', b'', b'', b'', b'', b'', b'', b'', b''],
       [b'', b'', b'', b'', b'', b'', b'', b'', b'', b''],
       [b'', b'', b'', b'', b'', b'', b'', b'', b'', b'']],
      dtype='|S1')

然而，如果情况并非如此：

In [22]: message = "กขฃคฅฆงจฉ"

In [23]: len(message)
Out[23]: 9

In [24]: unicodearray.reshape(-1)[:len(message)] = list(message)

In [25]: unicodearray
Out[25]:
array(['ก', 'ข', 'ฃ', 'ค', 'ฅ', 'ฆ', 'ง', 'จ', 'ฉ', '', '', '', '', '', '',
       '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', ''],
      dtype='<U1')

In [26]: chararray.reshape(-1)[:len(message)] = list(message)
                                     -
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-26-7d7cdb93de1f> in <module>()
  > 1 chararray.reshape(-1)[:len(message)] = list(message)

UnicodeEncodeError: 'ascii' codec can't encode character '\u0e01' in position 0: ordinal not in range(128)

In [27]:

注意，如果您想用一个元素初始化数组，而不是它默认使用的np.zeros，可以使用np.full：

In [27]: chararray = np.full((4,10), '*', dtype='S1')

In [28]: chararray
Out[28]:
array([[b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*'],
       [b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*'],
       [b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*'],
       [b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*', b'*']],
      dtype='|S1')

最后，要使用for循环执行此长表单：

In [17]: temp = "a test"

In [18]: display = np.full((4,10), '*', dtype='U1')

In [19]: display
Out[19]:
array([['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
      dtype='<U1')

In [20]: it = iter(temp) # give us a single-pass iterator
    ...: for i in range(display.shape[0]):
    ...:     for j, c in zip(range(display.shape[1]), it):
    ...:         display[i, j] = c
    ...:

In [21]: display
Out[21]:
array([['a', ' ', 't', 'e', 's', 't', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
      dtype='<U1')

另一个关于良好度量的测试，跨越行：

In [36]: temp = "this is a test, a test this is"

In [37]: display = np.full((4,10), '*', dtype='U1')

In [38]: it = iter(temp) # give us a single-pass iterator
    ...: for i in range(display.shape[0]):
    ...:     for j, c in zip(range(display.shape[1]), it):
    ...:         display[i, j] = c
    ...:

In [39]: display
Out[39]:
array([['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' '],
       ['t', 'e', 's', 't', ',', ' ', 'a', ' ', 't', 'e'],
       ['s', 't', ' ', 't', 'h', 'i', 's', ' ', 'i', 's'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
      dtype='<U1')

警告传递给zip的参数顺序很重要，因为it是一个单循环迭代器：

zip(range(display.shape[1]), it)

它应该是最后一个参数，否则它将跳过行之间的字符！在

最后，请注意，numpy提供了一个方便的函数，用于按顺序迭代数组：

In [49]: temp = "this is yet another test"

In [50]: display = np.full((4,10), '*', dtype='U1')

In [51]: for c, x in zip(temp, np.nditer(display, op_flags=['readwrite'])):
    ...:     x[...] = c
    ...:

In [52]: display
Out[52]:
array([['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'y', 'e'],
       ['t', ' ', 'a', 'n', 'o', 't', 'h', 'e', 'r', ' '],
       ['t', 'e', 's', 't', '*', '*', '*', '*', '*', '*'],
       ['*', '*', '*', '*', '*', '*', '*', '*', '*', '*']],
      dtype='<U1')

为了确保返回的迭代器允许对底层数组进行修改，必须将op_flags=['readwrite']传递给函数，这有一个小的复杂性，但它极大地简化了代码，而且我们不需要使用单次迭代器。不过，我还是喜欢切片分配。在

相关问题更多 >

编程相关推荐

热门问题

热门文章