将字符串的numpy转换成numpython字符

2条回答

网友

1楼 · 编辑于 2024-10-03 00:18:18

下面是一个使用查找表的方法：

>>> alphabet = np.array(list('ACGT'))
>>> alphabet
array(['A', 'C', 'G', 'T'], dtype='<U1')

要使用查找表，我们需要将字母重新解释为索引，这是通过视图转换完成的：

^{pr2}$

我们现在可以构建它需要的85槽，实际上我们只使用4个插槽，即65，67，71和{}。至于输出格式，我们可以自由选择最符合我们要求的格式：

示例一-输出为bytestring：

>>> lookup_1 = np.zeros((alph_as_num.max()+1), dtype='S4')
>>> lookup_1[alph_as_num] = [b'0001000'[i:i+4] for i in range(4)]

示例二-输出为uint8：

>>> lookup_2 = np.zeros((alph_as_num.max()+1), dtype=np.uint8)
>>> lookup_2[alph_as_num] = 1 << np.arange(4)

示例三-输出为每个字母四uint8：

>>> lookup_3 = np.zeros((alph_as_num.max()+1, 4), dtype=np.uint8)
>>> lookup_3[alph_as_num[::-1]] = np.identity(4)

现在让我们将其应用于100字母序列：

>>> seq
array(['CATTTCTCCACCATTTTGGTTTTTCATTGATCCGTTAGGTGGAGCCGGACTATGTCTACCGAAAGATGCACCTGCGCCGGGTCTGGTCTATCTCTTAATG'],
      dtype='<U100')

因为它只依赖于

numpy内置的高级索引它使我们可以非常快速地查找（例如，比Python字典快得多）
视图转换这基本上是免费的，因为它所做的只是重新解释数据缓冲区（没有任何复制或转换）

示例一-bytestrings：

>>> lookup_1[seq.view(np.int32)]
array([b'0010', b'0001', b'1000', b'1000', b'1000', b'0010', b'1000',
       b'0010', b'0010', b'0001', b'0010', b'0010', b'0001', b'1000',
       b'1000', b'1000', b'1000', b'0100', b'0100', b'1000', b'1000',
       b'1000', b'1000', b'1000', b'0010', b'0001', b'1000', b'1000',
       b'0100', b'0001', b'1000', b'0010', b'0010', b'0100', b'1000',
       b'1000', b'0001', b'0100', b'0100', b'1000', b'0100', b'0100',
       b'0001', b'0100', b'0010', b'0010', b'0100', b'0100', b'0001',
       b'0010', b'1000', b'0001', b'1000', b'0100', b'1000', b'0010',
       b'1000', b'0001', b'0010', b'0010', b'0100', b'0001', b'0001',
       b'0001', b'0100', b'0001', b'1000', b'0100', b'0010', b'0001',
       b'0010', b'0010', b'1000', b'0100', b'0010', b'0100', b'0010',
       b'0010', b'0100', b'0100', b'0100', b'1000', b'0010', b'1000',
       b'0100', b'0100', b'1000', b'0010', b'1000', b'0001', b'1000',
       b'0010', b'1000', b'0010', b'1000', b'1000', b'0001', b'0001',
       b'1000', b'0100'], dtype='|S4')

作为偏好，这些也可以被视为一个长序列：

>>> lookup_1[seq.view(np.int32)].view('S400')
array([b'0010000110001000100000101000001000100001001000100001100010001000100001000100100010001000100010000010000110001000010000011000001000100100100010000001010001001000010001000001010000100010010001000001001010000001100001001000001010000001001000100100000100010001010000011000010000100001001000101000010000100100001000100100010001001000001010000100010010000010100000011000001010000010100010000001000110000100'],
      dtype='|S400')

例二-uint8：

>>> lookup_2[seq.view(np.int32)]
array([2, 1, 8, 8, 8, 2, 8, 2, 2, 1, 2, 2, 1, 8, 8, 8, 8, 4, 4, 8, 8, 8,
       8, 8, 2, 1, 8, 8, 4, 1, 8, 2, 2, 4, 8, 8, 1, 4, 4, 8, 4, 4, 1, 4,
       2, 2, 4, 4, 1, 2, 8, 1, 8, 4, 8, 2, 8, 1, 2, 2, 4, 1, 1, 1, 4, 1,
       8, 4, 2, 1, 2, 2, 8, 4, 2, 4, 2, 2, 4, 4, 4, 8, 2, 8, 4, 4, 8, 2,
       8, 1, 8, 2, 8, 2, 8, 8, 1, 1, 8, 4], dtype=uint8)

示例3-每个字母有四个uint8；但是让我们使用一个不同的seq来处理多行：

>>> seq
array([['CCCT'],
       ['GCGA']], dtype='<U4')
>>> lookup_3[seq.view(np.int32)].reshape(len(seq), -1)
array([[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1]], dtype=uint8)

网友

2楼 · 编辑于 2024-10-03 00:18:18

Numpy有一个char.replace方法（参见docs）。你需要做的就是：

genes = np.char.replace(genes, 'A', '1')
genes = np.char.replace(genes, 'C', '2')
genes = np.char.replace(genes, 'G', '4')
genes = np.char.replace(genes, 'T', '8')

要将其转换为int数组

^{pr2}$

然后可以在数组上使用bitwise operations。在

正如评论中所指出的，结果序列的长度是有限的。解决这个问题的方法：

genes = np.char.replace(genes, 'A', '1')
genes = np.char.replace(genes, 'C', '2')
genes = np.char.replace(genes, 'G', '4')
genes = np.char.replace(genes, 'T', '8')

>>> genes
array([['12481248'],
       ['12481248']], dtype='|S8')

在数字之间插入逗号

genes = np.char.join(',', genes)

>>> genes
array([['1,2,4,8,1,2,4,8'],
       ['1,2,4,8,1,2,4,8']], dtype='|S15')

基于逗号拆分并转换回纯np.char.array

genes = np.char.array(np.char.split(genes, ','))

>>> genes
chararray([[['1', '2', '4', '8', '1', '2', '4', '8']],

           [['1', '2', '4', '8', '1', '2', '4', '8']]], dtype='|S1')

转换为int数组：

genes = np.array(genes, dtype=int)

>>> genes
array([[[1, 2, 4, 8, 1, 2, 4, 8]],

       [[1, 2, 4, 8, 1, 2, 4, 8]]])

删除大小为1的中间维度：

genes = genes.reshape(list(genes.shape[:-2]) + [genes.shape[-1]])

>>> genes
array([[1, 2, 4, 8, 1, 2, 4, 8],
       [1, 2, 4, 8, 1, 2, 4, 8]])

相关问题更多 >

编程相关推荐

热门问题

热门文章