Python长度太长4倍(从数组解码到utf8)

2024-06-26 13:28:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我有np.array个字符,看起来像

[['a' 'c' 'b' 'a' 'd' 'd' 'b' 'c']
 ['a' 'd' 'c' 'd' 'b' 'c' 'a' 'b']]

但是,当我使用.tostring()时,它们开始用\x00字节代码看起来很有趣

所以我用了.decode('utf-8'),现在它们看起来和我想要的一模一样

result['mytxt'].apply(lambda x: x.tostring().decode("utf-8"))

但是,当我使用len()函数计算它们的长度时,计数的长度是原来的4倍

有没有关于在哪里做出最好的改变以避免这种情况发生的想法

这感觉有点骇人:

result['pct_a_in_mytxt'].apply(lambda s: str(s).count('a') / (len(s) / 4 ))

编辑:添加一些代码以进行复制

import pandas as pd
import numpy as np

fakejson = [
 {   "territory": "A",   "salesqty": 98 },
 {   "territory": "A",   "salesqty": 84 },
 {   "territory": "A",   "salesqty": 56 },
 {   "territory": "A",   "salesqty": 41 },
 {   "territory": "A",   "salesqty": 82 },
 {   "territory": "B",   "salesqty": 79 },
 {   "territory": "B",   "salesqty": 36 },
 {   "territory": "B",   "salesqty": 1 },
 {   "territory": "B",   "salesqty": 52 },
 {   "territory": "B",   "salesqty": 12 },
 {   "territory": "B",   "salesqty": 17 }
]

df = pd.DataFrame(fakejson)

grouped = df.groupby(['territory'])
dfsax = grouped[['territory','salesqty']].aggregate(lambda x: tuple(x))

dfsax['sequence_len'] = dfsax['salesqty'].apply(lambda x: len(x))


from pyts.approximation import SymbolicAggregateApproximation
n_bins = 5
sax = SymbolicAggregateApproximation(n_bins=n_bins, strategy='quantile')


unique_lens = dfsax.sequence_len.unique()

result = pd.DataFrame()

for l in unique_lens:
    if l >= n_bins:
        filtered = dfsax[(dfsax['sequence_len']==l)].copy()
        if len(filtered) > 0:
            filtered['sax_txt_array'] = filtered['salesqty'].apply(lambda x: sax.fit_transform(np.array(x).reshape(1,-1)))
            result = result.append(filtered)

# peek at the result as an array 
result[['sax_txt_array']]

# now try to make it a string
result['sax_txt_not_decoded'] = result['sax_txt_array'].apply(lambda x: x.tostring())

# decode to make it readable
result['sax_txt_decoded'] = result['sax_txt_array'].apply(lambda x: x.tostring().decode('utf-8'))

# count each new string and get the wrong result
result['sequence_len_2'] = result['sax_txt_decoded'].apply(lambda x: len(x))

result


+-----------+--------------+----------------------+---------------------------------------------------+-----------------+----------------+
| territory | sequence_len |    sax_txt_array     |                sax_txt_not_decoded                | sax_txt_decoded | sequence_len_2 |
+-----------+--------------+----------------------+---------------------------------------------------+-----------------+----------------+
| A         |            5 | [[e, d, b, a, c]]    | b'e\x00\x00\x00d\x00\x00\x00b\x00\x00\x00a\x00... | edbac           |             20 |
| B         |            6 | [[e, c, a, d, a, b]] | b'e\x00\x00\x00c\x00\x00\x00a\x00\x00\x00d\x00... | ecadab          |             24 |
+-----------+--------------+----------------------+---------------------------------------------------+-----------------+----------------+


Tags: lambdatxtlenresultarrayfilteredapplydecoded
1条回答
网友
1楼 · 发布于 2024-06-26 13:28:33

如果不运行所有代码(我没有pyts),它看起来像是sax_txt_array列的一个单元格是一个numpy字符串数组

例如:

In [32]: arr = np.array([['e', 'd', 'b', 'a', 'c']])                                                 
In [33]: arr                                                                                         
Out[33]: array([['e', 'd', 'b', 'a', 'c']], dtype='<U1')
In [34]: arr.tostring()                                                                              
/usr/local/bin/ipython3:1: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
  #!/usr/bin/python3
Out[34]: b'e\x00\x00\x00d\x00\x00\x00b\x00\x00\x00a\x00\x00\x00c\x00\x00\x00'
In [35]: len(_)                                                                                      
Out[35]: 20
In [36]: arr.astype('S1')                                                                            
Out[36]: array([[b'e', b'd', b'b', b'a', b'c']], dtype='|S1')
In [37]: arr.astype('S1').tostring()                                                                 
/usr/local/bin/ipython3:1: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
  #!/usr/bin/python3
Out[37]: b'edbac'

在py3中,字符串是unicode,具有可变的字节数(每个字符最多4个字节)。在numpy's版本中,所有字符都使用4个字节,因此tostring的长度为4个字符

相关问题 更多 >