Python长度太长4倍（从数组解码到utf8）

import pandas as pd import numpy as np fakejson = [ { "territory": "A", "salesqty": 98 }, { "territory": "A", "salesqty": 84 }, { "territory": "A", "salesqty": 56 }, { "territory": "A", "salesqty": 41 }, { "territory": "A", "salesqty": 82 }, { "territory": "B", "salesqty": 79 }, { "territory": "B", "salesqty": 36 }, { "territory": "B", "salesqty": 1 }, { "territory": "B", "salesqty": 52 }, { "territory": "B", "salesqty": 12 }, { "territory": "B", "salesqty": 17 } ] df = pd.DataFrame(fakejson) grouped = df.groupby(['territory']) dfsax = grouped[['territory','salesqty']].aggregate(lambda x: tuple(x)) dfsax['sequence_len'] = dfsax['salesqty'].apply(lambda x: len(x)) from pyts.approximation import SymbolicAggregateApproximation n_bins = 5 sax = SymbolicAggregateApproximation(n_bins=n_bins, strategy='quantile') unique_lens = dfsax.sequence_len.unique() result = pd.DataFrame() for l in unique_lens: if l >= n_bins: filtered = dfsax[(dfsax['sequence_len']==l)].copy() if len(filtered) > 0: filtered['sax_txt_array'] = filtered['salesqty'].apply(lambda x: sax.fit_transform(np.array(x).reshape(1,-1))) result = result.append(filtered) # peek at the result as an array result[['sax_txt_array']] # now try to make it a string result['sax_txt_not_decoded'] = result['sax_txt_array'].apply(lambda x: x.tostring()) # decode to make it readable result['sax_txt_decoded'] = result['sax_txt_array'].apply(lambda x: x.tostring().decode('utf-8')) # count each new string and get the wrong result result['sequence_len_2'] = result['sax_txt_decoded'].apply(lambda x: len(x)) result +-----------+--------------+----------------------+---------------------------------------------------+-----------------+----------------+ | territory | sequence_len | sax_txt_array | sax_txt_not_decoded | sax_txt_decoded | sequence_len_2 | +-----------+--------------+----------------------+---------------------------------------------------+-----------------+----------------+ | A | 5 | [[e, d, b, a, c]] | b'e\x00\x00\x00d\x00\x00\x00b\x00\x00\x00a\x00... | edbac | 20 | | B | 6 | [[e, c, a, d, a, b]] | b'e\x00\x00\x00c\x00\x00\x00a\x00\x00\x00d\x00... | ecadab | 24 | +-----------+--------------+----------------------+---------------------------------------------------+-----------------+----------------+

1条回答

网友

1楼 · 发布于 2024-06-26 13:28:33

如果不运行所有代码（我没有pyts），它看起来像是sax_txt_array列的一个单元格是一个numpy字符串数组

例如：

In [32]: arr = np.array([['e', 'd', 'b', 'a', 'c']])                                                 
In [33]: arr                                                                                         
Out[33]: array([['e', 'd', 'b', 'a', 'c']], dtype='<U1')
In [34]: arr.tostring()                                                                              
/usr/local/bin/ipython3:1: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
  #!/usr/bin/python3
Out[34]: b'e\x00\x00\x00d\x00\x00\x00b\x00\x00\x00a\x00\x00\x00c\x00\x00\x00'
In [35]: len(_)                                                                                      
Out[35]: 20
In [36]: arr.astype('S1')                                                                            
Out[36]: array([[b'e', b'd', b'b', b'a', b'c']], dtype='|S1')
In [37]: arr.astype('S1').tostring()                                                                 
/usr/local/bin/ipython3:1: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
  #!/usr/bin/python3
Out[37]: b'edbac'

在py3中，字符串是unicode，具有可变的字节数（每个字符最多4个字节）。在numpy's版本中，所有字符都使用4个字节，因此tostring的长度为4个字符

相关问题更多 >

编程相关推荐

热门问题

热门文章