我有np.array
个字符,看起来像
[['a' 'c' 'b' 'a' 'd' 'd' 'b' 'c']
['a' 'd' 'c' 'd' 'b' 'c' 'a' 'b']]
但是,当我使用.tostring()
时,它们开始用\x00字节代码看起来很有趣
所以我用了.decode('utf-8')
,现在它们看起来和我想要的一模一样
result['mytxt'].apply(lambda x: x.tostring().decode("utf-8"))
但是,当我使用len()函数计算它们的长度时,计数的长度是原来的4倍
有没有关于在哪里做出最好的改变以避免这种情况发生的想法
这感觉有点骇人:
result['pct_a_in_mytxt'].apply(lambda s: str(s).count('a') / (len(s) / 4 ))
编辑:添加一些代码以进行复制
import pandas as pd
import numpy as np
fakejson = [
{ "territory": "A", "salesqty": 98 },
{ "territory": "A", "salesqty": 84 },
{ "territory": "A", "salesqty": 56 },
{ "territory": "A", "salesqty": 41 },
{ "territory": "A", "salesqty": 82 },
{ "territory": "B", "salesqty": 79 },
{ "territory": "B", "salesqty": 36 },
{ "territory": "B", "salesqty": 1 },
{ "territory": "B", "salesqty": 52 },
{ "territory": "B", "salesqty": 12 },
{ "territory": "B", "salesqty": 17 }
]
df = pd.DataFrame(fakejson)
grouped = df.groupby(['territory'])
dfsax = grouped[['territory','salesqty']].aggregate(lambda x: tuple(x))
dfsax['sequence_len'] = dfsax['salesqty'].apply(lambda x: len(x))
from pyts.approximation import SymbolicAggregateApproximation
n_bins = 5
sax = SymbolicAggregateApproximation(n_bins=n_bins, strategy='quantile')
unique_lens = dfsax.sequence_len.unique()
result = pd.DataFrame()
for l in unique_lens:
if l >= n_bins:
filtered = dfsax[(dfsax['sequence_len']==l)].copy()
if len(filtered) > 0:
filtered['sax_txt_array'] = filtered['salesqty'].apply(lambda x: sax.fit_transform(np.array(x).reshape(1,-1)))
result = result.append(filtered)
# peek at the result as an array
result[['sax_txt_array']]
# now try to make it a string
result['sax_txt_not_decoded'] = result['sax_txt_array'].apply(lambda x: x.tostring())
# decode to make it readable
result['sax_txt_decoded'] = result['sax_txt_array'].apply(lambda x: x.tostring().decode('utf-8'))
# count each new string and get the wrong result
result['sequence_len_2'] = result['sax_txt_decoded'].apply(lambda x: len(x))
result
+-----------+--------------+----------------------+---------------------------------------------------+-----------------+----------------+
| territory | sequence_len | sax_txt_array | sax_txt_not_decoded | sax_txt_decoded | sequence_len_2 |
+-----------+--------------+----------------------+---------------------------------------------------+-----------------+----------------+
| A | 5 | [[e, d, b, a, c]] | b'e\x00\x00\x00d\x00\x00\x00b\x00\x00\x00a\x00... | edbac | 20 |
| B | 6 | [[e, c, a, d, a, b]] | b'e\x00\x00\x00c\x00\x00\x00a\x00\x00\x00d\x00... | ecadab | 24 |
+-----------+--------------+----------------------+---------------------------------------------------+-----------------+----------------+
如果不运行所有代码(我没有
pyts
),它看起来像是sax_txt_array
列的一个单元格是一个numpy
字符串数组例如:
在py3中,字符串是unicode,具有可变的字节数(每个字符最多4个字节)。在
numpy's
版本中,所有字符都使用4个字节,因此tostring
的长度为4个字符相关问题 更多 >
编程相关推荐