“通过散列基数进行二进制编码”的R示例转换为Python cod

2024-10-02 22:31:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我发现了这篇关于分类、数字、一个热编码和二进制编码@Laurae’s Data Science & Design curated posts的很棒的博客文章

不过,我最感兴趣的部分是用R写的:

my_data <- c("Louise",
         "Gabriel",
         "Emma",
         "Adam",
         "Alice",
         "Raphael",
         "Chloe",
         "Louis",
         "Jeanne",
         "Arthur")
matrix(
  as.integer(intToBits(as.integer(as.factor(my_data)))),
  ncol = 32,
  nrow = length(my_data),
  byrow = TRUE
)[, 1:ceiling(log(length(unique(my_data)) + 1)/log(2))]

关于如何在Python中对Pandas数据帧的“category”列应用这一点有什么帮助吗?在

提前谢谢。在


Tags: log编码datamyas二进制分类数字
1条回答
网友
1楼 · 发布于 2024-10-02 22:31:53

Categoricals是一种pandas数据类型,它对应于统计数据中的分类变量:变量只能接受有限的(通常是固定的)数量的可能值(categories;levels in R),您可以使用documentation of pandas,这是文档中的一个小示例:

In [1]: s = pd.Series(["a","b","c","a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

或者正如你在DataFrame中要求的那样:

^{pr2}$

与R系数的差异:

可以观察到R因子函数的以下差异:

R’s levels are named categories

R’s levels are always of type string, while categories in pandas can be of any dtype.

It’s not possible to specify labels at creation time. Use s.cat.rename_categories(new_labels) afterwards.

In contrast to R’s factor function, using categorical data as the sole input to create a new categorical series will not remove unused categories but create a new categorical series which is equal to the passed in one!

R allows for missing values to be included in its levels (pandas’ categories). Pandas does not allow NaN categories, but missing values can still be in the values.

相关问题 更多 >