大Pandas的多列因子分解

网友

1楼 · 编辑于 2024-09-28 17:30:14

您可以使用drop_duplicates删除那些重复的行

In [23]: df.drop_duplicates()
Out[23]: 
      x  y
   0  1  1
   1  1  2
   2  2  2

编辑

为了实现您的目标，您可以将原来的df加入drop_duplicated：

In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]: 
   x  y  index
0  1  1      0
1  1  2      1
2  2  2      2
3  2  2      2
4  1  2      1
5  1  1      0

网友

2楼 · 编辑于 2024-09-28 17:30:14

我不确定这是否是一个有效的解决方案。也许有更好的解决办法。

arr=[] #this will hold the unique items of the dataframe
for i in df.index:
   if list(df.iloc[i]) not in arr:
      arr.append(list(df.iloc[i]))

所以打印arr会给你

>>>print arr
[[1,1],[1,2],[2,2]]

为了保存索引，我将声明一个ind数组

ind=[]
for i in df.index:
   ind.append(arr.index(list(df.iloc[i])))

打印ind将给出

 >>>print ind
 [0,1,2,2,1,0]

网友

3楼 · 编辑于 2024-09-28 17:30:14

首先需要创建一个元组数组，pandas.lib.fast_zip在cython循环中可以非常快地完成此操作。

import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]

输出为：

[0 1 2 2 1 0]

编辑

相关问题更多 >

编程相关推荐

热门问题

热门文章

大Pandas的多列因子分解

编辑

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >