从一列中的唯一值创建Pandas数据帧

3条回答

网友

1楼 · 编辑于 2024-05-18 20:14:45

要为列中的所有唯一值创建数据帧，请创建数据帧的dict，如下所示。

创建一个dict，其中每个键是所选列中的唯一值，值是一个数据帧。
像访问标准dict一样访问每个数据帧（例如df_names['Name1']）
^{}创建一个generator，它可以被解包。
- k是列中的唯一值，v是与每个k关联的数据。

使用`for-loop`和`.groupby`：

df_names = dict()
for k, v in df.groupby('customer name'):
    df_names[k] = v

用Python Dictionary Comprehension

PEP 274 -- Dict Comprehensions

使用`.groupby`

df_names = {k: v for (k, v) in df.groupby('customer name')}

这来自与rafaelc的对话，他指出使用.groupby比.unique更快。
- 列中有6个唯一值，.groupby在104 ms时比392 ms时更快
- 列中有26个唯一值，.groupby的速度更快，在147ms时比在1.53s时快
使用afor-loop稍快于理解，特别是对于更独特的列值或许多行（例如10M）。

使用`.unique`：

使用Boolean indexing匹配所选列中的唯一值。

df_names = {name: df[df['customer name'] == name] for name in df['customer name'].unique()}

测试

以下数据用于测试

import pandas as pd
import string
import random

random.seed(365)

# 6 unique values
data = {'class': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(1000000)],
        'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}

# 26 unique values
data = {'class': [random.choice( list(string.ascii_lowercase)) for _ in range(1000000)],
        'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}

df = pd.DataFrame(data)

网友

2楼 · 编辑于 2024-05-18 20:14:45

也许我错了，但是

当

for x in customerNames:
    x = DataFrame.loc[DataFrame['customer name'] == x]
x

为最后一个列表项提供正确的输出，因为输出超出了循环的缩进

import pandas as pd

customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA'])],
                        orient='index', columns=['customer', 'country'])

customer_list = ['James', 'Jean']

for x in customer_list:
    x = customer_df.loc[customer_df['customer'] == x]
    print(x)
    print('now I could append the data to something new')

你得到输出：

  customer country
B    James     USA
now I could append the data to something new
  customer country
A     Jean  France
now I could append the data to something new

或者如果你不喜欢环，你可以用

import pandas as pd

customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA']),('C', ['Hans', 'Germany'])],
                        orient='index', columns=['customer', 'country'])

customer_list = ['James', 'Jean']


print(customer_df[customer_df['customer'].isin(customer_list)])

输出：

  customer country
A     Jean  France
B    James     USA

df.isin最好在下面解释：How to implement 'in' and 'not in' for Pandas dataframe

网友

3楼 · 编辑于 2024-05-18 20:14:45

当前迭代每次运行时都会覆盖x两次：循环for为x分配一个客户名称，然后为其分配一个数据帧。

要以后按名称调用每个数据帧，请尝试将它们存储在字典中：

df_dict = {name: df.loc[df['customer name'] == name] for name in customerNames}

df_dict['Name3']

使用`for-loop`和`.groupby`：

用Python Dictionary Comprehension

使用`.groupby`

使用`.unique`：

测试

相关问题更多 >

编程相关推荐

热门问题

热门文章

从一列中的唯一值创建Pandas数据帧

使用for-loop和.groupby：

用Python Dictionary Comprehension

使用.groupby

使用.unique：

测试

相关问题 更多 >

编程相关推荐

热门问题

热门文章

使用`for-loop`和`.groupby`：

使用`.groupby`

使用`.unique`：

相关问题更多 >