基于特定类别查找值

2024-09-27 21:26:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我想知道如何找到基于几个不同类别的估计值。其中两列是分类的,另一列包含两个感兴趣的字符串,最后一列包含数值 我有一个csv文件叫做体育.csv你知道吗

import pandas as pd
import numpy as np

#loading the data into data frame
df = pd.read_csv('sports.csv')

我试图找到一个建议的price对于一个Gym既有棒球和篮球,也有enrollment从240到260,因为它们是从region4到type1

Region  Type    enroll  estimates   price   Gym
2   1   377 0.43    40  Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
4   2   100 0.26    37  Baseball|Tennis
4   1   347 0.65    61  Basketball|Baseball|Ballet
4   1   264 0.17    12  Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
1   1   286 0.74    78  Swimming|Basketball
0   1   210 0.13    29  Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
0   1   263 0.91    31  Tennis
2   2   271 0.39    54  Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
3   3   247 0.51    33  Baseball|Hockey|Swimming|Cycling
0   1   109 0.12    17  Football|Hockey|Volleyball

我不知道怎么把所有的东西拼凑起来。很抱歉,如果语法不正确,我只是刚刚开始使用Python。到目前为止,我已经:

import pandas as pd
import numpy as np

#loading the data into data frame
df = pd.read_csv('sports.csv')

#group 4th region and type 1 together where enrollment is in between 240 and 260
group = df[df['Region'] == 4] df[df['Type'] == 1] df[240>=df['Enrollment'] <=260 ]
#split by pipe chars to find gyms that contain both Baseball and Basketball
df['Gym'] = df['Gym'].str.split('|')
df['Gym'] = df['Gym'].str.contains('Baseball'& 'Basketball')

price = df.loc[df['Gym'], 'Price']

我应该改做群比吗?如果是这样,我将如何包含列Type==1Region==4和从240到260的注册?你知道吗


Tags: csvimportdfdataaspdgymbasketball
2条回答

我必须添加一个实际符合您的条件的实例,否则您将得到一个空结果。您希望将df.loc与以下条件一起使用:

In [1]: import pandas as pd, numpy as np, io
In [2]: in_string = io.StringIO("""Region  Type    enroll  estimates   price   Gym
    ...: 2   1   377 0.43    40  Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
    ...: 4   2   100 0.26    37  Baseball|Tennis
    ...: 4   1   247 0.65    61  Basketball|Baseball|Ballet
    ...: 4   1   264 0.17    12  Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
    ...: 1   1   286 0.74    78  Swimming|Basketball
    ...: 0   1   210 0.13    29  Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
    ...: 0   1   263 0.91    31  Tennis
    ...: 2   2   271 0.39    54  Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
    ...: 3   3   247 0.51    33  Baseball|Hockey|Swimming|Cycling
    ...: 0   1   109 0.12    17  Football|Hockey|Volleyball""")

In [3]: df = pd.read_csv(in_string,delimiter=r"\s+")

In [4]: df.loc[df.Gym.str.contains(r"(?=.*Baseball)(?=.*Basketball)") 
    ...:        & (df.enroll <= 260) & (df.enroll >= 240) 
    ...:        & (df.Region == 4) & (df.Type == 1), 'price']
Out[4]: 
2    61
Name: price, dtype: int64

注意,我对contains使用了regex模式,它实际上充当regex的AND操作符。你可以简单地为篮球和棒球做另一个.contains条件的结合。你知道吗

您可以使用指定的所有条件创建mask,然后使用掩码进行子集设置:

mask = (df['Region'] == 4) & (df['Type'] == 1) & \
       (df['enroll'] <= 260) & (df['enroll'] >= 240) & \
        df['Gym'].str.contains('Baseball') & df['Gym'].str.contains('Basketball')

df['price'][mask]
# Series([], name: price, dtype: int64)

它返回空,因为没有满足上述所有条件的记录。你知道吗

相关问题 更多 >

    热门问题