pandas dataframe在其他列值上使用多个条件添加具有值的列

2024-09-28 22:21:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我有熊猫完整的数据框叫'调整'。我想在'fyear'和'conm'上添加具有新值条件的'stage'列。在

    fyear   conm                indadjsg
1   1999    1-800-FLOWERS.COM   26.646086
2   2000    1-800-FLOWERS.COM   22.727175 
3   2001    1-800-FLOWERS.COM   7.312014
4   2002    1-800-FLOWERS.COM   4.948308
5   2003    1-800-FLOWERS.COM   6.278798
23  1996    ABERCROMBIE & FITCH -CL A   34.831691
24  1997    ABERCROMBIE & FITCH -CL A   48.053137
25  1998    ABERCROMBIE & FITCH -CL A   48.918326
26  1999    ABERCROMBIE & FITCH -CL A   46.956456
27  2000    ABERCROMBIE & FITCH -CL A   33.91436
28  2001    ABERCROMBIE & FITCH -CL A   67.23423
29  2002    ABERCROMBIE & FITCH -CL A   99.09342
11929   2006    CLIFTON BANCORP INC 0.236418
11930   2007    CLIFTON BANCORP INC -1.366626
11931   2008    CLIFTON BANCORP INC 8.564019
11932   2009    CLIFTON BANCORP INC -4.966110
11933   2010    CLIFTON BANCORP INC -4.359552
11934   2011    CLIFTON BANCORP INC -16.313852
11935   2012    CLIFTON BANCORP INC -18.193550
11936   2013    CLIFTON BANCORP INC -10.126603
11937   2014    CLIFTON BANCORP INC 4.718584
11938   2015    CLIFTON BANCORP INC -11.889065
11940   2015    CLIPPER REALTY INC  70.945767
11941   2016    CLIPPER REALTY INC  3.776001
11980   2014    CM FINANCE INC  205.894048
11981   2015    CM FINANCE INC  68.518555
121247  2009    VCA INC -5.552030
121248  2010    VCA INC -3.357275
121249  2011    VCA INC -0.930798
121250  2012    VCA INC 5.974914
121256  2007    VIASPACE INC    -50.966869
121257  2008    VIASPACE INC    149.957403
121258  2009    VIASPACE INC    197.776855
121259  2010    VIASPACE INC    -25.201733
121260  2011    VIASPACE INC    77.082624
121261  2012    VIASPACE INC    78.034233
121266  2005    YASHENG GROUP   -3.728098
121267  2006    YASHENG GROUP   -2.233927
121268  2007    YASHENG GROUP   0.349349
121279  2009    YUHE INTERNATIONAL INC  27.995324
121280  2010    YUHE INTERNATIONAL INC  34.375630

1)如果唯一公司的年度数等于或小于5,我想填写“开始”。在

^{pr2}$

2)在我用“start”条件填充完剩余数据后,我想填充另一个值。 计算出的唯一公司平均值。在

mask2 = adjusted.groupby(by=['conm'])['indadjsg']
countsg = mask2.mean().to_frame().reset_index()
c = countsg.dropna()   

数据框“c”

    conm                indadjsg
0   1-800-FLOWERS.COM   3.291539
1   ABERCROMBIE & FITCH -CL A   105.335324
2   CLIFTON BANCORP INC 22.920683
3   CLIPPER REALTY INC  36.784677
4   CM FINANCE INC  1.605919
5   VCA INC 3.116871
6   VIASPACE INC    -106.153789
7   YASHENG GROUP   -2.676296
8   YUHE INTERNATIONAL INC  12.306557

我想给出的条件是:

      indadjsg  < 0,  'decline'
 0 <= indadjsg  <= 15, 'revival'
 15< indadjsg  <= 100, 'mature'
 100< indajsg         , 'growth'

我要制作的最终数据帧是这样的

    fyear   conm                indadjsg    stage
1   1999    1-800-FLOWERS.COM   26.646086   start
2   2000    1-800-FLOWERS.COM   22.727175   start
3   2001    1-800-FLOWERS.COM   7.312014    start
4   2002    1-800-FLOWERS.COM   4.948308    start
5   2003    1-800-FLOWERS.COM   6.278798    start
23  1996    ABERCROMBIE & FITCH -CL A   34.831691  growth 
24  1997    ABERCROMBIE & FITCH -CL A   48.053137  growth    
25  1998    ABERCROMBIE & FITCH -CL A   48.918326  growth    
26  1999    ABERCROMBIE & FITCH -CL A   46.956456  growth 
27  2000    ABERCROMBIE & FITCH -CL A   33.91436  growth 
28  2001    ABERCROMBIE & FITCH -CL A   67.23423  growth 
29  2002    ABERCROMBIE & FITCH -CL A   99.09342    growth 
11929   2006    CLIFTON BANCORP INC 0.236418        mature
11930   2007    CLIFTON BANCORP INC -1.366626       mature
11931   2008    CLIFTON BANCORP INC 8.564019        mature 
11932   2009    CLIFTON BANCORP INC -4.966110       mature 
11933   2010    CLIFTON BANCORP INC -4.359552       mature 
11934   2011    CLIFTON BANCORP INC -16.313852      mature 
11935   2012    CLIFTON BANCORP INC -18.193550      mature 
11936   2013    CLIFTON BANCORP INC -10.126603      mature 
11937   2014    CLIFTON BANCORP INC 4.718584        mature 
11938   2015    CLIFTON BANCORP INC -11.889065      mature 
11940   2015    CLIPPER REALTY INC  70.945767       start
11941   2016    CLIPPER REALTY INC  3.776001        start
11980   2014    CM FINANCE INC  205.894048    start
11981   2015    CM FINANCE INC  68.518555     start
121247  2009    VCA INC -5.552030             start
121248  2010    VCA INC -3.357275             start
121249  2011    VCA INC -0.930798             start
121250  2012    VCA INC 5.974914              start
121256  2007    VIASPACE INC    -50.966869    decline
121257  2008    VIASPACE INC    149.957403    decline
121258  2009    VIASPACE INC    197.776855    decline
121259  2010    VIASPACE INC    -25.201733    decline
121260  2011    VIASPACE INC    77.082624     decline
121261  2012    VIASPACE INC    78.034233     decline 
121266  2005    YASHENG GROUP   -3.728098        start
121267  2006    YASHENG GROUP   -2.233927        start
121268  2007    YASHENG GROUP   0.349349         start
121279  2009    YUHE INTERNATIONAL INC  27.995324    start
121280  2010    YUHE INTERNATIONAL INC  34.375630    start

有什么办法我能马上做吗?我只能考虑做单独的列并合并它。你能帮我有效地思考吗?提前谢谢你。在


Tags: comclstartincgrowthvcaflowersbancorp
3条回答

开始于:

Adjusted:
        fyear   conm                        indadjsg     
0       1999    1-800-FLOWERS.COM           26.646086             
1       2000    1-800-FLOWERS.COM           22.727175             
2       2001    1-800-FLOWERS.COM           7.312014              
3       2002    1-800-FLOWERS.COM           4.948308              
4       2003    1-800-FLOWERS.COM           6.278798              
5       1996    ABERCROMBIE & FITCH -CL A   34.831691             
6       1997    ABERCROMBIE & FITCH -CL A   48.053137             
...
35      2012    VIASPACE INC                78.034233             
36      2005    YASHENG GROUP               -3.728098             
37      2006    YASHENG GROUP               -2.233927             
38      2007    YASHENG GROUP               0.349349              
39      2009    YUHE INTERNATIONAL INC      27.995324             
40      2010    YUHE INTERNATIONAL INC      34.375630             

这段代码并不特别聪明,但非常简单:

^{pr2}$

输出如下:

    fyear   conm                        indadjsg    stage  
0   1999    1-800-FLOWERS.COM           26.646086   start  
1   2000    1-800-FLOWERS.COM           22.727175   start  
2   2001    1-800-FLOWERS.COM           7.312014    start  
3   2002    1-800-FLOWERS.COM           4.948308    start  
4   2003    1-800-FLOWERS.COM           6.278798    start  
5   1996    ABERCROMBIE & FITCH -CL A   34.831691   mature 
6   1997    ABERCROMBIE & FITCH -CL A   48.053137   mature 
7   1998    ABERCROMBIE & FITCH -CL A   48.918326   mature 
8   1999    ABERCROMBIE & FITCH -CL A   46.956456   mature 
9   2000    ABERCROMBIE & FITCH -CL A   33.914360   mature 
10  2001    ABERCROMBIE & FITCH -CL A   67.234230   mature 
11  2002    ABERCROMBIE & FITCH -CL A   99.093420   mature 
12  2006    CLIFTON BANCORP INC         0.236418    decline
13  2007    CLIFTON BANCORP INC         -1.366626   decline
14  2008    CLIFTON BANCORP INC         8.564019    decline
15  2009    CLIFTON BANCORP INC         -4.966110   decline
16  2010    CLIFTON BANCORP INC         -4.359552   decline
17  2011    CLIFTON BANCORP INC         -16.313852  decline
18  2012    CLIFTON BANCORP INC         -18.193550  decline
19  2013    CLIFTON BANCORP INC         -10.126603  decline
20  2014    CLIFTON BANCORP INC         4.718584    decline
21  2015    CLIFTON BANCORP INC         -11.889065  decline
22  2015    CLIPPER REALTY INC          70.945767   start  
23  2016    CLIPPER REALTY INC          3.776001    start  
24  2014    CM FINANCE INC              205.894048  start  
25  2015    CM FINANCE INC              68.518555   start  
26  2009    VCA INC                     -5.552030   start  
27  2010    VCA INC                     -3.357275   start  
28  2011    VCA INC                     -0.930798   start  
29  2012    VCA INC                     5.974914    start  
30  2007    VIASPACE INC                -50.966869  mature 
31  2008    VIASPACE INC                149.957403  mature 
32  2009    VIASPACE INC                197.776855  mature 
33  2010    VIASPACE INC                -25.201733  mature 
34  2011    VIASPACE INC                77.082624   mature 
35  2012    VIASPACE INC                78.034233   mature 
36  2005    YASHENG GROUP               -3.728098   start  
37  2006    YASHENG GROUP               -2.233927   start  
38  2007    YASHENG GROUP               0.349349    start  
39  2009    YUHE INTERNATIONAL INC      27.995324   start  
40  2010    YUHE INTERNATIONAL INC      34.375630   start            

我相信您可以通过pd.cutnp.where来实现这一点:

adjusted # copied text from your example
Out[86]: 
    fyear               conm   indadjsg
0    1999  1-800-FLOWERS.COM   26.64609
1    2000  1-800-FLOWERS.COM   22.72717
2    2001  1-800-FLOWERS.COM    7.31201
3    2002  1-800-FLOWERS.COM    4.94831
4    2003  1-800-FLOWERS.COM    6.27880
5    1996        ABERCROMBIE   34.83169
6    1997        ABERCROMBIE   48.05314
7    1998        ABERCROMBIE   48.91833
8    1999        ABERCROMBIE   46.95646
9    2000        ABERCROMBIE   33.91436
10   2001        ABERCROMBIE   67.23423
11   2002        ABERCROMBIE   99.09342
..    ...                ...        ...
25   2015                 CM   68.51856
26   2009                VCA   -5.55203
27   2010                VCA   -3.35728
28   2011                VCA   -0.93080
29   2012                VCA    5.97491
30   2007           VIASPACE  -50.96687
31   2008           VIASPACE  149.95740
32   2009           VIASPACE  197.77686
33   2010           VIASPACE  -25.20173
34   2011           VIASPACE   77.08262
35   2012           VIASPACE   78.03423
36   2005            YASHENG   -3.72810

byyr = adjusted.groupby(by='conm')['fyear'].count().to_frame()
start = byyr.fyear[adjusted.conm]

indadjsg = adjusted.groupby(by='conm')['indadjsg'].mean().to_frame()
px = indadjsg.indadjsg[adjusted.conm]
categories = pd.cut(px.values.reshape((len(px), )), 
                    bins= [-np.inf, 0, 15, 100, np.inf], 
                    labels=['decline', 'revival', 'mature', 'growth'])

adjusted.loc[:, 'stage'] = np.where(start <= 5, 'start', categories)

adjusted # result
Out[130]: 
    fyear               conm   indadjsg   stage
0    1999  1-800-FLOWERS.COM   26.64609   start
1    2000  1-800-FLOWERS.COM   22.72717   start
2    2001  1-800-FLOWERS.COM    7.31201   start
3    2002  1-800-FLOWERS.COM    4.94831   start
4    2003  1-800-FLOWERS.COM    6.27880   start
5    1996        ABERCROMBIE   34.83169  mature
6    1997        ABERCROMBIE   48.05314  mature
7    1998        ABERCROMBIE   48.91833  mature
8    1999        ABERCROMBIE   46.95646  mature
9    2000        ABERCROMBIE   33.91436  mature
10   2001        ABERCROMBIE   67.23423  mature
11   2002        ABERCROMBIE   99.09342  mature
..    ...                ...        ...     ...
25   2015                 CM   68.51856   start
26   2009                VCA   -5.55203   start
27   2010                VCA   -3.35728   start
28   2011                VCA   -0.93080   start
29   2012                VCA    5.97491   start
30   2007           VIASPACE  -50.96687  mature
31   2008           VIASPACE  149.95740  mature
32   2009           VIASPACE  197.77686  mature
33   2010           VIASPACE  -25.20173  mature
34   2011           VIASPACE   77.08262  mature
35   2012           VIASPACE   78.03423  mature
36   2005            YASHENG   -3.72810   start

打开pd.切割,请确保使用right=Trueright=False.指定容器的边缘

有一种方法可以用一个groupby/transform操作计算stage列(请参阅下面的classify函数),但它涉及到为每个组调用一次自定义Python函数。如果有很多组,这往往是低效的。在

通常,当您替换大量Python时,您会获得更好的性能 对整个(大)数据帧或 数据帧的大列。在

因此,如果有很多conms(即很多组),可能最好 按照你的第一个想法,为每个公司计算阶段,然后合并 结果返回到adjusted。这里有一种方法合并 通过调用join完成:

import numpy as np
import pandas as pd
adjusted = pd.DataFrame({'conm': ['1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIPPER REALTY INC', 'CLIPPER REALTY INC', 'CM FINANCE INC', 'CM FINANCE INC', 'VCA INC', 'VCA INC', 'VCA INC', 'VCA INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'YASHENG GROUP', 'YASHENG GROUP', 'YASHENG GROUP', 'YUHE INTERNATIONAL INC', 'YUHE INTERNATIONAL INC'], 'fyear': [1999, 2000, 2001, 2002, 2003, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2015, 2016, 2014, 2015, 2009, 2010, 2011, 2012, 2007, 2008, 2009, 2010, 2011, 2012, 2005, 2006, 2007, 2009, 2010], 'indadjsg': [26.646085999999997, 22.727175, 7.312014, 4.948308, 6.278798, 34.831691, 48.053137, 48.918326, 46.956456, 33.914359999999995, 67.23423000000001, 99.09342, 0.236418, -1.3666260000000001, 8.564019, -4.96611, -4.359552, -16.313852, -18.19355, -10.126603, 4.718584, -11.889064999999999, 70.945767, 3.7760010000000004, 205.894048, 68.518555, -5.55203, -3.357275, -0.9307979999999999, 5.974914, -50.966869, 149.957403, 197.776855, -25.201732999999997, 77.082624, 78.034233, -3.728098, -2.233927, 0.34934899999999997, 27.995324, 34.37563]}, index=[1, 2, 3, 4, 5, 23, 24, 25, 26, 27, 28, 29, 11929, 11930, 11931, 11932, 11933, 11934, 11935, 11936, 11937, 11938, 11940, 11941, 11980, 11981, 121247, 121248, 121249, 121250, 121256, 121257, 121258, 121259, 121260, 121261, 121266, 121267, 121268, 121279, 121280])

grouped = adjusted.groupby(by=['conm'])
stage = pd.cut(grouped['indadjsg'].mean(), bins=[-np.inf,0,15,100,np.inf], labels=False)
stage.name = 'stage'
labels = np.array(['decline', 'revival', 'mature', 'growth'])
adjusted = adjusted.join(stage, on='conm')
adjusted['stage'] = labels[adjusted['stage']]
mask = (grouped['fyear'].transform('count') <= 5)
adjusted.loc[mask, 'stage'] = 'start'
print(adjusted)

收益率

^{pr2}$

这是另一种方法,当有很多组时,速度会慢一些(但如果组数很少,可能会更快)。在

您可以使用一个groupby/transform操作来计算stage列 使用自定义Python函数,classifyclassify为每个组调用一次,即为conm的每个值调用一次。在

import bisect
def classify(grp, grid=[0,15,100,np.inf], 
             labels=['decline', 'revival', 'mature', 'growth']):
    return 'start' if len(grp) <= 5 else labels[bisect.bisect_left(grid, grp.mean())]

grouped = adjusted.groupby(by=['conm'])
adjusted['stage'] = grouped['indadjsg'].transform(classify)
print(adjusted)

相关问题 更多 >