<p>有一种方法可以用一个<code>groupby/transform</code>操作计算<code>stage</code>列(请参阅下面的<code>classify</code>函数),但它涉及到为每个组调用一次自定义Python函数。如果有很多组,这往往是低效的。在</p>
<p>通常,当您替换大量Python时,您会获得更好的性能
对整个(大)数据帧或
数据帧的大列。在</p>
<p>因此,如果有很多<code>conm</code>s(即很多组),可能最好
按照你的第一个想法,为每个公司计算阶段,然后合并
结果返回到<code>adjusted</code>。这里有一种方法合并
通过调用<code>join</code>完成:</p>
<pre><code>import numpy as np
import pandas as pd
adjusted = pd.DataFrame({'conm': ['1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIPPER REALTY INC', 'CLIPPER REALTY INC', 'CM FINANCE INC', 'CM FINANCE INC', 'VCA INC', 'VCA INC', 'VCA INC', 'VCA INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'YASHENG GROUP', 'YASHENG GROUP', 'YASHENG GROUP', 'YUHE INTERNATIONAL INC', 'YUHE INTERNATIONAL INC'], 'fyear': [1999, 2000, 2001, 2002, 2003, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2015, 2016, 2014, 2015, 2009, 2010, 2011, 2012, 2007, 2008, 2009, 2010, 2011, 2012, 2005, 2006, 2007, 2009, 2010], 'indadjsg': [26.646085999999997, 22.727175, 7.312014, 4.948308, 6.278798, 34.831691, 48.053137, 48.918326, 46.956456, 33.914359999999995, 67.23423000000001, 99.09342, 0.236418, -1.3666260000000001, 8.564019, -4.96611, -4.359552, -16.313852, -18.19355, -10.126603, 4.718584, -11.889064999999999, 70.945767, 3.7760010000000004, 205.894048, 68.518555, -5.55203, -3.357275, -0.9307979999999999, 5.974914, -50.966869, 149.957403, 197.776855, -25.201732999999997, 77.082624, 78.034233, -3.728098, -2.233927, 0.34934899999999997, 27.995324, 34.37563]}, index=[1, 2, 3, 4, 5, 23, 24, 25, 26, 27, 28, 29, 11929, 11930, 11931, 11932, 11933, 11934, 11935, 11936, 11937, 11938, 11940, 11941, 11980, 11981, 121247, 121248, 121249, 121250, 121256, 121257, 121258, 121259, 121260, 121261, 121266, 121267, 121268, 121279, 121280])
grouped = adjusted.groupby(by=['conm'])
stage = pd.cut(grouped['indadjsg'].mean(), bins=[-np.inf,0,15,100,np.inf], labels=False)
stage.name = 'stage'
labels = np.array(['decline', 'revival', 'mature', 'growth'])
adjusted = adjusted.join(stage, on='conm')
adjusted['stage'] = labels[adjusted['stage']]
mask = (grouped['fyear'].transform('count') <= 5)
adjusted.loc[mask, 'stage'] = 'start'
print(adjusted)
</code></pre>
<p>收益率</p>
^{pr2}$
<hr/>
<p>这是另一种方法,当有很多组时,速度会慢一些(但如果组数很少,可能会更快)。在</p>
<p>您可以使用一个<code>groupby/transform</code>操作来计算<code>stage</code>列
使用自定义Python函数,<code>classify</code>。<code>classify</code>为每个组调用一次,即为<code>conm</code>的每个值调用一次。在</p>
<pre><code>import bisect
def classify(grp, grid=[0,15,100,np.inf],
labels=['decline', 'revival', 'mature', 'growth']):
return 'start' if len(grp) <= 5 else labels[bisect.bisect_left(grid, grp.mean())]
grouped = adjusted.groupby(by=['conm'])
adjusted['stage'] = grouped['indadjsg'].transform(classify)
print(adjusted)
</code></pre>