Python Pandas计算每个类别的回报百分比

2024-10-02 16:28:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下python pandas数据帧:

          |   Number of visits per year  |
user id   |  2013  | 2014 | 2015 | 2016  |
   A           4       3     6      0     
   B           3       0     7      3
   C          10       6     3      0

我想根据访问次数计算返回用户的百分比。对不起,我还没有任何代码,我不知道如何开始。在

这是我要寻找的最终结果:

^{pr2}$

因此,根据上述情况,我可以说,2013年4次光顾该店的客户中,有15%的人在2014年再次光顾该店。在

非常感谢。在


更新:这就是我所做的,也许有更好的方法通过一个循环?在

每年,我都有一个这样的csv:

user_id |    NR_V
   A           4      
   B           3       
   C          10 

NR_V代表访问次数。在

所以我上传了每个csv作为它自己的df,我有df_2009,df_2010。。。直到2016年。在

对于每一个文件,我添加了一个列列,如果他们第二年购买的话,则为0/1。在

 df_2009['shopped2010'] = np.where(df_2009['user_ID'].isin(df_2010['user_ID']), 1, 0)

然后我旋转每个数据帧。在

 pivot_2009 = pd.pivot_table(df_2009,index=["NR_V"],aggfunc={"NR_V":len, "shopped2010":np.sum})

接下来,对于每个数据帧,我创建了一个新的数据帧,其中包含一个按访问次数计算百分比的列。在

p_2009 = pd.DataFrame()
p_2009['%returned2010'] = (pivot_2009['shopped2010']/pivot_2009['NR_V'])*100

最后,我将所有这些数据帧合并为一个。在

dfs = [p_2009, p_2010, p_2011, p_2012, p_2013, p_2014, p_2015 ]
final = pd.concat(dfs, axis=1)

Tags: csv数据idpandasdfnp次数nr
3条回答

考虑示例访问数据帧df

df = pd.DataFrame(
    np.random.randint(1, 10, (100, 5)),
    pd.Index(['user_{}'.format(i) for i in range(1, 101)], name='user id'),
    [
        ['Number of visits per year'] * 5,
        [2012, 2013, 2014, 2015, 2016]
    ]
)

df.head()

enter image description here


{8的条目代表8次独立的访问,它应该计数8次。我将使用repeatvalue_counts之前完成此操作

^{pr2}$

enter image description here

我使用了每个访问者的索引值,并检查了下一年相同的索引值(也就是相同的vistor_ID)是否大于0。然后以True或False的形式将其添加到字典中,您可以将其用于条形图。我还列出了两个列表(times\u return和return-at-u-all),用于额外的数据操作。在

import pandas as pd

# Part 1, Building the dataframe.

df = pd.DataFrame({
                   'Visitor_ID':[1,2,3],
                   '2010'      :[4,3,10],
                   '2011'      :[3,0,6],
                   '2012'      :[6,7,3],
                   '2013'      :[0,3,0]    
                   })

df.set_index("Visitor_ID", inplace=True)

# Part 2, preparing the required variables.

def dictionary (max_visitors):
    dictionary={}
    for x in range(max_visitors):
        dictionary["number_{}".format(x)] = []
#    print(dictionary)
    return dictionary

# Part 3, Figuring out if the customer returned.             

def compare_yearly_visits(current_year, next_year):    
    index = 1 
    years = df.columns
    for x in df[current_year]: 
#        print (df[years][current_year][index], 'this year.')
#        print (df[years][next_year][index], 'Next year.')
        how_many_visits = df[years][current_year][index] 
        did_he_return   = df[years][next_year][index]

        if did_he_return > 0: 
            # If the visitor returned, add to a bunch of formats:
            returned_at_all.append([how_many_visits, True])
            times_returned.append([how_many_visits, did_he_return])
            dictionary["number_{}".format(x)].append(True)
        else: 
            ## If the visitor did not return, add to a bunch of formats:
            returned_at_all.append([how_many_visits, False])
            dictionary["number_{}".format(x)].append(False)

        index = index +1 

# Part 4, The actual program:
highest_amount_of_visits = 11 # should be done automatically, max(visits)?        
relevant_years = len(df.columns) -1
times_returned = []
returned_at_all = []

dictionary = dictionary(highest_amount_of_visits)
for column in range(relevant_years):  
#   print (dictionary)
    this_year = df.columns[column]
    next_year = df.columns[column+1]
    compare_yearly_visits(this_year, next_year)
    print ("cumulative dictionary up to:", this_year,"\n", dictionary)

请在下面找到我的解决方案。作为一个说明,我非常肯定这是可以改进的。在


# step 0: create data frame
df = pd.DataFrame({'2013':[4, 3, 10], '2014':[3, 0, 6], '2015':[6, 7, 3], '2016':[0, 3, 0]}, index=['A', 'B', 'C'])

# container list of dataframes to be concatenated
frames = []

# iterate through the dataframe one column at a time and determine its value_counts(freq table)
for name, series in df.iteritems():
  frames.append(series.value_counts())

# Merge frequency table for all columns into a dataframe
temp_df = pd.concat(frames, axis=1).transpose().fillna(0)

# Find the key for the new dataframe (i.e. range for number of columns), and append missing ones
cols = temp_df.columns
min = cols.min()
max = cols.max()
for i in range(min, max):
    if (not i in a):
        temp_df[str(i)] = 0

# Calculate percentage
final_df = temp_df.div(temp_df.sum(axis=1), axis=0)

相关问题 更多 >