如何使用python重新排列数据帧的行?

2024-05-19 08:57:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据框,其中数据位于另一列中,我希望从该列中获取这些日期,并创建一个日期列并存储它们。这是我的样本数据

df=[['Monday, 13 January 2020','',''],['Task 1',13588,'Jack'],['','','Address 1'],['','','City 1'],['Task 2',13589,'Ammie'],['','','Address 2'],['','','City'],['Task 3',13589,'Amanda'],['','','Address 3'],['','','City 3'],['Tuesday, 14 January 2020','',''],['Task 4',13587,'Chelsea'],['','','Address 4'],['','','City 4'],['Task 5','13586','Ibrahim'],['','','Address 5'],['','','City 5'],['Task 6',13585,'Kate'],['','','Address 6'],['','','City 6']]

df=pd.DataFrame(df)
df.columns = ['Task','ID','Supervisor']
df=df.replace(np.nan,'')
df

    Task    ID  Supervisor
0   Monday, 13 January 2020     
1   Task 1  13588   Jack
2           Address 1
3           City 1
4   Task 2  13589   Ammie
5           Address 2
6           City
7   Task 3  13589   Amanda
8           Address 3
9           City 3
10  Tuesday, 14 January 2020        
11  Task 4  13587   Chelsea
12          Address 4
13          City 4
14  Task 5  13586   Ibrahim
15          Address 5
16          City 5
17  Task 6  13585   Kate
18          Address 6
19          City 6

我想得到以下输出

    Date                    Task    ID      Supervisor
0 Monday, 13 January 2020   Task 1  13588   Jack Address 1 City 1
1 Monday, 13 January 2020   Task 2  13589   Ammie Address 2 City
2 Monday, 13 January 2020   Task 3  13589   Amanda Address 3 City 3
3 Tuesday, 14 January 2020  Task 4  13587   Chelsea Address 4 City 4
4 Tuesday, 14 January 2020  Task 5  13586   Ibrahim Address 5 City 5
5 Tuesday, 14 January 2020  Task 6  13585   Kate Address 6 City 6

这是我的尝试

def rowMerger(a,b):
    try:
        rule1 = lambda x: x not in ['']
        u = a.loc[a.iloc[:,0].apply(rule1) & a.iloc[:,1].apply(rule1) & a.iloc[:,2].apply(rule1)].index
        print(u)
        findMergerindexs = list(u)
        findMergerindexs.sort()
        a = pd.DataFrame(a)
        tabcolumns = pd.DataFrame(a.columns)
        totalcolumns = len(tabcolumns)
        b = pd.DataFrame(columns = list(tabcolumns))
        if (len(findMergerindexs) > 0):
            for m in range(len(findMergerindexs))
                if not (m == (len(findMergerindexs)-1)): 
                    startLoop = findMergerindexs[m]
                    endLoop = findMergerindexs[m+1]
                else:
                    startLoop = findMergerindexs[m]
                    endLoop = len(a)
                listValues = []
                for i in range(totalcolumns):
                    value = ' '
                    for n in range(startLoop,endLoop):
                        value = value + ' ' + str(a.iloc[n,i])
                    listValues.insert(i,(value.strip()))
                b = b.append(pd.Series(listValues),ignore_index = True)
        else:
            print("File is not having a row for merging instances - Please check the file manually for instance - ")
        return b
    except: 
        print("Error - While merging the rows")
    return b

这段代码给出了下面的输出

rowMerger(df,0)
       0       1             2
0   Task 1  13588   Jack Address 1 City 1
1   Task 2  13589   Ammie Address 2 City
2   Task 3 Tuesday, 14 January 2020 13589   Amanda Address 3 City 3
3   Task 4  13587   Chelsea Address 4 City 4
4   Task 5  13586   Ibrahim Address 5 City 5
5   Task 6  13585   Kate Address 6 City 6

但问题是这段代码只会合并行。不确定如何在所需输出中所示的各行之间复制日期,并将其放在不同的列中。有谁能帮我实现这个目标吗


Tags: citydffortasklenaddresspdjack
3条回答

您可以尝试以下操作:

task_mask = df.Task.str.match("Task\s+\d")
df.assign(Task = df.Task[task_mask],
          Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all') \
        .ffill() \
        .groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}) \
        .reset_index()

输出

#      Task     ID                      Date                Supervisor
# 0  Task 1  13588   Monday, 13 January 2020     Jack Address 1 City 1
# 1  Task 2  13589   Monday, 13 January 2020      Ammie Address 2 City
# 2  Task 3  13589   Monday, 13 January 2020   Amanda Address 3 City 3
# 3  Task 4  13587  Tuesday, 14 January 2020  Chelsea Address 4 City 4
# 4  Task 5  13586  Tuesday, 14 January 2020  Ibrahim Address 5 City 5
# 5  Task 6  13585  Tuesday, 14 January 2020     Kate Address 6 City 6

解释

  1. 筛选Task列:datestask id

    • 一种解决方案是使用正则表达式来匹配task id^{}做这项工作。使用的正则表达式非常简单:"Task\s+\d"表示Task+任何空格+数字
task_mask = df.Task.str.match("Task\s+\d")
  1. 从这个掩码中,我们可以提取DateTasks。通过df.Task[task_mask]task_mask

  2. {}的提取稍微困难一些

    • 我们使用^{}来设置Task值或NaN
    • 然后,我们将这个array转换成一个^{}
    • 最后,我们使用^{}将所有值移位1。通过移动行,我们可以在步骤5中轻松删除NaN
pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()
  1. 使用^{}

  2. 使用^{}how="all"

  3. 使用^{}

  4. Groupby"Task", "ID"a and "Date"并使用^{}聚合行。聚合函数基于^{}:lambda x: " ".join(x)

  5. 使用^{}groupby重置索引

希望这是清楚的


代码+插图

# Create dataframe
data = [['Monday, 13 January 2020', '', ''], ['Task 1', 13588, 'Jack'], ['', '', 'Address 1'], ['', '', 'City 1'], ['Task 2', 13589, 'Ammie'], ['', '', 'Address 2'], ['', '', 'City'], ['Task 3', 13589, 'Amanda'], ['', '', 'Address 3'], ['', '', 'City 3'], [
    'Tuesday, 14 January 2020', '', ''], ['Task 4', 13587, 'Chelsea'], ['', '', 'Address 4'], ['', '', 'City 4'], ['Task 5', '13586', 'Ibrahim'], ['', '', 'Address 5'], ['', '', 'City 5'], ['Task 6', 13585, 'Kate'], ['', '', 'Address 6'], ['', '', 'City 6']]
df = pd.DataFrame(data)
df.columns = ['Task', 'ID', 'Supervisor']
print(df)

# Step 1
task_mask = df.Task.str.match("Task\s+\d")
print(task_mask)
# 0     False
# 1      True
# 2     False
# 3     False
# 4      True
# 5     False
# 6     False
# 7      True
# 8     False
# 9     False
# 10    False
# 11     True
# 12    False
# 13    False
# 14     True
# 15    False
# 16    False
# 17     True
# 18    False
# 19    False
# Name: Task, dtype: bool

# Step 2
print(df.Task[task_mask])
# 1     Task 1
# 4     Task 2
# 7     Task 3
# 11    Task 4
# 14    Task 5
# 17    Task 6
# Name: Task, dtype: object

# Step 3
print(pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift())
# 0                          NaN
# 1      Monday, 13 January 2020
# 2                          NaN
# 3
# 4
# 5                          NaN
# 6
# 7
# 8                          NaN
# 9
# 10
# 11    Tuesday, 14 January 2020
# 12                         NaN
# 13
# 14
# 15                         NaN
# 16
# 17
# 18                         NaN
# 19
# dtype: object

# Step 4
print(df.assign(Task=df.Task[task_mask],
                Date=pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift())
        .replace("", np.NaN))
#       Task     ID Supervisor                      Date
# 0      NaN    NaN        NaN                       NaN
# 1   Task 1  13588       Jack   Monday, 13 January 2020
# 2      NaN    NaN  Address 1                       NaN
# 3      NaN    NaN     City 1                       NaN
# 4   Task 2  13589      Ammie                       NaN
# 5      NaN    NaN  Address 2                       NaN
# 6      NaN    NaN       City                       NaN
# 7   Task 3  13589     Amanda                       NaN
# 8      NaN    NaN  Address 3                       NaN
# 9      NaN    NaN     City 3                       NaN
# 10     NaN    NaN        NaN                       NaN
# 11  Task 4  13587    Chelsea  Tuesday, 14 January 2020
# 12     NaN    NaN  Address 4                       NaN
# 13     NaN    NaN     City 4                       NaN
# 14  Task 5  13586    Ibrahim                       NaN
# 15     NaN    NaN  Address 5                       NaN
# 16     NaN    NaN     City 5                       NaN
# 17  Task 6  13585       Kate                       NaN
# 18     NaN    NaN  Address 6                       NaN
# 19     NaN    NaN     City 6                       NaN

# Step 5:
print(df.assign(Task = df.Task[task_mask],
                Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all'))
#       Task     ID Supervisor                      Date
# 1   Task 1  13588       Jack   Monday, 13 January 2020
# 2      NaN    NaN  Address 1                       NaN
# 3      NaN    NaN     City 1                       NaN
# 4   Task 2  13589      Ammie                       NaN
# 5      NaN    NaN  Address 2                       NaN
# 6      NaN    NaN       City                       NaN
# 7   Task 3  13589     Amanda                       NaN
# 8      NaN    NaN  Address 3                       NaN
# 9      NaN    NaN     City 3                       NaN
# 11  Task 4  13587    Chelsea  Tuesday, 14 January 2020
# 12     NaN    NaN  Address 4                       NaN
# 13     NaN    NaN     City 4                       NaN
# 14  Task 5  13586    Ibrahim                       NaN
# 15     NaN    NaN  Address 5                       NaN
# 16     NaN    NaN     City 5                       NaN
# 17  Task 6  13585       Kate                       NaN
# 18     NaN    NaN  Address 6                       NaN
# 19     NaN    NaN     City 6                       NaN

# Step 6:
print(df.assign(Task = df.Task[task_mask],
                 Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
            .replace("", np.NaN) \
            .dropna(how='all') \
            .ffill())
#       Task     ID Supervisor                      Date
# 1   Task 1  13588       Jack   Monday, 13 January 2020
# 2   Task 1  13588  Address 1   Monday, 13 January 2020
# 3   Task 1  13588     City 1   Monday, 13 January 2020
# 4   Task 2  13589      Ammie   Monday, 13 January 2020
# 5   Task 2  13589  Address 2   Monday, 13 January 2020
# 6   Task 2  13589       City   Monday, 13 January 2020
# 7   Task 3  13589     Amanda   Monday, 13 January 2020
# 8   Task 3  13589  Address 3   Monday, 13 January 2020
# 9   Task 3  13589     City 3   Monday, 13 January 2020
# 11  Task 4  13587    Chelsea  Tuesday, 14 January 2020
# 12  Task 4  13587  Address 4  Tuesday, 14 January 2020
# 13  Task 4  13587     City 4  Tuesday, 14 January 2020
# 14  Task 5  13586    Ibrahim  Tuesday, 14 January 2020
# 15  Task 5  13586  Address 5  Tuesday, 14 January 2020
# 16  Task 5  13586     City 5  Tuesday, 14 January 2020
# 17  Task 6  13585       Kate  Tuesday, 14 January 2020
# 18  Task 6  13585  Address 6  Tuesday, 14 January 2020
# 19  Task 6  13585     City 6  Tuesday, 14 January 2020

# Step 7
print(df.assign(Task = df.Task[task_mask],
                Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all') \
        .ffill() \
        .groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}))
#                                                      Supervisor
# Task   ID    Date
# Task 1 13588 Monday, 13 January 2020      Jack Address 1 City 1
# Task 2 13589 Monday, 13 January 2020       Ammie Address 2 City
# Task 3 13589 Monday, 13 January 2020    Amanda Address 3 City 3
# Task 4 13587 Tuesday, 14 January 2020  Chelsea Address 4 City 4
# Task 5 13586 Tuesday, 14 January 2020  Ibrahim Address 5 City 5
# Task 6 13585 Tuesday, 14 January 2020     Kate Address 6 City 6

# Step 8
df = df.assign(Task = df.Task[task_mask],
               Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all') \
        .ffill() \
        .groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}) \
        .reset_index()
print(df)

#      Task     ID                      Date                Supervisor
# 0  Task 1  13588   Monday, 13 January 2020     Jack Address 1 City 1
# 1  Task 2  13589   Monday, 13 January 2020      Ammie Address 2 City
# 2  Task 3  13589   Monday, 13 January 2020   Amanda Address 3 City 3
# 3  Task 4  13587  Tuesday, 14 January 2020  Chelsea Address 4 City 4
# 4  Task 5  13586  Tuesday, 14 January 2020  Ibrahim Address 5 City 5
# 5  Task 6  13585  Tuesday, 14 January 2020     Kate Address 6 City 6

@Alexandre的答案很好-这是一个替代方案,我可以避免正则表达式提取和移位:

#convert empty cells to null
(df.replace("",np.nan)
 #create a new column containing only the dates
 #we'll use the null cells in Supervisor column to pick out the dates
 .assign(Date = lambda x: x.loc[x.Supervisor.isna(),'Task'])
 .ffill()
 .dropna(subset=['ID'])
 #drop Dates found in the Task column
 .query('Task != Date')
 .groupby(['Date','Task','ID'],as_index=False)
 .Supervisor.agg(' '.join)
)

        Date                    Task      ID      Supervisor
0   Monday, 13 January 2020     Task 1  13588   Jack Address 1 City 1
1   Monday, 13 January 2020     Task 2  13589   Ammie Address 2 City
2   Monday, 13 January 2020     Task 3  13589   Amanda Address 3 City 3
3   Tuesday, 14 January 2020    Task 4  13587   Chelsea Address 4 City 4
4   Tuesday, 14 January 2020    Task 5  13586   Ibrahim Address 5 City 5
5   Tuesday, 14 January 2020    Task 6  13585   Kate Address 6 City 6

因此,本质上我们使用lambda将日期与任务编号分开,并使用pd.Series.fillna(method='ffill')填充最后一个有效日期

所以我们要添加以下几行:

# Split Date column
df['Date'] = df.apply(lambda x: " ".join(x[0].split(' ')[2:]) if len(x[0].split(' ')) > 2 else np.nan,axis=1).fillna(method='ffill')

# Clean Task column
df['Task'] = df.apply(lambda x: " ".join(x[0].split(' ')[:2]) if len(x[0].split(' ')) > 1 else x[0],axis=1)

# Rename and reorder remaining columns
df['ID'] = df[1]
df['Supervisor'] = df[2]
df = df[['Date','Task','ID','Supervisor']]

相关问题 更多 >

    热门问题