缺少季度数据时,将行添加到数据框

2024-09-30 20:36:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个季度股票投资组合的数据框架:

dict = {'Period Date': {0: Timestamp('2017-12-31 00:00:00'), 1: Timestamp('2018-03-31 00:00:00'), 2: Timestamp('2018-03-31 00:00:00'), 3: Timestamp('2018-06-30 00:00:00'), 4: Timestamp('2018-06-30 00:00:00'), 5: Timestamp('2018-09-30 00:00:00'), 6: Timestamp('2018-12-31 00:00:00'), 7: Timestamp('2019-06-30 00:00:00'), 8: Timestamp('2018-06-30 00:00:00'), 9: Timestamp('2018-06-30 00:00:00'), 10: Timestamp('2018-09-30 00:00:00'), 11: Timestamp('2019-12-31 00:00:00'), 12: Timestamp('2020-03-31 00:00:00')}, 'Investor': {0: 'Dave', 1: 'Dave', 2: 'Dave', 3: 'Dave', 4: 'Dave', 5: 'Dave', 6: 'Dave', 7: 'Dave', 8: 'Faye', 9: 'Faye', 10: 'Faye', 11: 'Mike', 12: 'Mike'}, 'Symbol': {0: 'CC', 1: 'AABA', 2: 'CONN', 3: 'AABA', 4: 'CONN', 5: 'AABA', 6: 'CC', 7: 'CC', 8: 'AABA', 9: 'CC', 10: 'CC', 11: 'AMZN', 12: 'AMZN'}, 'CUSIP': {0: 163851108, 1: 21346101, 2: 208242107, 3: 21346101, 4: 208242107, 5: 21346101, 6: 163851108, 7: 163851108, 8: 21346101, 9: 163851108, 10: 163851108, 11: 45636363, 12: 45636363}, 'Company Name': {0: 'Chemours', 1: 'Yahoo', 2: 'Conn', 3: 'Yahoo', 4: 'Conn', 5: 'Altaba', 6: 'Chemours', 7: 'Chemours', 8: 'Altaba', 9: 'Chemours', 10: 'Chemours', 11: 'Amazon', 12: 'Amazon'}, 'Shares': {0: 2, 1: 9, 2: 4, 3: 9, 4: 5, 5: 7, 6: 3, 7: 7, 8: 5, 9: 7, 10: 1, 11: 2, 12: 3}, 'Value': {0: 20, 1: 12, 2: 15, 3: 12, 4: 9, 5: 20, 6: 22, 7: 12, 8: 6, 9: 6, 10: 9, 11: 10, 12: 20}, '% Change': {0: 'New', 1: 'New', 2: 'New', 3: '0.0', 4: '0.25', 5: '-0.2222', 6: 'New', 7: 'New', 8: 'New', 9: 'New', 10: '-0.8571', 11: 'New', 12: '.5'}}
df = pd.DataFrame.from_dict(dict)

我想添加一行来指示股票何时售出

例如,如果股票(CUSIPSymbol)出现在特定的Investor&Period Date组,但是不会出现在下一个Period Date中。对于该Investor,我想在原始数据框中添加一行,显示股票已售出

因此,要添加到原始数据帧的新行将如下所示:

   Period Date Investor Symbol      CUSIP  Company Name  Shares Value  % Change
0   2018-12-31     Dave   AABA   21346101        Altaba     NaN   NaN      Sold
1   2018-03-31     Dave     CC  163851108      Chemours     NaN   NaN      Sold
2   2018-09-30     Dave   CONN  208242107          Conn     NaN   NaN      Sold
3   2018-09-30     Faye   AABA   21346101        Altaba     NaN   NaN      Sold
4   2018-12-31     Faye     CC  163851108      Chemours     NaN   NaN      Sold

我尝试了下面的方法,但结果是不完整的解决方案(缺少CUSIPCompany Name),我还觉得有一个更简单的解决方案

# Create a dataframe of all the dates.
all_dates = pd.DataFrame({"Period Date": df["Period Date"].unique()})
all_dates["one"] = 1

# Create df of all the possible cords (all combinations of id-name-date)
name_ids = df[["Investor", "Symbol"]].drop_duplicates()
name_ids["one"] = 1

all_possible_records = pd.merge(all_dates, name_ids, on="one")
all_possible_records = pd.merge(all_possible_records, df, how ="left")
all_possible_records.drop("one", axis = "columns", inplace = True)
all_possible_records.sort_values(["Investor", "Symbol", "Period Date"], inplace=True)

# For every record, shift 1 to see if it had any value in the previous querter
all_possible_records["prev_q"] = all_possible_records.groupby(["Investor", "Symbol"]).shift(1)["Shares"]
# records in which change is NaN - but there was a value in the previous querter - are 'sold'
all_possible_records.loc[all_possible_records["% Change"].isna() & all_possible_records.prev_q.notna(), "% Change"]="Sold"
# Drop redundant records
res = all_possible_records.dropna(axis="rows", subset=["% Change"])
del res['prev_q']

res.reset_index(drop=True, inplace=True)

那么我如何才能做到这一点呢


Tags: newdateinvestornanallsymboltimestampperiod
1条回答
网友
1楼 · 发布于 2024-09-30 20:36:16

对数据进行排序后,您可以选择每个组的最后一行,其中包含不同的“投资者”、“符号”或“CUSIP”,或者期间列上的时差大于一个季度(我使用的时间大于95天)。一旦有了这些行,就可以将0.5添加到索引中,以便以后能够concat它们并更改所需列中的值

# sort to use diff after
df = df.sort_values(["Investor", "Symbol", "CUSIP", "Period Date"]).reset_index(drop=True)

# select the last rows per group
df_toadd = df[(df[['Investor',"Symbol", "CUSIP"]]
                 .ne(df[['Investor',"Symbol", "CUSIP"]].shift(-1)).any(axis=1))
              | df["Period Date"].diff(-1).dt.days.abs().gt(95)].copy()

# add 0.5 to index and set other values
df_toadd.index += 0.5
df_toadd['Period Date'] = (df_toadd['Period Date']
                           +pd.DateOffset(days=1)+pd.tseries.offsets.QuarterEnd())
df_toadd[['Shares', 'Value']] = np.nan
df_toadd['% Change'] = 'Sold'

# concat original df with the new rows and sort_index
df_ = pd.concat([df, df_toadd]).sort_index().reset_index(drop=True)

你得到了什么

print (df_)
   Period Date Investor Symbol      CUSIP Company Name  Shares  Value % Change
0   2018-03-31     Dave   AABA   21346101        Yahoo     9.0   12.0      New
1   2018-06-30     Dave   AABA   21346101        Yahoo     9.0   12.0      0.0
2   2018-09-30     Dave   AABA   21346101       Altaba     7.0   20.0  -0.2222
3   2018-12-31     Dave   AABA   21346101       Altaba     NaN    NaN     Sold
4   2017-12-31     Dave     CC  163851108     Chemours     2.0   20.0      New
5   2018-03-31     Dave     CC  163851108     Chemours     NaN    NaN     Sold
6   2018-12-31     Dave     CC  163851108     Chemours     3.0   22.0      New
7   2019-03-31     Dave     CC  163851108     Chemours     NaN    NaN     Sold
8   2019-06-30     Dave     CC  163851108     Chemours     7.0   12.0      New
9   2019-09-30     Dave     CC  163851108     Chemours     NaN    NaN     Sold
10  2018-03-31     Dave   CONN  208242107         Conn     4.0   15.0      New
11  2018-06-30     Dave   CONN  208242107         Conn     5.0    9.0     0.25
12  2018-09-30     Dave   CONN  208242107         Conn     NaN    NaN     Sold
13  2018-06-30     Faye   AABA   21346101       Altaba     5.0    6.0      New
14  2018-09-30     Faye   AABA   21346101       Altaba     NaN    NaN     Sold
15  2018-06-30     Faye     CC  163851108     Chemours     7.0    6.0      New
16  2018-09-30     Faye     CC  163851108     Chemours     1.0    9.0  -0.8571
17  2018-12-31     Faye     CC  163851108     Chemours     NaN    NaN     Sold
​

相关问题 更多 >