<p>我最终编写了一个定制的数据处理器,就我的案例而言,有更多的变量列,如cvxh_len,并且发布的解决方案没有考虑2或3天前的日期,如果在某些情况下,较低的值之间存在较高的值。此外,用NaN替换错误的值比删除行要好。我的解决方案肯定比较慢,但确实有效</p>
<pre><code>CheckList=["cvxh_len"] #Can add as many variables as needed
#If this is not the case we have to remove the row
for i in list(df['filename'].unique()): #for every unique filename
df2 = pd.DataFrame() #We create a new df
for index, row in df.iterrows(): #We need to fill this df with the other
if row["filename"] == i: #Find all filenames that match unique
row["index"]=index
df2 = pd.concat([df2, row.to_frame().T], ignore_index=True) #Add series to dataframe
df2.sort_values(by = 'date')
for idx, r in df2.iterrows(): #For every item in new df iterate
for M in list(range(len(df2))): #To check earlier dates we need to find length
for h in CheckList: #Variables to check
if int(idx-M) in list(range(len(df2))): #Check if the item exists we are checking
try:
if int(df2.loc[[idx]][h]) < int(df2.loc[[idx-int(M)]][h]): #If value was lower on earlier timepoint
df.loc[df2.loc[[idx]]["index"], h]=np.nan #We have to replace it with NaN
except ValueError: #We need except statement because
pass #Some values might be NaN beforehand and can not be subtracted
print(df)
filename cvxh_len date
0 118_3.JPG 100.0 2018-12-14
1 118_3.JPG 200.0 2018-12-15
2 118_3.JPG 3000.0 2018-12-16
3 118_3.JPG NaN 2018-12-17
4 118_3.JPG NaN 2018-12-18
5 15_7.JPG 200.0 2018-12-14
6 15_7.JPG 400.0 2018-12-15
7 15_7.JPG NaN 2018-12-16
8 15_7.JPG NaN 2018-12-17
9 15_7.JPG NaN 2018-12-18
10 203_4.JPG 5000.0 2018-12-14
11 203_4.JPG 6000.0 2018-12-15
12 203_4.JPG 9000.0 2018-12-16
13 203_4.JPG 11000.0 2018-12-17
14 203_4.JPG 15000.0 2018-12-18
</code></pre>