我有各种类型的日期时间索引的数据帧(可以是每周、每月、每年的数据)。我想生成其他列的滞后值的列。我从一个电子表格中导入这些数据,而不是在python中生成datetime索引。在
我正在努力寻找“Python式”的方法。我想如果我使用Pandas的datetime功能,那么在出现奇怪或异常数据的情况下,滞后可能会更加健壮。在
我做了一个玩具的例子,看起来很管用,但在我的实际例子中却失败了。在
正确工作的玩具示例(创建一个新列,其中包含上个月的“foo”值)
rng = pd.date_range('2012-01-01', '2013-1-01', freq="M")
toy2 = pd.DataFrame(pd.Series(np.random.randint(0, 50, len(rng)), index=rng, name="foo"))
foo
2012-01-31 4
2012-02-29 2
2012-03-31 27
2012-04-30 7
2012-05-31 44
2012-06-30 22
2012-07-31 16
2012-08-31 18
2012-09-30 35
2012-10-31 35
2012-11-30 16
2012-12-31 32
toy2['lag_foo']= toy2['foo'].shift(1,'m')
foo lag_foo
2012-01-31 4 NaN
2012-02-29 2 4.0
2012-03-31 27 2.0
2012-04-30 7 27.0
2012-05-31 44 7.0
2012-06-30 22 44.0
2012-07-31 16 22.0
2012-08-31 18 16.0
2012-09-30 35 18.0
2012-10-31 35 35.0
2012-11-30 16 35.0
2012-12-31 32 16.0
但当我在现实生活中运行这个例子时,它失败了:
^{pr2}$ValueError: cannot reindex from a duplicate axis
异常跟踪:
ValueError Traceback (most recent call last)
<ipython-input-170-9cb57a2ed681> in <module>()
----> 1 toy['prev_1m']= toy['IPE m2'].shift(1,'m')
C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in __setitem__(self, key, value)
2355 else:
2356 # set column
-> 2357 self._set_item(key, value)
2358
2359 def _setitem_slice(self, key, value):
C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _set_item(self, key, value)
2421
2422 self._ensure_valid_index(value)
-> 2423 value = self._sanitize_column(key, value)
2424 NDFrame._set_item(self, key, value)
2425
C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _sanitize_column(self, key, value)
2555
2556 if isinstance(value, Series):
-> 2557 value = reindexer(value)
2558
2559 elif isinstance(value, DataFrame):
C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in reindexer(value)
2547 # duplicate axis
2548 if not value.index.is_unique:
-> 2549 raise e
2550
2551 # other
ValueError: cannot reindex from a duplicate axis
好像我错过了熊猫约会时间指数的一些微妙之处。另外,我甚至不确定这是个理想的方法。我唯一能怀疑的是玩具.索引将None作为freq,而working toy2示例将其频率设置为'M'
toy.index
DatetimeIndex(['2016-04-30', '2016-03-31', '2016-02-29', '2016-01-31',
'2015-12-31', '2015-11-30', '2015-10-31', '2015-09-30',
'2015-08-31', '2015-07-31',
...
'NaT', 'NaT', 'NaT', 'NaT',
'NaT', 'NaT', 'NaT', 'NaT',
'NaT', 'NaT'],
dtype='datetime64[ns]', name=u'Date', length=142, freq=None)
toy2.index
DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
'2012-05-31', '2012-06-30', '2012-07-31', '2012-08-31',
'2012-09-30', '2012-10-31', '2012-11-30', '2012-12-31'],
dtype='datetime64[ns]', freq='M')
In [ ]:
============================
我扔掉了NaT
toy = toy.dropna()
toy['prev_1m']= toy['IPE m2'].shift(1,'m')
我确实得到了我想要的结果。不过,我也得到一个警告:
C:\Users\mds\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':
这种分配方式会抑制警告:
toy.loc[:,'prev_1m2']= toy['IPE m2'].shift(1,'m')
还有另一个问题-},所以{}有重复的值。(可能有些日期时间也被复制了。)
toy
DataFrame
中的索引中有很多{样品:
一种可能的解决方案是省略参数
^{pr2}$freq=m
:如果需要删除} 与^{} 一起使用:
NaN
(NaT
)在index
中的所有记录,请将^{相关问题 更多 >
编程相关推荐