当应用一个在pandas数据帧上返回元组的函数时,我遇到了一些我不理解的行为。我的目的是让df.apply()
返回一个新的序列,但只有当我将数据帧中的列子集化以排除一个日期时间序列的列时,这种方法才有效。在
这个虚拟示例演示了我看到的行为:
df = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
def random(row):
# Return an tuple with more elements than df has columns
return (1,2,3,4,5,6,7,8)
df.apply(random,axis=1)
# Output, returns new series as expected:
0 (1, 2, 3, 4, 5, 6, 7, 8)
1 (1, 2, 3, 4, 5, 6, 7, 8)
2 (1, 2, 3, 4, 5, 6, 7, 8)
3 (1, 2, 3, 4, 5, 6, 7, 8)
4 (1, 2, 3, 4, 5, 6, 7, 8)
正如预期的那样,但是当我向dataframe添加一个datetime列时。。。在
^{pr2}$我得到这个错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in create_block_manager_from_arrays(arrays, names, axes)
4262 blocks = form_blocks(arrays, names, axes)
-> 4263 mgr = BlockManager(blocks, axes)
4264 mgr._consolidate_inplace()
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
2760 if do_integrity_check:
-> 2761 self._verify_integrity()
2762
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in _verify_integrity(self)
2970 if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
-> 2971 construction_error(tot_items, block.shape[1:], self.axes)
2972 if len(self.items) != tot_items:
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in construction_error(tot_items, block_shape, axes, e)
4232 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 4233 passed, implied))
4234
ValueError: Shape of passed values is (5, 8), indices imply (5, 5)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-29-b57dd4b93995> in <module>()
----> 1 df.apply(random,axis=1)
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4150 if reduce is None:
4151 reduce = True
-> 4152 return self._apply_standard(f, axis, reduce=reduce)
4153 else:
4154 return self._apply_broadcast(f, axis)
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
4263 index = None
4264
-> 4265 result = self._constructor(data=results, index=index)
4266 result.columns = res_index
4267
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
264 dtype=dtype, copy=copy)
265 elif isinstance(data, dict):
--> 266 mgr = self._init_dict(data, index, columns, dtype=dtype)
267 elif isinstance(data, ma.MaskedArray):
268 import numpy.ma.mrecords as mrecords
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
400 arrays = [data[k] for k in keys]
401
--> 402 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
403
404 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
5406 axes = [_ensure_index(columns), _ensure_index(index)]
5407
-> 5408 return create_block_manager_from_arrays(arrays, arr_names, axes)
5409
5410
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in create_block_manager_from_arrays(arrays, names, axes)
4265 return mgr
4266 except ValueError as e:
-> 4267 construction_error(len(arrays), arrays[0].shape, axes, e)
4268
4269
/Users/jguillette/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in construction_error(tot_items, block_shape, axes, e)
4231 raise ValueError("Empty data passed with indices specified.")
4232 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 4233 passed, implied))
4234
4235
ValueError: Shape of passed values is (5, 8), indices imply (5, 5)
我唯一没有得到错误的时候是当函数返回一个元素数与dataframe的列数相同的元组时,当它返回一个dataframe而不是一个序列时。在
有没有办法改变这种行为?在我的例子中,我不需要在函数中使用日期时间信息,但是我仍然不明白排除它如何改变apply的行为。在
任何有见识的人都将不胜感激。在
https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/frame.py#L236-L6142
我找到了你问题背后的源代码,详细原因你可以检查GH6125,正如评论所说。 我的决定有些愚蠢,如下所示。在
^{pr2}$第二个解决方案是确保func返回一个序列(看起来比较慢)
希望有帮助。在
根据df的数据类型,pandas处理apply返回值的方式不同。在第一个示例中,所有数据类型都是float,而在添加列E之后,数据类型是混合的,这导致pandas试图使用返回的值重建数据帧。我不知道这种行为背后的理性,但以下几点应该可以解决你的问题:
相关问题 更多 >
编程相关推荐