在Dask中,张量是否可以在不预先计算大小的情况下在Dask中重塑为2D矩阵?

2024-06-16 07:05:56 发布

您现在位置:Python中文网/ 问答频道 /正文

在尝试创建一个能够对Dask上的标量函数进行矢量化的python基类时,我遇到了一个将张量重塑为2D矩阵的问题。解决这个问题将有助于创建sklearn管道,这些管道可以在Numpy、Pandas和Dask数据类型上互换操作。你知道吗

以下代码适用于Dask0.18.2,但在Dask0.19.40.20.0上失败:

import dask
import dask.array
import dask.dataframe
import numpy
import pandas

def and1(x): return numpy.array([x, x+1], dtype=numpy.float32)

expected = numpy.array([[10, 11, 20, 21], 
                        [30, 31, 40, 41]], 
                       dtype=numpy.float32)

df = pandas.DataFrame.from_dict({
  'c1': [10, 30], 'c2': [20, 40]
})

ddf = dask.dataframe.from_pandas(df, npartitions=2)

# Dask generalized universal function that outputs 2 values per input value
guf = dask.array.gufunc(
    pyfunc=and1,
    signature='()->(n)',
    output_dtypes=numpy.float32,
    output_sizes={'n': 2},
    vectorize=True,
    allow_rechunk = False
)

da = guf(ddf)
da_reshaped = da.reshape((-1, numpy.prod(da.shape[1:])))
npa = da_reshaped.compute()

assert da.shape == (2, 2, 2)  # (input rows, input cols, outputs per cols)
numpy.testing.assert_array_equal(expected, npa)

在Dask 0.19.4和0.20.0中,reshape引发了一个ValueError,因为da的shape的第一个元素是NaN(有关详细信息,请参阅堆栈跟踪)。你知道吗

ValueErrorTraceback (most recent call last)
<ipython-input-847-ad2c41e1d88c> in <module>
     24 
     25 da = guf(ddf)
---> 26 da_r = da.reshape((-1, numpy.prod(da.shape[1:])))
     27 npa = da_r.compute()
     28 

/opt/conda/lib/python3.6/site-packages/dask/array/core.py in reshape(self, *shape)
   1398         if len(shape) == 1 and not isinstance(shape[0], Number):
   1399             shape = shape[0]
-> 1400         return reshape(self, shape)
   1401 
   1402     def topk(self, k, axis=-1, split_every=None):

/opt/conda/lib/python3.6/site-packages/dask/array/reshape.py in reshape(x, shape)
    160         if len(shape) == 1 and x.ndim == 1:
    161             return x
--> 162         missing_size = sanitize_index(x.size / reduce(mul, known_sizes, 1))
    163         shape = tuple(missing_size if s == -1 else s for s in shape)
    164 

/opt/conda/lib/python3.6/site-packages/dask/array/slicing.py in sanitize_index(ind)
     58                      _sanitize_index_element(ind.step))
     59     elif isinstance(ind, Number):
---> 60         return _sanitize_index_element(ind)
     61     elif is_dask_collection(ind):
     62         return ind

/opt/conda/lib/python3.6/site-packages/dask/array/slicing.py in _sanitize_index_element(ind)
     20     """Sanitize a one-element index."""
     21     if isinstance(ind, Number):
---> 22         ind2 = int(ind)
     23         if ind2 != ind:
     24             raise IndexError("Bad index.  Must be integer-like: %s" % ind)

ValueError: cannot convert float NaN to integer

在Dask 0.20.0+中有没有其他方法可以在不预先计算大小的情况下重塑Dask数组?如果是这样的话,重塑是一个恒定的时间操作,因为它似乎是在Numpy?

我想创建一个矩阵(shape=(R,C)),这样第一个轴就不会改变,但是所有后续的轴都会按"C"顺序合并(Dask和Numpy中的默认值)。你知道吗

(顺便说一句,我已经看到:Reshape a dask array (obtained from a dask dataframe column)


Tags: inimportnumpyinputindexreturnifarray