NumPy record array or structured array or recarray

2条回答

网友

1楼 · 编辑于 2024-10-05 15:27:14

简而言之，答案是通常应该使用结构化数组，而不是重新排列，因为结构化数组速度更快，重新排列的唯一优势是允许您编写arr.x，而不是arr['x']，这可以是一个方便的快捷方式，但如果列名与numpy方法/属性冲突，则也容易出错。

请参阅@jakevdp的书中的excerpt以获得更详细的解释。他特别指出，简单地访问结构化数组的列比访问重排的列快20到30倍。然而，他的示例使用一个非常小的数据帧，只有4行，并且不执行任何标准操作。

对于大型数据帧上的简单操作，差异可能要小得多，尽管结构化数组仍然更快。例如，这里有一个结构化的记录数组，每个数组有10000行（从@jpp answerhere借用的数据帧创建数组的代码）。

n = 10_000
df = pd.DataFrame({ 'x':np.random.randn(n) })
df['y'] = df.x.astype(int)

rec_array = df.to_records(index=False)

s = df.dtypes
struct_array = np.array([tuple(x) for x in df.values], dtype=list(zip(s.index, s)))

如果我们执行一个标准操作，例如将一列乘以2，对于结构化数组来说大约快50%：

%timeit struct_array['x'] * 2
9.18 µs ± 88.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit rec_array.x * 2
14.2 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

网友

2楼 · 编辑于 2024-10-05 15:27:14

记录/重新排列在

https://github.com/numpy/numpy/blob/master/numpy/core/records.py

本文件中的一些相关引用

Record Arrays Record arrays expose the fields of structured arrays as properties. The recarray is almost identical to a standard array (which supports named fields already) The biggest difference is that it can use attribute-lookup to find the fields and it is constructed using a record.

recarray是ndarray的一个子类（与matrix和masked arrays相同）。但请注意，它的构造函数与np.array不同。更像是np.empty(size, dtype)。

class recarray(ndarray):
    """Construct an ndarray that allows field access using attributes.
    This constructor can be compared to ``empty``: it creates a new record
       array but does not fill it with data.

将唯一字段实现为属性行为的关键函数是__getattribute__（__getitem__实现索引）：

def __getattribute__(self, attr):
    # See if ndarray has this attr, and return it if so. (note that this
    # means a field with the same name as an ndarray attr cannot be
    # accessed by attribute).
    try:
        return object.__getattribute__(self, attr)
    except AttributeError:  # attr must be a fieldname
        pass

    # look for a field with this name
    fielddict = ndarray.__getattribute__(self, 'dtype').fields
    try:
        res = fielddict[attr][:2]
    except (TypeError, KeyError):
        raise AttributeError("recarray has no attribute %s" % attr)
    obj = self.getfield(*res)

    # At this point obj will always be a recarray, since (see
    # PyArray_GetField) the type of obj is inherited. Next, if obj.dtype is
    # non-structured, convert it to an ndarray. If obj is structured leave
    # it as a recarray, but make sure to convert to the same dtype.type (eg
    # to preserve numpy.record type if present), since nested structured
    # fields do not inherit type.
    if obj.dtype.fields:
        return obj.view(dtype=(self.dtype.type, obj.dtype.fields))
    else:
        return obj.view(ndarray)

它首先尝试获取一个常规属性，比如.shape、.strides、.data，以及所有方法（.sum、.reshape，等等）。如果失败，则在dtype字段名中查找名称。所以它实际上只是一个结构化数组，包含一些重新定义的访问方法。

我只能说record array和recarray是一样的。

另一个文件显示了一些历史

https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py

Collection of utilities to manipulate structured arrays. Most of these functions were initially implemented by John Hunter for matplotlib. They have been rewritten and extended for convenience.

此文件中的许多函数以以下结尾：

    if asrecarray:
        output = output.view(recarray)

您可以返回一个数组作为recarray视图这一事实显示了这个层有多么“薄”。

numpy历史悠久，合并了几个独立的项目。我的印象是recarray是一个较老的概念，并且结构化数组是当前构建在广义dtype基础上的实现。recarrays似乎是为了方便和向后兼容而保留的。但我必须研究github文件历史，以及任何最近的问题/请求才能确定。

相关问题更多 >

编程相关推荐

热门问题

热门文章