用numpy构建一个基本立方体?

2024-06-01 07:12:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我想知道numpy是否可以用于构建最基本的多维数据集模型,其中存储所有交叉组合及其计算值

让我们以以下数据为例:

AUTHOR         BOOK          YEAR        SALES
Shakespeare    Hamlet        2000        104.2
Shakespeare    Hamlet        2001        99.0
Shakespeare    Romeo         2000        27.0
Shakespeare    Romeo         2001        19.0
Dante          Inferno       2000        11.6
Dante          Inferno       2001        12.6

并且能够构建类似于:

                             YEAR                  TOTAL
AUTHOR            BOOK       2000       2001         
(ALL)             (ALL)      142.8      130.6      273.4
Shakespeare       (ALL)      131.2      118.0      249.2
Dante             (ALL)      11.6       12.6       24.2
Shakespeare       Hamlet     104.2      99.0       203.2
Shakespeare       Romeo      27.0       19.0       46.0
Dante             Inferno    11.6       12.6       24.2

我希望使用像^{}这样的东西可以让我达到75%。基本上,我想看看是否有可能用numpy(不是熊猫)构建一个包含所有预计算值的结构,以构建一个结构,这样我就可以检索所有可能组合的上述结果。为了简单起见,我们只考虑^ {CD4}}作为唯一可能的计算。也许这是一种迂回的提问方式,但是numpy是做这件事的支柱吗,还是我需要使用其他的东西

最后,如果在numpy中不可能,那么如何将其存储在MDA中


Tags: 数据模型numpyall结构yearshakespeare交叉
3条回答

这是一个解决方案的示意图,显然您需要包装辅助函数和类以提供一个简单的接口。其思想是将每个唯一的名称映射到一个索引(为了简单起见,这里是顺序的),然后使用该索引将值存储在数组中。它是次优的,因为您必须将数组填充到最大数量的不同项的最大大小。数组为零,否则不要包含在和中。如果想避免添加零元素,可以考虑掩码数组和掩码和。p>

import numpy as np

def get_dict(x):
    return {a:i for i, a in enumerate(set(x))}

#Mapping name to unique contiguous numbers (obviously put in a fn or class)
author = 4*["Shakespeare"]+ 2*["Dante"]
book = 2*["Hamlet"] + 2*["Romeo"] + 2*["Inferno"]
year = 3*["2000", "2001"]
sales = [104.2, 99.0, 27.0, 19.0, 11.6, 12.6]

#Define dictonary of indices
d = get_dict(author)
d.update(get_dict(book))
d.update(get_dict(year)) 

#Index values to put in multi-dimension array
ai = [d[i] for i in author]
bi = [d[i] for i in book]
yi = [d[i] for i in year]

#Pad array up to maximum size
A = np.zeros([np.max(ai)+1, np.max(bi)+1, np.max(yi)+1])

#Store elements with unique name as index in 3D datacube
for n in range(len(sales)):
    i = ai[n]; j = bi[n]; k = yi[n]
    A[i,j,k] = sales[n]

#Now we can get the various sums, for example all sales
print("Total=", np.sum(A))

#All shakespeare (0)
print("All shakespeare=", np.sum(A[d["Shakespeare"],:,:]))

#All year 2001
print("All year 2001", np.sum(A[:,:,d["2001"]]))

#All Shakespeare in 2000
print("All Shakespeare in 2000", np.sum(A[d["Shakespeare"],:,d["2000"]]))

我认为numpy记录数组可以用于此任务,下面是我基于记录数组的解决方案

class rec_array():
    
    def __init__(self,author=None,book=None,year=None,sales=None):
        self.dtype = [('author','<U20'), ('book','<U20'),('year','<U20'),('sales',float)]
        self.rec_array = np.rec.fromarrays((author,book,year,sales),dtype=self.dtype)
        
    def add_record(self,author,book,year,sales):
        new_rec = np.rec.fromarrays((author,book,year,sales),dtype=self.dtype)
        if not self.rec_array.shape == ():
            self.rec_array = np.hstack((self.rec_array,new_rec))
        else:
            self.rec_array = new_rec
    
    def get_view(self,conditions):
        """
        conditions: 
            A list of conditions, for example 
            [["author",<,"Shakespeare"],["year","<=","2000"]]
        """
        mask = np.ones(self.rec_array.shape[0]).astype(bool)
        for item in conditions:
            field,op,target = item
            field_op = "self.rec_array['%s'] %s '%s'" % (field,op,target)
            mask &= eval(field_op)
        
        selected_sales = self.rec_array['sales'][mask]
        
        return np.sum(selected_sales)

基于此rec_array,给定数据

author = 4*["Shakespeare"]+ 2*["Dante"]
book = 2*["Hamlet"] + 2*["Romeo"] + 2*["Inferno"]
year = 3*["2000", "2001"]
sales = [104.2, 99.0, 27.0, 19.0, 11.6, 12.6]

我们创建一个实例

test = rec_array()
test.add_record(author,book,year,sales)

例如,如果你想卖掉莎士比亚的《罗密欧》,你可以这么做

test.get_view([["author","==","Shakespeare"],["book","==","Romeo"]])

输出为46.0

或者,你也可以这样做

test.get_view([["author","==","Shakespeare"],["year","<=","2000"]])

输出为131.2

对于数据结构,可以定义以下类:

class Cube:

    def __init__(self, row_index, col_index, data):
        self.row_index = {r: i for i, r in enumerate(row_index)}
        self.col_index = {c: i for i, c in enumerate(col_index)}
        self.data = data

    def __getitem__(self, item):
        row, col = item
        return self.data[self.row_index[row] , self.col_index[col]]

    def __repr__(self):
        return repr(self.data)

基本上是一个二维numpy数组的轻包装。要计算交叉列表,您可以执行以下操作:

def _x_tab(rows, columns, values):
    """Function for computing the cross-tab of simple arrays"""
    unique_values_all_cols, idx = zip(*(np.unique(col, return_inverse=True) for col in [rows, columns]))

    shape_xt = [uniq_vals_col.size for uniq_vals_col in unique_values_all_cols]

    xt = np.zeros(shape_xt, dtype=np.float)
    np.add.at(xt, idx, values)

    return unique_values_all_cols, xt


def make_index(a, r):
    """Make array of tuples"""
    l = [tuple(row) for row in a[:, r]]
    return make_object_array(l)


def make_object_array(l):
    a = np.empty(len(l), dtype=object)
    a[:] = l
    return a


def fill_label(ar, le):
    """Fill missing parts with ALL label"""
    missing = tuple(["ALL"] * le)
    return [(e + missing)[:le] for e in ar]

def x_tab(rows, cols, values):
    """Main function for cross tabulation"""
    _, l_cols = rows.shape

    total_agg = []
    total_idx = []
    for i in range(l_cols + 1):
        (idx, _), agg = _x_tab(make_index(rows, list(range(i))), cols, values)
        total_idx.extend(fill_label(idx, l_cols))
        total_agg.append(agg)

    stacked_agg = np.vstack(total_agg)
    stacked_agg_total = stacked_agg.sum(axis=1).reshape(-1, 1)

    return Cube(total_idx, list(dict.fromkeys(cols)), np.concatenate((stacked_agg, stacked_agg_total), axis=1))

假设输入一个arr数组:

[['Shakespeare' 'Hamlet' 2000 104.2]
 ['Shakespeare' 'Hamlet' 2001 99.0]
 ['Shakespeare' 'Romeo' 2000 27.0]
 ['Shakespeare' 'Romeo' 2001 19.0]
 ['Dante' 'Inferno' 2000 11.6]
 ['Dante' 'Inferno' 2001 12.6]]

那么x_tab可以这样调用:

result = x_tab(arr[:, [0, 1]], arr[:, 2], arr[:, 3])
print(result)

输出

array([[142.8, 130.6, 273.4],
       [ 11.6,  12.6,  24.2],
       [131.2, 118. , 249.2],
       [ 11.6,  12.6,  24.2],
       [104.2,  99. , 203.2],
       [ 27. ,  19. ,  46. ]])

请注意,此表示法(repr)仅用于显示结果,您可以根据需要进行更改。然后可以按如下方式访问多维数据集的单元格:

print(result[('Dante', 'ALL'), 2001])
print(result[('Dante', 'Inferno'), 2001])
print(result[('Shakespeare', 'Hamlet'), 2000])

输出

12.6
12.6
104.2

请注意,大部分操作都在x_tab函数中,该函数使用纯numpy函数。同时,它为您选择的任何聚合函数提供了灵活的接口,只需更改此行的ufunc

np.add.at(xt, idx, values)

由本list的任何其他人。有关更多信息,请参阅at运算符的文档

代码的工作副本可在here找到。以上是基于这个gist

注意 这假设您正在为索引传递多个列(rows参数)

相关问题 更多 >