xarray展开_dim以添加更高级别的维度

2024-10-05 14:32:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试合并一个dataarray列表,然后添加一个维度,以便对每个连接的dataarray进行标记。我原以为这是expand_dims的一个用例,但在尝试了来自的各种解决方案后,我被卡住了。我想我错过了一些关于xarray的基本知识。 这些似乎是最接近的:

  1. Add a 'time' dimension to xarray Dataset and assign coordinates from another Dataset to it

  2. Add 'constant' dimension to xarray Dataset

我使用pandas dataframe从文件名编译元数据,然后分组并遍历组以创建数据集,使用skimage.io.ImageCollection将多个图像文件加载到nparray中,并最终创建xarray对象

自足的例子

设置

#%%  load libraries
from itertools import product
from PIL import Image
import numpy as np
import pandas as pd
import xarray as xr
import glob
from skimage import io
import re

#%% Synthetic data generator
ext = 'png'
delim = '_'

datadir = os.path.join('data','syn')
os.makedirs(datadir, exist_ok=True)
cartag = ['A1', 'A2']
date = ['2020-05-31', '2020-06-01', '2020-06-02']
frame = ['Fp', 'Fmp']
parameter = ['FvFm','t40', 't60']
list_vals = [cartag, date, frame, parameter]
mesh = list(product(*list_vals))
mesh = np.array(mesh)
for entry in mesh:
    print(entry)
    img = np.random.random_sample((8, 8))*255
    img = img.astype('uint8')
    fn = delim.join(entry)+'.png'
    pimg = Image.fromarray(img)
    pimg.save(os.path.join(datadir,fn))

#%% import synthetic images
fns = [
    fn for fn in glob.glob(pathname=os.path.join(datadir, '*%s' % ext))
]
flist = list()
for fullfn in fns:
    fn = os.path.basename(fullfn)
    fn,_ = os.path.splitext(fn)
    f = fn.split(delim)
    f.append(fullfn)
    flist.append(f)

fdf = pd.DataFrame(flist,
                columns=[
                    'plantbarcode', 'timestamp',
                    'frame','parameter', 'filename'
                ])
fdf=fdf.sort_values(['timestamp','plantbarcode','parameter','frame'])

功能定义

#%%
def get_tind_seconds(parameter):
    tind = re.search("\d+", parameter)
    if tind is not None:
        tind = int(tind.group())
    elif parameter == 'FvFm':
        tind = 0
    else:
        raise ValueError("the parameter '%s' is not supported" % parameter)
    return (tind)

xarray部件

dfgrps = fdf.groupby(['plantbarcode', 'timestamp', 'parameter'])
ds = list()
for grp, grpdf in dfgrps:
    # print(grpdf.parameter.unique())
    parameter = grpdf.parameter.unique()[0]
    tind = get_tind_seconds(
        parameter
    )  #tind is an integer representing seconds since start of experiment
    # print(tind)

    filenames = grpdf.filename.to_list()
    imgcol = io.ImageCollection(filenames)
    imgstack = imgcol.concatenate()  #imgstack is now 2x8x8 ndarray
    indf = grpdf.frame  #the 2 dim are frames Fp and Fmp
    # print(indf)
    arr = xr.DataArray(name=parameter,
                       data=imgstack,
                       dims=('frame', 'y', 'x'),
                       coords={
                    #        'frame': indf,
                           'parameter': [parameter,parameter]
                    #        'tind_s': [tind,tind]
                       },
                       attrs={
                           'jobdate': grpdf.timestamp.unique()[0],
                           'plantbarcode': grpdf.plantbarcode.unique()[0]
                       })
    # arr = arr.expand_dims(
    #     dims={'tind_s': tind}
    # )  #<- somehow I need to label each dataarray with another dimension assigning it the dim/coord `tind`
    ds.append(arr)

dstest = xr.concat(ds, dim='parameter')

目标是每天都有一个不同的文件,plantbarcode。所以在这个例子中有4个文件其中图像可通过参数和帧进行索引。tind_s通常用于为每个参数绘制每个图像的摘要统计数据,因此我也希望使用dim/coord-我不确定何时使用哪个。看起来dim必须与输入的数据匹配,因此在本例中为2帧x 8x8像素

原创的

im使用pandas数据框架从文件名编译元数据(以下是前几个条目)

    frameid plantbarcode    experiment  datetime    jobdate cameralabel filename    frame   parameter
4   5   A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-5.png    Fp  FvFm
5   6   A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-6.png    Fmp FvFm
6   7   A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-7.png    Fp  t40_ALon
7   8   A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-8.png    Fmp t40_ALon
8   9   A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-9.png    Fp  t60_ALon
9   10  A1  doi 2020-05-31 21:01:55 2020-06-01  PSII0   data/psII/A1-doi-20200531T210155-PSII0-10.png   Fmp t60_ALon
...

然后分组并遍历组以创建数据集,使用skimage.io.ImageCollection将多个图像文件加载到nparray中,并最终创建xarray对象

import os
import cppcpyutils as cppc
import re
from skimage import io
import xarray as xr
import numpy as np
import pandas as pd

delimiter = "(.{2})-(.+)-(\d{8}T\d{6})-(.+)-(\d+)"

filedf = cppc.io.import_snapshots('data/psII', camera='psII', delimiter=delimiter)
filedf = filedf.reset_index().set_index('frameid')

pimframes_map = pd.read_csv('data/pimframes_map.csv',index_col = 'frameid')

filedf = filedf.join(pimframes_map, on = 'frameid').reset_index().query('frameid not in [3,4,5,6]')
dfgrps = filedf.groupby(['experiment', 'plantbarcode', 'jobdate', 'datetime', 'parameter'])

ds=list()
for grp, grpdf in dfgrps:
    # print(grpdf.parameter.unique())
    parameter = grpdf.parameter.unique()[0]
    tind = get_tind_seconds(parameter) #tind is an integer representing seconds since start of experiment
    # print(tind)

    filenames = grpdf.filename.to_list()
    imgcol = io.ImageCollection(filenames)
    imgstack = imgcol.concatenate() #imgstack is now 2x640x480 ndarray
    indf = grpdf.frame #the 2 dim are frames Fp and Fmp
    # print(indf)
    arr = xr.DataArray(name=parameter,
                      data=imgstack,
                      dims=('induction frame','y', 'x'),
                      coords={'induction frame': indf},
                      attrs={'plantbarcode': grpdf.plantbarcode.unique()[0],
                            'jobdate': grpdf.jobdate.unique()[0]})
    arr = arr.expand_dims(dims = {'tind_s': tind}) #<- somehow I need to label each dataarray with another dimension assigning it the dim/coord `tind`
    ds.append(arr)

expand_dims行导致ValueError: dimensions ('dims',) must have the same length as the number of data dimensions, ndim=0

如果我试图遵循第二个,所以我在上面链接了我提供的“tind_s”作为坐标,它会抱怨相对于DIM的数量太多

ValueError: coordinate tind_s has dimensions ('tind_s',), but these are not a subset of the DataArray dimensions ('induction frame', 'y', 'x')

然后我想在tind_s是坐标的地方合并

dstest=xr.concat(ds[0:4], dim = 'tind_s')

又一次尝试

我确实发现我可以在imgstack上使用np.expand_dims(),然后指定额外的dim和coord,但它会产生一个nan数组。此外,xr.concat()的结果是一个数据数组而不是数据集,因此无法保存(?)。在xarray有没有直接的方法? 我还将属性转换为DIM

dfgrps = filedf.groupby(
    ['experiment', 'plantbarcode', 'jobdate', 'datetime', 'parameter'])

dalist = list()
for grp, grpdf in dfgrps:
    print(grpdf.parameter.unique())
    parameter = grpdf.parameter.unique()[0]
    tind = get_tind_seconds(parameter)
    # print(tind)
    print(grpdf.plantbarcode.unique())
    print(grpdf.jobdate.unique()[0])

    filenames = grpdf.filename.to_list()
    imgcol = io.ImageCollection(filenames)
    imgstack = imgcol.concatenate()
    imgstack = np.expand_dims(imgstack, axis=0)
    imgstack = np.expand_dims(imgstack, axis=0)
    imgstack = np.expand_dims(imgstack, axis=0)
    indf = grpdf.frame  #xr.Variable('induction frame', grpdf.frame)
    # tind = xr.Variable('tind', [tind])
    # print(indf)
    arr = xr.DataArray(data=imgstack,
                       dims=('jobdate','plantbarcode', 'tind_s', 'induction frame', 'y',
                             'x'),
                       coords={
                           'plantbarcode': grpdf.plantbarcode.unique(),
                           'tind_s': [tind],
                           'induction frame': indf,
                           'jobdate': grpdf.jobdate.unique()}
    )
    dalist.append(arr)

ds = xr.concat(dalist, dim='jobdate')

在for循环之后:print(arr)

<xarray.DataArray (jobdate: 1, plantbarcode: 1, tind_s: 1, induction frame: 2, y: 640, x: 480)>
array([[[[[[0, 0, 0, ..., 0, 0, 0],
           [1, 1, 0, ..., 0, 0, 0],
           [0, 0, 2, ..., 0, 0, 0],
           ...,
           [1, 0, 0, ..., 0, 1, 0],
           [1, 0, 0, ..., 0, 0, 1],
           [1, 0, 0, ..., 1, 1, 0]],

          [[0, 0, 0, ..., 0, 1, 1],
           [2, 2, 0, ..., 0, 0, 1],
           [2, 1, 1, ..., 0, 0, 0],
           ...,
           [0, 1, 0, ..., 1, 0, 1],
           [1, 0, 0, ..., 0, 1, 1],
           [0, 0, 0, ..., 0, 0, 0]]]]]], dtype=uint8)
Coordinates:
  * plantbarcode     (plantbarcode) object 'A2'
  * tind_s           (tind_s) int64 60
  * induction frame  (induction frame) object 'Fp' 'Fmp'
  * jobdate          (jobdate) datetime64[ns] 2020-06-03
Dimensions without coordinates: y, x

print(ds)


print(ds)
<xarray.DataArray (jobdate: 18, plantbarcode: 2, tind_s: 3, induction frame: 2, y: 640, x: 480)>
array([[[[[[ 0.,  0.,  0., ...,  0.,  0.,  1.],
           [ 0.,  0.,  1., ...,  2.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           ...,
           [ 1.,  0.,  0., ...,  7.,  0.,  0.],
           [ 0.,  2.,  4., ...,  0.,  0.,  4.],
           [ 0.,  1.,  0., ...,  1.,  0.,  0.]],

          [[ 0.,  1.,  0., ...,  0.,  1.,  0.],
           [ 0.,  0.,  1., ...,  1.,  2.,  1.],
           [ 0.,  1.,  1., ...,  1.,  0.,  0.],
           ...,
           [ 1.,  2.,  2., ...,  0.,  1.,  1.],
           [ 1.,  1.,  1., ...,  0.,  1.,  0.],
           [ 0.,  0.,  2., ...,  0.,  0.,  1.]]],


         [[[nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
...
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan]]],


         [[[ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 1.,  1.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  2., ...,  0.,  0.,  0.],
           ...,
           [ 1.,  0.,  0., ...,  0.,  1.,  0.],
           [ 1.,  0.,  0., ...,  0.,  0.,  1.],
           [ 1.,  0.,  0., ...,  1.,  1.,  0.]],

          [[ 0.,  0.,  0., ...,  0.,  1.,  1.],
           [ 2.,  2.,  0., ...,  0.,  0.,  1.],
           [ 2.,  1.,  1., ...,  0.,  0.,  0.],
           ...,
           [ 0.,  1.,  0., ...,  1.,  0.,  1.],
           [ 1.,  0.,  0., ...,  0.,  1.,  1.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.]]]]]])
Coordinates:
  * plantbarcode     (plantbarcode) object 'A1' 'A2'
  * tind_s           (tind_s) int64 0 40 60
  * induction frame  (induction frame) object 'Fp' 'Fmp'
  * jobdate          (jobdate) datetime64[ns] 2020-06-01 ... 2020-06-03
Dimensions without coordinates: y, x

我不明白nan的数组是从哪里来的。我还感到奇怪的是,无论concat中使用什么dim,每个条目都有一个coord值(本例中为18个文件),即使它们不是唯一的,但其他dim仅显示为唯一值

如果有人愿意下载一个小数据集,这里有一个link(很抱歉链接中的建议,我将尝试提供一个可以动态生成的合成数据集)


Tags: importdataparametera1nanframeuniqueprint
2条回答

您的原始代码在arr.expand_dims(dims={'tind_s': tind})中包含一个微妙的错误(打字错误):我想您想要的是dim而不是dims,后者被xarray解释为一个新的维度标签(请参见doc)。此外,tind在这里用作沿新维度创建的元素数,这可能也不是您想要的

您的另一个尝试(即在创建DataArray之前扩展数据维度)是一个更好的方法,但可以进一步改进。假设在同一个连接维度上有多个标签,我建议您创建一个多索引并将其分配给连接维度,例如

import numpy as np
import pandas as pd
import xarray as xr


da_list = []
props = []
prop_names = ['experiment', 'plantbarcode', 'tind']

for i in range(10):
    tind = i
    indf = ['Fp', 'Fmp']
    data = np.ones((2, 640, 480)) * i
    
    da = xr.DataArray(
        data=data[None, ...],
        dims=('props', 'frame', 'y', 'x'),
        coords={'frame': indf}
    )

    props.append((f'experiment{i}', i*2, i))
    da_list.append(da)


prop_idx = pd.MultiIndex.from_tuples(props, names=prop_names)

da_concat = xr.concat(da_list, 'props')
da_concat.coords['props'] = prop_idx

其中:

<xarray.DataArray (props: 10, frame: 2, y: 640, x: 480)>
array([[[[0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         ...,
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.]],

        [[0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         ...,
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.]]],


       [[[1., 1., 1., ..., 1., 1., 1.],
         [1., 1., 1., ..., 1., 1., 1.],
         [1., 1., 1., ..., 1., 1., 1.],
...
         [8., 8., 8., ..., 8., 8., 8.],
         [8., 8., 8., ..., 8., 8., 8.],
         [8., 8., 8., ..., 8., 8., 8.]]],


       [[[9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         ...,
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.]],

        [[9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         ...,
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.],
         [9., 9., 9., ..., 9., 9., 9.]]]])
Coordinates:
  * frame         (frame) <U3 'Fp' 'Fmp'
  * props         (props) MultiIndex
  - experiment    (props) object 'experiment0' 'experiment1' ... 'experiment9'
  - plantbarcode  (props) int64 0 2 4 6 8 10 12 14 16 18
  - tind          (props) int64 0 1 2 3 4 5 6 7 8 9
Dimensions without coordinates: y, x

我在xarray邮件列表上看到了你的问题。调试这个问题很困难,因为它很复杂,并且取决于您的数据。如果您可以将其简化一点,或者使用合成数据而不是数据文件,这将是非常棒的。有关这方面的建议,请参见https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

如果您共享print(arr)的输出,这样我们就可以了解数据数组的内容和结构,这也会很有帮助

相关问题 更多 >