dask数据帧读取拼花模式差异

2024-09-23 22:31:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我做以下工作:

import dask.dataframe as dd
from dask.distributed import Client
client = Client()

raw_data_df = dd.read_csv('dataset/nyctaxi/nyctaxi/*.csv', assume_missing=True, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])

数据集是从Mathew Rocklin制作的演示文稿中提取出来的,并被用作dask数据帧演示。然后我试着用pyarrow把它写到parquet中

^{pr2}$

尝试回读:

raw_data_df = dd.read_parquet(path='dataset/parquet/2015.parquet/')

我得到以下错误:

ValueError: Schema in dataset/parquet/2015.parquet//part.192.parquet was different. 

VendorID: double
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
pickup_longitude: double
pickup_latitude: double
RateCodeID: int64
store_and_fwd_flag: binary
dropoff_longitude: double
dropoff_latitude: double
payment_type: double
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
metadata
--------
{'pandas': '{"pandas_version": "0.22.0", "index_columns": [], "columns": [{"metadata": null, "field_name": "VendorID", "name": "VendorID", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tpep_pickup_datetime", "name": "tpep_pickup_datetime", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}, {"metadata": null, "field_name": "tpep_dropoff_datetime", "name": "tpep_dropoff_datetime", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}, {"metadata": null, "field_name": "passenger_count", "name": "passenger_count", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "trip_distance", "name": "trip_distance", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "pickup_longitude", "name": "pickup_longitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "pickup_latitude", "name": "pickup_latitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "RateCodeID", "name": "RateCodeID", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata": null, "field_name": "store_and_fwd_flag", "name": "store_and_fwd_flag", "numpy_type": "object", "pandas_type": "bytes"}, {"metadata": null, "field_name": "dropoff_longitude", "name": "dropoff_longitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "dropoff_latitude", "name": "dropoff_latitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "payment_type", "name": "payment_type", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "fare_amount", "name": "fare_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "extra", "name": "extra", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "mta_tax", "name": "mta_tax", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tip_amount", "name": "tip_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tolls_amount", "name": "tolls_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "improvement_surcharge", "name": "improvement_surcharge", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "total_amount", "name": "total_amount", "numpy_type": "float64", "pandas_type": "float64"}], "column_indexes": []}'}
vs

VendorID: double
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
pickup_longitude: double
pickup_latitude: double
RateCodeID: double
store_and_fwd_flag: binary
dropoff_longitude: double
dropoff_latitude: double
payment_type: double
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
metadata
--------
{'pandas': '{"pandas_version": "0.22.0", "index_columns": [], "columns": [{"metadata": null, "field_name": "VendorID", "name": "VendorID", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tpep_pickup_datetime", "name": "tpep_pickup_datetime", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}, {"metadata": null, "field_name": "tpep_dropoff_datetime", "name": "tpep_dropoff_datetime", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}, {"metadata": null, "field_name": "passenger_count", "name": "passenger_count", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "trip_distance", "name": "trip_distance", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "pickup_longitude", "name": "pickup_longitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "pickup_latitude", "name": "pickup_latitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "RateCodeID", "name": "RateCodeID", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "store_and_fwd_flag", "name": "store_and_fwd_flag", "numpy_type": "object", "pandas_type": "bytes"}, {"metadata": null, "field_name": "dropoff_longitude", "name": "dropoff_longitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "dropoff_latitude", "name": "dropoff_latitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "payment_type", "name": "payment_type", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "fare_amount", "name": "fare_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "extra", "name": "extra", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "mta_tax", "name": "mta_tax", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tip_amount", "name": "tip_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tolls_amount", "name": "tolls_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "improvement_surcharge", "name": "improvement_surcharge", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "total_amount", "name": "total_amount", "numpy_type": "float64", "pandas_type": "float64"}], "column_indexes": []}'}

但是看着他们他们看起来一模一样。能帮我找出原因吗?在


Tags: namenumpyfieldpandasdatetimetypeamountnull
2条回答

这个问题涉及到Pandas和Dask中的一个更糟糕的问题,即数据类型的可空性或缺乏性。因此,缺少数据可能会导致问题,尤其是对于没有缺失数据指定的数据类型(如整数)而言。在

float和datetimes并不是太糟糕,因为它们指定了null或缺少值的占位符(numpy中的浮点值为NaN,pandas中的datetime为NaT),因此可以为null。但在某些情况下,即使是那些数据类型也有问题。在

当您读取多个CSV文件(如您的情况),或从数据库中提取,或将一个小数据帧合并为一个较大的数据帧时,可能会出现此问题。您可能会得到一个分区,其中缺少某个给定字段的某些或所有值。对于这些分区,Dask和Pandas将为字段分配一个数据类型,以容纳缺失的数据指示符。对于整数,新的数据类型将是float。当写入拼花地板时,它将进一步转换为双倍。在

Dask很乐意为该字段列出一个有点误导性的数据类型。但是,当您写入parquet时,包含缺失数据的分区将被另一种方式写入。在您的例子中,“int64”在至少一个拼花板文件中被写成“double”。然后,当您试图读取整个Dask数据帧时,由于不匹配,出现了上面所示的ValueError。在

在解决这些问题之前,您需要确保所有Dask字段在每一行都有适当的数据。例如,如果您有一个int64字段,那么NaN值或其他缺失值的非整数表示将不起作用。在

您的int64字段可能需要分几个步骤进行修复:

  1. 进口熊猫:

    import pandas as pd
    
  2. 将字段数据清理为float64,并将缺少的值强制为NaN:

    df['myint64'] = df['myint64'].map_partitions(
        pd.to_numeric,
        meta='f8',
        errors='coerce'
    )
    
  3. 选择一个sentinal值(例如-1.0)来代替NaN,以便int64工作:

    df['myint64'] = df['myint64'].where(
        ~df['myint64'].isna(),
        -1.0
    )
    
  4. 将字段转换为int64并将其全部持久化:

    df['myint64'] = df['myint64'].astype('i8')
    df = client.persist(df)
    
  5. 然后尝试保存并重读往返。

注意:步骤1-2对于修复float64字段非常有用。在

最后,要修复日期时间字段,请尝试以下操作:

    df['mydateime'] = df['mydateime'].map_partitions(
        pd.to_datetime,
        meta='M8',
        infer_datetime_format=True, 
        errors='coerce'
    ).persist()

以下两个纽比规格不一致

{'metadata': None, 'field_name': 'RateCodeID', 'name': 'RateCodeID', 'numpy_type': 'int64', 'pandas_type': 'int64'}

RateCodeID: int64 


{'metadata': None, 'field_name': 'RateCodeID', 'name': 'RateCodeID', 'numpy_type': 'float64', 'pandas_type': 'float64'}

RateCodeID: double

(仔细看!)在

我建议您在加载时为这些列提供数据类型,或者在编写之前使用astype将它们强制为float。在

相关问题 更多 >