为什么达斯克读取拼花板文件的速度比Pandas读取相同拼花板文件的速度慢得多?

2024-09-29 23:28:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Dask和python测试parquet文件上的read spead,我发现用pandas读取同一个文件比Dask快得多。我想知道为什么会这样,如果有一种方法可以获得同样的表现

所有相关软件包的版本

print(dask.__version__) print(pd.__version__) print(pyarrow.__version__) print(fastparquet.__version__)

2.6.0 0.25.2 0.15.1 0.3.2

import pandas as pd 
import numpy as np
import dask.dataframe as dd

col = [str(i) for i in list(np.arange(40))]
df = pd.DataFrame(np.random.randint(0,100,size=(5000000, 4 * 10)), columns=col)

df.to_parquet('large1.parquet', engine='pyarrow')
 # Wall time: 3.86 s
df.to_parquet('large2.parquet', engine='fastparquet')
 # Wall time: 27.1 s
df = dd.read_parquet('large2.parquet', engine='fastparquet').compute()
 # Wall time: 5.89 s
df = dd.read_parquet('large1.parquet', engine='pyarrow').compute()
 # Wall time: 4.84 s
df = pd.read_parquet('large1.parquet',engine='pyarrow')
 # Wall time: 503 ms 
df = pd.read_parquet('large2.parquet',engine='fastparquet')
 # Wall time: 4.12 s

使用混合数据类型数据帧时,差异较大。在

^{pr2}$
df.to_parquet('large1.parquet', engine='pyarrow')
 # Wall time: 9.67 s

df.to_parquet('large2.parquet', engine='fastparquet')
 # Wall time: 33.3 s

# read with Dask
df = dd.read_parquet('large1.parquet', engine='pyarrow').compute()
 # Wall time: 34.5 s

df = dd.read_parquet('large2.parquet', engine='fastparquet').compute()
 # Wall time: 1min 22s

# read with pandas 
df = pd.read_parquet('large1.parquet',engine='pyarrow')
 # Wall time: 8.67 s

df = pd.read_parquet('large2.parquet',engine='fastparquet')
 # Wall time: 21.8 s


Tags: todfreadtimeversionengineddpd

热门问题