我正在使用Dask和python测试parquet文件上的read spead,我发现用pandas读取同一个文件比Dask快得多。我想知道为什么会这样,如果有一种方法可以获得同样的表现
所有相关软件包的版本
print(dask.__version__)
print(pd.__version__)
print(pyarrow.__version__)
print(fastparquet.__version__)
2.6.0
0.25.2
0.15.1
0.3.2
import pandas as pd
import numpy as np
import dask.dataframe as dd
col = [str(i) for i in list(np.arange(40))]
df = pd.DataFrame(np.random.randint(0,100,size=(5000000, 4 * 10)), columns=col)
df.to_parquet('large1.parquet', engine='pyarrow')
# Wall time: 3.86 s
df.to_parquet('large2.parquet', engine='fastparquet')
# Wall time: 27.1 s
df = dd.read_parquet('large2.parquet', engine='fastparquet').compute()
# Wall time: 5.89 s
df = dd.read_parquet('large1.parquet', engine='pyarrow').compute()
# Wall time: 4.84 s
df = pd.read_parquet('large1.parquet',engine='pyarrow')
# Wall time: 503 ms
df = pd.read_parquet('large2.parquet',engine='fastparquet')
# Wall time: 4.12 s
使用混合数据类型数据帧时,差异较大。在
^{pr2}$df.to_parquet('large1.parquet', engine='pyarrow')
# Wall time: 9.67 s
df.to_parquet('large2.parquet', engine='fastparquet')
# Wall time: 33.3 s
# read with Dask
df = dd.read_parquet('large1.parquet', engine='pyarrow').compute()
# Wall time: 34.5 s
df = dd.read_parquet('large2.parquet', engine='fastparquet').compute()
# Wall time: 1min 22s
# read with pandas
df = pd.read_parquet('large1.parquet',engine='pyarrow')
# Wall time: 8.67 s
df = pd.read_parquet('large2.parquet',engine='fastparquet')
# Wall time: 21.8 s
我的第一个猜测是Pandas将Parquet数据集保存到一个行组中,这将不允许Dask这样的系统并行化。这不能解释为什么它慢,但它确实解释了为什么它不快。在
为了获得更多信息,我建议您进行分析。您可能对本文档感兴趣:
相关问题 更多 >
编程相关推荐