在没有shu的分区拼花数据集上应用函数

2024-06-14 11:40:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个相当大(~1TB)的Parquet数据集,由一列database_id划分。我想将这个数据集复制到一个新的数据集中,只保留那些index列包含在单独的“特殊索引”表中的行。你知道吗

我目前的做法是:

import pyspark.sql.functions as F

big_table = spark.read.parquet("path/to/partitioned/big_table.parquet")
unique_indexes_table = spark.read.parquet("path/to/unique_indexes.parquet")

out = (
    big_table
    .join(F.broadcast(unique_indexes_table), on="index")
    .write
    .save(
        path="{path/to/partitioned/big_table_2.parquet}",
        format='parquet',
        mode='overwrite',
        partitionBy="database_id")
)

但是,这会导致洗牌并失败,在我的10节点集群上出现java.io.IOException: No space left on device错误,每个节点在SPARK_LOCAL_DIRS中有大约900MB的磁盘空间。你知道吗

我已经试了好几天了,但都没有成功。我正在考虑在pyarrow中重写它,在这里我读取分区并一次执行一个连接,但是我不明白为什么pyspark不能做同样的事情?你知道吗

我在下面发布了SQL图和查询执行计划。我认为Exchange步骤是导致问题的原因,我不知道为什么有必要这么做?!你知道吗


enter image description here

== Parsed Logical Plan ==
'InsertIntoHadoopFsRelationCommand file:/scratch/username/datapkg_output_dir/uniparc-domain-wstructure/master/remove_duplicate_matches/adjacency_matrix.parquet, false, ['database_id], Parquet, Map(path -> /scratch/username/datapkg_output_dir/uniparc-domain-wstructure/master/remove_duplicate_matches/adjacency_matrix.parquet), Overwrite, [__index_level_0__#77L, uniparc_id#70, sequence#71, database#72, interpro_name#73, interpro_id#74, domain_start#75L, domain_end#76L, domain_length#78L, structure_id#79, model_id#80, chain_id#81, pc_identity#82, alignment_length#83, mismatches#84, gap_opens#85, q_start#86, q_end#87, s_start#88, s_end#89, evalue_log10#90, bitscore#91, qseq#92, sseq#93, ... 11 more fields]
+- AnalysisBarrier
      +- RepartitionByExpression [database_id#104], 200
         +- Project [__index_level_0__#77L, uniparc_id#70, sequence#71, database#72, interpro_name#73, interpro_id#74, domain_start#75L, domain_end#76L, domain_length#78L, structure_id#79, model_id#80, chain_id#81, pc_identity#82, alignment_length#83, mismatches#84, gap_opens#85, q_start#86, q_end#87, s_start#88, s_end#89, evalue_log10#90, bitscore#91, qseq#92, sseq#93, ... 11 more fields]
            +- Join Inner, (__index_level_0__#77L = __index_level_0__#222L)
               :- Relation[uniparc_id#70,sequence#71,database#72,interpro_name#73,interpro_id#74,domain_start#75L,domain_end#76L,__index_level_0__#77L,domain_length#78L,structure_id#79,model_id#80,chain_id#81,pc_identity#82,alignment_length#83,mismatches#84,gap_opens#85,q_start#86,q_end#87,s_start#88,s_end#89,evalue_log10#90,bitscore#91,qseq#92,sseq#93,... 11 more fields] parquet
               +- ResolvedHint (broadcast)
                  +- Relation[__index_level_0__#222L] parquet

== Analyzed Logical Plan ==
InsertIntoHadoopFsRelationCommand file:/scratch/username/datapkg_output_dir/uniparc-domain-wstructure/master/remove_duplicate_matches/adjacency_matrix.parquet, false, [database_id#104], Parquet, Map(path -> /scratch/username/datapkg_output_dir/uniparc-domain-wstructure/master/remove_duplicate_matches/adjacency_matrix.parquet), Overwrite, [__index_level_0__#77L, uniparc_id#70, sequence#71, database#72, interpro_name#73, interpro_id#74, domain_start#75L, domain_end#76L, domain_length#78L, structure_id#79, model_id#80, chain_id#81, pc_identity#82, alignment_length#83, mismatches#84, gap_opens#85, q_start#86, q_end#87, s_start#88, s_end#89, evalue_log10#90, bitscore#91, qseq#92, sseq#93, ... 11 more fields]
+- RepartitionByExpression [database_id#104], 200
   +- Project [__index_level_0__#77L, uniparc_id#70, sequence#71, database#72, interpro_name#73, interpro_id#74, domain_start#75L, domain_end#76L, domain_length#78L, structure_id#79, model_id#80, chain_id#81, pc_identity#82, alignment_length#83, mismatches#84, gap_opens#85, q_start#86, q_end#87, s_start#88, s_end#89, evalue_log10#90, bitscore#91, qseq#92, sseq#93, ... 11 more fields]
      +- Join Inner, (__index_level_0__#77L = __index_level_0__#222L)
         :- Relation[uniparc_id#70,sequence#71,database#72,interpro_name#73,interpro_id#74,domain_start#75L,domain_end#76L,__index_level_0__#77L,domain_length#78L,structure_id#79,model_id#80,chain_id#81,pc_identity#82,alignment_length#83,mismatches#84,gap_opens#85,q_start#86,q_end#87,s_start#88,s_end#89,evalue_log10#90,bitscore#91,qseq#92,sseq#93,... 11 more fields] parquet
         +- ResolvedHint (broadcast)
            +- Relation[__index_level_0__#222L] parquet

== Optimized Logical Plan ==
InsertIntoHadoopFsRelationCommand file:/scratch/username/datapkg_output_dir/uniparc-domain-wstructure/master/remove_duplicate_matches/adjacency_matrix.parquet, false, [database_id#104], Parquet, Map(path -> /scratch/username/datapkg_output_dir/uniparc-domain-wstructure/master/remove_duplicate_matches/adjacency_matrix.parquet), Overwrite, [__index_level_0__#77L, uniparc_id#70, sequence#71, database#72, interpro_name#73, interpro_id#74, domain_start#75L, domain_end#76L, domain_length#78L, structure_id#79, model_id#80, chain_id#81, pc_identity#82, alignment_length#83, mismatches#84, gap_opens#85, q_start#86, q_end#87, s_start#88, s_end#89, evalue_log10#90, bitscore#91, qseq#92, sseq#93, ... 11 more fields]
+- RepartitionByExpression [database_id#104], 200
   +- Project [__index_level_0__#77L, uniparc_id#70, sequence#71, database#72, interpro_name#73, interpro_id#74, domain_start#75L, domain_end#76L, domain_length#78L, structure_id#79, model_id#80, chain_id#81, pc_identity#82, alignment_length#83, mismatches#84, gap_opens#85, q_start#86, q_end#87, s_start#88, s_end#89, evalue_log10#90, bitscore#91, qseq#92, sseq#93, ... 11 more fields]
      +- Join Inner, (__index_level_0__#77L = __index_level_0__#222L)
         :- Filter isnotnull(__index_level_0__#77L)
         :  +- Relation[uniparc_id#70,sequence#71,database#72,interpro_name#73,interpro_id#74,domain_start#75L,domain_end#76L,__index_level_0__#77L,domain_length#78L,structure_id#79,model_id#80,chain_id#81,pc_identity#82,alignment_length#83,mismatches#84,gap_opens#85,q_start#86,q_end#87,s_start#88,s_end#89,evalue_log10#90,bitscore#91,qseq#92,sseq#93,... 11 more fields] parquet
         +- ResolvedHint (broadcast)
            +- Filter isnotnull(__index_level_0__#222L)
               +- Relation[__index_level_0__#222L] parquet

== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand InsertIntoHadoopFsRelationCommand file:/scratch/username/datapkg_output_dir/uniparc-domain-wstructure/master/remove_duplicate_matches/adjacency_matrix.parquet, false, [database_id#104], Parquet, Map(path -> /scratch/username/datapkg_output_dir/uniparc-domain-wstructure/master/remove_duplicate_matches/adjacency_matrix.parquet), Overwrite, [__index_level_0__#77L, uniparc_id#70, sequence#71, database#72, interpro_name#73, interpro_id#74, domain_start#75L, domain_end#76L, domain_length#78L, structure_id#79, model_id#80, chain_id#81, pc_identity#82, alignment_length#83, mismatches#84, gap_opens#85, q_start#86, q_end#87, s_start#88, s_end#89, evalue_log10#90, bitscore#91, qseq#92, sseq#93, ... 11 more fields]
+- Exchange hashpartitioning(database_id#104, 200)
   +- *(2) Project [__index_level_0__#77L, uniparc_id#70, sequence#71, database#72, interpro_name#73, interpro_id#74, domain_start#75L, domain_end#76L, domain_length#78L, structure_id#79, model_id#80, chain_id#81, pc_identity#82, alignment_length#83, mismatches#84, gap_opens#85, q_start#86, q_end#87, s_start#88, s_end#89, evalue_log10#90, bitscore#91, qseq#92, sseq#93, ... 11 more fields]
      +- *(2) BroadcastHashJoin [__index_level_0__#77L], [__index_level_0__#222L], Inner, BuildRight
         :- *(2) Project [uniparc_id#70, sequence#71, database#72, interpro_name#73, interpro_id#74, domain_start#75L, domain_end#76L, __index_level_0__#77L, domain_length#78L, structure_id#79, model_id#80, chain_id#81, pc_identity#82, alignment_length#83, mismatches#84, gap_opens#85, q_start#86, q_end#87, s_start#88, s_end#89, evalue_log10#90, bitscore#91, qseq#92, sseq#93, ... 11 more fields]
         :  +- *(2) Filter isnotnull(__index_level_0__#77L)
         :     +- *(2) FileScan parquet [uniparc_id#70,sequence#71,database#72,interpro_name#73,interpro_id#74,domain_start#75L,domain_end#76L,__index_level_0__#77L,domain_length#78L,structure_id#79,model_id#80,chain_id#81,pc_identity#82,alignment_length#83,mismatches#84,gap_opens#85,q_start#86,q_end#87,s_start#88,s_end#89,evalue_log10#90,bitscore#91,qseq#92,sseq#93,... 11 more fields] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/scratch/username/datapkg_output_dir/uniparc-domain-wstructure/v0.1/contru..., PartitionCount: 1373, PartitionFilters: [], PushedFilters: [IsNotNull(__index_level_0__)], ReadSchema: struct

Tags: nameidindexmodeldomainstructurelevelstart