java对一个数据集进行分区和存储,该数据集有一个字符串列,其值看起来是数字。再次读取时,数据仍然是“字符串”,但丢失了零
在Spark 3.0.2
中,我正在拼花地板文件中写一个Dataset
。我的代码就是这样结束的:
etablissements = etablissements.repartition(col("codeDepartement"));
etablissements = etablissements.sortWithinPartitions(col("siret"));
etablissements = etablissements.persist();
// Write it in a file named with the year of data, selections, and sorting in it's name.
// Underlying statement writing the parquet file is :
// ds.write().partitionBy(colonnesPartionnement /* = codeDepartement */)
saveToStore(etablissements, new String[] {"codeDepartement"},
"{0}_{1,number,#0}_{2}_{3}", "etablissements", anneeSIRENE, actifsSeulement,
communesValides);
{
# schema() :
|-- codeDepartement: string (nullable = true)
它在show()
输出的最后三分之一处可见(城市名称前三列大写),并且具有for值:"01"
:
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|siren |nic |siret |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1 |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|codeDepartement|arrondissement|typeNomEtCharniere|nomMajuscules |nomCommune |libelle |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI |libelleNAF |
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|015850944|00024|01585094400024|O |2007-04-01 |11 |2017 |null |2019-11-14T14:00:12 |false |2 |ZONE INDUSTRIELLE |null |null |CHE |DE THIL |01700 |null |null |01376 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |25.73B |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |01 |012 |0 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113 |null |5 |210103768 |4006 |3967 |39 |210103768 |240100800|CC de Miribel et du Plateau |Fabrication d'autres outillages |
|015851793|00479|01585179300479|O |2005-01-01 |11 |2017 |null |2019-06-24T13:04:28 |false |2 |null |null |null |null |ZONE INDUST LA FONTAINE |01290 |null |null |01134 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |46.73A |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |01 |012 |0 |CROTTET |Crottet |Crottet |0123 |null |3 |210101341 |1777 |1734 |43 |210101341 |200070555|CC de la Veyle |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
|015851793|00743|01585179300743|O |2012-09-01 |02 |2017 |null |2019-06-24T13:04:28 |false |1 |ZA ACTIPARC |null |null |null |PRE LION |01190 |null |null |01057 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2012-09-01 |A |null |null |null |DORAS |46.73A |NAFRev2 |O |true |2017 |true |false|1 |COM |84 |01 |012 |0 |BOZ |Boz |Boz |0117 |null |3 |210100574 |519 |512 |7 |210100574 |200071371|CC Bresse et Saône |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
|015851793|00917|01585179300917|O |2020-01-01 |null |null |null |2020-01-31T16:13:25 |false |1 |null |28 |null |AV |DE MARBOZ |01000 |null |null |01053 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2020-01-01 |A |CLEAU |null |null |null |46.73A |NAFRev2 |O |true |null |true |false|1 |COM |84 |01 |012 |0 |BOURG EN BRESSE |Bourg-en-Bresse |Bourg-en-Bresse |0199 |null |8 |210100533 |43306 |41527 |1779 |210100533 |200071751|CA du Bassin de Bourg-en-Bresse |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
我看到我的拼花文件下的文件夹很好:
codeDepartement=01
codeDepartement=2A
codeDepartement=75
codeDepartement=971
注意:由于2A
(对于Corse)等一些值,部门代码永远不能转换为数值
snappy.parquet
块分别存储在/data/tmp/etablissements_2020_true_true/codeDepartement=01
文件夹中,这样就可以了
在阅读时,我试图阅读该商店的内容。搜索城市代码(在法国以部门代码开头)以"01"
开头的城市:适当的拼花地板文件和区块如下:
2021-03-24 07:14:33.825 INFO 13860 --- [er for task 106] o.a.s.s.e.datasources.FileScanRDD : Reading File path: file:/data/tmp/etablissements_2020_true_true/codeDepartement=01/part-00024-f7d33eea-6d79-4f1a-bf35-0666dcc5e0f5.c000.snappy.parquet, range: 0-5246504, partition values: [1]
当显示部门时(现在位于数据集show()
命令的末尾),它现在有值"1"
而不是"01"
:
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|siren |nic |siret |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1 |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|arrondissement|typeNomEtCharniere|nomMajuscules |nomCommune |libelle |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI |libelleNAF |codeDepartement|
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|015850944|00024|01585094400024|O |2007-04-01 |11 |2017 |null |2019-11-14T14:00:12 |false |2 |ZONE INDUSTRIELLE |null |null |CHE |DE THIL |01700 |null |null |01376 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |25.73B |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |012 |0 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113 |null |5 |210103768 |4006 |3967 |39 |210103768 |240100800|CC de Miribel et du Plateau |Fabrication d'autres outillages |1 |
|015851793|00479|01585179300479|O |2005-01-01 |11 |2017 |null |2019-06-24T13:04:28 |false |2 |null |null |null |null |ZONE INDUST LA FONTAINE |01290 |null |null |01134 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |46.73A |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |012 |0 |CROTTET |Crottet |Crottet |0123 |null |3 |210101341 |1777 |1734 |43 |210101341 |200070555|CC de la Veyle |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
|015851793|00743|01585179300743|O |2012-09-01 |02 |2017 |null |2019-06-24T13:04:28 |false |1 |ZA ACTIPARC |null |null |null |PRE LION |01190 |null |null |01057 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2012-09-01 |A |null |null |null |DORAS |46.73A |NAFRev2 |O |true |2017 |true |false|1 |COM |84 |012 |0 |BOZ |Boz |Boz |0117 |null |3 |210100574 |519 |512 |7 |210100574 |200071371|CC Bresse et Saône |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
|015851793|00917|01585179300917|O |2020-01-01 |null |null |null |2020-01-31T16:13:25 |false |1 |null |28 |null |AV |DE MARBOZ |01000 |null |null |01053 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2020-01-01 |A |CLEAU |null |null |null |46.73A |NAFRev2 |O |true |null |true |false|1 |COM |84 |012 |0 |BOURG EN BRESSE |Bourg-en-Bresse |Bourg-en-Bresse |0199 |null |8 |210100533 |43306 |41527 |1779 |210100533 |200071751|CA du Bassin de Bourg-en-Bresse |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
即使拼花文件仍将其声明为StringType
:
|-- codeDepartement: string (nullable = true)
发生了什么事
我倾向于把repartition()
语句作为造成这场混乱的原因,但我不知道怎么做。如果这个命令很复杂,而且分区不能按字符串值进行分区,那么程序如何按字母中的红色、蓝色和黄色进行数据分区呢
我不理解整体行为(问题?)我要面对
# 1 楼答案
您可以禁用选项
spark.sql.sources.partitionColumnTypeInference.enabled
从文件Partition Discovery中:
要设置选项,请执行以下操作:
# 2 楼答案
我能重现这个问题
要解决这个问题,您可以在阅读时提供一个模式:
(在Pyspark中测试,希望可以在Java中使用)