将PySpark中的0替换为null

2024-09-27 21:24:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我的PySpark数据框中有一些值显示为NaN,我发现可以将它们转换为空值。然后,我通过将该值输入到其他值来调整这些空值。在执行此操作时,我发现它也将我的许多列中的0变为null。为什么会发生这种情况?我如何将nan转换为NULL而不影响0s

cSchema = StructType([StructField("col", LongType())])
vals = [[0] for i in range(20)]
test_df = spark.createDataFrame(vals,schema=cSchema)

test_df.show(20)

+---+
|col|
+---+
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|

test_df = test_df.replace(float('nan'), None)

test_df.show(20)

+----+
| col|
+----+
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|

Tags: 数据testdfshow情况colnannull
1条回答
网友
1楼 · 发布于 2024-09-27 21:24:41

示例中的模式不适合您尝试执行的操作。您正在(长)整数列中搜索浮点值。我很惊讶replace没有完全忽略该列…
下面是当您尝试直接创建这样一个DF时发生的情况:

>>> cSchema = StructType([StructField("col1", LongType()),StructField("col2", LongType())])
... vals = [[0, float('nan')] for i in range(20)]
... test_df = spark.createDataFrame(vals,schema=cSchema)
...
... test_df.show(20)
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\session.py", line 748, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\session.py", line 413, in _createFromLocal
    data = list(data)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\session.py", line 730, in prepare
    verify_func(obj)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\types.py", line 1389, in verify
    verify_value(obj)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\types.py", line 1370, in verify_struct
    verifier(v)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\types.py", line 1389, in verify
    verify_value(obj)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\types.py", line 1383, in verify_default
    verify_acceptable_types(obj)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\types.py", line 1278, in verify_acceptable_types
    % (dataType, obj, type(obj))))
TypeError: field col2: LongType can not accept object nan in type <class 'float'>

field col2: LongType can not accept object nan in type <class 'float'>

下面是使用适当的模式时发生的情况

>>> cSchema = StructType([StructField("col1", DoubleType()),StructField("col2", DoubleType())])
... vals = [[0., float('nan')] for i in range(20)]
... test_df = spark.createDataFrame(vals,schema=cSchema)
...
... test_df.show(3)
+  +  +
|col1|col2|
+  +  +
| 0.0| NaN|
| 0.0| NaN|
| 0.0| NaN|
+  +  +
only showing top 3 rows

>>> test_df.replace(float('nan'), None).show(3)
+  +  +
|col1|col2|
+  +  +
| 0.0|null|
| 0.0|null|
| 0.0|null|
+  +  +
only showing top 3 rows

因此,可以尝试预先将所有内容强制转换为float/double(如果nan-s在整数列中混合),或者使用^{}subset参数指定只搜索float列

相关问题 更多 >

    热门问题