如何仅选择pyspark中某列具有NaN
值的行
import numpy as np
import pandas as pd
# pyspark
import pyspark
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext, SQLContext
spark = pyspark.sql.SparkSession.builder.appName('app').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
sc.setLogLevel("INFO")
# data
dft = pd.DataFrame({
'Code': [1, 2, 3, 4, 5, 6],
'Name': ['Odeon', 'Imperial', 'Majestic',
'Royale', 'Paraiso', 'Nickelodeon'],
'Movie': [5.0, 1.0, np.nan, 6.0, 3.0, np.nan]})
schema = StructType([
StructField('Code',IntegerType(),True),
StructField('Name',StringType(),True),
StructField('Movie',FloatType(),True),
])
sdft = sqlContext.createDataFrame(dft, schema)
sdft.createOrReplaceTempView("MovieTheaters")
sdft.show()
spark.sql("""
select * from MovieTheaters where Movie is null
""").show()
+----+----+-----+
|Code|Name|Movie|
+----+----+-----+
+----+----+-----+
我得到的是空输出,如何解决这个问题
预期产出:
+----+-----------+-----+
|Code| Name|Movie|
+----+-----------+-----+
| 3| Majestic| NaN|
| 6|Nickelodeon| NaN|
+----+-----------+-----+
如果要从数据帧中获取
np.nan
值,请使用以下代码:相关问题 更多 >
编程相关推荐