PySpark:与.dropna()相反?

2024-06-01 10:10:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图找出哪家商店有“空”的一天,即没有顾客来的一天

我的表具有以下结构:

+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| shop     | 2020-10-15  | 2020-10-16  | 2020-10-17  | 2020-10-18  | 2020-10-19  | 2020-10-20  | 2020-10-21 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| Paris    | 215         | 213         | 128         | 102         | 195         | 180         |        110 |
| London   | 145         | 106         | 102         | 83          | 127         | 111         |         56 |
| Beijing  | 179         | 245         | 134         | 136         | 207         | 183         |        136 |
| Sydney   | 0           | 0           | 0           | 0           | 0           | 6           |         36 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+

使用pandas,我可以执行类似customers[customers== 0].dropna(how="all")的操作,这将只保留有0的行,我得到以下结果:

+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| shop     | 2020-10-15  | 2020-10-16  | 2020-10-17  | 2020-10-18  | 2020-10-19  | 2020-10-20  | 2020-10-21 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| Sydney   | 0           | 0           | 0           | 0           | 0           | NaN         |         NaN|
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+

在PySpark中,我相信^{}做了类似的事情,但我想做相反的事情,保持NA/0值。我该怎么做


Tags: pandasnanallshop事情结构pysparkhow
1条回答
网友
1楼 · 发布于 2024-06-01 10:10:11

正在创建示例数据集:

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as f

df_list= [
  { "shop":"Paris", "2020-10-15" : 215,"2020-10-16": 213, "2020-10-17" : 128,"2020-10-18": 195,"2020-10-19":195},
{"shop":"London", "2020-10-15" : 145,"2020-10-16": 106, "2020-10-17" : 102,"2020-10-18": 127,"2020-10-19":127},
 { "shop":"Beijing ", "2020-10-15" : 179,"2020-10-16": 245, "2020-10-17" : 136,"2020-10-18": 207,"2020-10-19":207},

 {"shop":"Sydney", "2020-10-15" : 0,"2020-10-16": 0 ,"2020-10-17" : 0,"2020-10-18": 0, "2020-10-19":0}

]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(Row(**x) for x in df_list)
df.show()

+    +     +     +     +     +     +
|    shop|2020-10-15|2020-10-16|2020-10-17|2020-10-18|2020-10-19|
+    +     +     +     +     +     +
|   Paris|       215|       213|       128|       195|       195|
|  London|       145|       106|       102|       127|       127|
|Beijing |       179|       245|       136|       207|       207|
|  Sydney|         0|         0|         0|         0|         0|
+    +     +     +     +     +     +

您可以应用过滤器功能

df.filter(f.greatest(*[f.col(i).isin(0) for i in df.columns])).show()

结果:

+   +     +     +     +     +     +
|  shop|2020-10-15|2020-10-16|2020-10-17|2020-10-18|2020-10-19|
+   +     +     +     +     +     +
|Sydney|         0|         0|         0|         0|         0|
+   +     +     +     +     +     +

相关问题 更多 >