如何使用spark和python访问拼花地板表中单元格内的嵌套数组？

2条回答

网友

1楼 · 编辑于 2024-09-28 01:24:39

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName('Test').getOrCreate()

data = [
    ("harry", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "happy"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "sad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "mad"}
    ]),
    ("sally", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "mad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("gary", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "excited"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "down"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("mary", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "low"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "content"}
    ]),
    ("gerry", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "ecstatic"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "good"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "bad"}
    ])
]

df = spark.createDataFrame(data=data, schema=["name", "city", "sentiment"])
df.show()

df.filter(df.city == "london").select("name", "city", F.explode("sentiment").alias("sentiment"))\
    .select("name", "city", F.col("sentiment.text").alias("sentiment")).show()

Output:
+  -+   +    -+
| name|  city|sentiment|
+  -+   +    -+
|harry|london|    happy|
|harry|london|      sad|
|harry|london|      mad|
|sally|london|      sad|
|sally|london|      mad|
|sally|london| agitated|
| gary|london|  excited|
| gary|london|     down|
| gary|london| agitated|
+  -+   +    -+

网友

2楼 · 编辑于 2024-09-28 01:24:39

让我们首先使用示例中的数据创建一个数据帧：

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('explode_example').getOrCreate()

data = [
    ("harry", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "happy"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "sad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "mad"}
    ]),
    ("sally", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "mad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("gary", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "excited"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "down"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("mary", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "low"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "content"}
    ]),
    ("gerry", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "ecstatic"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "good"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "bad"}
    ])
]

df = spark.createDataFrame(data=data, schema = ["name", "city", "sentiment"])

您拥有的是以下数据帧：

df.show(truncate=False)

+  -+     +                                                                                                      -+
|name |city      |sentiment                                                                                                                                                                                                    |
+  -+     +                                                                                                      -+
|harry|london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad]]        |
|sally|london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]]     |
|gary |london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> down], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]]|
|mary |manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> low], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> content]]      |
|gerry|manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> good], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad]]    |
+  -+     +                                                                                                      -+

一旦我们有了数据帧，您需要分解sentiment列：

from pyspark.sql.functions import explode

df_exp = df.select(df["name"], df["city"], explode(df["sentiment"]))

结果是：

df_exp.show(truncate=False)

+  -+     +                                  -+
|name |city      |col                                                                  |
+  -+     +                                  -+
|harry|london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy]   |
|harry|london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad]     |
|harry|london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad]        |
|sally|london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad]     |
|sally|london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad]     |
|sally|london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]   |
|gary |london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited] |
|gary |london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> down]    |
|gary |london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]   |
|mary |manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad]     |
|mary |manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> low]     |
|mary |manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> content]    |
|gerry|manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic]|
|gerry|manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> good]    |
|gerry|manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad]        |
+  -+     +                                  -+

最后，让我们创建一个只包含文本的列，按城市筛选并获得3个想要的列：

# Extract text
df_exp = df_exp.withColumn("text", df_exp["col"].text)

# Select result columns and filter city
result = df_exp.select("name", "city", "text").where("city = 'london'")

结果将是：

result.show(truncate=False)

+  -+   +    +
|name |city  |text    |
+  -+   +    +
|harry|london|happy   |
|harry|london|sad     |
|harry|london|mad     |
|sally|london|sad     |
|sally|london|mad     |
|sally|london|agitated|
|gary |london|excited |
|gary |london|down    |
|gary |london|agitated|
+  -+   +    +

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用spark和python访问拼花地板表中单元格内的嵌套数组？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >