如何使用spark和python访问拼花地板表中单元格内的嵌套数组?

2024-09-28 01:24:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在我的表中的情绪栏中提取“文本”,并按city=london进行过滤

我有一张这样的桌子:

name    city    sentiment
    harry   london  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='happy'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='sad'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='mad')
                ]"
sally   london  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='sad'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='mad'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='agitated')
                ]"
gary    london  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='excited'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='down'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='agitated')
                ]"
mary    manchester  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='sad'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='low'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='content')
                ]"
gerry   manchester  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='ecstatic'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='good'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='bad')
                ]"

我的代码当前看起来像这样,但不起作用:

from pyspark.sql import functions as F
from pyspark.sql import types as T

data= spark.read.parquet("INSERT S3 TABLE").where("city LIKE 'london' AND sentiment['text=']")
df = sharethis.toPandas()
print (df)

我希望输出像这样:

name    city    sentiment
harry   london  happy
harry   london  sad
harry   london  mad
sally   london  sad
sally   london  mad
sally   london  agitated
gary    london  sad
gary    london  low
gary    london  content

有人知道我如何访问情感栏中的数组以提取文本吗

提前谢谢


Tags: textname文本citysallyrowscorehappy
2条回答
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName('Test').getOrCreate()

data = [
    ("harry", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "happy"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "sad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "mad"}
    ]),
    ("sally", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "mad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("gary", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "excited"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "down"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("mary", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "low"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "content"}
    ]),
    ("gerry", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "ecstatic"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "good"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "bad"}
    ])
]

df = spark.createDataFrame(data=data, schema=["name", "city", "sentiment"])
df.show()

df.filter(df.city == "london").select("name", "city", F.explode("sentiment").alias("sentiment"))\
    .select("name", "city", F.col("sentiment.text").alias("sentiment")).show()

Output:
+  -+   +    -+
| name|  city|sentiment|
+  -+   +    -+
|harry|london|    happy|
|harry|london|      sad|
|harry|london|      mad|
|sally|london|      sad|
|sally|london|      mad|
|sally|london| agitated|
| gary|london|  excited|
| gary|london|     down|
| gary|london| agitated|
+  -+   +    -+

让我们首先使用示例中的数据创建一个数据帧:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('explode_example').getOrCreate()

data = [
    ("harry", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "happy"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "sad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "mad"}
    ]),
    ("sally", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "mad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("gary", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "excited"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "down"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("mary", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "low"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "content"}
    ]),
    ("gerry", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "ecstatic"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "good"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "bad"}
    ])
]

df = spark.createDataFrame(data=data, schema = ["name", "city", "sentiment"])

您拥有的是以下数据帧:

df.show(truncate=False)

+  -+     +                                                                                                      -+
|name |city      |sentiment                                                                                                                                                                                                    |
+  -+     +                                                                                                      -+
|harry|london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad]]        |
|sally|london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]]     |
|gary |london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> down], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]]|
|mary |manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> low], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> content]]      |
|gerry|manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> good], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad]]    |
+  -+     +                                                                                                      -+

一旦我们有了数据帧,您需要分解sentiment列:

from pyspark.sql.functions import explode

df_exp = df.select(df["name"], df["city"], explode(df["sentiment"]))

结果是:

df_exp.show(truncate=False)

+  -+     +                                  -+
|name |city      |col                                                                  |
+  -+     +                                  -+
|harry|london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy]   |
|harry|london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad]     |
|harry|london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad]        |
|sally|london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad]     |
|sally|london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad]     |
|sally|london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]   |
|gary |london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited] |
|gary |london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> down]    |
|gary |london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]   |
|mary |manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad]     |
|mary |manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> low]     |
|mary |manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> content]    |
|gerry|manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic]|
|gerry|manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> good]    |
|gerry|manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad]        |
+  -+     +                                  -+

最后,让我们创建一个只包含文本的列,按城市筛选并获得3个想要的列:

# Extract text
df_exp = df_exp.withColumn("text", df_exp["col"].text)

# Select result columns and filter city
result = df_exp.select("name", "city", "text").where("city = 'london'")

结果将是:

result.show(truncate=False)

+  -+   +    +
|name |city  |text    |
+  -+   +    +
|harry|london|happy   |
|harry|london|sad     |
|harry|london|mad     |
|sally|london|sad     |
|sally|london|mad     |
|sally|london|agitated|
|gary |london|excited |
|gary |london|down    |
|gary |london|agitated|
+  -+   +    +

相关问题 更多 >

    热门问题