如何使用spark和python访问拼花地板表中单元格内的嵌套数组？问题的回答

如何使用spark和python访问拼花地板表中单元格内的嵌套数组？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我想在我的表中的情绪栏中提取“文本”，并按city=london进行过滤 我有一张这样的桌子： <pre><code>name city sentiment harry london "[ Row(score='0.999926', sentiment=Row(score='-0.640237'), text='happy'), Row(score='0.609836', sentiment=Row(score='-0.607594'), text='sad'), Row(score='0.58564', sentiment=Row(score='-0.6833'), text='mad') ]" sally london "[ Row(score='0.999926', sentiment=Row(score='-0.640237'), text='sad'), Row(score='0.609836', sentiment=Row(score='-0.607594'), text='mad'), Row(score='0.58564', sentiment=Row(score='-0.6833'), text='agitated') ]" gary london "[ Row(score='0.999926', sentiment=Row(score='-0.640237'), text='excited'), Row(score='0.609836', sentiment=Row(score='-0.607594'), text='down'), Row(score='0.58564', sentiment=Row(score='-0.6833'), text='agitated') ]" mary manchester "[ Row(score='0.999926', sentiment=Row(score='-0.640237'), text='sad'), Row(score='0.609836', sentiment=Row(score='-0.607594'), text='low'), Row(score='0.58564', sentiment=Row(score='-0.6833'), text='content') ]" gerry manchester "[ Row(score='0.999926', sentiment=Row(score='-0.640237'), text='ecstatic'), Row(score='0.609836', sentiment=Row(score='-0.607594'), text='good'), Row(score='0.58564', sentiment=Row(score='-0.6833'), text='bad') ]" </code></pre> 我的代码当前看起来像这样，但不起作用： <pre><code>from pyspark.sql import functions as F from pyspark.sql import types as T data= spark.read.parquet("INSERT S3 TABLE").where("city LIKE 'london' AND sentiment['text=']") df = sharethis.toPandas() print (df) </code></pre> 我希望输出像这样： <pre><code>name city sentiment harry london happy harry london sad harry london mad sally london sad sally london mad sally london agitated gary london sad gary london low gary london content </code></pre> 有人知道我如何访问情感栏中的数组以提取文本吗 提前谢谢

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

让我们首先使用示例中的数据创建一个数据帧： <pre class="lang-py prettyprint-override"><code>import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.appName('explode_example').getOrCreate() data = [ ("harry", "london", [ {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "happy"}, {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "sad"}, {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "mad"} ]), ("sally", "london", [ {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"}, {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "mad"}, {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"} ]), ("gary", "london", [ {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "excited"}, {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "down"}, {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"} ]), ("mary", "manchester", [ {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"}, {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "low"}, {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "content"} ]), ("gerry", "manchester", [ {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "ecstatic"}, {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "good"}, {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "bad"} ]) ] df = spark.createDataFrame(data=data, schema = ["name", "city", "sentiment"]) </code></pre> 您拥有的是以下数据帧： <pre><code>df.show(truncate=False) + -+ + -+ |name |city |sentiment | + -+ + -+ |harry|london |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad]] | |sally|london |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]] | |gary |london |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> down], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]]| |mary |manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> low], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> content]] | |gerry|manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> good], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad]] | + -+ + -+ </code></pre> 一旦我们有了数据帧，您需要分解<code>sentiment</code>列： <pre class="lang-py prettyprint-override"><code>from pyspark.sql.functions import explode df_exp = df.select(df["name"], df["city"], explode(df["sentiment"])) </code></pre> 结果是： <pre><code>df_exp.show(truncate=False) + -+ + -+ |name |city |col | + -+ + -+ |harry|london |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy] | |harry|london |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad] | |harry|london |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad] | |sally|london |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad] | |sally|london |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad] | |sally|london |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated] | |gary |london |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited] | |gary |london |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> down] | |gary |london |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated] | |mary |manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad] | |mary |manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> low] | |mary |manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> content] | |gerry|manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic]| |gerry|manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> good] | |gerry|manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad] | + -+ + -+ </code></pre> 最后，让我们创建一个只包含文本的列，按城市筛选并获得3个想要的列： <pre class="lang-py prettyprint-override"><code># Extract text df_exp = df_exp.withColumn("text", df_exp["col"].text) # Select result columns and filter city result = df_exp.select("name", "city", "text").where("city = 'london'") </code></pre> 结果将是： <pre><code>result.show(truncate=False) + -+ + + |name |city |text | + -+ + + |harry|london|happy | |harry|london|sad | |harry|london|mad | |sally|london|sad | |sally|london|mad | |sally|london|agitated| |gary |london|excited | |gary |london|down | |gary |london|agitated| + -+ + + </code></pre>

如何使用spark和python访问拼花地板表中单元格内的嵌套数组？

1 个回答

相关Python问题