擅长:python、mysql、java
<p>火花中没有<code>TupleType</code>这样的东西。产品类型用特定类型的字段表示为<code>structs</code>。例如,如果要返回一个成对数组(整数、字符串),可以使用如下架构:</p>
<pre><code>from pyspark.sql.types import *
schema = ArrayType(StructType([
StructField("char", StringType(), False),
StructField("count", IntegerType(), False)
]))
</code></pre>
<p>示例用法:</p>
<pre><code>from pyspark.sql.functions import udf
from collections import Counter
char_count_udf = udf(
lambda s: Counter(s).most_common(),
schema
)
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["id", "value"])
df.select("*", char_count_udf(df["value"])).show(2, False)
## +---+-----+-------------------------+
## |id |value|PythonUDF#<lambda>(value)|
## +---+-----+-------------------------+
## |1 |foo |[[o,2], [f,1]] |
## |2 |bar |[[r,1], [a,1], [b,1]] |
## +---+-----+-------------------------+
</code></pre>