用ud从数据帧中编程选择列

******************************************************************* version 2.2.0 Using Python version 2.7.16 (default, Mar 18 2019 18:38:44) SparkSession available as 'spark' ******************************************************************* jsonDF = spark.read.json("/tmp/people.json") jsonDF.show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ jsonDF.printSchema() root |-- age: long (nullable = true) |-- name: string (nullable = true) jsonCurDF = jsonDF.filter(jsonDF.age.isNotNull()).cache() # Define the UDF from pyspark.sql.functions import udf @udf("long") def squared_udf(s): return s * s # Selecting the columns from a list. colSelList = ['age', 'name', squared_udf('age')] jsonCurDF.select(colSelList).show() +---+------+----------------+ |age| name|squared_udf(age)| +---+------+----------------+ | 30| Andy| 900| | 19|Justin| 361| +---+------+----------------+ # If I use an external config file colSelListStr = ["age", "name" , "squared_udf('age')"] jsonCurDF.select(colSelListStr).show()

1条回答

网友

1楼 · 发布于 2024-06-28 20:49:41

这是因为当你从列表中传递它时，平方年龄被认为是字符串而不是函数。有一个圆的方法，你可以这样做，你不需要为此导入自定义项。假设这是您需要选择的列表

直接传递此列表将导致错误，因为此数据帧中不包含平方年龄

因此，首先将现有df的所有列按

existing_cols = df.columns

这些就是你需要的专栏

现在把这两个列表交起来它会给你一个常用元素列表

intersection = list(set(existing_cols) & set(col_list))

现在试试这个

newDF= df.select(intersection).rdd.map(lambda x: (x["age"], x["name"], x["age"]*x["age"])).toDF(col_list)

给你这个

希望这有帮助。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章