在火花柱上作为argumen操作的函数

2024-09-30 14:22:50 发布

您现在位置:Python中文网/ 问答频道 /正文

编辑:终于自己弄明白了。我一直在函数中的select()column上使用,这就是为什么它不起作用。我将我的解决方案作为注释添加到原始问题中,以防对其他人有用。

我正在做一个在线课程,我应该写下以下函数:

# TODO: Replace <FILL IN> with appropriate code

# Note that you shouldn't use any RDD operations or need to create custom user defined functions (udfs) to accomplish this task

from pyspark.sql.functions import regexp_replace, trim, col, lower

def removePunctuation(column):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        column (Column): A Column containing a sentence.

    Returns:
        Column: A Column named 'sentence' with clean-up operations applied.
    """

    # EDIT: MY SOLUTION
    # column = lower(column)
    # column = regexp_replace(column, r'([^a-z\d\s])+', r'')
    # return trim(column).alias('sentence')

    return <FILL IN>

sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
                                         (' No under_score!',),
                                         (' *      Remove punctuation then spaces  * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
 .select(removePunctuation(col('sentence')))
 .show(truncate=False))

我已经编写了一段代码,它为DataFrame本身的操作提供了所需的输出:

^{pr2}$

我只是不知道如何在我的函数中实现这段代码,因为它不操作DataFrame,而只对给定的column进行操作。我尝试过不同的方法,一种是使用

[...]
df = sqlContext.createDataFrame(column, ['sentence'])
[...]

在函数中,但它不起作用:TypeError: Column is not iterable。其他方法尝试在函数中直接对column进行操作,总是导致TypeError: 'Column' object is not callable。在

几天前,我从(Py)Spark开始讲起,但是对于如何只处理行和列,仍然存在概念上的问题。我真的很感谢在当前问题上的任何帮助。在


Tags: andto函数iscolumnbefillselect