编辑:终于自己弄明白了。我一直在函数中的select()
在column
上使用,这就是为什么它不起作用。我将我的解决方案作为注释添加到原始问题中,以防对其他人有用。
我正在做一个在线课程,我应该写下以下函数:
# TODO: Replace <FILL IN> with appropriate code
# Note that you shouldn't use any RDD operations or need to create custom user defined functions (udfs) to accomplish this task
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
# EDIT: MY SOLUTION
# column = lower(column)
# column = regexp_replace(column, r'([^a-z\d\s])+', r'')
# return trim(column).alias('sentence')
return <FILL IN>
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
我已经编写了一段代码,它为DataFrame
本身的操作提供了所需的输出:
我只是不知道如何在我的函数中实现这段代码,因为它不操作DataFrame
,而只对给定的column
进行操作。我尝试过不同的方法,一种是使用
[...]
df = sqlContext.createDataFrame(column, ['sentence'])
[...]
在函数中,但它不起作用:TypeError: Column is not iterable
。其他方法尝试在函数中直接对column
进行操作,总是导致TypeError: 'Column' object is not callable
。在
几天前,我从(Py)Spark
开始讲起,但是对于如何只处理行和列,仍然存在概念上的问题。我真的很感谢在当前问题上的任何帮助。在
你可以在一行中完成。在
相关问题 更多 >
编程相关推荐