我的任务是构建一个函数“RemovePercentration”,该函数去除标点符号,从而通过此测试:
# TEST Capitalization and punctuation (4b)
testPunctDF = sqlContext.createDataFrame([(" The Elephant's 4 cats. ",)])
testPunctDF.show()
Test.assertEquals(testPunctDF.select(removePunctuation(col('_1'))).first()[0],
'the elephants 4 cats',
'incorrect definition for removePunctuation function')
这是我设法写的
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
return lower(trim(regexp_replace("column_name", "[\W_]+"," "))).alias("sentence");
但我仍然无法使函数regexp_替换为使用别名“句子”。我得到这个错误:
AnalysisException: u"cannot resolve 'sentence' given input columns: [_1];"
我会尝试:
它在引擎盖下使用c,这是效率方面最好的
您的尝试:
似乎没有在任何地方使用参数
column
,这可能解释了错误令人惊讶的是,我只能以
regexp_replace()
参数而不是列名传递列对象相关问题 更多 >
编程相关推荐