这是我用来清理文本文件的方法:
# reference : https://github.com/GongtingPeng/Spark
# remove punctuation
def remove_punct(text):
regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]')
nopunct = regex.sub("", text)
return nopunct
# binarize rating
def convert_rating(rating):
rating = int(rating)
if rating >=4: return 1
else: return 0
# udf
punct_remover = udf(lambda x: remove_punct(x))
rating_convert = udf(lambda x: convert_rating(x))
# apply to review raw data
review_df = review.select('review_id', punct_remover('text'), rating_convert('stars'))
review_df = review_df.withColumnRenamed('<lambda>(text)', 'text')\
.withColumn('label', review_df["<lambda>(stars)"].cast(IntegerType()))\
.drop('<lambda>(stars)')\
.limit(1000000)
review_df.show(5)
这就是我用来删除stopwords的方法:
tok = Tokenizer(inputCol="text", outputCol="words")
review_tokenized = tok.transform(review_df)
# remove stop words
stopword_rm = StopWordsRemover(inputCol='words', outputCol='words_nsw')
review_tokenized = stopword_rm.transform(review_tokenized)
review_tokenized.show(5)
但在分解单词后,我仍然得到空格的值计数:
dfwords_exploded = review_tokenized.withColumn('words',explode('words_nsw'))
dfwords_exploded.show(50)
当我返回单词的计数时,空格是最高计数,我想删除它,所以我只计算实际单词:
我想问题出在我清理文本文件的初始代码的正则表达式中,但我不确定在哪里,这需要花费相当长的时间来运行,因此任何帮助都将不胜感激
目前没有回答
相关问题 更多 >
编程相关推荐