如何在清除Yelp评论中的文本后从数据框中删除空白词

2024-09-27 19:12:41 发布

男 | 程序猿一只，喜欢编程写python代码。

这是我用来清理文本文件的方法：

# reference : https://github.com/GongtingPeng/Spark
# remove punctuation
def remove_punct(text):
    regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]')
    nopunct = regex.sub("", text) 
    return nopunct
    
# binarize rating
def convert_rating(rating):
    rating = int(rating)
    if rating >=4: return 1
    else: return 0

# udf
punct_remover = udf(lambda x: remove_punct(x))
rating_convert = udf(lambda x: convert_rating(x))

# apply to review raw data
review_df = review.select('review_id', punct_remover('text'), rating_convert('stars'))

review_df = review_df.withColumnRenamed('<lambda>(text)', 'text')\
                     .withColumn('label', review_df["<lambda>(stars)"].cast(IntegerType()))\
                     .drop('<lambda>(stars)')\
                     .limit(1000000)
review_df.show(5)

这就是我用来删除stopwords的方法：

tok = Tokenizer(inputCol="text", outputCol="words")
review_tokenized = tok.transform(review_df)

# remove stop words
stopword_rm = StopWordsRemover(inputCol='words', outputCol='words_nsw')
review_tokenized = stopword_rm.transform(review_tokenized)

review_tokenized.show(5)

但在分解单词后，我仍然得到空格的值计数：

dfwords_exploded = review_tokenized.withColumn('words',explode('words_nsw'))
dfwords_exploded.show(50)

当我返回单词的计数时，空格是最高计数，我想删除它，所以我只计算实际单词：

我想问题出在我清理文本文件的初始代码的正则表达式中，但我不确定在哪里，这需要花费相当长的时间来运行，因此任何帮助都将不胜感激

0条回答

目前没有回答

如何在清除Yelp评论中的文本后从数据框中删除空白词

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在清除Yelp评论中的文本后从数据框中删除空白词

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >