在PySp中使用词典进行情感分析

+---+--------------------+--------------------+----+-----+--------+---------+--------+ | Id| CreationDate| Body|Year|Month|Day_of_Y|Week_of_Y|Year_adj| +---+--------------------+--------------------+----+-----+--------+---------+--------+ | 1|2011-08-30 21:12:...|What open source ...|2011| 8| 242| 35| 2011| | 2|2011-08-30 21:14:...|GPU mining is the...|2011| 8| 242| 35| 2011| | 8|2011-08-30 21:18:...|I would like to d...|2011| 8| 242| 35| 2011| | 9|2011-08-30 21:18:...|I didn't get it. ...|2011| 8| 242| 35| 2011| | 10|2011-08-30 21:19:...|Poclbm: An open s...|2011| 8| 242| 35| 2011| +---+--------------------+--------------------+----+-----+--------+---------+--------+

+---------+--------+--------+-----------+---------+------------+-----------+-----------+-----+ | Word|Negative|Positive|Uncertainty|Litigious|Constraining|Superfluous|Interesting|Modal| +---------+--------+--------+-----------+---------+------------+-----------+-----------+-----+ | aardvark| 0| 0| 0| 0| 0| 0| 0| 0| | abalones| 0| 0| 0| 0| 0| 0| 0| 0| | abandon| 2009| 0| 0| 0| 0| 0| 0| 0| +---------+--------+--------+-----------+---------+------------+-----------+-----------+-----+

1条回答

网友

1楼 · 发布于 2024-09-29 02:25:21

一种方法（不确定它是否是最有效的）是从您的情感词典创建一个实际的python词典，并将其应用到用户定义函数（UDF）中。考虑到你的情感词典大约有8万行，这应该是可行的。另外，通过先删除中性词，可以进一步加快速度。
代码大纲如下：

from pyspark.sql import functions as f
# filter neutral words
filtered_sentiment_df = sentiment_df.filter((f.col("negative") > 0) | (f.col("positive") > 0))
# the following assumes that there are no words both positive and negative
sentiments = filtered_sentiment_df.select(f.col("word"), f.when(f.col("negative") > 0, -1).otherwise(1).alias("sentiment"))

# now we got the dict and can apply it via a UDF
sentiment_dict = {row["word"]: row["sentiment"] for row in sentiments.collect()}

def calculate_sentiment_score(sentence, sentiment_dict):
    return sum([sentiment_dict.get(w, 0) for w in sentence.split(" ")])

sentiment_udf = f.udf(lambda x: calculate_sentiment_score(x, sentiment_dict))
bodies_df = bodies_df.withColumn("total_sentiment", sentiment_udf(f.col("body")))
bodies_df.show()

相关问题更多 >

编程相关推荐

热门问题

热门文章