如何使用PySpark的FP-growth与RDD?

2024-10-06 12:39:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用FP growth来了解下面的RDD是否有相关的关联规则。从documentation开始,我尝试了以下方法:

sqlContext = SQLContext(sc)

spark_df = sqlContext.createDataFrame(pandas_df[['Category','Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address']])

spark_df.show(2)

+--------------+--------------------+---------+----------+--------------+------------------+
|      Category|            Descript|DayOfWeek|PdDistrict|    Resolution|           Address|
+--------------+--------------------+---------+----------+--------------+------------------+
|      WARRANTS|      WARRANT ARREST|Wednesday|  NORTHERN|ARREST, BOOKED|OAK ST / LAGUNA ST|
|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|OAK ST / LAGUNA ST|
+--------------+--------------------+---------+----------+--------------+------------------+
only showing top 2 rows

from pyspark.mllib.fpm import FPGrowth

model = FPGrowth.train(spark_df.rdd, minSupport=0.2, numPartitions=10)

result = model.freqItemsets().collect()

for fi in result:

    print(fi)

但是,我有个例外:

^{pr2}$

因此,使用FP-Growth实现的正确方法是什么?。在


Tags: 方法dfaddresssparkstresolutioncategoryfp
1条回答
网友
1楼 · 发布于 2024-10-06 12:39:14

这是错误的:transactions = spark_df.map(lambda line: line.strip().split(' '))。放下这行试试:

>>> FPGrowth.train(
... spark_df.rdd.map(lambda x: list(set(x))),
... minSupport=0.2, numPartitions=10)

它应该提供一个解决方案。在

相关问题 更多 >