我想用FP growth来了解下面的RDD是否有相关的关联规则。从documentation开始,我尝试了以下方法:
sqlContext = SQLContext(sc)
spark_df = sqlContext.createDataFrame(pandas_df[['Category','Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address']])
spark_df.show(2)
+--------------+--------------------+---------+----------+--------------+------------------+
| Category| Descript|DayOfWeek|PdDistrict| Resolution| Address|
+--------------+--------------------+---------+----------+--------------+------------------+
| WARRANTS| WARRANT ARREST|Wednesday| NORTHERN|ARREST, BOOKED|OAK ST / LAGUNA ST|
|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday| NORTHERN|ARREST, BOOKED|OAK ST / LAGUNA ST|
+--------------+--------------------+---------+----------+--------------+------------------+
only showing top 2 rows
from pyspark.mllib.fpm import FPGrowth
model = FPGrowth.train(spark_df.rdd, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
print(fi)
但是,我有个例外:
^{pr2}$因此,使用FP-Growth实现的正确方法是什么?。在
这是错误的:
transactions = spark_df.map(lambda line: line.strip().split(' '))
。放下这行试试:它应该提供一个解决方案。在
相关问题 更多 >
编程相关推荐