我已经成功地在Python中使用了先验算法,如下所示:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
df = pd.read_csv('C:\\Users\\marka\\Downloads\\Assig5.csv')
df = apriori(df, min_support=0.79, use_colnames=True)
rules = association_rules(df, metric="lift", min_threshold=1)
rules[ (rules['lift'] >= 1) &
(rules['confidence'] >= 1) ]我想使用FPGrowth算法看看我是否得到了相同的结果,但我认为我使用它是错误的,因为我没有得到类似的输出。spark (https://spark.apache.org/docs/1.6.0/mllib-frequent-pattern-mining.html)的文档中写道:
from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("data/mllib/sample_fpgrowth.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
print(fi)所以我的代码依次是:
from pyspark.mllib.fpm import FPGrowth
from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
data = sc.textFile("C:\\Users\\marka\\Downloads\\Assig6.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
print(fi)但我得到的是以下答案,而不是真正的答案,我做错了什么?
FreqItemset(items=['1\t1\t1\t1\t1\t1\t1\t0\t0\t0\t0\t1\t1\t0\t0\t1\t1\t1\t1\t1\t0\t0'], freq=24)要制作Assig6,我只需将原始csv重新保存为txt。

我开始更改我的格式,并按user10136092更新代码,但仍然得到了不想要的输出。下面是我的代码、输出和新输入的示例图片。
from pyspark.mllib.fpm import FPGrowth
from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
data = sc.textFile("C:\\Users\\marka\\Downloads\\Assig2.txt")
data.map(lambda line: line.strip().split())
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
print(fi)输出
FreqItemset(items=['Rock_salt\tFlashlights\tWater\tSnow_shovels\tBlankets\tCanned_food'], freq=34)

发布于 2018-07-26 07:05:21
您的数据不是火花FPGrowth算法的有效输入。
在Spark中,每个篮子都应该表示为唯一标签的列表,例如:
baskets = sc.parallelize([("Rock Salt", "Blankets"), ("Blankets", "Dry Fruits", Canned Food")])不是二进制矩阵,就像你使用的其他库一样。请先将您的数据转换为上述格式。
此外,您的数据是制表符分隔的,而不是空格分隔的,因此即使输入正确,也应该像这样拆分
data.map(lambda line: line.strip().split())发布于 2018-07-26 20:29:47
我认为这个文件是用制表符分隔的,所以你应该用'\t'而不是' '来拆分它
transactions = data.map(lambda line: line.strip().split('\t'))
https://stackoverflow.com/questions/51528769
复制相似问题