我对机器学习了解很多,但对scala和spark还很陌生。由于Spark API卡住了,所以请给我建议。
我有一个txt文件,每行格式如下
#label \t # query, a strong of words, delimited by space
1 wireless amazon kindle
2 apple iPhone 5
1 kindle fire 8G
2 apple iPad第一个字段是标签,第二个字段是字符串我的计划是将数据拆分成标签和特征,使用内置函数Word2Vec将字符串转换为稀疏向量(我假设它首先使用词袋来获取字典),然后使用SVMWithSGD进行分类
object QueryClassification {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Query Classification").setMaster("local")
val sc = new SparkContext(conf)
val input = sc.textFile("spark_data.txt")
val word2vec = new Word2Vec()
val parsedData = input.map {line =>
val parts = line.split("\t")
## How to write code here? I need to parse into feature vector
## properly and then apply word2vec function after the map
*LabeledPoint(parts(0).toDouble, ????)*
}
## * is the item I got from parsing parts(1) above
word2vec.fit(*)
val numIterations = 20
val model = SVMWithSGD.train(parsedData,numIterations)
}
}非常感谢
发布于 2015-03-07 16:14:59
如果你使用word2vec算法,你应该用字符串的单词来训练word2vec。
val vocabulary = input.map {line =>
val parts = line.split("\t")
val partWords = parts(1).split(" ")
partWords.toSeq
}
val word2vec = new Word2Vec()
val wordModel = word2vec.fit(vocabulary)对于词汇表中的单词,可以从wordModel.transform( word )获取词向量。因为支持向量机需要两个值(0或1)的LabelPoint,所以我不知道如何将标签字段转换为两个值?
https://stackoverflow.com/questions/27370170
复制相似问题