文章/答案/技术大牛

发布

社区首页 >问答首页 >标记为LDA/pLDA [主题建模工具箱]的推理

问标记为LDA/pLDA [主题建模工具箱]的推理
EN

Stack Overflow用户

提问于 2012-07-28 16:17:57

回答 1查看 1.7K关注 0票数 3

我一直在尝试使用TMT工具箱(Stanford nlp group)从训练过的带标签的LDA模型和pLDA中进行推断的代码。我已经浏览了以下链接中提供的示例：http://nlp.stanford.edu/software/tmt/tmt-0.3/ http://nlp.stanford.edu/software/tmt/tmt-0.4/

下面是我尝试进行标记LDA推断的代码

val modelPath = file("llda-cvb0-59ea15c7-31-61406081-75faccf7");

val model = LoadCVB0LabeledLDA(modelPath);`

val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);

val text = {
  source ~>                              // read from the source file
  Column(4) ~>                           // select column containing text
  TokenizeWith(model.tokenizer.get)      //tokenize with model's tokenizer
 }

 val labels = {
  source ~>                              // read from the source file
  Column(2) ~>                           // take column two, the year
  TokenizeWith(WhitespaceTokenizer())     
 }

 val outputPath = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv",""));

 val dataset = LabeledLDADataset(text,labels,model.termIndex,model.topicIndex);

 val perDocTopicDistributions =  InferCVB0LabeledLDADocumentTopicDistributions(model, dataset);

 val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);

 TSVFile(outputPath+"-word-topic-distributions.tsv").write({
  for ((terms,(dId,dists)) <- text.iterator zip perDocTermTopicDistributions.iterator) yield {
    require(terms.id == dId);
    (terms.id,
     for ((term,dist) <- (terms.value zip dists)) yield {
       term + " " + dist.activeIterator.map({
         case (topic,prob) => model.topicIndex.get.get(topic) + ":" + prob
       }).mkString(" ");
     });
  }
});

错误

found : scalanlp.collection.LazyIterable[(String, Array[Double])] required: Iterable[(String, scalala.collection.sparse.SparseArray[Double])] EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);

我知道这是一个类型不匹配错误。但是我不知道如何为scala解决这个问题。基本上，我不明白在infer命令的输出之后，我应该如何提取1.per文档主题分布2.per文档标签分布。

请帮帮忙。pLDA的情况也是如此。我到达推理命令，然后毫无头绪地处理它。

lda

topic-modeling

scala

nlp

stanford-nlp

回答 1

Stack Overflow用户

发布于 2012-08-03 17:57:20

Scala类型的系统比Java类型的系统复杂得多，理解它会让你成为更好的程序员。问题出在这里：

val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);

因为模型、数据集或perDocTopicDistributions的类型都是：

scalanlp.collection.LazyIterable[(String, Array[Double])]

而EstimateLabeledLDAPerWordTopicDistributions.apply期望的是

Iterable[(String, scalala.collection.sparse.SparseArray[Double])]

调查这种类型错误的最好方法是查看ScalaDoc (例如，tmt的类型就在那里：http://nlp.stanford.edu/software/tmt/tmt-0.4/api/#package )，如果您不能很容易地找到问题所在，您应该在代码中显式地指定变量的类型，如下所示：

 val perDocTopicDistributions:LazyIterable[(String, Array[Double])] =  InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)

如果我们一起来看一下edu.stanford.nlp.tmt.stage的javadoc：

def
EstimateLabeledLDAPerWordTopicDistributions (model: edu.stanford.nlp.tmt.model.llda.LabeledLDA[_, _, _], dataset: Iterable[LabeledLDADocumentParams], perDocTopicDistributions: Iterable[(String, SparseArray[Double])]): LazyIterable[(String, Array[SparseArray[Double]])]

def
InferCVB0LabeledLDADocumentTopicDistributions (model: CVB0LabeledLDA, dataset: Iterable[LabeledLDADocumentParams]): LazyIterable[(String, Array[Double])]

现在您应该清楚了，不能直接使用InferCVB0LabeledLDADocumentTopicDistributions的返回值来馈送EstimateLabeledLDAPerWordTopicDistributions。

我从未使用过stanford nlp，但这是设计好的，所以你只需要在调用函数之前将你的scalanlp.collection.LazyIterable[(String, Array[Double])]转换成Iterable[(String, scalala.collection.sparse.SparseArray[Double])]即可。

如果你看一下scaladoc关于如何做这个转换的代码，它是非常简单的。在包阶段中，我可以在package.scala中读取import scalanlp.collection.LazyIterable;

所以我知道在哪里查找，实际上在http://www.scalanlp.org/docs/core/data/#scalanlp.collection.LazyIterable中有一个toIterable方法，它将LazyIterable转换为Iterable，但您仍然需要将内部数组转换为SparseArray

同样，我在tmt中查看了stage包的package.scala，我看到：import scalala.collection.sparse.SparseArray;和我查找scalala文档：

http://www.scalanlp.org/docs/scalala/0.4.1-SNAPSHOT/#scalala.collection.sparse.SparseArray

事实证明，构造函数对我来说似乎很复杂，所以听起来很像我必须在伴生对象中查找工厂方法。事实证明，我正在寻找的方法就在那里，并且它在Scala中的名称为apply as。

def
apply [T] (values: T*)(implicit arg0: ClassManifest[T], arg1: DefaultArrayValue[T]): SparseArray[T]

通过使用它，您可以编写具有以下签名的函数：

def f: Array[Double] => SparseArray[Double]

完成此操作后，您可以使用一行代码将InferCVB0LabeledLDADocumentTopicDistributions的结果转换为稀疏数组的非惰性迭代：

result.toIterable.map { case (name, values => (name, f(values)) }

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/11699404

复制

相似问题

问标记为LDA/pLDA [主题建模工具箱]的推理
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问标记为LDA/pLDA [主题建模工具箱]的推理EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问标记为LDA/pLDA [主题建模工具箱]的推理
EN