blocks|key|2089584|text|回答自己:从现在起，这是不可能的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2089585|实际上，Spark有两个LDA模型训练的实现，一个是OnlineLDAOptimizer。这种方法特别设计成用小批文档增量地更新模型。|offset|length|2089586|优化器实现了在线变分贝叶斯LDA算法，该算法在每次迭代时处理语料库的一个子集，并自适应地更新术语主题分布。
原始在线LDA论文：霍夫曼，布莱和巴赫，“潜在的Dirichlet分配的在线学习”。NIPS，2010年。|blockquote|2089587|不幸的是，当前的mllib+API不允许加载经过训练的LDA模型，并向其添加批处理。|2089588|一些mllib模型支持initialModel作为增量更新的起点(请参阅KMeans或GMM)，但LDA目前不支持这一点。我为此填写了一个JIRA：火花-20082。请投票-)|style|CODE|2089589|为了记录在案，还有一个用于流LDA+火花-8696的JIRA|2089590|entityMap|0|LINK|mutability|MUTABLE|url|https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.OnlineLDAOptimizer|1|https://mimno.infosci.cornell.edu/info6150/readings/HoffmanBleiBach2010b.pdf|2|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans|3|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture|4|https://issues.apache.org/jira/browse/SPARK-20082|5|https://issues.apache.org/jira/browse/SPARK-8696^0|0|Q|I|0|0|1S|16|1|0|0|B|C|10|6|2|17|3|3|22|8|4|0|I|7|5|0^^$0|@$1|2|3|4|5|6|7|18|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|19|8|@]|9|@$D|1A|E|1B|1|1C]]|A|$]]|$1|F|3|G|5|H|7|1D|8|@]|9|@$D|1E|E|1F|1|1G]]|A|$]]|$1|I|3|J|5|6|7|1H|8|@]|9|@]|A|$]]|$1|K|3|L|5|6|7|1I|8|@$D|1J|E|1K|M|N]]|9|@$D|1L|E|1M|1|1N]|$D|1O|E|1P|1|1Q]|$D|1R|E|1S|1|1T]]|A|$]]|$1|O|3|P|5|6|7|1U|8|@]|9|@$D|1V|E|1W|1|1X]]|A|$]]|$1|Q|3|-4|5|6|7|1Y|8|@]|9|@]|A|$]]]|R|$S|$5|T|U|V|A|$W|X]]|Y|$5|T|U|V|A|$W|Z]]|10|$5|T|U|V|A|$W|11]]|12|$5|T|U|V|A|$W|13]]|14|$5|T|U|V|A|$W|15]]|16|$5|T|U|V|A|$W|17]]]]

Answering myself : it is not possible as of now.
Actually, Spark has 2 implementations for LDA model training, and one is <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.OnlineLDAOptimizer" rel="nofollow noreferrer">OnlineLDAOptimizer</a>. This approach is especially designed to incrementally update the model with mini batches of documents.
<blockquote>
The Optimizer implements the Online variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively.
Original Online LDA paper: <a href="https://mimno.infosci.cornell.edu/info6150/readings/HoffmanBleiBach2010b.pdf" rel="nofollow noreferrer">Hoffman, Blei and Bach, &quot;Online Learning for Latent Dirichlet Allocation.&quot; NIPS, 2010</a>.
</blockquote>
Unfortunately, the current mllib API does not allow to load a previously trained LDA model, and add a batch to it.
Some mllib models support an <code>initialModel</code> as starting point for incremental updates (see <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans" rel="nofollow noreferrer">KMeans</a>, or <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture" rel="nofollow noreferrer">GMM</a>), but LDA does not currently support that. I filled a JIRA for it : <a href="https://issues.apache.org/jira/browse/SPARK-20082" rel="nofollow noreferrer">SPARK-20082</a>. Please upvote ;-)
For the record, there's also a JIRA for streaming LDA <a href="https://issues.apache.org/jira/browse/SPARK-8696" rel="nofollow noreferrer">SPARK-8696</a>

blocks|key|807941|text|我不认为这样的事情会存在。LDA是一种概率参数估计算法(在这里是对LDA解释过程的一个非常简化的解释)，添加一个文档或几个文档会改变所有以前计算的概率，因此可以从字面上重新计算模型。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|807942|我不知道您的用例，但是如果您的模型在合理的时间内收敛并在每次重新计算时丢弃一些最古老的文档，那么您可以考虑逐批进行更新，以使估计更快。|807943|entityMap|0|LINK|mutability|MUTABLE|url|https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation^0|X|5|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]]|C|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]]]

I don't think that such a thing would exist. LDA is probabilistic parameter estimation algorithm ( a very simplified explanation of the process here <a href="https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation" rel="nofollow noreferrer">LDA explained</a>), and adding a document or a few would change all previously computed probabilities, so literally recompute the model.

I don't know about your use case, but you can think about doing an update by batch if your model converges in a reasonable time and discard some of the oldest document at each re-computation to make the estimation faster.

Is there a way to train a LDA model in an online-learning fashion, ie. loading a previously train model, and update it with new documents ?

Online learning of LDA model in Spark

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

有没有办法以在线学习的方式训练LDA模型？加载以前的训练模型，并使用新的文档更新它？

问星火中LDA模型的在线学习
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问星火中LDA模型的在线学习EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问星火中LDA模型的在线学习
EN