假设我们有一个文本分类问题。
我们都知道,为了训练模型,我们必须将文本数据转换成向量。因此,有几种矢量化方法,如计数矢量化、Tf-以色列国防军法、词袋法等,那么从这几种向量化方法中,我们如何选择一种方法呢?是这样还是换一种方式,我们需要尝试所有的方法,训练模型,然后用每一种向量化方法来检查性能?
请分享你的想法,并帮助我正确理解这一点。
发布于 2022-01-29 20:11:19
计数向量器是将给定的一组字符串转换为频率representation.Count向量的一种方法,它可以帮助理解文本的类型。但它的主要缺点是:
Its inability in identifying more important and less important words for analysis.
It will just consider words that are abundant in a corpus as the most statistically significant word.
It also doesn't identify the relationships between words such as linguistic similarity between words.TF-以色列国防军比计数矢量更好,因为它不仅关注语料库中单词的出现频率,而且还提供了单词的重要性。然后,我们可以删除对分析不太重要的单词,从而通过减少输入维度使模型构建变得不那么复杂。
Even though TFIDF can provide a good understanding about the importance of words but just like Count Vectors, its disadvantage is:
It fails to provide linguistic information about the words such as the real meaning of the words, similarity with other words etc.因此,您应该尝试使您的用例与上面提到的优点保持一致。此外,如果您希望向量具有上下文信息,我建议探索Word2vec和手套嵌入。
https://datascience.stackexchange.com/questions/107611
复制相似问题