文章/答案/技术大牛

发布

社区首页 >问答首页 >Spark nlp:无法加载预先训练过的实体模型。

问Spark nlp:无法加载预先训练过的实体模型。
EN

Stack Overflow用户

提问于 2019-12-02 20:50:39

回答 1查看 2.1K关注 0票数 2

我有一个星火集群设置，并希望集成火花-nlp运行命名实体识别。我需要从磁盘访问模型，而不是在运行时从互联网下载它。我已经从模型下载页面下载了recognize_entities_dl模型，并将解压缩文件放置在spark应该能够访问它的地方。当我运行以下代码时：

ner = NerDLModel.pretrained('/path/to/unzipped/files')

我看到了Can not find the model to download please check the name!消息，表示它无法在代码中找到后面跟着堆栈跟踪的文件。我还用类似的结果尝试了PretrainedPipeline类。

关于它们的价值，有几个重要的细节：

火花版本: 2.4.4

散列文版本: 2.3.3

火花正在库伯内特斯舱内的一个码头容器中运行。我可以执行到这个容器中，并手动运行命令来重现问题。看起来_internal._GetResourceSize正在返回a-1，导致加载程序退出。我还收到了一些关于http的警告，但我所要做的只是访问一个本地文件，所以不确定这与事情有什么关系：

>>> _internal._GetResourceSize('/path/in/container/recognize_entities_dl_en_2.1.0_2.4_1562946909722', 'en', remote_loc=None).apply()
19/12/02 20:29:03 WARN ApacheUtils: NoSuchMethodError was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information
19/12/02 20:29:03 WARN ApacheUtils: NoSuchMethodError was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information
'-1'
>>>

apache-spark

kubernetes

pyspark

johnsnowlabs-spark-nlp

回答 1

Stack Overflow用户

发布于 2020-02-14 11:14:55

您正在尝试在注解器中加载经过预先训练的管道。有两种类型的预培训资源:模型和管道。预训练的模型可以加载在注解器内，之后将在管道内部使用，但是，预训练的管道可以简单地加载，然后再使用。

预培训管道的例子(在线-需要互联网)：

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP

SparkNLP.version()

val testData = spark.createDataFrame(Seq(
(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
)).toDF("id", "text")

// Pay attention, for loading a pre-trained pipeline we use PretrainedPipeline
val pipeline = PretrainedPipeline("recognize_entities_dl", lang="en")

val annotation = pipeline.transform(testData)

annotation.show()

/*
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
2.4.0
testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(entity_recognizer_dl,en,public/models)
annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields]
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| id|                text|            document|            sentence|               token|          embeddings|                 ner|       entities|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
|  2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
*/

annotation.select("entities.result").show(false)

/*
+----------------------------------+
|result                            |
+----------------------------------+
|[Google, TensorFlow]              |
|[Donald John Trump, United States]|
+----------------------------------+
*/

预训练管道的

示例(脱机加载保存的管道)：

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP

SparkNLP.version()

val testData = spark.createDataFrame(Seq(
(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
)).toDF("id", "text")

// Here we are loading a pre-trained pipeline we already downloaded manually for offline use

val pipeline = PretrainedPipeline.load("/path/in/container/recognize_entities_dl_en_2.1.0_2.4_1562946909722")

val annotation = pipeline.transform(testData)

annotation.show()

/*
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
2.4.0
testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(entity_recognizer_dl,en,public/models)
annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields]
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| id|                text|            document|            sentence|               token|          embeddings|                 ner|       ner_converter|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
|  2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
*/

annotation.select("entities.result").show(false)

/*
+----------------------------------+
|result                            |
+----------------------------------+
|[Google, TensorFlow]              |
|[Donald John Trump, United States]|
+----------------------------------+
*/

加载NerDLModel

预训练模型的实例

// Online
val ner = NerDLModel.pretrained(name="ner_dl", lang="en")
// Offline - manualy downloaded
val ner = NerDLModel.load("/path/ner_dl_en_2.4.0_2.4_1580251789753")

如果您的输入数据有任何问题或问题，请告诉我，我会更新我的答案。

参考资料

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59146469

复制

相似问题

问Spark nlp:无法加载预先训练过的实体模型。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Spark nlp:无法加载预先训练过的实体模型。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Spark nlp:无法加载预先训练过的实体模型。
EN