Spacy 2.0文档提到,开发人员添加了一些功能,允许对Spacy进行腌制,这样就可以由PySpark接口的星火集群使用它,但是,他们没有给出如何这样做的说明。
有人能解释我如何在我的udf函数中使用Spacy的英语NE解析器吗?
这不管用:
from pyspark import cloudpickle
nlp = English()
pickled_nlp = cloudpickle.dumps(nlp)发布于 2018-07-26 15:31:22
这不是真正的答案,但我发现最好的解决办法是:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
import spacy
def get_entities_udf():
def get_entities(text):
global nlp
try:
doc = nlp(unicode(text))
except:
nlp = spacy.load('en')
doc = nlp(unicode(text))
return [t.label_ for t in doc.ents]
res_udf = udf(get_entities, StringType(ArrayType()))
return res_udf
documents_df = documents_df.withColumn('entities', get_entities_udf()('text'))发布于 2018-07-27 13:31:30
这满足了我的需要,而且看起来非常快速(根据讨论结束时的这里改编):
# create class to wrap spacy object
class SpacyMagic(object):
"""
Simple Spacy Magic to minimize loading time.
>>> SpacyMagic.get("en")
<spacy.en.English ...
"""
_spacys = {}
@classmethod
def get(cls, lang):
if lang not in cls._spacys:
import spacy
cls._spacys[lang] = spacy.load(lang, disable=['parser', 'tagger', 'ner'])
return cls._spacys[lang]
# broadcast `nlp` object as `nlp_br`
nlp_br = sc.broadcast( SpacyMagic.get('en_core_web_lg') )
# returns a list of word2vec vectors for each phrase or word `x`
def get_vector(x):
return nlp_br.value(x).vector.tolist()
get_vector_udf = F.udf( get_vector, T.ArrayType( T.FloatType() ) )
# create new column with word2vec vectors
new_df = df.withColumn( 'w2v_vectors', get_vector_udf( F.col('textColumn') ) )https://stackoverflow.com/questions/50880303
复制相似问题