我有一个包含大量文档的文件,如何跳过长度为<= 2的行,然后处理长度> 2的行,例如:
fit perfectly clie .
purchased not
instructions install helpful . improvement battery life not hoped .
product.
cable good not work . cable extremely hot not recognize devices .跳过行后:
fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .我的代码:
val Bi = text.map(sen=> sen.split(" ").sliding(2))有什么解决办法吗?
发布于 2015-06-05 15:38:30
我会用过滤器:
> val text = sc.parallelize(Array("fit perfectly clie .",
"purchased not",
"instructions install helpful . improvement battery life not hoped .",
"product.",
"cable good not work . cable extremely hot not recognize devices ."))
> val result = text.filter{_.split(" ").size > 2}
> result.collect.foreach{println}
fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .从这里开始,您可以在过滤后处理数据的原始形式(即不标记化)。如果您希望先标记,那么可以这样做:
text.map{_.split(" ")}.filter{_.size > 2}因此,最后,要标记,然后过滤,然后使用sliding查找bigram,您可以使用:
text.map{_.split(" ")}.filter{_.size > 2}.map{_.sliding(2)}发布于 2015-06-05 15:33:20
flatMap怎么样?
text.flatMap(line=>{
val tokenized = line.split(" ")
if(tokenized.length > 2) Some(tokenized.sliding(2))
else None
})https://stackoverflow.com/questions/30670337
复制相似问题