我正在使用apache orc 1.8。下面是文档中的简短示例:https://orc.apache.org/docs/core-java.html,我无法将字符串写入orc文件。
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector
import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector
import org.apache.orc.OrcFile
import org.apache.orc.TypeDescription
import org.apache.orc.Writer
val conf = Configuration()
val schema = TypeDescription.fromString("struct<x:string,y:string>")
val writer = OrcFile.createWriter(
Path("./my-file.orc"),
OrcFile.writerOptions(conf)
.setSchema(schema)
)
val batch = schema.createRowBatch()
val x = batch.cols[0] as BytesColumnVector
val y = batch.cols[1] as BytesColumnVector
for (r in 0..99) {
val row = batch.size++
x.vector[row] = r.toString().toByteArray(Charsets.UTF_8)
y.vector[row] = r.toString().toByteArray(Charsets.UTF_8)
// If the batch is full, write it out and start over.
if (batch.size == batch.maxSize) {
writer.addRowBatch(batch)
batch.reset()
}
}
if (batch.size != 0) {
writer.addRowBatch(batch)
batch.reset()
}
writer.close()当我使用spark读取orc文件时:
val df = spark.read().format("orc").load("./my-file.orc")
df.show()
df.printSchema()它显示:
+---+---+
| x| y|
+---+---+
| | |
| | |
| | |
| | |我不明白这里有什么问题。我认为错误在于这一行:r.toString().toByteArray(Charsets.UTF_8)。但我不知道我能做些什么来解决这个问题。
有什么想法吗?
发布于 2022-09-24 16:41:17
我发现“我想”
作了以下修改:
...
...
val schema = TypeDescription.createStruct().addField("x", TypeDescription.createString()).addField("y", TypeDescription.createString())
...
...
for (r in 0..99) {
val row = batch.size++
x.setVal(row, r.toString().toByteArray(Charsets.UTF_8))
}https://stackoverflow.com/questions/73838766
复制相似问题