首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Apache 1.16 TXTParser未能检测到sbt构建中的字符编码

Apache 1.16 TXTParser未能检测到sbt构建中的字符编码
EN

Stack Overflow用户
提问于 2017-11-03 16:47:13
回答 2查看 3K关注 0票数 0

我正在构建一个使用sbt程序集在Eclipse中拥有的项目。我有一个非常大和复杂的build.sbt文件,因为我有很多冲突。

对于pdf、pptx、odt和docx文件,所有内容都是正确工作的,使用的是tika 1.16中的PDF、OOXML和OpenDocument解析器。然而,当我试图使用TXTParser解析一个txt文件(UTF-8编码)时,我会得到以下错误:

代码语言:javascript
复制
org.apache.tika.exception.TikaException: Failed to detect the character encoding of a document
    at org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:77)
    at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:108)
    at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:114)
    at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:79)`

在我的Scala代码中的这一行:

代码语言:javascript
复制
val content = theParser.parse(stream.open(), chandler, meta, pContext)

其中,流是一个PortableDataStream,钱德勒是一个新的BodyContentHandler,元是一个新的元数据,pContext是一个新的ParseContext。

如果我使用的是AutoDetectParser,则会得到以下错误:

代码语言:javascript
复制
org.apache.jena.shared.SyntaxError: unknown
    at org.apache.jena.rdf.model.impl.NTripleReader.read(NTripleReader.java:73)
    at org.apache.jena.rdf.model.impl.NTripleReader.read(NTripleReader.java:58)
    at org.apache.jena.rdf.model.impl.ModelCom.read(ModelCom.java:305)

在我的Scala代码中的这一行:

代码语言:javascript
复制
val response = model.read(stream, null, "N-TRIPLES")

其中流是一个InputStream。

我认为这是由于来自Tika的一个空洞的回应(同样的问题)。

我很确定这可能是我过于复杂的build.sbt文件中的一个依赖问题,但是经过许多小时的尝试,我确实需要帮助。

一个积极的方面是,如果没有输入txt文件,一切都会很完美,所以这可能是我的最后一个问题!

最后,下面是我使用build.sbt构建的sbt clean assembly文件

代码语言:javascript
复制
scalaVersion := "2.11.8"
version      := "1.0.0"
name := "crawldocs"
conflictManager := ConflictManager.strict
mainClass in assembly := Some("com.addlesee.crawling.CrawlHiccup")
libraryDependencies ++= Seq(
  "org.apache.tika" % "tika-core" % "1.16",
  "org.apache.tika" % "tika-parsers" % "1.16" excludeAll(
    ExclusionRule(organization = "*", name = "guava")
  ),
    "com.blazegraph" % "bigdata-core" % "2.0.0" excludeAll(
    ExclusionRule(organization = "*", name = "collection-0.7"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "commons-logging"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "httpmime"),
    ExclusionRule(organization = "*", name = "jackson-annotations"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-cmds"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "jena-tdb"),
    ExclusionRule(organization = "*", name = "jsonld-java"),
    ExclusionRule(organization = "*", name = "libthrift"),
    ExclusionRule(organization = "*", name = "log4j"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "xercesImpl"),
    ExclusionRule(organization = "*", name = "xml-apis")
  ),
    "org.scalaj" %% "scalaj-http" % "2.3.0",
  "org.apache.jena" % "apache-jena" % "3.4.0" excludeAll(
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
    "org.apache.jena" % "apache-jena-libs" % "3.4.0" excludeAll(
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
    "org.noggit" % "noggit" % "0.6",
    "com.typesafe.scala-logging" %% "scala-logging" % "3.7.2" excludeAll(
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
  "org.apache.spark" % "spark-core_2.11" % "2.2.0" excludeAll(
    ExclusionRule(organization = "*", name = "breeze_2.11"),
    ExclusionRule(organization = "*", name = "hadoop-hdfs"),
    ExclusionRule(organization = "*", name = "hadoop-annotations"),
    ExclusionRule(organization = "*", name = "hadoop-common"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-app"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-common"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-core"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-jobclient"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-shuffle"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-api"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-client"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-common"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-server-common"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-server-web-proxy"),
    ExclusionRule(organization = "*", name = "activation"),
    ExclusionRule(organization = "*", name = "hive-exec"),
    ExclusionRule(organization = "*", name = "scala-compiler"),
    ExclusionRule(organization = "*", name = "spire_2.11"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "guava"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "gson"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "zookeeper"),
    ExclusionRule(organization = "*", name = "jettison"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "bcprov-jdk15on"),
    ExclusionRule(organization = "*", name = "jul-to-slf4j"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "curator-framework")
  ),
  "org.scala-lang" % "scala-xml" % "2.11.0-M4",
  "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.7.3" excludeAll(
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "jettison"),
    ExclusionRule(organization = "*", name = "avro"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "netty")
  ),
  "org.apache.hadoop" % "hadoop-common" % "2.7.3" excludeAll(
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "commons-math3"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "jets3t"),
    ExclusionRule(organization = "*", name = "gson"),
    ExclusionRule(organization = "*", name = "avro"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "zookeeper"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "commons-net"),
    ExclusionRule(organization = "*", name = "curator-recipes"),
    ExclusionRule(organization = "*", name = "jsr305")
  )
)
assemblyMergeStrategy in assembly := {
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-11-09 12:10:10

终于修好了。

我在case x if x.contains("EncodingDetector") => MergeStrategy.deduplicate中的丢弃行上方添加了MergeStrategy。下面位于我的assemblyMergeStrategy底部的build.sbt修复了我的问题:

代码语言:javascript
复制
assemblyMergeStrategy in assembly := {
 case x if x.contains("EncodingDetector") => MergeStrategy.deduplicate
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}
票数 0
EN

Stack Overflow用户

发布于 2017-11-03 21:50:56

上面的代码调用仅由于遗留原因而存在的旧N-三元组解析。老读者只读ASCII。会把它弄坏的。

要么是apache-jena-libs (即type=pom)没有被处理,要么您正在重新打包jars,并且还没有处理ServiceLoader放置文件的元-INF/服务。Jena使用它进行初始化。必须通过连接同名的文件来组合META_INF/service/*文件。

详细信息:https://jena.apache.org/documentation/notes/jena-repack.html

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/47100718

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档