首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >StormCrawler不解析Tika元数据

StormCrawler不解析Tika元数据
EN

Stack Overflow用户
提问于 2020-12-25 15:38:58
回答 1查看 77关注 0票数 1

当将Tika解析器添加到StormCrawler时,不会从该字段提取信息并将其存储在ElasticSearch中。

es-crawler.flux

代码语言:javascript
复制
includes:
  - resource: true
    file: "/crawler-default.yaml"
    override: false

  - resource: false
    file: "crawler-conf.yaml"
    override: true

  - resource: false
    file: "es-conf.yaml"
    override: true

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 10

  - id: "filespout"
    className: "com.digitalpebble.stormcrawler.spout.FileSpout"
    parallelism: 1
    constructorArgs:
      - "."
      - "seeds.txt"
      - true

bolts:
  - id: "filter"
    className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
    parallelism: 1
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 1
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 1
  - id: "tika_redirection"
    className: "com.digitalpebble.stormcrawler.tika.RedirectionBolt"
    parallelism: 1
  - id: "tika_parser"
    className: "com.digitalpebble.stormcrawler.tika.ParserBolt"
    parallelism: 1
  - id: "index"
    className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
    parallelism: 1
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 1
  - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 1

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE
      
  - from: "spout"
    to: "status_metrics"
    grouping:
      type: SHUFFLE     

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "tika_redirection"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "tika_redirection"
    to: "tika_parser"
    grouping:
      type: LOCAL_OR_SHUFFLE
      streamId: "tika"

  - from: "tika_parser"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "tika_parser"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filespout"
    to: "filter"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filter"
    to: "status"
    grouping:
      streamId: "status"
      type: CUSTOM
      customClass:
        className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byDomain"

我把这些设置添加到爬虫-conf.yaml. added:

crawler-conf.yaml

代码语言:javascript
复制
  parser.mimetype.whitelist:
    - application/.*pdf.*

  jsoup.treat.non.html.as.error: false

此外,在运行拓扑时,我发现了以下日志:

代码语言:javascript
复制
16:27:29.867 [Thread-43-tika_parser-executor[22, 22]] INFO  c.d.s.t.ParserBolt - skipped_trimmed -> http://cds.iisc.ac.in/wp-content/uploads/DS256.2017.Storm_.Tutorial.pdf

我更喜欢从pdf中提取所有可能的字段,并使用数组存储页面中的信息,这样,在Elasticsearch中,一个页面就成为数组中的一个元素。

EN

回答 1

Stack Overflow用户

发布于 2020-12-26 09:55:00

参见ParserBolt --如果在获取过程中对文档进行了裁剪,解析就不会发生。

可以禁用conf中的裁剪。

代码语言:javascript
复制
  http.content.limit: -1

这应该能使文件与Tika解析。得到的元数据将有一个前缀解析。您可能需要编写一个自定义螺栓,以按您想要的格式按摩数据,即ES中的每页一个键。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65449492

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档