首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用BaseX整理HTML

用BaseX整理HTML
EN

Stack Exchange QA用户
提问于 2019-01-02 04:21:23
回答 1查看 72关注 0票数 0

我如何为HTML整理这个baseX?使用基础知识示例。

看起来HTML是无效的,或者至少baseX在解析它时有明显的问题:

代码语言:javascript
复制
thufir@dur:~/NetBeansProjects/helloWorldBaseX$ 
thufir@dur:~/NetBeansProjects/helloWorldBaseX$ gradle clean run

> Task :run FAILED
Jan 01, 2019 8:11:21 PM org.basex.examples.local.ScraperForHTML fetch
INFO: name  Resources  Size  Input Path  
---------------------------------

0 Databases.

Exception in thread "main" org.basex.core.BaseXException: "http://books.toscrape.com/" (Line 665):  found,  expected.
        at org.basex.core.Command.execute(Command.java:78)
        at org.basex.core.Command.execute(Command.java:90)
        at org.basex.examples.local.ScraperForHTML.fetch(ScraperForHTML.java:38)
        at org.basex.examples.local.App.helloWorld(App.java:24)
        at org.basex.examples.local.App.main(App.java:14)

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':run'.
> Process 'command '/usr/lib/jvm/java-8-openjdk-amd64/bin/java'' finished with non-zero exit value 1

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 2s
4 actionable tasks: 3 executed, 1 up-to-date
thufir@dur:~/NetBeansProjects/helloWorldBaseX$ 

代码:

代码语言:javascript
复制
package org.basex.examples.local;

import java.net.MalformedURLException;
import java.net.URL;
import java.util.Iterator;
import java.util.Properties;
import java.util.logging.Logger;
import org.basex.core.BaseXException;
import org.basex.core.Context;
import org.basex.core.Databases;
import org.basex.core.cmd.CreateDB;
import org.basex.core.cmd.List;
import org.basex.core.cmd.Set;
import org.basex.util.list.StringList;

public class ScraperForHTML implements Scraper{

    private static final Logger LOG = Logger.getLogger(App.class.getName());
    private Properties properties = new Properties();

    private ScraperForHTML() {
    }

    public ScraperForHTML(Properties properties) {
        this.properties = properties;
        LOG.fine(properties.toString());
    }

    @Override
    public void fetch() throws BaseXException, MalformedURLException   {
        URL url = new URL(properties.getProperty("htmlURL"));
        String databaseName = properties.getProperty("databaseName");

        Context context = new Context();
        LOG.info(new List().execute(context));

        new Set("parser", "html").execute(context);
        new CreateDB(databaseName, url.toString()).execute(context);


        Databases databases = context.databases();
        StringList stringListOfDatabases = databases.listDBs();
        String currentDatabaseName = null;

        Iterator databaseIterator = stringListOfDatabases.iterator();

        while (databaseIterator.hasNext()) {
            currentDatabaseName=databaseIterator.next();
            LOG.info(currentDatabaseName);
            //not quite sure how to query a database...
        }


      //  new DropDB(databaseName).execute(context);
        context.close();
    }


}

值得注意的是,这个XQuery似乎工作很好:

代码语言:javascript
复制
thufir@dur:~/basex$ 
thufir@dur:~/basex$ cat html_fetch_parse.xq 


fetch:xml("http://books.toscrape.com/", map {
  'parser': 'html',
  'htmlparser': map { 'html': false(), 'nodefaults': true() }
})

thufir@dur:~/basex$ 
EN

回答 1

Stack Exchange QA用户

回答已采纳

发布于 2019-01-02 23:36:09

看起来您在类路径中丢失了TagSoup。正如上面所说的,这里应该有一个可以成功导入html格式的格式。由于html通常不是格式良好的XML,因此在导入之前应该对其进行预处理。

当一个人没有TagSoup时,输入格式被认为是XML,并且必须是格式良好的。您的示例缺少关闭标记,这使得不可能在不使用TagSoup的情况下导入。

票数 1
EN
页面原文内容由Stack Exchange QA提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://sqa.stackexchange.com/questions/37110

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档