首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在WebHarvest中禁用XML验证

在WebHarvest中禁用XML验证
EN

Stack Overflow用户
提问于 2014-08-21 18:43:33
回答 2查看 510关注 0票数 1

我已经在苹果的应用商店中发布了一个移动应用程序。

此SPI客户端应用程序在服务器端使用Rest API来检索有关到达特定公共汽车站的公交车的实时信息。

这款应用程序运行了6个月,效果非常出色。

Rest API使用WebHarvest从网站(例如:http://www.metlink.org.nz/stop/4912/departures)中剔除实际数据信息。

几天前,从我的服务器端代码中抓取的HTML页面已经更改,添加了以下代码行:

代码语言:javascript
复制
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

从那时起,我的应用程序就停止工作了。

我知道我可以使用regExp去掉上面的代码,但是我想知道是否有一种方法可以通知WebHarvest禁用XML验证。禁用XML验证后,我不需要在我拥有的每个配置中将xpath表达式更改为regExp来去掉上面的行。

下面是我的配置文件:

代码语言:javascript
复制
<config charset="UTF-8">

    <var-def name="pageContentStr">
        <html-to-xml>
            <http url="http://www.metlink.org.nz/stop/${stationID.toString()}/departures" />
        </html-to-xml>
    </var-def>

    <var-def name="serverTime">
        <xpath expression="/html/body/ul/li/span/text()">
            <var name="pageContentStr" />
        </xpath>
    </var-def>

    <var-def name="busRTI">
            <xpath expression="//tbody/tr[@data-code]/concat(td[1]/a[starts-with(@href,'timetables/')]/span/text(),'::',td[1]/a[starts-with(@href,'timetables/bus/')]/span/attribute::style,'::',td[2]/span/text(),'::',td[3]/span/text())">
                <var name="pageContentStr" />
            </xpath>
    </var-def>
</config>

如果我在WebHarvest图形用户界面中运行上面插入的配置文件,它工作得很好(奇怪)。然而,当在我的Rest API中运行它时,我收到一个错误。下面是我收到的错误:

代码语言:javascript
复制
exception

org.springframework.web.util.NestedServletException: Request processing failed; nested exception is org.webharvest.exception.ScraperXPathException: Error parsing XPath expression (XPath = [/html/body/ul/li/span/text()])!
    org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:948)
    org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:827)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
    org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
root cause

org.webharvest.exception.ScraperXPathException: Error parsing XPath expression (XPath = [/html/body/ul/li/span/text()])!
    org.webharvest.runtime.processors.XPathProcessor.execute(XPathProcessor.java:70)
    org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
    org.webharvest.runtime.processors.BodyProcessor.execute(BodyProcessor.java:25)
    org.webharvest.runtime.processors.VarDefProcessor.execute(VarDefProcessor.java:59)
    org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
    org.webharvest.runtime.Scraper.execute(Scraper.java:166)
    org.webharvest.runtime.Scraper.execute(Scraper.java:179)
    com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.scrapeBusesForStation(MetLinkAdapterImpl.java:147)
    com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.getStationBuses(MetLinkAdapterImpl.java:118)
    com.didibaba.services.BusStationServiceImpl.getBusStationInfoByName(BusStationServiceImpl.java:80)
    com.didibaba.web.controllers.BusStationController.getBusStationInfo(BusStationController.java:36)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:606)
    org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:219)
    org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:132)
    org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:104)
    org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandleMethod(RequestMappingHandlerAdapter.java:745)
    org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:686)
    org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80)
    org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
    org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
    org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:936)
    org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:827)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
    org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
root cause

net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseExceptionpublicId: -//W3C//DTD HTML 4.0 Transitional//EN; systemId: `http://www.w3.org/TR/REC-html40/loose.dtd`; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'.
    net.sf.saxon.event.Sender.sendSAXSource(Sender.java:420)
    net.sf.saxon.event.Sender.send(Sender.java:169)
    net.sf.saxon.Configuration.buildDocument(Configuration.java:3346)
    net.sf.saxon.Configuration.buildDocument(Configuration.java:3288)
    net.sf.saxon.query.StaticQueryContext.buildDocument(StaticQueryContext.java:327)
    org.webharvest.utils.XmlUtil.evaluateXPath(XmlUtil.java:77)
    org.webharvest.runtime.processors.XPathProcessor.execute(XPathProcessor.java:68)
    org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
    org.webharvest.runtime.processors.BodyProcessor.execute(BodyProcessor.java:25)
    org.webharvest.runtime.processors.VarDefProcessor.execute(VarDefProcessor.java:59)
    org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
    org.webharvest.runtime.Scraper.execute(Scraper.java:166)
    org.webharvest.runtime.Scraper.execute(Scraper.java:179)
    com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.scrapeBusesForStation(MetLinkAdapterImpl.java:147)
    com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.getStationBuses(MetLinkAdapterImpl.java:118)
    com.didibaba.services.BusStationServiceImpl.getBusStationInfoByName(BusStationServiceImpl.java:80)
    com.didibaba.web.controllers.BusStationController.getBusStationInfo(BusStationController.java:36)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:606)
    org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:219)
    org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:132)
    org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:104)
    org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandleMethod(RequestMappingHandlerAdapter.java:745)
    org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:686)
    org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80)
    org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
    org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
    org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:936)
    org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:827)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
    org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
root cause

org.xml.sax.SAXParseExceptionpublicId: -//W3C//DTD HTML 4.0 Transitional//EN; systemId: http://www.w3.org/TR/REC-html40/loose.dtd; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'.
    com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
    com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
    com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
    com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
    com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
    com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanEntityDecl(XMLDTDScannerImpl.java:1562)
    com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(XMLDTDScannerImpl.java:1964)
    com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDExternalSubset(XMLDTDScannerImpl.java:297)
    com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1162)
    com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1049)
    com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:962)
    com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
    com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
    com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
    com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
    com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
    com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
    com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
    com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
    net.sf.saxon.event.Sender.sendSAXSource(Sender.java:396)
    net.sf.saxon.event.Sender.send(Sender.java:169)
    net.sf.saxon.Configuration.buildDocument(Configuration.java:3346)
    net.sf.saxon.Configuration.buildDocument(Configuration.java:3288)
    net.sf.saxon.query.StaticQueryContext.buildDocument(StaticQueryContext.java:327)
    org.webharvest.utils.XmlUtil.evaluateXPath(XmlUtil.java:77)
    org.webharvest.runtime.processors.XPathProcessor.execute(XPathProcessor.java:68)
    org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
    org.webharvest.runtime.processors.BodyProcessor.execute(BodyProcessor.java:25)
    org.webharvest.runtime.processors.VarDefProcessor.execute(VarDefProcessor.java:59)
    org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
    org.webharvest.runtime.Scraper.execute(Scraper.java:166)
    org.webharvest.runtime.Scraper.execute(Scraper.java:179)
    com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.scrapeBusesForStation(MetLinkAdapterImpl.java:147)
    com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.getStationBuses(MetLinkAdapterImpl.java:118)
    com.didibaba.services.BusStationServiceImpl.getBusStationInfoByName(BusStationServiceImpl.java:80)
    com.didibaba.web.controllers.BusStationController.getBusStationInfo(BusStationController.java:36)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:606)
    org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:219)
    org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:132)
    org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:104)
    org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandleMethod(RequestMappingHandlerAdapter.java:745)
    org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:686)
    org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80)
    org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
    org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
    org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:936)
    org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:827)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
    org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:728) 

提前谢谢。

EN

回答 2

Stack Overflow用户

发布于 2014-08-21 20:01:39

您可以尝试在html-to-xml调用中使用omithtmlenvelope="true"

代码语言:javascript
复制
<var-def name="pageContentStr">
    <html-to-xml omithtmlenvelope="true">
        <http url="http://www.metlink.org.nz/stop/${stationID.toString()}/departures" />
    </html-to-xml>
</var-def>

然而,不幸的是,正如你所说,我不能重现你得到的错误,也不能测试结果……

票数 0
EN

Stack Overflow用户

发布于 2014-12-29 05:25:52

我在xpath计算器抛出org.xml.sax.SAXParseException时遇到过类似的问题。

White spaces are required between publicId and systemId.

当您可以更改原始xml时,这个问题的解决方案已经在here中得到了解决。

网络收割在幕后使用html cleaner。我使用完整的网络收获项目,因此可以防止在html-to-xml上添加doctype-tag。

我使用html cleaner版本2.6.1.,并修改了org.webharvest.runtime.processors.HtmlToXmlProcessor以支持这个新版本

代码语言:javascript
复制
HtmlCleaner cleaner = new HtmlCleaner( );
CleanerProperties cleanerProperties = cleaner.getProperties();

由于html清理器完全支持省略文档类型的omitDoctypeDeclaration - parameter,因此我补充了这一点(将来可能会通过一个额外的属性来实现)。

代码语言:javascript
复制
cleanerProperties.setOmitDoctypeDeclaration(true);

希望它能帮上忙,感谢网络收割的创建者,它是一个很棒的、非常可靠的工具!

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/25424305

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档