我正在使用WebHarvest工具从几个网站抓取web数据。我已经看过了示例,但是找不到一种方法来在网站上进行身份验证,然后从网站上抓取数据。
谁能举出一个通过身份验证实现web数据抓取的配置示例?如何发送登录参数,然后接收主页内容?感谢你在这方面的帮助。
发布于 2014-04-15 17:44:26
我刚刚修改了一个网络收获示例(http://web-harvest.sourceforge.net/samples.php?num=4),它使用登录凭据运行得很好。您可能会获得更新的代码并尝试:
<?xml version="1.0" encoding="UTF-8"?>
<config charset="ISO-8859-1">
<!-- sends post request with needed login information -->
<http method="post" url="http://www.nytimes.com/auth/login">
<http-param name="is_continue">true</http-param>
<http-param name="URI">http://</http-param>
<http-param name="OQ"></http-param>
<http-param name="OP"></http-param>
<http-param name="USERID">web-harvest</http-param>
<http-param name="PASSWORD">web-harvest</http-param>
</http>
<var-def name="startUrl">http://www.nytimes.com/pages/todayspaper/index.html</var-def>
<file action="write" path="D:/nytimes/nytimes${sys.date()}.xml" charset="UTF-8">
<template>
<![CDATA[ <newyourk_times date="${sys.datetime("dd.MM.yyyy")}"> ]]>
</template>
<loop item="articleUrl" index="i">
<!-- collects URLs of all articles from the front page -->
<list>
<xpath expression="//div[@class='story']">
<html-to-xml>
<http url="${startUrl}"/>
</html-to-xml>
</xpath>
</list>
<!-- downloads each article and extract data from it -->
<body>
<xquery>
<xq-param name="doc">
<var name="articleUrl"/>
</xq-param>
<xq-expression><![CDATA[
declare variable $doc as node() external;
$doc
]]></xq-expression>
</xquery>
</body>
</loop>
<![CDATA[ </newyourk_times> ]]>
</file>
</config>https://stackoverflow.com/questions/23054692
复制相似问题