我正在尝试使用javascript构建一个带有javascript的web 刮刀,使用节点包从这个网站获得asn前缀数据:prefixes。
到目前为止,这就是我所拥有的:
var request = require('request');
var cheerio = require('cheerio');
apnList = {
'MIT': 3,
'Dynamics': 15,
'NYU': 12,
'Harvard': 11,
'Bull HN Information Sys': 6,
'NNIC': 22,
'Symbolics': 5,
'University of Delaware': 2
};
for (apn in apnList) {
var url = 'http://bgp.he.net/AS'+apnList[apn]+'#_prefixes'
request(url, (function(apn) { return function(err, resp, body) {
$ = cheerio.load(body);
console.log(body)
}})(apn));
}当我在终端中运行文件时,我得到以下信息:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /AS11
on this server.</p>
<hr>
<address>Apache/2.2.22 (Ubuntu) Server at bgp.he.net Port 80</address>
</body></html>每个数字。我该怎么解决这个问题?
另外,额外的问题:最后,我想从这个txt文件中获取每个asn编号,并将其输入for循环。
[asn.txt] 2014-06-30 02:05:03Z
This file contains a list of autonomous system numbers and names of all
registered ASNs. The column on the right below contains the ARIN
database "handle" of the technical, abuse or NOC contacts for the ASN.
Additional information may be obtained about any of the ASN contacts
in this list by querying the ARIN WHOIS server. For questions or updates
on this information please contact the ARIN Registration Services Hostmaster
staff, HOSTMASTER@ARIN.NET.
AS Number AS Name POC Handles
0 IANA-RSVD-0 IANA-IP-ARIN (Abuse), IANA-ARIN (Admin), IANA-IP-ARIN (Tech)
1 LVLT-1 IPADD5-ARIN (Tech), APL8-ARIN (Abuse), NOCSU27-ARIN (NOC), APL7-ARIN (Admin)
2 UDEL-DCN CASHJ-ARIN (Tech), NSS13-ARIN (Abuse), DJG2-ARIN (Tech), DJG2-ARIN (Admin)
3 MIT-GATEWAYS MNO78-ARIN (NOC), SILIS-ARIN (Admin), MNS18-ARIN (Abuse), SILIS-ARIN (Tech)
4 ISI-AS ACT-ORG-ARIN (Admin), ACT-ORG-ARIN (Abuse), ACT-ORG-ARIN (Tech)
5 SYMBOLICS SG52-ARIN (Tech), SG52-ARIN (Admin), SG52-ARIN (Abuse)
6 BULL-HN USINT-ARIN (Admin), ZB126-ARIN (Abuse), ZB126-ARIN (Tech), JLM23-ARIN (Tech)
7 RIPE-ASNBLOCK-7 ABUSE3850-ARIN (Abuse), RNO29-ARIN (Tech), RNO29-ARIN (Admin)
8 RICE-AS RUH-ORG-ARIN (Tech), RUH-ORG-ARIN (Admin), RUH-ORG-ARIN (Abuse)
9 CMU-ROUTER CH4-ORG-ARIN (Tech), CH4-ORG-ARIN (NOC), CMA3-ARIN (Abuse), CH4-ORG-ARIN (Admin)
10 CSNET-EXT-AS CS15-ARIN (Abuse), CS15-ARIN (Tech), CS15-ARIN (Admin)
11 HARVARD JNL17-ARIN (Admin), JNL17-ARIN (Tech那只是其中的一个片段。它持续了几千个数字。是否有选择地从AS数字列中提取每个数字?
发布于 2014-07-24 15:57:46
忘了回来回答这个问题。
var request = require('request');
var cheerio = require('cheerio');
var fs= require('fs')
var filename="/scraping/asn.txt "
apnList = {
'MIT': 3,
'Dynamics': 15,
'NYU': 12,
'Harvard': 11,
'Bull HN Information Sys': 6,
'NNIC': 22,
'Symbolics': 5,
'University of Delaware': 2
};
for (apn in apnList) {
var options = {
url : 'http://bgp.he.net/AS'+apnList[apn]+'#_prefixes',
headers: {
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'
}
};
request(options, (function(apn) { return function(err, resp, body) {
var $ = cheerio.load(body);
$('#table_prefixes4 tr').each(function(index, prefix) {
$(this).find('.nowrap').each(function(){
event = $(this).text().trim();
nextevent = $(this).next().text();
console.log(apn+","+event+","+nextevent)
});
});
}})(apn));
}发布于 2014-07-24 04:08:58
bgp.he.net似乎阻止了试图刮掉其站点的尝试,可能是基于用户代理。这就是为什么您要得到403错误:访问被拒绝!您可以尝试更改Node.js在检索远程URL时使用的用户代理(但不确定如何随手操作),但似乎飓风电气并不喜欢刮刮,所以您可能应该避免使用它。
https://stackoverflow.com/questions/24543214
复制相似问题