在试图大容量提取2000个域名的WHOIS信息时,python代码处理csv文件中的两个项,但对20000个域名的整个数据集带来错误。
尝试用两个域名,好的。使用20k域名的完整列表会带来错误。
import whois
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import socket
import os
import csv
import datetime
import time
import requests
from ipwhois import IPWhois
from urllib import request
from ipwhois.utils import get_countries
import tldextract
from ipwhois.utils import get_countries
countries = get_countries(is_legacy_xml=True)
from ipwhois.experimental import bulk_lookup_rdap
from ipwhois.hr import (HR_ASN, HR_ASN_ORIGIN, HR_RDAP_COMMON, HR_RDAP, HR_WHOIS, HR_WHOIS_NIR)
countries = get_countries(is_legacy_xml=True)
import ipaddress
df = pd.read_csv('labelled_dataset.csv')
#TimeOut Setting
s = socket.socket()
s.settimeout(10)
#Date Processing Function
def check_date_type(d):
if type(d) is datetime.datetime:
return d
if type(d) is list:
return d[0]
for index,row in df.iterrows():
DN = df.iloc[index]['Domains']
df['IPaddr'] = socket.gethostbyname(DN)
df['IPcity'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['city']
df['ASNumber'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['asn']
df['NetAddr'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['address']
df['NetCity'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['city']
df['NetPostCode'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['postal_code']
W = whois.whois(DN)
df['WebsiteName'] = W.name
df['ASRegistrar'] = W.registrar
df['CtryCode'] = W.country
df['Dstatus'] = W.status[1]
df['RegDate'] = check_date_type(W.creation_date)
df['ExDate'] = check_date_type(W.expiration_date)
df.to_csv('extracted_dataset_1_1.csv', index=False)期望ASN详细信息的输出,每个域名在csv文件中导出的WHOIS信息
发布于 2019-07-27 11:31:05
您正在为正在查找的每个属性创建一个新的IPWhois对象。这意味着每次迭代至少运行5个呼呼查询。
这将产生大量的网络流量,而且是完全不必要的-您只需在每个域运行一次whois,并以成员身份访问结果。
尝试将循环中的代码更改为如下所示:
df['IPaddr'] = socket.gethostbyname(DN)
ipwhois = IPWhois(df['IPaddr'], allow_permutations=True).lookup_whois()
if (ipwhois):
df['IPcity'] = ipwhois['nets'][0]['city']
df['ASNumber'] = ipwhois['asn']
df['NetAddr'] = ipwhois['nets'][0]['address']
df['NetCity'] = ipwhois['city']
df['NetPostCode'] = ipwhois['nets'][0]['postal_code']还有一些其他的优化,我建议:
IPWhois或whois --而不是两者都使用。whois查询的响应才能继续,而且网络查询比您的代码在循环中的每一次迭代中运行的速度慢很多数量级。使用异步模型,您可以触发多个whois查询,并且只能在结果到达时对其进行操作。这个模型可以帮助优化应用程序的效率。https://stackoverflow.com/questions/57231047
复制相似问题