我对上课比较陌生。这个用的是一个。
我很肯定这不是正确的方法。但我也不知道正确的方法。
在创建具有certified_secure URL的爬虫实例时,可以使用hackthissite.org函数。
我有很多评论吗?是不是太冗长了?
#! usr/bin/env python
import bs4
import requests
users = ['user1', 'user2']
certified_secure_url = 'https://www.certifiedsecure.com/profile?alias='
hack_this_site_url = 'https://www.hackthissite.org/user/view/'
# this function takes a string as input and outputs a list with all the integers in the string
def get_num(string):
# get the numbers from string
lst = ''.join([x if x.isdigit() else ' ' for x in string]).split()
# change to list of ints instead of strings
new_lst = []
for item in lst:
new_lst.append(int(item))
return new_lst
class Crawler(object):
def __init__(self, url):
self.url = url
# retrieve data from site and
def get_site_data(self, user):
request = requests.get(self.url + user)
return bs4.BeautifulSoup(request.text, 'lxml')
def certified_secure(self, user):
experience = self.get_site_data(user).select('.level_progress_details')[0].getText()
# get the points from the string
return get_num(experience)[1]
def hack_this_site(self, user):
experience = self.get_site_data(user).select('.blight-td')[1].getText()
return get_num(experience)[0]
# make to instances to crawl
cs = Crawler(certified_secure_url)
hts = Crawler(hack_this_site_url)
for user in users:
print cs.certified_secure(user)
print hts.hack_this_site(user)发布于 2016-09-07 20:00:00
您可以使用get_num中的列表理解:
def get_num(string):
"""this function takes a string as input and outputs a list with all the integers in the string"""
# get the numbers from string
numbers = ''.join(x if x.isdigit() else ' ' for x in string).split()
# change to list of ints instead of strings
return [int(number) for number in numbers]
# return map(int, numbers) # Alternative还请注意,join可以接受生成器表达式,因此不需要首先转换为list。我还在这里选择了更多的描述性变量名。
您还应该为您的函数(和类)提供一个docstring (就像我前面所做的那样,将一个三重“”分隔字符串作为函数体的第一行),您可以通过help(function_name)交互地访问它,许多文档构建工具都使用它。
它似乎也有点过于手动,不知道根据url调用哪种方法。你的爬虫可以自己决定:
class Crawler(object):
sites = {"hackthissite.org": ('.blight-td', 1, 0),
"certifiedsecure.com": ('.level_progress_details', 0, 1)}
def __init__(self, url):
self.url = url
self.options = self.get_sites_options(Crawler.sites)
def get_sites_options(self, sites):
for site, options in sites.items():
if self.url in site:
return options
def get_site_data(self, user):
"""retrieve data from site and"""
request = requests.get(self.url + user)
return bs4.BeautifulSoup(request.text, 'lxml')
def get_experience(self, user):
select_str, index, out_index = self.options
experience = self.get_site_data(user).select(select_str)[index].getText()
return get_num(experience)[out_index]https://codereview.stackexchange.com/questions/140753
复制相似问题