这是我写的一个程序,用来抓取谷歌、Yelp和Foursquare的餐馆/酒吧。然后,它根据评级、评级数量和使用贝叶斯平均值的数据源数量对它们进行更有效的排序。我猜主要的方法可能会被分解成更多的函数。我也猜我错过了一些方便的列表理解技巧。有什么建议吗?
import csv
import time
import foursquare
import yelp
import google
def bayesian(R, v, m, C):
"""
Computes the Bayesian average for the given parameters
:param R: Average rating for this business
:param v: Number of ratings for this business
:param m: Minimum ratings required
:param C: Mean rating across the entire list
:returns: Bayesian average
"""
# Convert to floating point numbers
R = float(R)
v = float(v)
m = float(m)
C = float(C)
return ((v / (v + m)) * R + (m / (v + m)) * C)
def remove_duplicate_names(full_list):
"""
Fixes issue with multiple API calls returning the same businesses
:param R: The entire unfiltered list
:returns: Filtered list
"""
names = set()
filtered_list = []
for business in full_list:
if business.name not in names:
filtered_list.append(business)
names.add(business.name)
return filtered_list
def main():
"""
Finds all the bars/restaurants in the given area. Use different
lat/long points to cover entire town since API calls have length limits.
"""
input_value = ''
locations = []
distance = input('Search Radius (meters): ')
while input_value is not 'n':
lat = input('Lat: ')
lng = input('Long: ')
locations.append((lat, lng))
input_value = raw_input('Would you like more points? (y/n) ')
venues, businesses, places = [], [], []
for lat,lng in locations:
# Retrieve all businesses for all sources
print 'Searching lat: {} long: {} ...'.format(lat, lng)
venues.extend(foursquare.search(lat, lng, distance))
businesses.extend(yelp.search(lat, lng, distance))
places.extend(google.search(lat, lng, distance))
# Rate-limit API calls
time.sleep(1.0)
# Remove duplicates from API call overlap
venues = remove_duplicate_names(venues)
businesses = remove_duplicate_names(businesses)
places = remove_duplicate_names(places)
# Calculate low threshold and average ratings
fs_low = min(venue.rating_count for venue in venues)
fs_avg = sum(venue.rating for venue in venues) / len(venues)
yp_low = min(business.rating_count for business in businesses)
yp_avg = sum(business.rating for business in businesses) / len(businesses)
gp_low = min(place.rating_count for place in places)
gp_avg = sum(place.rating for place in places) / len(places)
# Add bayesian estimates to business objects
for v in venues:
v.bayesian = bayesian(v.rating, v.rating_count, fs_low, fs_avg)
for b in businesses:
b.bayesian = bayesian(b.rating * 2, b.rating_count, yp_low, yp_avg * 2)
for p in places:
p.bayesian = bayesian(p.rating * 2, p.rating_count, gp_low, gp_avg * 2)
# Combine all lists into one
full_list = []
full_list.extend(venues)
full_list.extend(businesses)
full_list.extend(places)
print 'Found {} total businesses!'.format(len(full_list))
# Combine ratings of duplicates
seen_addresses = set()
filtered_list = []
for business in full_list:
if business.address not in seen_addresses:
filtered_list.append(business)
seen_addresses.add(business.address)
else:
# Find duplicate in list
for b in filtered_list:
if b.address == business.address:
# Average bayesian ratings and update source count
new_rating = (b.bayesian + business.bayesian) / 2.0
b.bayesian = new_rating
b.source_count = b.source_count + 1
# Sort by Bayesian rating
filtered_list.sort(key=lambda x: x.bayesian, reverse=True)
# Write to .csv file
with open('data.csv', 'w') as csvfile:
categories = ['Name', 'Rating', 'Number of Ratings', 'Checkins', 'Sources']
writer = csv.DictWriter(csvfile, fieldnames=categories)
writer.writeheader()
for venue in filtered_list:
writer.writerow({'Name': venue.name.encode('utf-8'),
'Rating': '{0:.2f}'.format(venue.bayesian),
'Number of Ratings': venue.rating_count,
'Checkins': venue.checkin_count,
'Sources': venue.source_count})
if __name__ == '__main__':
main()发布于 2015-12-11 21:38:12
函数签名是:
bayesian(R, v, m, C)但是,您可以在docstring中很长一段时间描述这些单个字母参数:
:param R: Average rating for this business
:param v: Number of ratings for this business
:param m: Minimum ratings required
:param C: Mean rating across the entire list大多数情况下,描述性代码比描述性注释/文档字符串更可取,原因很简单:有两件事(代码/注释)而不是一件(代码)会使维护工作加倍,如果代码和注释不同步,代码就会变得非常混乱。
names = set()
filtered_list = []
for business in full_list:
if business.name not in names:
filtered_list.append(business)
names.add(business.name)
return filtered_list变成:
return list(set(business))据我所知,代码并不关心餐馆的顺序,所以set更改顺序这一事实不应该是个问题。
输入
获取用户输入是一个细节,在main中查看程序的主要结构时,我们并不关心它,所以只需使用函数即可。
while input_value is not 'n':
lat = input('Lat: ')
lng = input('Long: ')
locations.append((lat, lng))
input_value = raw_input('Would you like more points? (y/n) ')Python2中的
input 它自动评估输入,执行用户输入的任何内容都是危险的,并且普遍认为是错误的做法。使用int(raw_input(x))
+在Python中有很多含义,其中之一就是添加列表:
full_list = []
full_list.extend(venues)
full_list.extend(businesses)
full_list.extend(places)变成:
full_list = venues + businesses + places明显地增加了清晰度。
发布于 2015-12-11 22:30:24
除了Caridorc的好评论外,我还有几点评论:
bayesian()中,您可以转换为浮动,但在此之前,您可能使用int - When为该函数提供参数,您可以进行一些数学运算,这可能是int操作,也可能不是int操作。您可能希望在较早的级别强制执行浮点数。main()的方式,但是我会将它拆分成更多的函数,这样它就可以读到以下内容: def ():get_location_list() execute_search= execute_search( locations,search_engines) rated_restaurants =calculate_restaurant_rating(餐馆) write_restaurants("data.csv",rated_restaurants) #或相同的丑陋版本.write_restaurants("data.csv",计算餐馆等级( execute_search( get_location_list(),SEARCH_ENGINES ))定义了这个函数,它允许您的脚本作为其逻辑部分中的一个模块使用,并且您可以根据不同的需要来收集和操作数据。你仍然可以把它称为一个脚本来做一个单一的搜索。https://codereview.stackexchange.com/questions/113644
复制相似问题