我正试着从https://www.brewbound.com/breweries网站上抓取一份啤酒厂的列表,上面有他们的经度和纬度。这是我感兴趣的网站源代码的一部分:
<script>
var locations = [['Wolf Pack Brewing Company', 44.6620529, -111.0994608, '/breweries/Wolf_Pack_Brewing_Co'],['Defiant Brewing Company', 41.0584046, -74.022847, '/breweries/Defiant_Brewing_Co'],还有一长串的啤酒厂。每个啤酒厂都列在带有名称、[]、long和website的lat之间。我想要做的是收集var locations并从它创建一个DataFrame,每个酿酒厂作为一行,列包含列出的信息。
我已经能够用<script>标签(包括页面的多个部分)抓取网站中的所有内容。我不知道下一步该怎么走。
url = "https://www.brewbound.com/breweries"
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
script = html_soup.find_all('script')这是我用来获取所有<script>标签的代码。
发布于 2019-06-11 01:08:42
对于<script>标签的内容,BeautifulSoup不会帮到你。但是,您可以使用re和ast.literal_eval来提取信息:
import re
import requests
from ast import literal_eval
from pprint import pprint
url = "https://www.brewbound.com/breweries"
r = requests.get(url)
l = literal_eval(re.search(r'var locations = (\[.*?\]);', r.text, flags=re.DOTALL)[1])
pprint(l)打印:
[['Wolf Pack Brewing Company',
44.6620529,
-111.0994608,
'/breweries/Wolf_Pack_Brewing_Co'],
['Defiant Brewing Company',
41.0584046,
-74.022847,
'/breweries/Defiant_Brewing_Co'],
['El Toro Brewing Company',
37.1465525,
-121.6219873,
'/breweries/El_Toro_Brewing_Co'],
['Sebago Brewing Company',
43.679212,
-70.396424,
'/breweries/Sebago_Brewing_Co'],
...etc.https://stackoverflow.com/questions/56530509
复制相似问题