我试图刮掉这个页面左侧超链接的内容。我已经能够抓取超链接的内容,所以现在我尝试在页面左侧的每个单独的超链接上运行脚本。
网址:https://bitinfocharts.com/top-100-richest-dogecoin-addresses-3.html
我认为需要做的是url是一个动态变量,而这个变量是一个循环,它将遍历上面URL中的所有超链接。虽然我不太确定这是否是最好的方法,因为这是我的第一个项目
任何建议都是非常感谢的。
下面是我试图插入的代码。
import csv
import requests
from bs4 import BeautifulSoup as bs
url = 'https://bitinfocharts.com/dogecoin/address/DN5Hp2kCkvCsdwr5SPmwHpiJgjKnC5wcT7'
headers = {"User-Agent": "Mozilla/5.0"}
r = requests.get(url, headers=headers)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
headers = []
datarows = []
#Get crypto address for the filename
item = soup.find('h1').text
newitem = item.replace('Dogecoin','')
finalitem = newitem.replace('Address','')
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)发布于 2022-01-21 03:35:58
一个简单的方法是发出初始请求并提取表第二列中的所有链接。
然后循环这些链接,发出请求,并继续您现有的代码,除非也处理没有表的情况。
import csv
import requests
from bs4 import BeautifulSoup as bs
headers = []
datarows = []
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-3.html')
soup = bs(r.content, 'lxml')
address_links = [i['href'] for i in soup.select('.table td:nth-child(2) > a')]
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
if table:
item = soup.find('h1').text
newitem = item.replace('Dogecoin','')
finalitem = newitem.replace('Address','')
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
else:
print('no table for: ', url)https://stackoverflow.com/questions/70795634
复制相似问题