我正在寻找提取在“下一步”中发现的所有页面的链接,并将它们附加到lists...............................................................................................................................................................请在这方面给我指点一下。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
sub_link=[]
sub_link_edit=[]
def convert(url):
if not url.startswith('http://'):
return 'http:' + url
return url
next_link = 'https://money.rediff.com/companies/groups/A'
while next_link:
page = requests.get(next_link)
soup = BeautifulSoup(page.content,'html.parser')
pagination_container_company = soup.find_all("table", class_="pagination-container-company")[0].text
sub_link = re.search('href="(.*)">Next', pagination_container_company).group(1)
sub_link_edit.append(convert(sub_link))
next_link=convert(sub_link)
data_df = pd.DataFrame()
df = pd.DataFrame(
{
'Link': sub_link_edit
})
data_df = pd.concat([data_df, df], sort=False)
print(df.shape)
tot_sub=len(sub_link_edit)
print(tot_sub)
data_df.to_csv('results_1.csv')发布于 2020-05-30 14:58:53
对于这类问题,你可能应该使用网络爬行器,否则你将无法处理任何javascript魔法(例如,在页面加载时)...although在这种情况下,这可能不是问题。
您还需要向bs4传递页面的内容,而不是整个响应对象
soup = BeautifulSoup(page.content)既然您已经在使用bs4搜索表,为什么不使用它直接搜索您感兴趣的链接呢?如下所示:
sub_link = soup.body.findAll("a", text="Next")[0].get('href')发布于 2020-05-31 16:18:25
import pandas as pd
import requests
from bs4 import BeautifulSoup
def convert(url):
if not url.startswith('http://'):
return 'http:' + url
return url
company_name = []
company_link = []
company_link_edit=[]
sub_link = 'https://money.rediff.com/companies/groups/A'
while sub_link:
page = requests.get(sub_link)
soup = BeautifulSoup(page.content, 'html.parser')
company_A_subpg1 = soup.find_all(class_='dataTable')
for sub_tab in company_A_subpg1:
temp = sub_tab.find('tbody')
all_rows = temp.find_all('tr')
for val in all_rows:
a_tag = val.find('a', href=True)
company_name.append(a_tag.text.strip())
company_link_edit.append(convert(a_tag.get('href')))
try:
sub_link = soup.body.findAll("a", text="Next")[0].get('href')
except:
break
sub_link = convert(sub_link)
print(len(company_name), len(company_link_edit))
data_df = pd.DataFrame()
df=pd.DataFrame(
{'Name':company_name,
'Link':company_link_edit
})
data_df = pd.concat([data_df, df], sort=False)
print(df.shape)https://stackoverflow.com/questions/62098440
复制相似问题