我希望使用以下代码从标记中获取href:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import bs4
import requests
# url of website
my_url='https://www.metmuseum.org/art/collection/search#!?showOnly=openAccess&material=Bowls&era=8000-2000%20B.C.&offset=0&pageSize=0&perPage=20&sortBy=Relevance&sortOrder=asc&searchField=All'
# opening up connection by using uReq and store in variable
uClient=uReq(my_url)
page_html=uClient.read() # read website html and store in variable
uClient.close() # closing of connection
# parsing html page
page_soup=soup(page_html , "html.parser")
#info
con2=page_soup.find_all("a" , {"class" : "result-card__link js-advanced-form"})
for link in con2:
href=link.get('href')
print(href)但我得到了这样的结果:{searchResult.url}
发布于 2019-10-28 07:48:43
这是一个动态的页面。在这里获得结果的最好方法是通过API直接访问数据源。我将其转储到一个dataframe中,但是您可以做您想做的事情,只需调用该列(我不知道您想要哪个url )就可以提取出href。
import requests
import pandas as pd
import math
from pandas.io.json import json_normalize
url = 'https://www.metmuseum.org/api/collection/collectionlisting'
payload = {
'artist': '',
'department': '',
'era': '8000-2000 B.C.',
'geolocation': '',
'material': 'Bowls',
'offset': '0',
'pageSize': '0',
'perPage': '100',
'searchField': 'All',
'showOnly': 'openAccess',
'sortBy': 'Relevance',
'sortOrder': 'asc'}
jsonData = requests.get(url, params=payload).json()
print ('Aquired page 1...')
df = json_normalize(jsonData['results'])
total_collections = jsonData['totalResults']
totalPages = math.ceil(total_collections / 100)
for page in range(1, totalPages):
payload = {
'artist': '',
'department': '',
'era': '8000-2000 B.C.',
'geolocation': '',
'material': 'Bowls',
'offset': '%s' %(page*100),
'pageSize': '0',
'perPage': '100',
'searchField': 'All',
'showOnly': 'openAccess',
'sortBy': 'Relevance',
'sortOrder': 'asc'}
jsonData = requests.get(url, params=payload).json()
print ('Aquired page %s...' %(page+1))
temp_df = json_normalize(jsonData['results'])
df = df.append(temp_df, sort=True).reset_index(drop=True)输出:
print (df.head(5).to_string())
accessionNumber artist culture date description galleryInformation image largeImage medium regularImage teaserText title url
0 36.1.117 ca. 3850–2960 B.C.\n Not on view https://images.metmuseum.org/CRDImages/eg/mobi... eg/web-large/36.1.117_EGDP010235.jpg Pottery eg/web-additional/36.1.117_EGDP010235.jpg <p>Date: ca. 3850–2960 B.C.\n<br/>Accession Nu... Bowl with flattened rim /art/collection/search/552185?&searchField=All...
1 1992.252.1 Japan None Accession Number: 1992.252.1 On view at The Met Fifth Avenue in <a href='ht... https://images.metmuseum.org/CRDImages/as/mobi... as/web-large/DP23088.jpg Earthenware with cord-marked and incised decor... as/web-additional/DP23088.jpg <p>Accession Number: 1992.252.1</p> “Flame-rimmed” deep bowl (kaen doki)\n\n /art/collection/search/44905?&searchField=All&...
2 33.4.41 ca. 3850–2960 B.C. Not on view https://images.metmuseum.org/CRDImages/eg/mobi... eg/web-large/33.4.41_EGDP011262.jpg Pottery eg/web-additional/33.4.41_EGDP011262.jpg <p>Date: ca. 3850–2960 B.C.<br/>Accession Numb... Deep bowl /art/collection/search/558199?&searchField=All...
3 12.181.38 ca. 3100–2649 B.C. Not on view https://images.metmuseum.org/CRDImages/eg/mobi... eg/web-large/12-181-38.jpg Travertine (Egyptian alabaster) eg/web-additional/12-181-38.jpg <p>Date: ca. 3100–2649 B.C.<br/>Accession Numb... Shallow bowl /art/collection/search/547548?&searchField=All...
4 99.4.55 ca. 3850–2960 B.C.\n Not on view https://images.metmuseum.org/CRDImages/eg/mobi... eg/web-large/99.4.55_EGDP010319.jpg Pottery eg/web-additional/99.4.55_EGDP010319.jpg <p>Date: ca. 3850–2960 B.C.\n<br/>Accession Nu... Shallow bowl /art/collection/search/552308?&searchField=All...https://stackoverflow.com/questions/58581573
复制相似问题