我的代码如下所示,但是为什么brand值输出External_links而不是我所提取的项目列表。
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
my_url = 'https://en.wikipedia.org/wiki/Harry_Potter'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
headline = page_soup.findAll("span",{"class":"mw-headline"})
for item in headline:
brand = item["id"] # Outputs "External_links"发布于 2018-07-22 03:12:40
在for循环中,您将遍历页面中的每个标题,然后将标题值分配给变量brand。循环完成后,brand的值将是最后一个标题("External_links")。
如果您修改代码以打印每个标题的值,您将看到您正在获取所要的值。
>>> for item in headline:
... print(item["id"])
...
Plot
Early_years
Voldemort_returns
Supplementary_works
Harry_Potter_and_the_Cursed_Child
In-universe_books
Pottermore_website
Structure_and_genre
Themes
Origins
Publishing_history
Translations
Completion_of_the_series
Cover_art
Achievements
Cultural_impact
Commercial_success
Awards,_honours,_and_recognition
Reception
Literary_criticism
Social_impact
Controversies
Adaptations
Films
Spin-off_prequels
Games
Audiobooks
Stage_production
Attractions
The_Wizarding_World_of_Harry_Potter
The_Making_of_Harry_Potter
References
Further_reading
External_links发布于 2018-07-22 11:53:49
您的brand变量需要是一个列表,例如,代码可以如下所示:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
from pprint import pprint
my_url = 'https://en.wikipedia.org/wiki/Harry_Potter'
with uReq(my_url) as uClient:
page_html = uClient.read()
page_soup = soup(page_html, "xml")
brand = []
for item in page_soup.find_all('span', {'class': 'mw-headline'}):
brand.append(item["id"])
pprint(brand)指纹:
['Plot',
'Early_years',
'Voldemort_returns',
'Supplementary_works',
'Harry_Potter_and_the_Cursed_Child',
'In-universe_books',
'Pottermore_website',
'Structure_and_genre',
'Themes',
'Origins',
'Publishing_history',
'Translations',
'Completion_of_the_series',
'Cover_art',
'Achievements',
'Cultural_impact',
'Commercial_success',
'Awards,_honours,_and_recognition',
'Reception',
'Literary_criticism',
'Social_impact',
'Controversies',
'Adaptations',
'Films',
'Spin-off_prequels',
'Games',
'Audiobooks',
'Stage_production',
'Attractions',
'The_Wizarding_World_of_Harry_Potter',
'The_Making_of_Harry_Potter',
'References',
'Further_reading',
'External_links']发布于 2018-07-22 14:01:49
实现同样的使用列表理解:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url = 'https://en.wikipedia.org/wiki/Harry_Potter'
soup = BeautifulSoup(requests.get(url).text, "lxml")
items = [item.get('id') for item in soup.find_all('span',class_='mw-headline')]
pprint(items)https://stackoverflow.com/questions/51461701
复制相似问题