我被要求从网站https://secc.gov.in/lgdStateList刮PDF的。有一个州、一个区和一个街区的3个下拉菜单。有几个州,每个州下有区,每个区下有区块。
我尝试实现以下代码。我可以选择州,但在选择地区时似乎出现了一些错误。
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome()
url = ("https://secc.gov.in/lgdStateList")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, 'html.parser')
for name_list in soup.find_all(class_ ='dropdown-row'):
print(name_list.text)
driver = webdriver.Chrome()
driver.get('https://secc.gov.in/lgdStateList')
selectState = Select(driver.find_element_by_id("lgdState"))
for state in selectState.options:
state.click()
selectDistrict = Select(driver.find_element_by_id("lgdDistrict"))
for district in selectDistrict.options:
district.click()
selectBlock = Select(driver.find_element_by_id("lgdBlock"))
for block in selectBlock.options():
block.click()我遇到的错误是:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="lgdDistrict"]"}
(Session info: chrome=83.0.4103.106)我需要帮助在3个菜单中爬行。
如有任何帮助/建议,我们将不胜感激。如果评论中有任何澄清,请让我知道。
发布于 2020-06-21 23:21:17
这是您可以找到不同状态的the value的地方。您可以在地区和区块下拉列表中找到相同的内容。
现在,您应该在有效负载中使用这些值来获取要从中获取数据的表:
import urllib3
import requests
from bs4 import BeautifulSoup
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
link = "https://secc.gov.in/lgdGpList"
payload = {
'stateCode': '10',
'districtCode': '188',
'blockCode': '1624'
}
r = requests.post(link,data=payload,verify=False)
soup = BeautifulSoup(r.text,"html.parser")
for items in soup.select("table#example tr"):
data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(data)脚本产生的输出:
['Select State', 'Select District', 'Select Block']
['', 'Select District', 'Select Block']
['ARARIA BASTI (93638)', 'BANGAMA (93639)', 'BANSBARI (93640)']
['BASANTPUR (93641)', 'BATURBARI (93642)', 'BELWA (93643)']
['BOCHI (93644)', 'CHANDRADEI (93645)', 'CHATAR (93646)']
['CHIKANI (93647)', 'DIYARI (93648)', 'GAINRHA (93649)']
['GAIYARI (93650)', 'HARIA (93651)', 'HAYATPUR (93652)']
['JAMUA (93653)', 'JHAMTA (93654)', 'KAMALDAHA (93655)']
['KISMAT KHAWASPUR (93656)', 'KUSIYAR GAWON (93657)', 'MADANPUR EAST (93658)']
['MADANPUR WEST (93659)', 'PAIKTOLA (93660)', 'POKHARIA (93661)']
['RAMPUR KODARKATTI (93662)', 'RAMPUR MOHANPUR EAST (93663)', 'RAMPUR MOHANPUR WEST (93664)']
['SAHASMAL (93665)', 'SHARANPUR (93666)', 'TARAUNA BHOJPUR (93667)']您需要在上面每个结果旁边的括号中抓取可用的数字,然后在payload中使用它们,并发送另一个post请求来下载pdf文件。确保在执行之前将脚本放在一个文件夹中,以便可以获取其中的所有文件。
import urllib3
import requests
from bs4 import BeautifulSoup
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
link = "https://secc.gov.in/lgdGpList"
download_link = "https://secc.gov.in/downloadLgdwisePdfFile"
payload = {
'stateCode': '10',
'districtCode': '188',
'blockCode': '1624'
}
r = requests.post(link,data=payload,verify=False)
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select("table#example td > a[onclick^='downloadLgdFile']"):
gp_code = item.text.strip().split("(")[1].split(")")[0]
payload['gpCode'] = gp_code
with open(f'{gp_code}.pdf','wb') as f:
f.write(requests.post(download_link,data=payload,verify=False).content)https://stackoverflow.com/questions/62500126
复制相似问题