我想从一个具有JavaServer页面和登录保护的网站抓取一些数据。
问题是登录页面是动态创建的。起初,我发现我无法登录,因为我无法加载登录页面。登录页面的url类似于https://xxxx.xxxxxxx.edu.au/login/pages/login.jsp。下面是我的python代码:
def print_HTML(url):
request = req.Request(url, headers={"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0"})
with req.urlopen(request) as response:
data = response.read().decode("utf-8")
html = bs4.BeautifulSoup(data, "html.parser")
print(html.prettify())下面是输出:
<head>
<meta content="no-cache" http-equiv="pragma"/>
<meta content="no-cache" http-equiv="cache-control"/>
<meta content="0" http-equiv="expires"/>
<noscript>
<meta content="1;url=https://my.xxxxxxx.edu.au/studentportal/faces/home" http-equiv="refresh"/>
</noscript>
<script type="text/javascript">
function delayedRedirect(){
window.location = "https://my.xxxxxxx.edu.au/studentportal/faces/home";
}
</script>
<title>
Login redirect page
</title>
</head>
<body onload="setTimeout('delayedRedirect()', 1000)">
<i>
Redirecting to... https://my.xxxxxxx.edu.au/studentportal/faces/home
</i>
</body>在此之后,我返回到登录前的最后一个页面,它类似于https://xxxxxx.xxx.xxxxxxx.edu.au,打印它的html,我发现要转到登录页面的href是https://xxxxxx.xxx.xxxxxxx.edu.au/login/saml。但是,当我尝试打印它时,它显示
<html>
<head>
<base target="_self"/>
</head>
<body onload="document.myForm.submit()">
<noscript>
<p>
JavaScript is required. Enable JavaScript to use OAM Server.
</p>
</noscript>
<form action="https://auth.xxxxxxx.edu.au/login/pages/login.jsp" method="post" name="myForm">
<!------------ DO NOT REMOVE ------------->
<!----- loginform renderBrowserView ----->
<!-- Required for SmartView Integration -->
<input name="contextType" type="hidden" value="external"/>
<input name="username" type="hidden" value="string"/>
<input name="OverrideRetryLimit" type="hidden" value="6"/>
<input name="password" type="hidden" value="secure_string"/>
<input name="challenge_url" type="hidden" value="https%3A%2F%2Fauth.unimelb.edu.au%2Flogin%2Fpages%2Flogin.jsp"/>
<input name="request_id" type="hidden" value="1031689933436939677"/>
<input name="authn_try_count" type="hidden" value="0"/>
<input name="OAM_REQ" type="hidden" value="VERSION_4~Dx0y9HYwplTsrfWQuqCU5Y2hQlk96FnIkBSXmLxTfuyLy0XUtqGK20TF4Z7nTGFfHouR5m7KmcK96in%2f670EPaaukVhyOLld36hlyZe4ZtPW9Bvz%2bs%2fN%2fQXgcBw5z5ppJksT6HckJtxSI1TSWL5fPHKBjCQk0MuIzrxmH%2b%2fP4NnoeL73NCL4mCLoIu6NrPQ8q28kYR8Gi2Qh9i1mqOtr1QXl%2bXzeAXMS6ShA307odSH%2fT1GzsEcxTEEPKd7JLXUd8Z28iQM4t5PyVQJVHqiqTgyVxvFgiPlsrs%2bBb%2bhJ1tmCyvuPPsCc9cOsX7p1Jg0gHZkoRJjxrbYhXKVqJvAj9HhBve5zI6Hs73m6YyKyWgztO3gmlj5clBHMAzEY5EJ4MU8OojP6fxdd5cRL2GQPUQ6cGk9IV4HOSV2SPCaKdzkXGt5DwLXnMLsx3AJpiPEXptSns%2fDm%2fzPcnWbtD%2fZrFKgM%2b6hatFtlsFPk65N0fbNu1T5FMGVNioqIVBbkdcNyEHyoPmioCBXb9eB5KWXdGDudLApKy0nVdLjrYE14hRDwZstX8SkpqvKhjKB5JeiWCKuPvPe%2bWFg6ZcVftSj3UuaNaH%2f4Wst4suXGKq9t8di2e1kbJAV5pBamxkwVKrHJ9cz%2boJzqgJ5Cx6s1dxb%2brHBxTw6VJ%2f9otIlaplxNvKwilRUOhXqgoGVJxsVp5z9BDdnWt%2fzgjK8Rxq6qtQt1LfmM5pSdNB3Rn%2b6Q3S0kgofs7goOr%2bEqo1Fc3kTxn%2ffMjvASU%2fdYwFuVafahA4lkgplHT7986SdHt8V1A5dLLRSdX8PgwHMd4XlJHYEkw1Neeoog%2fG7Lq%2fysG%2bfDc5rCvjoj0gLZy%2fowUhgqYwaZvfNGLNkH7H802e0bP59Ms3IU605%2f9o7in%2bS1u3ZE3PnNabP1pu0somVqcRxz8hxOEkRbRLHZwYB%2fTNvAalywCAZ9sCwweH8tU0oFAuXwWdUDuviq8Hz%2bBWwhHEJkSfv%2b100lgRBlX6p%2b9HJYW4cqgcXU1oT%2f8qBywYHw1Ap6DmZb6L0S7MUNw%2ft8%2fg%2bO5NwGRbrOjlV0cQ4tCEU3ehZiEnXwuunjVOAfjjiyACjkfstnY7vSsFbcWEeBwtvZIW2RXFFV2qYPaS6iqZxlt0fWpV2VvL%2fb9BipKOgtJxFigvnsSa5a9THBrlBM%2byA1pNNI2dm3s18Fx68z0oIQhNDEVVx7Q7oOl5TBdUxYgU7uWrkqtKf%2flxvGrsKEmhdWModmOIiYKq2I4U6KcYmN2fogi7neh6t%2fZbg7%2bMQ%2fvQAeVOrKpJWB558DXm0qDW65msxQgmwhg6ct7D29iSOVDyLGpnrMAw5QU%2fB7jwx5OinbJ83UyGCJqTm0T9%2fm9fAq4ofjQ0p3YV62iokrCC0E0ZR7GBh6%2bFaaElOSdoL1nxdVJN2KNXTuwFgg8iK5%2fPVcoYgLsCRXGq0Dutwaf%2fp6UgjdTKHz5y0W4DO3ZTsPF4jhhWUJ%2fvtG3slDJHN1EOb78ACnrAi5S4q109xFPqj8s5U835yUdaHIMFXxMpT2pWutWtbC39p09y9LXuwUM4obMutVmA5EvYvSLqPnu3KAiMGDttfbvmkA9AjSDvV6mAwv8k9urj%2bo%2fSQkFxNt3aUD4ymERZ7ksyjQbm2ud%2f15gFvfNizTRE6JsamIWO4UICJUX6Pr7A%3d%3d"/>
<input name="locale" type="hidden" value="en_AU"/>
<input name="resource_url" type="hidden" value="%252Fuser%252Floginsso"/>
</form>
</body>
</html>有没有办法让我的python程序转到.jsp页面?谢谢。
发布于 2020-04-08 21:14:51
在浏览器中打开devtools,单击网络,转到
https://auth.xxxxxxx.edu.au/login/pages/login.jsp,登录。在post查询中检查浏览器发送的内容。如果它只是简单的身份验证,只需复制所有带有头部等的post查询,并将所有这些放到您的请求中。
或者甚至更简单地从devtoll复制为CURL并转换为请求(例如这里的https://curl.trillworks.com/)
https://stackoverflow.com/questions/61101314
复制相似问题