我对网络抓取非常陌生(我对html几乎一无所知,这是我第一次使用BeautifulSoup),我正在制作一个程序,基本上可以让我在网上为小说生成PDF或epubs。我并不担心与各种各样的网站的兼容性,因为我只是为我自己做这个。我编写了代码,从该特定章节的任何链接中获取网络小说所有章节的链接,并将其全部放入列表中,但这需要很长时间。每个链接大约有一秒钟。考虑到有些小说实际上有一千到两千章,这就像半个小时来获取所有的链接,而这个程序甚至还没有得到每个链接的正文并将它们编译成PDF,有什么方法可以让这个代码更快吗?
import requests
from bs4 import BeautifulSoup
def list_chapters():
given_chapter = 'https://www.box-novel.com/novel/cannon-fodder-counterattack-system/chapter-4-1/'
current_chapter = find_first_chapter(given_chapter)
print("Starting chapter: ", current_chapter)
link_list = []
try:
while True:
link_list.append(current_chapter)
r = requests.get(current_chapter)
soup = BeautifulSoup(r.content, 'html.parser')
s = soup.find('div', class_='nav-next')
for link in s.find_all('a'):
current_chapter = link.get('href')
except AttributeError:
link_list.pop(-1)
print(len(link_list), "chapters detected.")请让我知道其他方法,以改善我的代码以及。注意:我弹出链接中的最后一个值,因为它比检测导航-下一个值是在导航中引用的manga info更容易,在最后一章中,也忽略了我使用的随机垃圾小说链接,这是我在第一页中能找到的最短的链接。
发布于 2022-08-13 10:53:12
你的任务并不琐碎。首先,指向所有章节的链接是通过该入口点页面中的ajax POST请求加载的。整理完后,您需要一个健壮的异步解决方案,我的意思是可以处理10亿个链接列表,并且可以在Raspberry pi上执行(所以您需要一些队列的概念)。以下内容将花费大约10秒的时间,并将返回包含小说90章中每一章的标题和内容的数据格式(如果愿意,可以按标题进行排序):
import asyncio
from httpx import Client, AsyncClient, Limits
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
## run this is you're executing the code in a notebook
import nest_asyncio
nest_asyncio.apply()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
#### setup some sort of mock persistence ###
big_df_list = []
#### async scrape funcs ####
def all_chapters_urls():
url_list = []
payload = {
'action': 'manga_get_reading_nav',
'manga': '1987979',
'chapter': 'chapter-29-7',
'volume_id': '0',
'type': 'content'
}
with Client(headers=headers, timeout=60.0, follow_redirects=True) as client:
r = client.post('https://www.box-novel.com/wp-admin/admin-ajax.php', data = payload)
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.select_one('select.c-selectpicker.selectpicker_chapter.selectpicker.single-chapter-select').select('option')
for l in links:
url_list.append(l.get('data-redirect'))
return url_list
async def get_chapters(url):
async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
try:
r = await client.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
title = soup.select_one('h1#chapter-heading').get_text(strip=True)
text_content = soup.select_one('div.text-left').get_text(strip=True)
big_df_list.append((title, text_content))
except Exception as e:
print(url, e)
async def scrape_chapters():
start_time = datetime.now()
tasks = asyncio.Queue()
for x in all_chapters_urls():
tasks.put_nowait(get_chapters(x))
async def worker():
while not tasks.empty():
await tasks.get_nowait()
await asyncio.gather(*[worker() for _ in range(20)])
end_time = datetime.now()
duration = end_time - start_time
print('chapters scraping took', duration)
asyncio.run(scrape_chapters())
df = pd.DataFrame(big_df_list, columns = ['Chapter', 'Content'])
print(df)这将返回航站楼:
chapters scraping took 0:00:10.991827
Chapter Content
0 Cannon Fodder Counterattack System - Chapter 30.1 The power of gossip was never been underestimated. Huang Dezheng’s reputation for kind and charismatic was far-reaching. His neighbours recognized him. The original impression of him was quite good, but he did not expect that he would be well-known not only in public but also in private. Especially messing about with your own students!Seeing his white and tender student being dragged by him, notice the way he couldn’t even walk properly. Hehe! What a scumbag!The gossipy neighbours recalled the scene they saw through their door’s peepholes and were still amazed. There was no way. At that time, the two of them were getting intimate, there was still energy to pay attention to whether the door was open, wasn’t there?Huang Dezheng did not notice this little detail when he left with Su Yibai in anger. The time he realized this, it was already several days later.The campus forum calmness of the past was swept away with an earthquake. The entire page layout was filled by posts with similar titles! Among them, the top one was the most eye-catching and popular!“During the 18th of August, School grass[1] Su and Teacher Huang’s cohabitation dog blood drama, here are the pictures and truth”Huang Dezheng, who was passing by his colleague’s computer, inadvertently caught a glimpse of this thick red line of words, and his heart jerked. He quietly held his breath as he returned to his office. His face paled as he entered into the forum he had previously scorned. With trembling hands, he opened the very hot post.“It is said that the landlord was shocked when he heard this. He was not familiar with the school, but the teacher Huang’s reputation in the school was very good. How could it be that he did not close the door and even did it with a student? What a scum?! But there are pictures of the truth, so it was not nonsense, the pictures are linked below.”“Fu*k! It turned out to be true!!!”“The soft and cute school grass together with the male god! Look at the hickey on the neck! Fu*k! It’s too intense! Teacher Huang bao dao wei lao[2]!!”“After examining the pictures, it truly hasn’t been photo-shopped… Fu*k! What a scumbag!!”“It should be true… School grass Su never returned to the dormitory and stayed outside, so it turned out…”“To help the landlord add fire, the photos were taken by a friend who went to the nightclub to play”” It turns out that Su Xuedi[3] is like this in private! Look at the half-covered chest, the creamy thighs! No wonder Teacher Huang This white flower has a half-covered chest and a chest, and the trough is still pink!! No wonder Huang teacher doesn’t love Jiangshan beauties!!”“Wow, there’s a reason the number of people who never go to class is so high. With these two pictures, it seems like our Su Xuedi’s eyes are not very good!”“…”Huang Dezheng looked at the increasingly unsightly text and pictures on the computer screen, his whole body was shaking in anger!Who was it?! Who did he offend for him to be framed so viciously?!He immediately left a message asking the moderator to delete the post, but it didn’t take long for the message that didn’t hide his identity to completely detonate the entire forum!Fu*k the person involved actually appeared!!!The forum was boiling with this additional drama and Huang Dezheng got so angry that his liver began to ache. Not only were the posts not deleted, but his message was even re-posted with screenshots!These students were really shameless![1]School grass: most handsome guy in school. For the opposite gender it would be school flower.[2] Bao dao wei loa: Old but still vigorous. I think that explains it.[3] Xuedi: junior or younger male school mate.(Visited 1 times, 1 visits today)
1 Cannon Fodder Counterattack System - Chapter 29.7 Qin Shiyue rushed back to the house without saying a word, he was tempted to blow up, but he was afraid of hurting the stupid rabbit, so he kept suppressing it.Ye Si Nian also did not say a word, and when he got home, he went into the bathroom without saying anything.The more he thought about the more frustrated he was! Qin Shiyue was tense like a trapped beast as he moved about in the study. The desk was already in chaos, and there were scattered documents on the floor.Just as his anger was reaching the apex, the study door was opened, and the stupid rabbit who had just taken a bath with a towel around his body leisurely walked in.His body was covered with a thin layer of tight and well-proportioned muscles. The skin was fair and smooth, the waist, thin but not weak. At first glance, it was full of explosive power.His eyes glided uncontrollably as he observed the man’s movement. Qin Shiyue was frozen in place, his heart almost stopped beating, and a thought flashed in his mind flashed that allowed him to recover his heartbeat whose speed soared to the limit.Ye Si Nian was getting closer and closer, and Qin Shiyue, who only had a theoretical experience, wanted to step forward into his (Ye Si Nian’s)arms, but Qin Shiyue’s brain was blank, and he didn’t know where to start…Intensely attracted to his lover who was stunned, he pressed his naked and exposed skin on the man’s thin shirt and gently rubbed on them.The man’s reaction was very interesting. Ye Si Nian pursed his lips and pushed the man slightly on his shoulder to make him sit down on the large chair.Smiling as Qin Shiyue raised his head to look up at him, Ye Si Nian’s index finger hooked up his chin and he bent to kiss the tense tightly-close thin lip.Effortlessly prying his lover’s lips open, Ye Si Nian invaded his soft tongue constantly wreaking havoc in Qin Shiyue’s mouth. He licked and played with Qin Shiyue’s sensitive mouth before his lover finally reacted.The breathing became more intense, his lover’s strength also increased, Ye Si Nian hummed and pulled away from Qin Shiyue’s mouth and gently licked his lower lip.“I want you, Qin Shiyue.”Looking at his lover’s suddenly large eyes, Ye Si Nian smiled smugly, kissing his earlobe and licking his ears he murmured slowly, “I want you… Qin Shiyue… I want you……”If one could hold back at this time, would he still be a man?!!Qin Shiyue slammed down Ye Si Nian’s thin waist, suppressing his desire. His voice was hoarse with craving, “Stupid rabbit, do you know that you are playing with fire?!”Ye Si Nian raised an eyebrow and replied to the question with action instead.(Visited 1 times, 1 visits today)
2 Cannon Fodder Counterattack System - Chapter 29.8 With his long leg stretched, Ye Si Nian sat on Qin Shiyue’s lap, lowering his head to nibble on his throat, he felt his slight trembling and repressed gasp. He flexibly untied his clothes and put his hands on the well-defined chest.No longer be a manShiyue made a beast-like roar and kissed Ye Si Nian’s fragile neck hard. The hands clinging behind him tore open Ye Si Nian’s towel.=======================The next afternoon Ye Si Nian sat up in bed sourly and examined the various traces all over his body. He was full of regrets.He really underestimated the enemy’s fighting power!The two personalities were frightening! They being virgins who were almost thirty years old was also dreadful! The combination of the two resulted in being tossed from yesterday afternoon to this morning was scary!When Qin Shiyue and Pei Yiyuan took turns in battle, who said that having a double personality was amazing? !!Complaining in his heart, Ye Si Nian saw the door being pushed open, and Pei Yiyuan came in with a gentle smile like a spring breeze.“Woken up? Are there any uncomfortable place in your body?” Pei Yiyuan went near the bed and knelt on one knee as he reached out and placed Ye Si Nian into his arms.“No.” Ye Si Nian gave a serious thought about it. He felt that the communication last night was really hearty and he enjoyed himself. It was normal for the muscles to be sore, and it was obvious that he was clean and dry now, so he decided to praise instead, “I felt very good last night!”“It will get better in the future!” The performance of the first time last night was affirmed. Pei YiYuan felt a little proud in his heart. He bowed to kiss Ye Si Nian’s lips. “Yes, Qin Shiyue wanted me to ask how you intend to deal with those two?”Speaking about the incident, the second personality was embarrassed to come out himself to ask. Ye Si Nian’s lips twitched and said: “I decided to sell the apartment.”“That’s it?” Pei Yiyuan raised his eyebrows, he also had no good feelings for the two people.“Don’t underestimate the power of gossip…” Ye Si Nian shook his head with a smile and said, “Otherwise, you just wait and see! Without me, they are well able to kill themselves!”“Then I’ll wait and see.” Pei Yiyuan’s arm wrapped around him as he lifted Ye Si Nian up to carry to the bathroom. He did not care and decided to change to a more important topic, “I just went out for a walk and bought your favourite. Porridge…”(Visited 1 times, 1 visits today)
[...]发布于 2022-08-13 02:33:44
如果一个请求在时间上太长,我们应该同时发射多个请求!
多么?好的,有多个选项,但是我会坚持使用aiohttp库,它可以做requests所做的事情,但是是异步的。
下面是一些我从another question上偷来的使用它的例子
import asyncio
import aiohttp
import time
websites = """https://www.youtube.com
http://www.chrome.com
http://www.booking.com
http://www.googleusercontent.com
http://www.google.com.au
http://www.popads.net
http://www.cntv.cn"""
async def get(url, session):
try:
async with session.get(url=url) as response:
resp = await response.read()
print("Successfully got url {} with resp of length {}.".format(url, len(resp)))
except Exception as e:
print("Unable to get url {} due to {}.".format(url, e.__class__))
async def main(urls):
async with aiohttp.ClientSession() as session:
ret = await asyncio.gather(*[get(url, session) for url in urls])
print("Finalized all. Return is a list of len {} outputs.".format(len(ret)))
urls = websites.split("\n")
start = time.time()
asyncio.run(main(urls))
end = time.time()
print("Took {} seconds to pull {} websites.".format(end - start, len(urls)))https://stackoverflow.com/questions/73341178
复制相似问题