首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何解析包含单引号和双引号的HTMl文本

如何解析包含单引号和双引号的HTMl文本
EN

Stack Overflow用户
提问于 2021-11-01 21:58:52
回答 1查看 54关注 0票数 0

因此,我试图用Selenium为我想要阅读的网络小说制作一个刮刀,但当我解析HTML并写入文件时,单引号和双引号变成了带问号的菱形。我找过了,但什么也找不到。我认为它与unicode有关,但我对它了解不多。不管怎样,这是我的代码:

代码语言:javascript
复制
url = 'https://parahumans.wordpress.com/2011/06/11/1-1/'
driver.get(url)

chapter_name = driver.find_element_by_class_name('entry-title')
print(chapter_name.text)

text_div = driver.find_element_by_class_name('entry-content')
text = text_div.find_elements_by_tag_name('p')

with open(os.path.join(os.path.dirname(__file__), path), 'w') as file:
   for paragraph in text[3:]:
       file.write(paragraph.text + '\n')

.txt文件中的输出为:

代码语言:javascript
复制
Since the start of the semester, I had been looking forward to the part of Mr. Gladly�s World 
Issues class where we�d start discussing capes.  Now that it had finally arrived, I couldn�t 
focus.  I fidgeted, my pen moving from hand to hand, tapping, or absently drawing some figure 
in the corner of the page to join the other doodles.  My eyes were restless too, darting from 
the clock above the door to Mr. Gladly and back to the clock.  I wasn�t picking up enough of 
his lesson to follow along.  Twenty minutes to twelve; five minutes left before class ended.
EN

回答 1

Stack Overflow用户

发布于 2021-11-01 22:53:29

这就是我的朋友,它从网络序列中抓取所有章节,并将其保存到一个名为Worm.txt的文件中你可以更改为任何你想要的,我还包括一个使用内置tqdm的进度条,以便你可以查看进度,有300多章,每个章节大约需要1秒来抓取,所以预计它至少需要5分钟,但仍然比使用selenium快得多。

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

f = open("Worm.txt", "w")
a = requests.get("https://parahumans.wordpress.com/table-of-contents/")
soup = BeautifulSoup(a.text, "lxml")
text_div = soup.find("", {"class": "entry-content"})
links = text_div.find_all("a", href=True)[:-2]
for url in tqdm(links):
    a = requests.get(url['href'])
    soup = BeautifulSoup(a.text, "lxml")
    text_div = soup.find("", {"class": "entry-content"})
    text = text_div.find_all("p")
    for paragraph in text[3:]:
        f.write(paragraph.text + '\n')
f.close()
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69803214

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档