纽比来了。
在VS代码上运行此代码时,我在geeksforgeeks.org上找到了这段代码。我不知道结果怎么回事?来源:https://www.geeksforgeeks.org/python-program-crawl-web-page-get-frequent-words/
# Python3 program for a word frequency
# counter after crawling/scraping a web-page
import requests
from bs4 import BeautifulSoup
import operator
from collections import Counter
'''Function defining the web-crawler/core
spider, which will fetch information from
a given website, and push the contents to
the second function clean_wordlist()'''
def start(url):
# empty list to store the contents of
# the website fetched from our web-crawler
wordlist = []
source_code = requests.get(url).text
# BeautifulSoup object which will
# ping the requested url for data
soup = BeautifulSoup(source_code, 'html.parser')
# Text in given web-page is stored under
# the <div> tags with class <entry-content>
for each_text in soup.findAll('div', {'class': 'entry-content'}):
content = each_text.text
# use split() to break the sentence into
# words and convert them into lowercase
words = content.lower().split()
for each_word in words:
wordlist.append(each_word)
clean_wordlist(wordlist)
# Function removes any unwanted symbols
def clean_wordlist(wordlist):
clean_list = []
for word in wordlist:
symbols = "!@#$%^&*()_-+={[}]|\;:\"<>?/., "
for i in range(len(symbols)):
word = word.replace(symbols[i], '')
if len(word) > 0:
clean_list.append(word)
create_dictionary(clean_list)
# Creates a dictionary containing each word's
# count and top_20 occurring words
def create_dictionary(clean_list):
word_count = {}
for word in clean_list:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
''' To get the count of each word in
the crawled page -->
# operator.itemgetter() takes one
# parameter either 1(denotes keys)
# or 0 (denotes corresponding values)
for key, value in sorted(word_count.items(),
key = operator.itemgetter(1)):
print ("% s : % s " % (key, value))
<-- '''
c = Counter(word_count)
# returns the most occurring elements
top = c.most_common(10)
print(top)
# Driver code
if __name__ == '__main__':
url = "https://www.geeksforgeeks.org/programming-language-choose/"
# starts crawling and prints output
start(url)我试图在控制台和Visual代码中运行,但没有得到相同的结果。根据这篇文章,我应该得到这些结果。('to',10),('in',7),('is',6),(‘语言’,6),('the',5),('programming',5),('a',5),('c',5),('you',5),('of',4)
发布于 2022-10-06 19:38:13
在浏览器中打开该页面,单击右键,选择Inspect。然后单击底部(或右侧)打开的页面源代码的任何位置,然后选择点击Ctrl。将出现一个搜索字段:输入div//@class='entry-content‘,您将看到没有结果。很明显,自从他们出版了那篇教程之后,页面的结构发生了变化。你能做的就是改变这一行:
for each_text in soup.find_all('div', {'class': 'entry-content'})对此:
for each_text in soup.find_all('div', {'class': 'text'})您将根据这些元素内容获得(您的)一些结果。
https://stackoverflow.com/questions/73978654
复制相似问题