首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在Python中解析div中的HTML表而不是表

如何在Python中解析div中的HTML表而不是表
EN

Stack Overflow用户
提问于 2021-04-26 19:43:37
回答 1查看 431关注 0票数 1

我正在尝试从本网站解析表。我从Username列开始,在堆栈溢出的帮助下,我能够使用以下代码获得Username的内容:

代码语言:javascript
复制
with open("Top 50 TikTok users sorted by Followers - Socialblade TikTok Stats _ TikTok Statistics.html", "r", encoding="utf-8") as file:
    soup = BeautifulSoup(str(file.readlines()), "html.parser")

tiktok = []
for tag in soup.select("div div:nth-of-type(n+5) > div > a"):
    tiktok.append(tag.text)

这给了我

代码语言:javascript
复制
['addison rae',
 'Bella Poarch',
 'Zach King',
 'TikTok',
 'Spencer X',
 'Will Smith',
 'Loren Gray',
 'dixie',
 'Michael Le',
 'Jason Derulo',
 'Riyaz',
.
.
.

我的最终目标是用[Rank, Grade, Username, Uploads, Followers, Following, Likes]填充整个表

我读过几篇关于Parsing HTML Tables in Python with BeautifulSoup and pandas的文章,但没有起作用,因为在源代码中没有将它定义为表。在Python中将其作为表的替代方法有哪些?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-04-26 19:58:02

您可以使用以下代码将HTML从一个文件加载到另一个文件,然后将表解析为dataframe:

代码语言:javascript
复制
import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")

data = []
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):
    data.append(
        [
            d.get_text(strip=True)
            for d in div.find_all("div", recursive=False)[:8]
        ]
    )


df = pd.DataFrame(
    data,
    columns=[
        "Rank",
        "Grade",
        "Username",
        "Uploads",
        "Followers",
        "Following",
        "Likes",
        "Interactions",
    ],
)
print(df)
df.to_csv("data.csv", index=False)

指纹:

代码语言:javascript
复制
    Rank Grade           Username Uploads    Followers Following          Likes Interactions
0    1st   A++    charli d’amelio   1,755  113,600,000     1,210  9,200,000,000           --
1    2nd   A++        addison rae   1,411   79,900,000     2,454  5,100,000,000           --
2    3rd   A++       Bella Poarch     282   63,600,000       588  1,400,000,000           --
3    4th   A++          Zach King     277   58,800,000        41    723,400,000           --
4    5th   A++             TikTok     139   52,900,000       495    250,300,000           91
5    6th   A++          Spencer X   1,250   52,700,000     7,206  1,300,000,000           --
6    7th   A++         Will Smith      73   52,500,000        23    314,400,000           --
7    8th   A++         Loren Gray   2,805   52,100,000       221  2,800,000,000           --
8    9th   A++              dixie     120   51,200,000     1,267  2,900,000,000           --
9   10th   A++         Michael Le   1,158   47,400,000        93  1,300,000,000           --
10  11th    A+       Jason Derulo     675   44,900,000        12  1,000,000,000           --
11  12th    A+              Riyaz   2,056   44,100,000        43  2,100,000,000           --
12  13th    A+  Kimberly Loaiza ✨   1,150   41,000,000       123  2,200,000,000           --
13  14th    A+       Brent Rivera     955   37,800,000       272  1,200,000,000           --
14  15th    A+           cznburak   1,301   37,300,000         1    688,700,000           --
15  16th    A+           The Rock      42   36,200,000         1    200,300,000           --
16  17th    A+      James Charles     238   36,200,000       148    881,400,000           --
17  18th    A+          BabyAriel   2,365   35,300,000       326  1,900,000,000           --
18  19th    A+          JoJo Siwa   1,206   33,500,000       346  1,100,000,000           --
19  20th    A+              avani   5,347   33,300,000     5,003  2,400,000,000           --
20  21st    A+          GIL CROES     693   32,900,000       454    803,200,000           --
21  22nd    A+      Faisal shaikh     461   32,200,000        --  2,000,000,000           --
22  23rd    A+                BTS      39   32,000,000        --    557,100,000          255
23  24th    A+           LILHUDDY   4,187   30,500,000     8,652  1,600,000,000           --
24  25th    A+       Stokes Twins     548   30,100,000        21    781,000,000           --
25  26th    A+                Joe   1,487   29,800,000     8,402  1,200,000,000           --
26  27th    A+               ROD   1,792   29,500,000       536  1,700,000,000           --
27  28th    A+                 899   29,400,000       216  1,700,000,000           --
28  29th    A+       Kylie Jenner      69   29,400,000        14    318,800,000           --
29  30th    A+         Junya/じゅんや   2,823   29,000,000     1,934    533,800,000       12,200
30  31st    A+                 YZ     816   28,900,000       563    554,700,000           --
31  32nd    A+      Arishfa Khan   2,026   28,600,000        27  1,100,000,000           --
32  33rd    A+   Lucas and Marcus   1,248   28,500,000       158    806,500,000           --
33  34th    A+    jannat_zubair29   1,054   28,200,000         6    746,300,000           47
34  35th    A+     Nisha Guragain   1,751   28,000,000        33    756,300,000           --
35  36th    A+       Selena Gomez      40   27,800,000        17     82,300,000           --
36  37th    A+            Kris HC   1,049   27,800,000     1,405  1,200,000,000           --
37  38th    A+        flighthouse   4,200   27,600,000       488  2,300,000,000           --
38  39th    A+         wigofellas   1,251   27,500,000       812    707,200,000           --
39  40th    A+   Savannah LaBrant   1,860   27,300,000       155  1,400,000,000           --
40  41st    A+          noah beck   1,395   26,900,000     2,297  1,700,000,000           --
41  42nd    A+         Liza Koshy     155   26,700,000       104    321,900,000           --
42  43rd    A+   Kirya Kolesnikov   1,338   26,400,000        78    543,200,000           --
43  44th    A+        Awez Darbar   2,708   26,100,000       208  1,100,000,000           --
44  45th    A+       Carlos Feria   2,522   25,700,000       138  1,200,000,000           --
45  46th    A+       Kira Kosarin     837   25,700,000       401    447,000,000           --
46  47th    A+     Naim Darrechi   2,634   25,300,000       527  2,200,000,000           --
47  48th    A+      Josh Richards   1,899   24,900,000     9,847  1,600,000,000           --
48  49th    A+             Q Park     231   24,800,000         3    294,100,000           --
49  50th    A+       TikTok_India     186   24,500,000       191     40,100,000           --

并保存data.csv (LibreOffice截图):

编辑:获取URL用户名:

代码语言:javascript
复制
import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")

data = []
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):

    data.append(
        [
            d.get_text(strip=True)
            for d in div.find_all("div", recursive=False)[:8]
        ]
        + [div.a["href"].split("/")[-1]]
    )


df = pd.DataFrame(
    data,
    columns=[
        "Rank",
        "Grade",
        "Username",
        "Uploads",
        "Followers",
        "Following",
        "Likes",
        "Interactions",
        "URL username",
    ],
)

print(df)
df.to_csv("data.csv", index=False)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67272879

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档