首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >网页解析

网页解析
EN

Stack Overflow用户
提问于 2022-02-07 13:22:29
回答 1查看 53关注 0票数 -1

我正试图解析一个网页并将结果存储在一个表中。但是,我不能再继续下去了,因为解析结果在csv的顶部、中部和底部都有一些不想要的文本。http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"


response = requests.get(URL)
mywebpage = response.text
mysoup = BeautifulSoup(mywebpage, "html.parser")

print(mysoup)

给..。

代码语言:javascript
复制
<html>
<head>
<title> Euro Millions Winning Numbers</title>
<body>
<pre> Euro Millions Winning Numbers

No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
1500, Fri, 4,Feb,2022, 03,25,38,43,49,03,07, 109915000,    1
1499, Tue, 1,Feb,2022, 01,19,36,38,49,06,09,  52442757,    0
1498, Fri,28,Jan,2022, 10,25,29,34,45,09,10,  42779117,    0
...
1451, Tue,17,Aug,2021, 12,31,41,42,47,04,06,  14502700,    0
1450, Fri,13,Aug,2021, 06,12,44,47,49,08,12,  96295864,    1
<hr/><b>All lotteries below have exceeded the 180 days expiry date</b><hr/>No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
1449, Tue,10,Aug,2021, 09,37,47,48,49,02,07,  80768518,    0
1448, Fri, 6,Aug,2021, 07,14,21,26,32,04,12,  71953143,    0
...
3, Fri,27,Feb,2004, 14,18,19,31,37,04,05,  11880304,    0
  2, Fri,20,Feb,2004, 07,13,39,47,50,02,05,  10111500,    0
  1, Fri,13,Feb,2004, 16,29,32,36,41,07,09,  10143000,    1

This page shows all the draws that used any machine and any ball set in any year.

Data obtained from http://lottery.merseyworld.com/Euro/
</pre>
</body></head></html>

我只想能够提取1500行熊猫的数据,并摆脱文本发生在开始,中间和结尾!任何帮助都是非常感谢的。谢谢!

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-02-07 16:37:51

这应该会给你带来预期的效果:

代码语言:javascript
复制
# Access the <pre> tag
mysoup = mysoup.pre.text.split("\n")

# Create a DataFrame object
df = pd.DataFrame(mysoup)

# Split column 0 based on the seperator ","
df = df[0].str.split(',', expand=True)

# Strip each cell by using a lambda function
df = df.apply(lambda row: row.str.strip(), axis=1)

# Exclude all rows which do not have an entry in column 1
df = df[~df[1].isnull()]

# Set the column names based on the first row
df.columns = df.iloc[0]

# drop the first row since it is the same as our column names
df = df.drop(df.index[0])
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71019288

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档