问网页解析
EN

Stack Overflow用户

提问于 2022-02-07 13:22:29

回答 1查看 53关注 0票数 -1

我正试图解析一个网页并将结果存储在一个表中。但是，我不能再继续下去了，因为解析结果在csv的顶部、中部和底部都有一些不想要的文本。http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"


response = requests.get(URL)
mywebpage = response.text
mysoup = BeautifulSoup(mywebpage, "html.parser")

print(mysoup)

给..。

<html>
<head>
<title> Euro Millions Winning Numbers</title>
<body>
<pre> Euro Millions Winning Numbers

No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
1500, Fri, 4,Feb,2022, 03,25,38,43,49,03,07, 109915000,    1
1499, Tue, 1,Feb,2022, 01,19,36,38,49,06,09,  52442757,    0
1498, Fri,28,Jan,2022, 10,25,29,34,45,09,10,  42779117,    0
...
1451, Tue,17,Aug,2021, 12,31,41,42,47,04,06,  14502700,    0
1450, Fri,13,Aug,2021, 06,12,44,47,49,08,12,  96295864,    1
<hr/><b>All lotteries below have exceeded the 180 days expiry date</b><hr/>No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
1449, Tue,10,Aug,2021, 09,37,47,48,49,02,07,  80768518,    0
1448, Fri, 6,Aug,2021, 07,14,21,26,32,04,12,  71953143,    0
...
3, Fri,27,Feb,2004, 14,18,19,31,37,04,05,  11880304,    0
  2, Fri,20,Feb,2004, 07,13,39,47,50,02,05,  10111500,    0
  1, Fri,13,Feb,2004, 16,29,32,36,41,07,09,  10143000,    1

This page shows all the draws that used any machine and any ball set in any year.

Data obtained from http://lottery.merseyworld.com/Euro/
</pre>
</body></head></html>

我只想能够提取1500行熊猫的数据，并摆脱文本发生在开始，中间和结尾！任何帮助都是非常感谢的。谢谢!

python

pandas

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-02-07 16:37:51

这应该会给你带来预期的效果：

# Access the <pre> tag
mysoup = mysoup.pre.text.split("\n")

# Create a DataFrame object
df = pd.DataFrame(mysoup)

# Split column 0 based on the seperator ","
df = df[0].str.split(',', expand=True)

# Strip each cell by using a lambda function
df = df.apply(lambda row: row.str.strip(), axis=1)

# Exclude all rows which do not have an entry in column 1
df = df[~df[1].isnull()]

# Set the column names based on the first row
df.columns = df.iloc[0]

# drop the first row since it is the same as our column names
df = df.drop(df.index[0])

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71019288

复制

相似问题

问网页解析
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问网页解析EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问网页解析
EN