我正试图解析一个网页并将结果存储在一个表中。但是,我不能再继续下去了,因为解析结果在csv的顶部、中部和底部都有一些不想要的文本。http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = "http://lottery.merseyworld.com/cgi-bin/lottery?days=20&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"
response = requests.get(URL)
mywebpage = response.text
mysoup = BeautifulSoup(mywebpage, "html.parser")
print(mysoup)给..。
<html>
<head>
<title> Euro Millions Winning Numbers</title>
<body>
<pre> Euro Millions Winning Numbers
No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2, Jackpot, Wins
1500, Fri, 4,Feb,2022, 03,25,38,43,49,03,07, 109915000, 1
1499, Tue, 1,Feb,2022, 01,19,36,38,49,06,09, 52442757, 0
1498, Fri,28,Jan,2022, 10,25,29,34,45,09,10, 42779117, 0
...
1451, Tue,17,Aug,2021, 12,31,41,42,47,04,06, 14502700, 0
1450, Fri,13,Aug,2021, 06,12,44,47,49,08,12, 96295864, 1
<hr/><b>All lotteries below have exceeded the 180 days expiry date</b><hr/>No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2, Jackpot, Wins
1449, Tue,10,Aug,2021, 09,37,47,48,49,02,07, 80768518, 0
1448, Fri, 6,Aug,2021, 07,14,21,26,32,04,12, 71953143, 0
...
3, Fri,27,Feb,2004, 14,18,19,31,37,04,05, 11880304, 0
2, Fri,20,Feb,2004, 07,13,39,47,50,02,05, 10111500, 0
1, Fri,13,Feb,2004, 16,29,32,36,41,07,09, 10143000, 1
This page shows all the draws that used any machine and any ball set in any year.
Data obtained from http://lottery.merseyworld.com/Euro/
</pre>
</body></head></html>我只想能够提取1500行熊猫的数据,并摆脱文本发生在开始,中间和结尾!任何帮助都是非常感谢的。谢谢!
发布于 2022-02-07 16:37:51
这应该会给你带来预期的效果:
# Access the <pre> tag
mysoup = mysoup.pre.text.split("\n")
# Create a DataFrame object
df = pd.DataFrame(mysoup)
# Split column 0 based on the seperator ","
df = df[0].str.split(',', expand=True)
# Strip each cell by using a lambda function
df = df.apply(lambda row: row.str.strip(), axis=1)
# Exclude all rows which do not have an entry in column 1
df = df[~df[1].isnull()]
# Set the column names based on the first row
df.columns = df.iloc[0]
# drop the first row since it is the same as our column names
df = df.drop(df.index[0])https://stackoverflow.com/questions/71019288
复制相似问题