我正在写一个程序来寻找歌词,这个程序几乎快完成了,但是我对bs4数据类型有一个小的问题,我的问题是如何从歌词的末尾提取纯文本?
import re
import requests
import bs4
from urllib import unquote
def getLink(fileName):
webFileName = unquote(fileName)
page = requests.get("http://songmeanings.com/query/?query="+str(webFileName)+"&type=songtitles")
match = re.search('songmeanings\.com\/[^image].*?\/"',page.content)
if match:
Mached = str("http://"+match.group())
return(Mached[:-1:]) # this line used to remove a " at the end of line
else:
return(1)
def getText(link):
page = requests.get(str(link))
soup = bs4.BeautifulSoup(page.content ,"lxml")
return(soup)
Soup = getText(getLink("paranoid android"))
lyric = Soup.findAll(attrs={"lyric-box"})
print (lyric)这里的结果是:
请你别吵了好吗,
我想休息一下
从我脑海中所有未出生的小鸡的声音中
\n那是什么?
\n那是什么?
\n
当我成为国王的时候,你将是第一个靠墙的人
你的意见一点也不重要
\n那是什么?
\n那是什么?
\n
*
‘“’
\n你不记得了
\n你不记得了
\n你为什么不记得我的名字?
闻他的头,伙计
闻他的头,伙计
\n你为什么不记得我的名字?
\我猜他知道
\n
雨下,雨下
在我身上下起雨
从一个很高的高度
从很高,很高
雨下,雨下
在我身上下起雨
从一个很高的高度
\n从很高,很高,
雨下,雨下
在我身上下起雨
\n
就这样,先生
你要走了
猪皮的裂纹
尘土和尖叫声
\n雅皮士们联网
恐慌,呕吐
恐慌,呕吐
上帝爱他的孩子
上帝爱他的孩子耶!
编辑歌词\n编辑Wiki\n添加视频\n
]
发布于 2016-03-28 00:18:20
追加以下代码行:
lyric = ''.join([tag.text for tag in lyric])之后
lyric = Soup.findAll(attrs={"lyric-box"})你会得到输出,就像
Please could you stop the noise,
I'm trying to get some rest
From all the unborn chicken voices in my head
What's that?
What's that?
When I am king, you will be first against the wall
With your opinion which is of no consequence at all
What's that?
What's that?
...发布于 2016-03-27 22:52:01
首先,通过执行stringvar[1:-1]来修剪前导和尾随[],然后在每一行上调用linevar.strip(),这将去掉所有的空格。
发布于 2016-03-28 22:28:41
对于喜欢这个想法的人来说,经过一些小小的修改,我的代码看起来如下:)
import re
import pycurl
import bs4
from urllib import unquote
from StringIO import StringIO
def getLink(fileName):
fileName = unquote(fileName)
baseAddres = "https://songmeanings.com/query/?query="
linkToPage = str(baseAddres)+str(fileName)+str("&type=songtitles")
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToPage)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
tab_content = str(soup.find_all(attrs={"tab-content"}))
pattern = r'\"\/\/songmeanings.com\/.+?\"'
links = re.findall(pattern,tab_content)
"""returns first mached item without double quote
at the beginning and at the end of the string"""
return("http:"+links[0][1:-1:])
def getText(linkToSong):
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToSong)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
lyric_box = soup.find_all(attrs={"lyric-box"})
lyric_boxSTR = ''.join([tag.text for tag in lyric_box])
return(lyric_boxSTR)
link = getLink("Anarchy In The U.K")
text = getText(link)
print(text)
https://stackoverflow.com/questions/36253506
复制相似问题