所以我现在正在尝试导出一个twitter的.html页面,并且我使用BeautifulSoup创建了这个网页爬行器。OUTPUT.csv文件目前非常混乱,我的问题如下(当前的.py文件如下所示):
我可以采取哪些步骤来清理代码?我的输出csv包含tweet,但是它们真的很乱,并且用逗号分隔。有没有什么办法可以用新的行将它们分开?另外,我如何才能在我的cleanup()函数中只提取tweet中的"Bank of America: Growth Is Back - Bank Of America Corporation“(我用星号包围了它)呢?
"<div class=""js-tweet-text-container"">
<p class=""TweetTextSize js-tweet-text tweet-text"" data-aria-label-
part=""0"" lang=""en"">*****Bank Of America: Growth Is Back – Bank of
America Corporation***** (<strong>NYSE:BAC</strong>) <a class=""twitter-
timeline-link u-hidden"" data-expanded-url=""https://good-
stockinvest.com/2017/11/29/bank-of-america-growth-is-back-bank-of-
america-corporation-nysebac/"" dir=""ltr""
href="" rel=""nofollow noopener""
target=""_blank"" title=""https://good-stockinvest.com/2017/11/29/bank-
of-america-growth-is-back-bank-of-america-corporation-nysebac/""><span
class=""tco-ellipsis""></span><span class=""invisible"">https://</span>
<span class=""js-display-url"">good-
stockinvest.com/2017/11/29/ban</span><span class=""invisible"">k-of-
america-growth-is-back-bank-of-america-corporation-nysebac/</span><span
class=""tco-ellipsis""><span class=""invisible""> </span>…</span></a>
</p>
</div>"下面是我的代码:
from bs4 import BeautifulSoup
import csv
new = csv.writer(open("OUTPUT", "w"))
new.writerow(["Tweets:"])
new.writerow([ ]) # allowing for a simple space
data = open("bac.html", "r").read()
soup = BeautifulSoup(data, "html.parser")
tweets = soup.find_all('div', class_="js-tweet-text-container")
def writetweets():
for tweet in tweets:
new.writerow(tweets)
new.writerow([ ])
print "writetweets - open OUTPUT.csv for the tweet divs"
def cleanup():
print "cleanup - nothing here for now"
def tests():
print "tests - nothing here for now"
def demo():
writetweets()
cleanup()
tests()
if __name__ == '__main__':
demo()发布于 2017-12-06 08:21:05
一种快速的解决方法是,如果使用split()函数仅获取星号之间的文本。你得到的每条推文都是在星号之间还是在这条特定的推文之间?
另一种解决方案是更多地搜索标签,以便最终得到一个“干净”的字符串。也就是说,在"tweets“字符串中进一步使用find_all。
发布于 2017-12-06 08:34:58
首先你有几个错误,你使用for来迭代tweet,但是你写的是tweet而不是tweet,
此外,如果您希望它是逐行的,而不是逗号分隔值,则可以从使用csv改为使用
这样的话,它将是每条推文一行,你也可以使用file_output = open(file_name,'w') for tweet in tweets: file_output.write() file_output.close()它由你决定
发布于 2017-12-06 08:47:49
建立在前面的答案基础上,但帮助清理:
from bs4 import BeautifulSoup
import csv
data = open("bac.html", "r").read()
soup = BeautifulSoup(data, "html.parser")
#tweets = soup.find_all('div', class_="js-tweet-text-container")
tweets = soup.find_all("div", {"class": "js-tweet-text-container"})
def writetweets():
with open("OUTPUT.txt", "w") as new:
new.write("Tweets:\r\n")
for tweet in tweets:
new.write(tweet.getText() + "\r\n")
print "writetweets - open OUTPUT.txt for the tweet divs"
def cleanup():
print "cleanup - nothing here for now"
def tests():
print "tests - nothing here for now"
def demo():
writetweets()
cleanup()
tests()
if __name__ == '__main__':
demo()我得到了:
在29中: tweet.getText()
Out29:‘*美国银行:恢复增长--美国银行*(纽约证券交易所代码:BAC) https://good-stockinvest.com/2017/11/29/bank-of-america-growth-is-back-bank-of-america-corporation-nysebac/…’
https://stackoverflow.com/questions/47664769
复制相似问题