文章/答案/技术大牛

发布

社区首页 >问答首页 >在R中改进从google获取股票新闻数据的函数

问在R中改进从google获取股票新闻数据的函数
EN

Stack Overflow用户

提问于 2011-04-23 09:43:57

回答 1查看 2.9K关注 0票数 6

我已经编写了一个函数来从Google抓取和解析给定股票代码的新闻数据，但我相信有方法可以改进它。首先，我的函数返回一个GMT时区的对象，而不是用户的当前时区，如果传递的数字大于299，它就会失败(可能是因为google每只股票只返回300篇报道)。这在一定程度上是堆栈溢出的in response to my own question，并且严重依赖于this blog post。

tl;dr:如何改进这个功能？

 getNews <- function(symbol, number){

    # Warn about length
    if (number>300) {
        warning("May only get 300 stories from google")
    }

    # load libraries
    require(XML); require(plyr); require(stringr); require(lubridate);
    require(xts); require(RDSTK)

    # construct url to news feed rss and encode it correctly
    url.b1 = 'http://www.google.com/finance/company_news?q='
    url    = paste(url.b1, symbol, '&output=rss', "&start=", 1,
               "&num=", number, sep = '')
    url    = URLencode(url)

    # parse xml tree, get item nodes, extract data and return data frame
    doc   = xmlTreeParse(url, useInternalNodes = TRUE)
    nodes = getNodeSet(doc, "//item")
    mydf  = ldply(nodes, as.data.frame(xmlToList))

    # clean up names of data frame
    names(mydf) = str_replace_all(names(mydf), "value\\.", "")

    # convert pubDate to date-time object and convert time zone
    pubDate = strptime(mydf$pubDate, 
                     format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
    pubDate = with_tz(pubDate, tz = 'America/New_york')
    mydf$pubDate = NULL

    #Parse the description field
    mydf$description <- as.character(mydf$description)
    parseDescription <- function(x) {
        out <- html2text(x)$text
        out <- strsplit(out,'\n|--')[[1]]

        #Find Lead
        TextLength <- sapply(out,nchar)
        Lead <- out[TextLength==max(TextLength)]

        #Find Site
        Site <- out[3]

        #Return cleaned fields
        out <- c(Site,Lead)
        names(out) <- c('Site','Lead')
        out
    }
    description <- lapply(mydf$description,parseDescription)
    description <- do.call(rbind,description)
    mydf <- cbind(mydf,description)

    #Format as XTS object
    mydf = xts(mydf,order.by=pubDate)

    # drop Extra attributes that we don't use yet
    mydf$guid.text = mydf$guid..attrs = mydf$description = mydf$link = NULL
    return(mydf) 

}

timezone

xts

quantmod

google-finance

回答 1

Stack Overflow用户

回答已采纳

发布于 2011-04-23 11:19:32

下面是您的getNews函数的一个更短(也可能更有效)的版本

  getNews2 <- function(symbol, number){

    # load libraries
    require(XML); require(plyr); require(stringr); require(lubridate);  

    # construct url to news feed rss and encode it correctly
    url.b1 = 'http://www.google.com/finance/company_news?q='
    url    = paste(url.b1, symbol, '&output=rss', "&start=", 1,
               "&num=", number, sep = '')
    url    = URLencode(url)

    # parse xml tree, get item nodes, extract data and return data frame
    doc   = xmlTreeParse(url, useInternalNodes = T);
    nodes = getNodeSet(doc, "//item");
    mydf  = ldply(nodes, as.data.frame(xmlToList))

    # clean up names of data frame
    names(mydf) = str_replace_all(names(mydf), "value\\.", "")

    # convert pubDate to date-time object and convert time zone
    mydf$pubDate = strptime(mydf$pubDate, 
                     format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
    mydf$pubDate = with_tz(mydf$pubDate, tz = 'America/New_york')

    # drop guid.text and guid..attrs
    mydf$guid.text = mydf$guid..attrs = NULL

    return(mydf)    
}

此外，您的代码中可能存在错误，因为我尝试将其用于symbol = 'WMT'，但它返回了一个错误。我认为getNews2在WMT上也运行得很好。检查一下，让我知道它是否适用于您。

PS。description列仍然包含html代码。但从其中提取文本应该很容易。当我有时间的时候我会更新的

票数 6

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/5761576

复制

相似问题

问在R中改进从google获取股票新闻数据的函数
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在R中改进从google获取股票新闻数据的函数EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在R中改进从google获取股票新闻数据的函数
EN