文章/答案/技术大牛

发布

问数据枢轴变换
EN

Stack Overflow用户

提问于 2017-02-22 11:00:24

回答 1查看 110关注 0票数 1

我知道在R中有一些包用于像tm这样的文本挖掘，但我无法将它用于我的任务。我有一个文本文件，它的数据如下所示：

 452924301037
    5May2014
       John
    7May2014
       Mark
       Sam
 452924302789
    6May2014
       Bill

我希望数据框架中的上述数据如下所示：

UserID, Date, Names
452924301037,5May2014,John
452924301037,7May2014,Mark Sam
452924302789,6May2014,Bill

我怎样才能在R中做到这一点？

示例2:

输入文本文件：

452924301037
    5May2014
       John
           Cricket
           Football
    7May2014
       Mark
           Hockey
452924302789
     6May2014
       Bill
           Billiards

我想在下面设置一个数据框架：

Game, Player, Date, UserID 
Cricket, John, 5May2014, 452924301037
Football, John, 5May2014, 452924301037
Hockey, Mark, 7May2014, 452924301037
Billiards, Bill, 6May2014, 452924302789

text

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-02-22 11:38:01

使用data.table和zoo的一种可能的解决方案

# read the textfile
txt <- readLines('textlines.txt')

# load the needed packages
library(zoo)
library(data.table)

# convert the text to a data.table (an enhanced form of a dataframe)
DT <- data.table(txt = txt)

# extract the info into new columns
DT[grepl('\\d+{8,}', txt), User_id := grep('\\d+{8,}', txt, value = TRUE)
   ][grepl('\\D+{3}\\d+{4}', txt), Date := txt
     ][, (c('User_id','Date')) := lapply(.SD, na.locf, na.rm = FALSE), .SDcols = 2:3
       ][txt!=User_id & txt != Date, .(Names = paste0(txt, collapse = ' ')), by = .(User_id, Date)]

这意味着：

        user_id     date    Names
1: 452924301037 5May2014     John
2: 452924301037 7May2014 Mark Sam
3: 452924302789 6May2014     Bill

要查看每个步骤的执行情况，请运行以下代码：

# extract the user_id's
DT[grepl('\\d+{8,}', txt), User_id := grep('\\d+{8,}', txt, value = TRUE)][]
# extract the dates
DT[grepl('\\D+{3}\\d+{4}', txt), Date := txt][]
# fill the NA-values of 'User_id' and 'Date' with na.locf from the zoo package
DT[, (c('User_id','Date')) := lapply(.SD, na.locf, na.rm = FALSE), .SDcols = 2:3][]
# filter out the rows where the 'txt'-column has either a 'User_id' or a 'Date'
# collapse the names into one string by 'User_id' and 'Date'
DT[txt != User_id & txt != Date, .(Names = paste0(txt, collapse = ' ')), by = .(User_id, Date)][]

对于添加的第二个示例，您可以这样做：

DT <- data.table(txt = trimws(txt))

DT[grepl('\\d+{8,}', txt), User_id := grep('\\d+{8,}', txt, value = TRUE)
   ][grepl('\\D+{3}\\d+{4}', txt), Date := txt
     ][, (c('User_id','Date')) := lapply(.SD, na.locf, na.rm = FALSE), .SDcols = 2:3
       ][txt!=User_id & txt != Date
         ][, Name := txt[1], by = .(User_id, Date)
           ][Name != txt]

这意味着：

         txt      User_id     Date Name
1:   Cricket 452924301037 5May2014 John
2:  Football 452924301037 5May2014 John
3:    Hockey 452924301037 7May2014 Mark
4: Billiards 452924302789 6May2014 Bill

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42389360

复制

相似问题

问数据枢轴变换
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问数据枢轴变换EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问数据枢轴变换
EN