我有一个长字符串(每行)中的数据。基本上,它是用分号分隔的,列/答案由=分隔。我正在尝试做以下几件事:
当前结构:
Row1: “Column1 = blah1;Column2 = blah2;Column3 = blah3;Column4 = blah4”
Row2: “Column1 = blah1;Column2 = blah2;Column3 = blah3;Column4 = blah4”转换为->
Column1|Column2|Column3|Column4
blah1|blah2|blah3|blah4
blah1|blah2|blah3|blah4我相信R中的tidyr包是可行的,但我还没弄清楚。
这就是我使用tidyr所得到的结果,但我仍然收到错误:
# CREATE TEST DATA
mydata <- as.data.frame(c("Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4"))
names(mydata) <- "TEST"
# Create dummy vector
x <- vector(mode="numeric", length=0)
# Separate by ;
x <- separate(mydata, TEST, x, sep = ";" )任何帮助都是非常感谢的。
发布于 2017-03-25 04:22:01
我将使用dplyr pipes一步一步地展示如何做到这一点,并在每一步之后打印输出,这样您就可以看到数据结构是如何演变的。
mydata <- as.data.frame(c("Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4"))
names(mydata) <- "TEST"这看起来是这样的:
> mydata
TEST
1 Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4
2 Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4
3 Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4下面是转换的步骤:
library(dplyr)
library(tidyr)1)按变量分隔
mydata %>%
separate(rows, into=paste0("Column", 1:4), sep=";")输出:
Column1 Column2 Column3 Column4
1 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4
2 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4
3 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah42)添加行标识符
mydata %>%
separate(TEST, into=paste0("Column", 1:4), sep=";") %>%
mutate(row=row.names(mydata))输出:
Column1 Column2 Column3 Column4 row
1 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4 1
2 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4 2
3 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4 33)重新格式化为long
mydata %>%
separate(TEST, into=paste0("Column", 1:4), sep=";") %>%
mutate(row=row.names(mydata)) %>%
gather("key", "value", -row)输出:
row key value
1 1 Column1 Column1 = blah1
2 2 Column1 Column1 = blah1
3 3 Column1 Column1 = blah1
4 1 Column2 Column2 = blah2
5 2 Column2 Column2 = blah2
6 3 Column2 Column2 = blah2
7 1 Column3 Column3 = blah3
8 2 Column3 Column3 = blah3
9 3 Column3 Column3 = blah3
10 1 Column4 Column4 = blah4
11 2 Column4 Column4 = blah4
12 3 Column4 Column4 = blah44)然后提取数据
mydata %>%
separate(TEST, into=paste0("Column", 1:4), sep=";") %>%
mutate(row=row.names(mydata)) %>%
gather("key", "value", -row) %>%
extract(value, into="value", regex=".* = (.*)$")输出:
row key value
1 1 Column1 blah1
2 2 Column1 blah1
3 3 Column1 blah1
4 1 Column2 blah2
5 2 Column2 blah2
6 3 Column2 blah2
7 1 Column3 blah3
8 2 Column3 blah3
9 3 Column3 blah3
10 1 Column4 blah4
11 2 Column4 blah4
12 3 Column4 blah45)如果需要,将其重新展开为宽格式
mydata %>%
separate(TEST, into=paste0("Column", 1:4), sep=";") %>%
mutate(row=row.names(mydata)) %>%
gather("key", "value", -row) %>%
extract(value, into="value", regex=".* = (.*)$") %>%
spread(key, value)输出:
row Column1 Column2 Column3 Column4
1 1 blah1 blah2 blah3 blah4
2 2 blah1 blah2 blah3 blah4
3 3 blah1 blah2 blah3 blah46)如果需要,删除行标识符
mydata %>%
separate(TEST, into=paste0("Column", 1:4), sep=";") %>%
mutate(row=row.names(mydata)) %>%
gather("key", "value", -row) %>%
extract(value, into="value", regex=".* = (.*)$") %>%
spread(key, value) %>%
select(-row)输出:
Column1 Column2 Column3 Column4
1 blah1 blah2 blah3 blah4
2 blah1 blah2 blah3 blah4
3 blah1 blah2 blah3 blah4发布于 2017-03-25 03:51:55
下面是一个基数r的尝试
#Example data provided
data <- data.frame(
string=c(
"Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4",
"Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4"))
#Modulo function for odd and even numbers
odd <- function(x) x%%2 != 0
even <- function(x) x%%2 == 0
#split string based on condition and remove all xtra whitespace
s <- gsub("[[:space:]]", "", unlist(strsplit(as.character(data$string), '= |;')))
#bind the data into a df no factors
data <- data.frame(rbind(unique(s[even(1:length(s))]),
unique(s[even(1:length(s))])),
stringsAsFactors=F)
#rename column names exctrating the odd vector numbers from s
colnames(data) <- unique(s[odd(1:length(s))])
datahttps://stackoverflow.com/questions/43007206
复制相似问题