我有一份文件上面有密码和他们的描述。代码始终是一个短字符串(3-6个字符),由空格与下面的描述分隔开来。描述通常是几个单词(也有空格)。下面是一个例子:
LIISS License Issued
LIMOD License Modified
LIPASS License Assigned (Partial Assignment)
LIPND License Assigned (Partition/Disaggregation)
LIPPND License Issued from a Partial/P&D Assignment
LIPUR License Purged
LIREIN License Reinstated
LIREN License Renewed我想把它作为一个2列数据框架来阅读,第一列中的代码和第二列中的描述。我怎么能用R做这件事?
发布于 2015-09-30 03:23:44
您可以使用stri_split_fixed()
library(stringi)
as.data.frame(stri_split_fixed(readLines("x.txt"), " ", n = 2, simplify = TRUE))
# V1 V2
# 1 LIISS License Issued
# 2 LIMOD License Modified
# 3 LIPASS License Assigned (Partial Assignment)
# 4 LIPND License Assigned (Partition/Disaggregation)
# 5 LIPPND License Issued from a Partial/P&D Assignment
# 6 LIPUR License Purged
# 7 LIREIN License Reinstated
# 8 LIREN License Renewed在这里,我们使用readLines()读取文件(由"x.txt"显示)。然后stri_split_fixed()说我们想要在一个空间上拆分,并且希望n = 2列作为回报(因此只在第一个空间上分裂)。simplify = TRUE用于返回矩阵而不是列表。
数据: x.txt
writeLines("LIISS License Issued
LIMOD License Modified
LIPASS License Assigned (Partial Assignment)
LIPND License Assigned (Partition/Disaggregation)
LIPPND License Issued from a Partial/P&D Assignment
LIPUR License Purged
LIREIN License Reinstated
LIREN License Renewed", "x.txt")发布于 2015-09-30 03:16:33
我们可以使用readLines读取它,然后使用sub创建一个两列的data.frame
#read the lines with readLines
lines <- readLines('pavel.txt')
#match one or more spaces followed by one or more characters
#replace with `''` to extract the non-space characters at the beginning.
str1 <- sub('\\s+.*', '', lines)
#match non space characters from the beginning (`^[^ ]+`) followed by space
#replace with `''` to extract the characters that follow after the space.
str2 <- sub('^[^ ]+\\s+', '', lines)
out <- data.frame(v1= str1, v2=str2, stringsAsFactors=FALSE)
head(out,3)
# v1 v2
#1 LIISS License Issued
#2 LIMOD License Modified
#3 LIPASS License Assigned (Partial Assignment)或者另一个选项是将dataset作为单一列读取后从extract发出的library(tidyr)。我们使用捕获组来提取我们在每一列中需要的字符。在这里,([^ ]+)匹配一个或多个非空格,并使用括号捕获,后面是一个或多个空格(我们删除该空格),然后使用第二个捕获组提取其余字符。
library(tidyr)
extract(read.table('pavel.txt', sep=','), V1,
into= c('V1', 'V2'), '([^ ]+)\\s+(.*)')
# V1 V2
#1 LIISS License Issued
#2 LIMOD License Modified
#3 LIPASS License Assigned (Partial Assignment)
#4 LIPND License Assigned (Partition/Disaggregation)
#5 LIPPND License Issued from a Partial/P&D Assignment
#6 LIPUR License Purged
#7 LIREIN License Reinstated
#8 LIREN License Renewed或者我们可以用,替换第一个空间,然后用read.csv替换sep=','。
read.table(text=sub(' ', ',', readLines('pavel.txt')), sep=',')
# V1 V2
#1 LIISS License Issued
#2 LIMOD License Modified
#3 LIPASS License Assigned (Partial Assignment)
#4 LIPND License Assigned (Partition/Disaggregation)
#5 LIPPND License Issued from a Partial/P&D Assignment
#6 LIPUR License Purged
#7 LIREIN License Reinstated
#8 LIREN License Renewed如果我们使用的是linux,那么awk可以使用来自data.table或read.csv/read.table的fread。
library(data.table)
fread("awk '{sub(\" \", \",\", $0)}1' pavel.txt", header=FALSE)
# V1 V2
#1: LIISS License Issued
#2: LIMOD License Modified
#3: LIPASS License Assigned (Partial Assignment)
#4: LIPND License Assigned (Partition/Disaggregation)
#5: LIPPND License Issued from a Partial/P&D Assignment
#6: LIPUR License Purged
#7: LIREIN License Reinstated
#8: LIREN License Renewedhttps://stackoverflow.com/questions/32857066
复制相似问题