我有一个包含以下字段(示例数据)的大型数据框架。
#dput(data) gives...
data <- structure(list(idNum = 1:11, personID = c(111L, 112L, 113L, 113L, 111L, 112L, 114L, 112L, 111L, 113L, 115L), Name = c("PETER PAN", "RUPERT BEAR", "LONG JOHN SILVER", "SILVER LONG JOHN", "PAN PETER", "BEAR RUPERT", "R BEAR", "RUPERT BEAR", "PETER PAN", "LONG J SILVER", "LJ SILVER "), DOB = c("1/01/2001", "2/01/2001", "3/01/2001", "3/01/2001", "1/01/2001", "2/01/2001", "10/01/2001", "2/01/2001", "1/01/2001", "1/01/2001", "5/01/2001"), date = c("12/01/2012", "12/01/2012", "14/01/2012", "12/01/2012", "14/01/2012", "11/01/2012", "10/01/2012", "16/01/2012", "10/01/2012", "16/01/2012", "10/01/2012" ), colour = c("RED", "BLUE", "RED", "GREEN", "YELLOW", "BLUE", "RED", "BLUE", "ORGANGE", "BLUE", "ORANGE"), firstName = c("PETER", "RUPERT", "LONG", "SILVER", "PAN", "BEAR", "R", "RUPERT", "PETER", "LONG", "LJ"), lastName = c("PAN", "BEAR", "SILVER", "JOHN", "PETER", "RUPERT", "BEAR", "BEAR", "PAN", "SILVER", "SILVER")), .Names = c("idNum", "personID", "Name", "DOB", "date", "colour", "firstName", "lastName" ), row.names = c(NA, -11L), class = "data.frame")
firstName和lastName不在原始数据中。原始数据集中的名称格式由自由格式输入系统生成。它包含大量的外国名字,因此数据录入人员不能准确地收集名字和姓氏。我使用以下命令派生出它们:
data$firstName <-sapply(strsplit(data$Name, split=" "), head, 1)
data$lastName <- sapply(strsplit(data$Name, split=" "), tail, 1)我需要实现的是一个子集数据框,它删除与personID、Name和道布匹配的重复项,以便返回的值包含每个唯一案例的最新日期的最新条目。
也就是说,我想返回第5、7、8、10和11行。
我将名字和姓氏分开,因为我设想它的工作原理是先提取大小写,然后根据日期进行lastName == firstName。然后,我很难使用lastName在firstName中的情况,并且满足了其他考虑因素。
没有,如果这是工作,现在我迷失了。
有没有一种相对简单的方法来删除与列personID、Name和道布匹配的重复项,并保留最新的唯一实例?
在此之前,非常感谢您。
发布于 2012-01-20 12:25:54
我用的是@Vincent's
data[ !duplicated( data$personID, fromLast=TRUE ), ]
排序方式为:
data <- ddply(.data=data, .variables= 'date')
https://stackoverflow.com/questions/8892566
复制相似问题