首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >识别数据集中的预订扩展(ddply?)

识别数据集中的预订扩展(ddply?)
EN

Stack Overflow用户
提问于 2020-06-28 02:11:46
回答 1查看 32关注 0票数 0

我有一个保留数据集,其中一些是原始保留的扩展。我在试着识别那些扩展。

我总是在蹒跚学步地尝试这样做,或者在'for (row in data)‘样式循环中来回切换,但是在这两种情况下,我都想不出如何使组合工作。

这3次检查必须针对其他行进行组合。第1行为Ie,第2行为报到/退房日期、建筑代码、访客邮件等.

代码语言:javascript
复制
for (i in datax) {
  #1:length(fdr.list)
datax %>% filter(datax$email == datax[i,]$email)

# filter <- datax[datax$email == datax[i]$email]

datax[i, ]$Extension <- ifelse(data[i, ]$StaysOrdered == 1, 0, #Initial filtering just to do less work
     ifelse(
        floor_date(datax[i, ]$checkOutDate)== floor_date(datax$checkInDate) &
        datax[i, ]$buildingcode == datax$buildingcode &
          ifelse(not(is.na(datax[i, ]$email)),ifelse(datax[i, ]$email == datax$email
        , 1, 0),0)))
}

我没有一个可重复的例子,因为我希望中间的逻辑会改变和/或扩展,所以我更多的是寻找一个可以扩展的代码基础,而不是直接解决问题。

这涉及到大约30000个预订,因此理想情况下,代码不会花太长时间进行划船。我不知道这是怎么回事,但也许先过滤一下,然后再通过访客邮件检查呢?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-06-29 14:22:59

最后创建了一些唯一的代码字符串并进行匹配。时间上的巨大差异把它从一个小时降到了5分钟,是把结果放在一个预先填充的矩阵中,而不是直接进入一个data.frame。

代码语言:javascript
复制
#Create unique codes to match; by three methods (email, lastname_cleaned, guest), one for check-out one for check-in
data$Extension_MatchCode_Email_Out <- paste(data$internaltitle, "-", floor_date(data$checkOutDate,"day"), "-", data$email)
data$Extension_MatchCode_Email_In <- paste(data$internaltitle, "-", floor_date(data$checkInDate,"day"), "-", data$email)
data$Extension_MatchCode_Name_Out <- paste(data$internaltitle, "-", floor_date(data$checkOutDate,"day"), "-", data$lastname_cleaned)
data$Extension_MatchCode_Name_In <- paste(data$internaltitle, "-", floor_date(data$checkInDate,"day"), "-", data$lastname_cleaned)
data$Extension_MatchCode_Guest_Out <- paste(data$internaltitle, "-", floor_date(data$checkOutDate,"day"), "-", data$guest)
data$Extension_MatchCode_Guest_In <- paste(data$internaltitle, "-", floor_date(data$checkInDate,"day"), "-", data$guest)

#Prepopulate results matices. We don't want to be writing into the dataframe directly else it will be ~8x slower
matchval_email <- rep(NA, nrow(data))
matchval_lastname <- rep(NA, nrow(data))
matchval_guest <- rep(NA, nrow(data))

#For loop; for each row check if there is a match in another row's 'in' column to the row's 'out' column. Every 100 loops, print progress and time stamp.
for (i in 1:nrow(data)) {
ifelse(i %% 100 == 0, print(paste(i, "-", Sys.time())),"")

#Match by customer email method
matchval_email[i] <- ifelse(is.na(data[i, ]$email), 0, #Check if email is blank, if so, skip
                              ifelse(data[i, ]$Stays_Email <2, 0, #Check if stays for this email is <2, if so, skip
                              fmatch(data[i, ]$Extension_MatchCode_Email_In, data$Extension_MatchCode_Email_Out, nomatch = 0))) #find first occurance of a match for out to in code.
#Match by last name method
matchval_lastname[i] <- ifelse(is.na(data[i, ]$lastname_cleaned), 0,
                              ifelse(data[i, ]$Stays_LastName_Cleaned <2, 0,
                              fmatch(data[i, ]$Extension_MatchCode_Name_In, data$Extension_MatchCode_Name_Out, nomatch = 0)))
#Match by guest code method
matchval_guest[i] <- ifelse(is.na(data[i, ]$guest), 0,
                              ifelse(data[i, ]$Stays <2, 0,                               
                              fmatch(data[i, ]$Extension_MatchCode_Guest_In, data$Extension_MatchCode_Guest_Out, nomatch = 0)))

}


#Move matrix results into dataframe once
data$matchval_email <- matchval_email
data$matchval_lastname <- matchval_lastname
data$matchval_guest <- matchval_guest
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62617486

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档