首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何使用R精确匹配整个数据集中的两列值

如何使用R精确匹配整个数据集中的两列值
EN

Stack Overflow用户
提问于 2020-08-05 03:27:55
回答 1查看 241关注 0票数 6

我在R中有下面提到的两个数据,并且我尝试过各种方法,但是还不能达到所需的输出。

DF:

代码语言:javascript
复制
ID     Date                 city        code    uid
I-1    2020-01-01 10:12:15  New York     123    K-1
I-1    2020-01-01 10:12:15  Utha         103    K-1
I-2    2020-01-02 10:12:15  Washington   122    K-1
I-3    2020-02-01 10:12:15  Tokyo        123    K-2
I-3    2020-02-01 10:12:15  Osaka        193    K-2
I-4    2020-02-02 10:12:15  London       144    K-3
I-5    2020-02-04 10:12:15  Dubai        101    K-4
I-6    2019-11-01 10:12:15  Dubai        101    K-4
I-7    2019-11-01 10:12:15  London       144    K-3
I-8    2018-12-13 10:12:15  Tokyo        143    K-5
I-9    2019-05-17 10:12:15  Dubai        101    K-4
I-19   2020-03-11 10:12:15  Dubai        150    K-7

迪普特:

代码语言:javascript
复制
structure(list(ID = c("I-1", "I-1", 
"I-2", "I-3", "I-3", "I-4", 
"I-5", "I-6", "I-7", "I-8", "I-9","I-19" 
), DATE = c("2020-01-01 11:49:40.842", "2020-01-01 09:35:33.607", 
"2020-01-02 06:14:58.731", "2020-02-01 16:51:27.190", "2020-02-01 05:35:46.952", 
"2020-02-02 05:48:49.443", "2020-02-04 10:00:41.616", "2019-11-01 09:10:46.536", 
"2019-11-01 11:54:05.655", "2018-12-13 14:24:31.617", "2019-05-17 14:24:31.617", "2020-03-11 14:24:31.617"), CITY = c("New York", 
"UTAH", "Washington", "Tokyo", 
"Osaka", "London", "Dubai", 
"Dubai", "London", "Tokyo", "Dubai", 
"Dubai"), CODE = c("221010", 
"411017", "638007", "583101", "560029", "643102", "363001", "452001", 
"560024", "509208"), UID = c("K-1", 
"K-1", "K-1", "K-2", "K-2", 
"K-3", "K-4", "K-4", "K-3", 
"K-5","K-4","K-7")), .Names = c("ID", "DATE", 
"CITY", "CODE", "UID"), row.names = c(NA, 
10L), class = "data.fram)

使用上述两个数据,我希望在2020年1月1日至2002年2月29日之间获取记录,并比较整个数据库中的ID,以检查城市和代码是否与其他ID相匹配,并进一步对其进行分类,以检查有多少具有相同的uid,有多少有不同的ID。

哪里,

database

  • Same_uid中城市和代码匹配与其他ID的组合-对匹配ID进行分类以确定有多少ID具有相似的uid

  • different_uid,对匹配ID进行分类以确定有多少ID没有相似的uid

  • uid_count --整个数据库

中该特定ID的相似uid计数

注意-我有超过1000万的数据记录。

所需输出

代码语言:javascript
复制
ID      Date                  city         code   uid   Match   Same_uid   different_uid  uid_count
I-1     2020-01-01 10:12:15   New York     123    K-1    No      0          0              2
I-2     2020-01-02 10:12:15   Washington   122    K-1    No      0          0              2
I-3     2020-02-01 10:12:15   Tokyo        123    K-2    No      0          0              1   
I-4     2020-02-02 10:12:15   London       144    K-3    Yes     1          0              2
I-5     2020-02-04 10:12:15   Dubai        101    K-4    Yes     2          0              3 
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-08-05 06:45:11

一种方法,

在数据集中加载

代码语言:javascript
复制
    library(tidyverse)
    library(lubridate)

   

   mydata <- tibble(
   ID = c("I-1","I-1",
          "I-2","I-3",
          "I-3","I-4",
          "I-5","I-6",
          "I-7","I-8",
          "I-9","I-19"),
   Date = c("2020-01-01", "2020-01-01",
            "2020-01-02", "2020-02-01",
            "2020-02-01", "2020-02-02",
            "2020-02-04", "2019-11-01", 
            "2019-11-01", "2018-12-13", 
            "2019-05-17", "2020-03-11"),
   city = c("New York", "Utha", 
            "Washington", "Tokyo", 
            "Osaka", "London", 
            "Dubai", "Dubai", 
            "London", "Tokyo", 
            "Dubai", "Dubai"),
   code = c("123", "103", "122", "123", "193, "144",
            "101", "101", "144", "143", "101", "150"),
   uid = c("K-1", "K-1", "K-1", "K-2", "K-2", "K-3",
           "K-4", "K-4", "K-3", "K-5", "K-4", "K-7"))

   mydata <- mydata %>% 
     mutate(Date = ymd(str_remove(Date, " .*")),
            code = as.character(code))

第1条

我使用来自dplyr的count来按城市计算代码。然后case_when根据请求进一步识别“是”或“否”。

代码语言:javascript
复制
# This counts city and code, and fullfills your "Match" column requirement
startdate <- "2017-01-01"
enddate <-   "2020-03-29"
mydata %>% 
  filter(Date >= startdate,
         Date <= enddate) %>%
  count(city, code, name = "count_samecode") %>%   
  mutate(Match = case_when(
   count_samecode > 1 ~ "Yes",
                    T ~ "No")) %>%
  head()
# # A tibble: 6 x 4
#  city     code  count_samecode Match
# <chr>    <chr>          <int> <chr>
# 1 Dubai    101                3 Yes  
# 2 Dubai    150                1 No   
# 3 London   144                2 Yes  
# 4 New York 123                1 No   
# 5 Osaka    193                1 No   
# 6 Tokyo    123                1 No  

第2条

我会对UID做同样的事情

代码语言:javascript
复制
mydata %>% 
  filter(Date >= startdate,
         Date <= enddate ) %>% 
  count(city, uid, name = "UIDs_#_filtered") %>%
  head()

# # A tibble: 6 x 3
# city     uid   `UIDs_#_filtered`
# <chr>    <chr>             <int>
# 1 Dubai    K-4                   3
# 2 Dubai    K-7                   1
# 3 London   K-3                   2
# 4 New York K-1                   1
# 5 Osaka    K-2                   1
# 6 Tokyo    K-2                   1

第3条

我可以重复子句2的count,以找出这些城市中有多少城市有不同的UID,其中>1表示不同的UID。

代码语言:javascript
复制
mydata %>% 
  filter(Date >= startdate,
         Date <= enddate ) %>% 
  count(city, uid, name = "UIDs_#_filtered") %>% 
  count(city, name = "UIDs_#_different") %>% 
  head()
# # A tibble: 6 x 2
# city     `UIDs_#_different`
# <chr>                 <int>
# 1 Dubai                     2
# 2 London                    1
# 3 New York                  1
# 4 Osaka                     1
# 5 Tokyo                     2
# 6 Utha                      1

第4条

使用#2中的相同代码,我可以消除筛选器来查找整个数据集

代码语言:javascript
复制
mydata %>% 
  count(city, uid, name = "UIDs_#_all") %>% 
  head()

把所有的东西都放在一起

使用几个left_join,我们可以更接近所需的输出。编辑:现在将从第一个城市/代码组合中获取ID的第一个实例。

代码语言:javascript
复制
check_duplicates_filterview.f <- function( df, startdate, enddate ){
  # df should be a tibble
  # startdate should be a string "yyyy-mm-dd"
  # enddate should be a string   "yyyy-mm-dd"
  cityfilter <- df %>% filter(Date >= startdate,
                              Date <= enddate) %>% distinct(city) %>% pull(1)
  df <- df %>% 
    filter(city %in% cityfilter) %>% 
    mutate(Date = ymd(str_remove(Date, " .*")),
           code = as.character(code))
  entire.db.countcodes <- df %>%              # Finds count of code in entire DB
    count(city, code) 
  where.1 <- df %>% filter(Date >= startdate, 
                     Date <= enddate) %>% 
    distinct(city, code, .keep_all = T) %>%
    left_join(entire.db.countcodes)  %>% 
    rename("count_samecode" = n) %>% 
    mutate(Match = case_when(
      count_samecode > 1 ~ "Yes",
      T ~ "No"))
  
  where.2 <- df %>% 
    filter(Date >= startdate,
           Date <= enddate ) %>% 
    count(city, uid, name = "UIDs_#_filtered")
  where.3 <- df %>% 
    filter(Date >= startdate,
           Date <= enddate ) %>% 
    distinct(city, uid) %>% 
    count(city, name = "UIDs_#_distinct") 
  where.4 <- df %>% 
    filter(city %in% cityfilter) %>% 
    count(city, uid, name = "UIDs_#_all")
  first_half <- left_join(where.1, where.2)
  second_half <- left_join(where.4, where.3)
  full <- left_join(first_half, second_half)
  return(full)
}


# > check_duplicates_filterview.f(mydata, "2018-01-01", "2020-01-01")
# Joining, by = "city"
# Joining, by = "city"
# Joining, by = c("city", "uid")
# # A tibble: 5 x 8
# city     code  count_samecode Match uid   `UIDs_#_filtered` `UIDs_#_all` `UIDs_#_distinct`
# <chr>    <chr>          <int> <chr> <chr>             <int>        <int>             <int>
# 1 Dubai    101                2 Yes   K-4                   2            3                 1
# 2 London   144                1 No    K-3                   1            2                 1
# 3 New York 123                1 No    K-1                   1            1                 1
# 4 Tokyo    143                1 No    K-5                   1            1                 1
# 5 Utha     103                1 No    K-1                   1            1                 1
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/63257981

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档