文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用R精确匹配整个数据集中的两列值

问如何使用R精确匹配整个数据集中的两列值
EN

Stack Overflow用户

提问于 2020-08-05 03:27:55

回答 1查看 241关注 0票数 6

我在R中有下面提到的两个数据，并且我尝试过各种方法，但是还不能达到所需的输出。

DF：

ID     Date                 city        code    uid
I-1    2020-01-01 10:12:15  New York     123    K-1
I-1    2020-01-01 10:12:15  Utha         103    K-1
I-2    2020-01-02 10:12:15  Washington   122    K-1
I-3    2020-02-01 10:12:15  Tokyo        123    K-2
I-3    2020-02-01 10:12:15  Osaka        193    K-2
I-4    2020-02-02 10:12:15  London       144    K-3
I-5    2020-02-04 10:12:15  Dubai        101    K-4
I-6    2019-11-01 10:12:15  Dubai        101    K-4
I-7    2019-11-01 10:12:15  London       144    K-3
I-8    2018-12-13 10:12:15  Tokyo        143    K-5
I-9    2019-05-17 10:12:15  Dubai        101    K-4
I-19   2020-03-11 10:12:15  Dubai        150    K-7

迪普特：

structure(list(ID = c("I-1", "I-1", 
"I-2", "I-3", "I-3", "I-4", 
"I-5", "I-6", "I-7", "I-8", "I-9","I-19" 
), DATE = c("2020-01-01 11:49:40.842", "2020-01-01 09:35:33.607", 
"2020-01-02 06:14:58.731", "2020-02-01 16:51:27.190", "2020-02-01 05:35:46.952", 
"2020-02-02 05:48:49.443", "2020-02-04 10:00:41.616", "2019-11-01 09:10:46.536", 
"2019-11-01 11:54:05.655", "2018-12-13 14:24:31.617", "2019-05-17 14:24:31.617", "2020-03-11 14:24:31.617"), CITY = c("New York", 
"UTAH", "Washington", "Tokyo", 
"Osaka", "London", "Dubai", 
"Dubai", "London", "Tokyo", "Dubai", 
"Dubai"), CODE = c("221010", 
"411017", "638007", "583101", "560029", "643102", "363001", "452001", 
"560024", "509208"), UID = c("K-1", 
"K-1", "K-1", "K-2", "K-2", 
"K-3", "K-4", "K-4", "K-3", 
"K-5","K-4","K-7")), .Names = c("ID", "DATE", 
"CITY", "CODE", "UID"), row.names = c(NA, 
10L), class = "data.fram)

使用上述两个数据，我希望在2020年1月1日至2002年2月29日之间获取记录，并比较整个数据库中的ID，以检查城市和代码是否与其他ID相匹配，并进一步对其进行分类，以检查有多少具有相同的uid，有多少有不同的ID。

哪里,

database

Same_uid中城市和代码匹配与其他ID的组合-对匹配ID进行分类以确定有多少ID具有相似的uid

different_uid，对匹配ID进行分类以确定有多少ID没有相似的uid

uid_count --整个数据库

中该特定ID的相似uid计数

注意-我有超过1000万的数据记录。

所需输出

ID      Date                  city         code   uid   Match   Same_uid   different_uid  uid_count
I-1     2020-01-01 10:12:15   New York     123    K-1    No      0          0              2
I-2     2020-01-02 10:12:15   Washington   122    K-1    No      0          0              2
I-3     2020-02-01 10:12:15   Tokyo        123    K-2    No      0          0              1   
I-4     2020-02-02 10:12:15   London       144    K-3    Yes     1          0              2
I-5     2020-02-04 10:12:15   Dubai        101    K-4    Yes     2          0              3

dataframe

ggplot2

dplyr

tidyverse

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-08-05 06:45:11

一种方法，

在数据集中加载

    library(tidyverse)
    library(lubridate)

   

   mydata <- tibble(
   ID = c("I-1","I-1",
          "I-2","I-3",
          "I-3","I-4",
          "I-5","I-6",
          "I-7","I-8",
          "I-9","I-19"),
   Date = c("2020-01-01", "2020-01-01",
            "2020-01-02", "2020-02-01",
            "2020-02-01", "2020-02-02",
            "2020-02-04", "2019-11-01", 
            "2019-11-01", "2018-12-13", 
            "2019-05-17", "2020-03-11"),
   city = c("New York", "Utha", 
            "Washington", "Tokyo", 
            "Osaka", "London", 
            "Dubai", "Dubai", 
            "London", "Tokyo", 
            "Dubai", "Dubai"),
   code = c("123", "103", "122", "123", "193, "144",
            "101", "101", "144", "143", "101", "150"),
   uid = c("K-1", "K-1", "K-1", "K-2", "K-2", "K-3",
           "K-4", "K-4", "K-3", "K-5", "K-4", "K-7"))

   mydata <- mydata %>% 
     mutate(Date = ymd(str_remove(Date, " .*")),
            code = as.character(code))

第1条

我使用来自dplyr的count来按城市计算代码。然后case_when根据请求进一步识别“是”或“否”。

# This counts city and code, and fullfills your "Match" column requirement
startdate <- "2017-01-01"
enddate <-   "2020-03-29"
mydata %>% 
  filter(Date >= startdate,
         Date <= enddate) %>%
  count(city, code, name = "count_samecode") %>%   
  mutate(Match = case_when(
   count_samecode > 1 ~ "Yes",
                    T ~ "No")) %>%
  head()
# # A tibble: 6 x 4
#  city     code  count_samecode Match
# <chr>    <chr>          <int> <chr>
# 1 Dubai    101                3 Yes  
# 2 Dubai    150                1 No   
# 3 London   144                2 Yes  
# 4 New York 123                1 No   
# 5 Osaka    193                1 No   
# 6 Tokyo    123                1 No

第2条

我会对UID做同样的事情

mydata %>% 
  filter(Date >= startdate,
         Date <= enddate ) %>% 
  count(city, uid, name = "UIDs_#_filtered") %>%
  head()

# # A tibble: 6 x 3
# city     uid   `UIDs_#_filtered`
# <chr>    <chr>             <int>
# 1 Dubai    K-4                   3
# 2 Dubai    K-7                   1
# 3 London   K-3                   2
# 4 New York K-1                   1
# 5 Osaka    K-2                   1
# 6 Tokyo    K-2                   1

第3条

我可以重复子句2的count，以找出这些城市中有多少城市有不同的UID，其中>1表示不同的UID。

mydata %>% 
  filter(Date >= startdate,
         Date <= enddate ) %>% 
  count(city, uid, name = "UIDs_#_filtered") %>% 
  count(city, name = "UIDs_#_different") %>% 
  head()
# # A tibble: 6 x 2
# city     `UIDs_#_different`
# <chr>                 <int>
# 1 Dubai                     2
# 2 London                    1
# 3 New York                  1
# 4 Osaka                     1
# 5 Tokyo                     2
# 6 Utha                      1

第4条

使用#2中的相同代码，我可以消除筛选器来查找整个数据集

mydata %>% 
  count(city, uid, name = "UIDs_#_all") %>% 
  head()

把所有的东西都放在一起

使用几个left_join，我们可以更接近所需的输出。编辑:现在将从第一个城市/代码组合中获取ID的第一个实例。

check_duplicates_filterview.f <- function( df, startdate, enddate ){
  # df should be a tibble
  # startdate should be a string "yyyy-mm-dd"
  # enddate should be a string   "yyyy-mm-dd"
  cityfilter <- df %>% filter(Date >= startdate,
                              Date <= enddate) %>% distinct(city) %>% pull(1)
  df <- df %>% 
    filter(city %in% cityfilter) %>% 
    mutate(Date = ymd(str_remove(Date, " .*")),
           code = as.character(code))
  entire.db.countcodes <- df %>%              # Finds count of code in entire DB
    count(city, code) 
  where.1 <- df %>% filter(Date >= startdate, 
                     Date <= enddate) %>% 
    distinct(city, code, .keep_all = T) %>%
    left_join(entire.db.countcodes)  %>% 
    rename("count_samecode" = n) %>% 
    mutate(Match = case_when(
      count_samecode > 1 ~ "Yes",
      T ~ "No"))
  
  where.2 <- df %>% 
    filter(Date >= startdate,
           Date <= enddate ) %>% 
    count(city, uid, name = "UIDs_#_filtered")
  where.3 <- df %>% 
    filter(Date >= startdate,
           Date <= enddate ) %>% 
    distinct(city, uid) %>% 
    count(city, name = "UIDs_#_distinct") 
  where.4 <- df %>% 
    filter(city %in% cityfilter) %>% 
    count(city, uid, name = "UIDs_#_all")
  first_half <- left_join(where.1, where.2)
  second_half <- left_join(where.4, where.3)
  full <- left_join(first_half, second_half)
  return(full)
}


# > check_duplicates_filterview.f(mydata, "2018-01-01", "2020-01-01")
# Joining, by = "city"
# Joining, by = "city"
# Joining, by = c("city", "uid")
# # A tibble: 5 x 8
# city     code  count_samecode Match uid   `UIDs_#_filtered` `UIDs_#_all` `UIDs_#_distinct`
# <chr>    <chr>          <int> <chr> <chr>             <int>        <int>             <int>
# 1 Dubai    101                2 Yes   K-4                   2            3                 1
# 2 London   144                1 No    K-3                   1            2                 1
# 3 New York 123                1 No    K-1                   1            1                 1
# 4 Tokyo    143                1 No    K-5                   1            1                 1
# 5 Utha     103                1 No    K-1                   1            1                 1

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63257981

复制

相似问题

问如何使用R精确匹配整个数据集中的两列值
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用R精确匹配整个数据集中的两列值EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用R精确匹配整个数据集中的两列值
EN