学习《R for Data Science》(7)——Relation Data

nycflights13数据集
Keys 主键
Mutating Joins
Filtering joins
Join problems
Set operations

涉及到多个表的数据分析任务。

关系型数据的三个动词：

mutating joins，通过一个表中相匹配的观测向另一个表中添加变量；
filtering joins，通过一个表中的观测筛选另一个表中的观测；
set operations，将观测视为集合的元素。

library(tidyverse)
library(nycflights13)

nycflights13数据集

flights查询航班信息

flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

airlines查询航空公司及其缩写

airlines

## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

airports查询机场，用faa标识

airports

## # A tibble: 1,458 x 8
##    faa   name                   lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                <dbl>  <dbl> <int> <dbl> <chr> <chr>        
##  1 04G   Lansdowne Airport     41.1  -80.6  1044    -5 A     America/New_~
##  2 06A   Moton Field Municip~  32.5  -85.7   264    -6 A     America/Chic~
##  3 06C   Schaumburg Regional   42.0  -88.1   801    -6 A     America/Chic~
##  4 06N   Randall Airport       41.4  -74.4   523    -5 A     America/New_~
##  5 09J   Jekyll Island Airpo~  31.1  -81.4    11    -5 A     America/New_~
##  6 0A9   Elizabethton Munici~  36.4  -82.2  1593    -5 A     America/New_~
##  7 0G6   Williams County Air~  41.5  -84.5   730    -5 A     America/New_~
##  8 0G7   Finger Lakes Region~  42.9  -76.8   492    -5 A     America/New_~
##  9 0P2   Shoestring Aviation~  39.8  -76.6  1000    -5 U     America/New_~
## 10 0S9   Jefferson County In~  48.1 -123.    108    -8 A     America/Los_~
## # ... with 1,448 more rows

planes查询飞机，用tailnum标识

planes

## # A tibble: 3,322 x 9
##    tailnum  year type      manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>     <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wi~ EMBRAER       EMB-1~       2    55    NA Turbo~
##  2 N102UW   1998 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  3 N103US   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  4 N104UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  5 N10575   2002 Fixed wi~ EMBRAER       EMB-1~       2    55    NA Turbo~
##  6 N105UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  7 N107US   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  8 N108UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  9 N109UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## 10 N110UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## # ... with 3,312 more rows

weather给出所有机场每小时的天气信息

只需要知道每一对表通过什么键连接即可

flights connects to planes via a single variable, tailnum.
flights connects to airlines through the carrier variable.
flights connects to airports in two ways: via the origin and dest variables.
flights connects to weather via origin (the location), and year, month, day and hour (the time).

Keys 主键

主键是唯一标识每一个观测的变量或变量组。

鉴定主键是否唯一标识的方法：

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)

## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>

weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)

## # A tibble: 3 x 6
##    year month   day  hour origin     n
##   <dbl> <dbl> <int> <int> <chr>  <int>
## 1  2013    11     3     1 EWR        2
## 2  2013    11     3     1 JFK        2
## 3  2013    11     3     1 LGA        2

如果一个表中没有一个显式化的主键，可以使用代理主键，以用来追踪过滤前后的数据差异。使用mutate()。

一个主键和一个外键构成一对关系,relation。

Mutating Joins

将两个表中的变量合并。首先通过键来匹配观测，再将变量从一个表复制到另一个表。

增加的变量默认放在最右边。创建一个窄表。

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2

## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ... with 336,766 more rows

例如，想将航空公司全名添加到表中

flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier")

## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

相当于

merge(flights2,airlines,by="carrier") %>% as.tibble()

## # A tibble: 336,776 x 9
##    carrier  year month   day  hour origin dest  tailnum name             
##    <chr>   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>            
##  1 9E       2013     2     5     8 JFK    RDU   N8698A  Endeavor Air Inc.
##  2 9E       2013     8    23    19 JFK    PIT   N926XJ  Endeavor Air Inc.
##  3 9E       2013     6     2     8 JFK    MSP   N925XJ  Endeavor Air Inc.
##  4 9E       2013    10    26    19 JFK    CVG   N928XJ  Endeavor Air Inc.
##  5 9E       2013     7     7    20 JFK    PHL   <NA>    Endeavor Air Inc.
##  6 9E       2013     2    18    15 JFK    DCA   N910XJ  Endeavor Air Inc.
##  7 9E       2013     1    19    18 JFK    MSP   N935XJ  Endeavor Air Inc.
##  8 9E       2013    12    19    19 JFK    PIT   N604LR  Endeavor Air Inc.
##  9 9E       2013     5    26    18 JFK    PIT   N906XJ  Endeavor Air Inc.
## 10 9E       2013     5    29     9 JFK    PHL   N601XJ  Endeavor Air Inc.
## # ... with 336,766 more rows

创建两个样例表

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

inner join

在两张表中找出共有观测

x %>% 
  inner_join(y, by = "key")

## # A tibble: 2 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2

没有匹配的行不会包含在结果中。

outer join

left join包含所有x的观测；
right join包含所有y的观测；
full join包含所有x和y的观测；

duplicate keys

当只有一个表中有重复键时：可以向一个一对多关系的表中添加额外信息

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
left_join(x, y, by = "key")

## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x3    y2   
## 4     1 x4    y1

当两个表中都有重复键时，会得到所有可能结果的笛卡尔积

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
left_join(x, y, by = "key")

## # A tibble: 6 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     2 x2    y3   
## 4     2 x3    y2   
## 5     2 x3    y3   
## 6     3 x4    y4

dplyr中函数和merge函数的对应关系

inner_join(x, y) = merge(x, y)
left_join(x, y) = merge(x, y, all.x = TRUE)
right_join(x, y) = merge(x, y, all.y = TRUE)
full_join(x, y) = merge(x, y, all.x = TRUE, all.y = TRUE)

dplyr中的函数含义更明确，速度更快，并且行的顺序不会错乱。

Filtering joins

过滤型连接不会改变变量，而是改变观测。

semi_join(x, y) keeps all observations in x that have a match in y.
anti_join(x, y) drops all observations in x that have a match in y.

例如：找到了最受欢迎的十个机场

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest

## # A tibble: 10 x 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705

筛选去往这十个机场的航班

flights %>% 
  filter(dest %in% top_dest$dest)

## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540         2      923
##  2  2013     1     1      554            600        -6      812
##  3  2013     1     1      554            558        -4      740
##  4  2013     1     1      555            600        -5      913
##  5  2013     1     1      557            600        -3      838
##  6  2013     1     1      558            600        -2      753
##  7  2013     1     1      558            600        -2      924
##  8  2013     1     1      558            600        -2      923
##  9  2013     1     1      559            559         0      702
## 10  2013     1     1      600            600         0      851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

使用semi_join，保留在y中有记录的x

flights %>% 
  semi_join(top_dest)

## Joining, by = "dest"

## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540         2      923
##  2  2013     1     1      554            600        -6      812
##  3  2013     1     1      554            558        -4      740
##  4  2013     1     1      555            600        -5      913
##  5  2013     1     1      557            600        -3      838
##  6  2013     1     1      558            600        -2      753
##  7  2013     1     1      558            600        -2      924
##  8  2013     1     1      558            600        -2      923
##  9  2013     1     1      559            559         0      702
## 10  2013     1     1      600            600         0      851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

anti_join是反向操作：去掉y中有记录的x。anti_join在检验被删掉记录时很有用。

下例找到了很多没有在planes里有记录的flights。

flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)

## # A tibble: 722 x 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # ... with 712 more rows

Join problems

根据对数据的理解来寻找主键，而不是根据唯一标识原则；
确认主键没有缺失值；
确认外键在其他表中有对应主键，使用anti_join()。
如果有缺失，需要仔细考虑是内连接还是外连接。

Set operations

集合操作需要x和y有相同变量。

intersect(x,y)
union(x,y)
setdiff(x,y)

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)
df1

## # A tibble: 2 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     1
## 2     2     1

df2

## # A tibble: 2 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     1
## 2     1     2

intersect(df1, df2)

## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     1

union(df1, df2)

## # A tibble: 3 x 2
##       x     y
##   <dbl> <dbl>
## 1     2     1
## 2     1     1
## 3     1     2

setdiff(df1, df2)

## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1     2     1

保留df1中不一样的

setdiff(df2, df1)

## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2

保留df2中不一样的