涉及到多个表的数据分析任务。
关系型数据的三个动词:
- mutating joins,通过一个表中相匹配的观测向另一个表中添加变量;
- filtering joins,通过一个表中的观测筛选另一个表中的观测;
- set operations,将观测视为集合的元素。
library(tidyverse)
library(nycflights13)
nycflights13数据集
flights查询航班信息
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
airlines查询航空公司及其缩写
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
airports查询机场,用faa标识
airports
## # A tibble: 1,458 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_~
## 2 06A Moton Field Municip~ 32.5 -85.7 264 -6 A America/Chic~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic~
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_~
## 5 09J Jekyll Island Airpo~ 31.1 -81.4 11 -5 A America/New_~
## 6 0A9 Elizabethton Munici~ 36.4 -82.2 1593 -5 A America/New_~
## 7 0G6 Williams County Air~ 41.5 -84.5 730 -5 A America/New_~
## 8 0G7 Finger Lakes Region~ 42.9 -76.8 492 -5 A America/New_~
## 9 0P2 Shoestring Aviation~ 39.8 -76.6 1000 -5 U America/New_~
## 10 0S9 Jefferson County In~ 48.1 -123. 108 -8 A America/Los_~
## # ... with 1,448 more rows
planes查询飞机,用tailnum标识
planes
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wi~ EMBRAER EMB-1~ 2 55 NA Turbo~
## 2 N102UW 1998 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 3 N103US 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 4 N104UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 5 N10575 2002 Fixed wi~ EMBRAER EMB-1~ 2 55 NA Turbo~
## 6 N105UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 7 N107US 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 8 N108UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 9 N109UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 10 N110UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## # ... with 3,312 more rows
weather给出所有机场每小时的天气信息
只需要知道每一对表通过什么键连接即可
- flights connects to planes via a single variable, tailnum.
- flights connects to airlines through the carrier variable.
- flights connects to airports in two ways: via the origin and dest variables.
- flights connects to weather via origin (the location), and year, month, day and hour (the time).
Keys 主键
主键是唯一标识每一个观测的变量或变量组。
鉴定主键是否唯一标识的方法:
planes %>%
count(tailnum) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>
weather %>%
count(year, month, day, hour, origin) %>%
filter(n > 1)
## # A tibble: 3 x 6
## year month day hour origin n
## <dbl> <dbl> <int> <int> <chr> <int>
## 1 2013 11 3 1 EWR 2
## 2 2013 11 3 1 JFK 2
## 3 2013 11 3 1 LGA 2
如果一个表中没有一个显式化的主键,可以使用代理主键,以用来追踪过滤前后的数据差异。使用mutate()。
一个主键和一个外键构成一对关系,relation。
Mutating Joins
将两个表中的变量合并。首先通过键来匹配观测,再将变量从一个表复制到另一个表。
增加的变量默认放在最右边。创建一个窄表。
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ... with 336,766 more rows
例如,想将航空公司全名添加到表中
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # ... with 336,766 more rows
相当于
merge(flights2,airlines,by="carrier") %>% as.tibble()
## # A tibble: 336,776 x 9
## carrier year month day hour origin dest tailnum name
## <chr> <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 9E 2013 2 5 8 JFK RDU N8698A Endeavor Air Inc.
## 2 9E 2013 8 23 19 JFK PIT N926XJ Endeavor Air Inc.
## 3 9E 2013 6 2 8 JFK MSP N925XJ Endeavor Air Inc.
## 4 9E 2013 10 26 19 JFK CVG N928XJ Endeavor Air Inc.
## 5 9E 2013 7 7 20 JFK PHL <NA> Endeavor Air Inc.
## 6 9E 2013 2 18 15 JFK DCA N910XJ Endeavor Air Inc.
## 7 9E 2013 1 19 18 JFK MSP N935XJ Endeavor Air Inc.
## 8 9E 2013 12 19 19 JFK PIT N604LR Endeavor Air Inc.
## 9 9E 2013 5 26 18 JFK PIT N906XJ Endeavor Air Inc.
## 10 9E 2013 5 29 9 JFK PHL N601XJ Endeavor Air Inc.
## # ... with 336,766 more rows
创建两个样例表
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
inner join
在两张表中找出共有观测
x %>%
inner_join(y, by = "key")
## # A tibble: 2 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
没有匹配的行不会包含在结果中。
outer join
- left join包含所有x的观测;
- right join包含所有y的观测;
- full join包含所有x和y的观测;
duplicate keys
当只有一个表中有重复键时:可以向一个一对多关系的表中添加额外信息
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
left_join(x, y, by = "key")
## # A tibble: 4 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 2 x3 y2
## 4 1 x4 y1
当两个表中都有重复键时,会得到所有可能结果的笛卡尔积
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
left_join(x, y, by = "key")
## # A tibble: 6 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 2 x2 y3
## 4 2 x3 y2
## 5 2 x3 y3
## 6 3 x4 y4
dplyr中函数和merge函数的对应关系
- inner_join(x, y) = merge(x, y)
- left_join(x, y) = merge(x, y, all.x = TRUE)
- right_join(x, y) = merge(x, y, all.y = TRUE)
- full_join(x, y) = merge(x, y, all.x = TRUE, all.y = TRUE)
dplyr中的函数含义更明确,速度更快,并且行的顺序不会错乱。
Filtering joins
过滤型连接不会改变变量,而是改变观测。
- semi_join(x, y) keeps all observations in x that have a match in y.
- anti_join(x, y) drops all observations in x that have a match in y.
例如:找到了最受欢迎的十个机场
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
## # A tibble: 10 x 2
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
筛选去往这十个机场的航班
flights %>%
filter(dest %in% top_dest$dest)
## # A tibble: 141,145 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 542 540 2 923
## 2 2013 1 1 554 600 -6 812
## 3 2013 1 1 554 558 -4 740
## 4 2013 1 1 555 600 -5 913
## 5 2013 1 1 557 600 -3 838
## 6 2013 1 1 558 600 -2 753
## 7 2013 1 1 558 600 -2 924
## 8 2013 1 1 558 600 -2 923
## 9 2013 1 1 559 559 0 702
## 10 2013 1 1 600 600 0 851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
使用semi_join,保留在y中有记录的x
flights %>%
semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 542 540 2 923
## 2 2013 1 1 554 600 -6 812
## 3 2013 1 1 554 558 -4 740
## 4 2013 1 1 555 600 -5 913
## 5 2013 1 1 557 600 -3 838
## 6 2013 1 1 558 600 -2 753
## 7 2013 1 1 558 600 -2 924
## 8 2013 1 1 558 600 -2 923
## 9 2013 1 1 559 559 0 702
## 10 2013 1 1 600 600 0 851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
anti_join是反向操作:去掉y中有记录的x。anti_join在检验被删掉记录时很有用。
下例找到了很多没有在planes里有记录的flights。
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
## tailnum n
## <chr> <int>
## 1 <NA> 2512
## 2 N725MQ 575
## 3 N722MQ 513
## 4 N723MQ 507
## 5 N713MQ 483
## 6 N735MQ 396
## 7 N0EGMQ 371
## 8 N534MQ 364
## 9 N542MQ 363
## 10 N531MQ 349
## # ... with 712 more rows
Join problems
- 根据对数据的理解来寻找主键,而不是根据唯一标识原则;
- 确认主键没有缺失值;
- 确认外键在其他表中有对应主键,使用anti_join()。
- 如果有缺失,需要仔细考虑是内连接还是外连接。
Set operations
集合操作需要x和y有相同变量。
- intersect(x,y)
- union(x,y)
- setdiff(x,y)
df1 <- tribble(
~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
df1
## # A tibble: 2 x 2
## x y
## <dbl> <dbl>
## 1 1 1
## 2 2 1
df2
## # A tibble: 2 x 2
## x y
## <dbl> <dbl>
## 1 1 1
## 2 1 2
intersect(df1, df2)
## # A tibble: 1 x 2
## x y
## <dbl> <dbl>
## 1 1 1
union(df1, df2)
## # A tibble: 3 x 2
## x y
## <dbl> <dbl>
## 1 2 1
## 2 1 1
## 3 1 2
setdiff(df1, df2)
## # A tibble: 1 x 2
## x y
## <dbl> <dbl>
## 1 2 1
保留df1中不一样的
setdiff(df2, df1)
## # A tibble: 1 x 2
## x y
## <dbl> <dbl>
## 1 1 2
保留df2中不一样的