学习《R for Data Science》(3)——EDA

问题
Variation
缺失值
Covariation
模式和模型
Project式工作流

系统地综合使用可视化和数据变形来探索数据。EDA(Exploratory Data Analysis)

EDA的循环步骤：

针对数据产生问题（假设）；
通过数据可视化、变形、建模来回答问题（验证假设）；
用所获得的知识改善问题或产生新问题；

EDA不是一个严格的标准过程，而是一种思维模式。EDA的目的是对数据的理解。

library(tidyverse)

## Warning: 程辑包'tidyverse'是用R版本3.5.1 来建造的

## Warning: 程辑包'tidyr'是用R版本3.5.1 来建造的

## Warning: 程辑包'readr'是用R版本3.5.1 来建造的

## Warning: 程辑包'forcats'是用R版本3.5.1 来建造的

library(hexbin)

## Warning: 程辑包'hexbin'是用R版本3.5.2 来建造的

library(modelr)

## Warning: 程辑包'modelr'是用R版本3.5.1 来建造的

问题

“There are no routine statistical questions, only questionable statistical routines.”
— Sir David Cox

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”
— John Tukey

使用问题来引导对数据的理解。EDA是个需要创造力的过程。通过问题来不断深化对数据的理解，越来越能问出好问题。

有两类经典问题：

变量的变化是怎样的？（variation）
变量之间的协变化是怎样的？(covariation)

Variation

可视化分布

怎样可视化变量的分布取决于变量是离散的还是连续的。

对于分类变量，通常使用柱状图(bar chart)

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

或者直接输出计数

diamonds %>% 
  count(cut)

## # A tibble: 5 x 2
##   cut           n
##   <ord>     <int>
## 1 Fair       1610
## 2 Good       4906
## 3 Very Good 12082
## 4 Premium   13791
## 5 Ideal     21551

table(diamonds$cut)

## 
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551

对于连续变量，使用直方图（histogram）

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

可以手动计算分箱

diamonds %>% 
  count(cut_width(carat, 0.5))

## # A tibble: 11 x 2
##    `cut_width(carat, 0.5)`     n
##    <fct>                   <int>
##  1 [-0.25,0.25]              785
##  2 (0.25,0.75]             29498
##  3 (0.75,1.25]             15977
##  4 (1.25,1.75]              5313
##  5 (1.75,2.25]              2002
##  6 (2.25,2.75]               322
##  7 (2.75,3.25]                32
##  8 (3.25,3.75]                 5
##  9 (3.75,4.25]                 4
## 10 (4.25,4.75]                 1
## 11 (4.75,5.25]                 1

使用直方图时，应使用多个分箱区间来放大和缩小区间及粒度。

smaller <- diamonds %>% 
  filter(carat < 3)
  
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)

如果想绘制多个直方图（比如一组分类变量下的），则使用geom_freqpoly

ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1)

典型值

一些可能的问题：

Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?

对于钻石问题来说：

Why are there more diamonds at whole carats and common fractions of carats?
Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
Why are there no diamonds bigger than 3 carats?

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

观察到了钻石有聚类现象，则对聚类提出问题：

How are the observations within each cluster similar to each other?
How are the observations in separate clusters different from each other?
How can you explain or describe the clusters?
Why might the appearance of clusters be misleading?

老忠实泉的喷发时间

ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_histogram(binwidth = 0.25)

为什么老忠实泉有两个集中的喷发时间，中间却很少。

提出一些问题引导探索，比如，一些变量的分布是否能通过其他变量来解释。即变量之间的相互关系。

异常值

非典型值有两种情况：

错误数据；
蕴含重要新知识的数据；

当数据量很大时，离群点不容易被发现。

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

因为直方图被压缩，所以意识到有个很大的值。

为了发现缺失值，可以zoom in到小值的范围

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

区别于xlim()和ylim()，这两个是将超出范围的样本扔掉

存在3个异常值，挑选出来

unusual <- diamonds %>% 
  filter(y < 3 | y > 20) %>% 
  select(price, x, y, z) %>%
  arrange(y)
unusual

## # A tibble: 9 x 4
##   price     x     y     z
##   <int> <dbl> <dbl> <dbl>
## 1  5139  0      0    0   
## 2  6381  0      0    0   
## 3 12800  0      0    0   
## 4 15686  0      0    0   
## 5 18034  0      0    0   
## 6  2130  0      0    0   
## 7  2130  0      0    0   
## 8  2075  5.15  31.8  5.12
## 9 12210  8.09  58.9  8.06

发现这些异常数据都是有问题的。

不推荐不加考虑直接扔掉。因为一个观测存在一个值是异常值，不代表其他值也是异常值。直接丢掉会很可惜。

对于异常值，如果不知道为什么产生，可以将其替换为缺失值。

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

缺失值

ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point()

## Warning: Removed 9 rows containing missing values (geom_point).

如果不想要提示，加上na.rm=T

为了探寻缺失值的模式，可以将缺失单列出一个变量

nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
    geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)

然而由于没有取消的航班远远多于取消的航班，所以看不出规律。

使用分面。

nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
    geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)+
    facet_wrap(~cancelled,nrow = 2,scales = "free_y")

Covariation

多个变量之间的协变化关系

一个分类变量和一个连续变量

geom_freqpoly()不太容易比较，因为高度是由count给出的

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

因为每组中样本量相差很大，所以分布的差别不容易看出来。

ggplot(diamonds) + 
  geom_bar(mapping = aes(x = cut))

为了消除量的差别，y坐标需要调整，改成density，是count的标准化结果

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

除了直方图外，还可以用箱线图看分布，箱体内（IQR）是25%~75%，离群点是超过1.5个IQR

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()

当分类变量是无序的时，可以考虑重新排序

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))

如果变量名较长，可以翻转xy坐标

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()

两个分类变量

ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

相当于列联表的可视化

table(diamonds$color,diamonds$cut)

##    
##     Fair Good Very Good Premium Ideal
##   D  163  662      1513    1603  2834
##   E  224  933      2400    2337  3903
##   F  312  909      2164    2331  3826
##   G  314  871      2299    2924  4884
##   H  303  702      1824    2360  3115
##   I  175  522      1204    1428  2093
##   J  119  307       678     808   896

可以用瓦块图

diamonds %>% 
  count(color, cut)

## # A tibble: 35 x 3
##    color cut           n
##    <ord> <ord>     <int>
##  1 D     Fair        163
##  2 D     Good        662
##  3 D     Very Good  1513
##  4 D     Premium    1603
##  5 D     Ideal      2834
##  6 E     Fair        224
##  7 E     Good        933
##  8 E     Very Good  2400
##  9 E     Premium    2337
## 10 E     Ideal      3903
## # ... with 25 more rows

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

两个连续变量

使用散点图

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))

如果点很多的话，散点图就看不清楚，可以加上透明度

ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)

也可以使用二维分箱

smaller <- diamonds %>% 
  filter(carat < 3)
ggplot(data = smaller) +
  geom_bin2d(mapping = aes(x = carat, y = price))

# install.packages("hexbin")
ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price))

也可以选择在一个维度上装箱，则变成了分类和连续变量问题

使用group来分组

ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

但cut_width()每一个箱体都是等宽的，所以不能反映该分组中样本数量，可以通过设置

ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)),varwidth = T)

模式和模型

关于模式可以提出的问题：

Could this pattern be due to coincidence (i.e. random chance)?
How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?

ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))

模式是很重要的，因为利于发现相关关系。

如果说variation是增加了不确定性，covariation则是减少了不确定性。

模型是一种提取模式的方法。

钻石问题中，首先移除carat对price的影响

mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

ggplot(data = diamonds2) + 
  geom_point(mapping = aes(x = carat, y = resid))

当移除了carat对price的影响后，就可以看cut对price的影响了

ggplot(data = diamonds2) + 
  geom_boxplot(mapping = aes(x = cut, y = resid))

当移除了克拉因素后，切割越好价格越高。

Project式工作流

C+S+F10重启Rstudio
C+S+S运行当前脚本

对于路径来说:

最好使用斜杠而不是反斜杠;
不要使用绝对路径，因为别人的路径和你的不一样；

Project工作流

Create an RStudio project for each data analysis project.
Keep data files there; we’ll talk about loading them into R in data import.
Keep scripts there; edit them, run them in bits or as a whole.
Save your outputs (plots and cleaned data) there.
Only ever use relative paths, not absolute paths.