Makeover DataDrivenHypothesis Distributions

Install Packages

We are going to use a suite of packages. If those are not installed on your machine, please run the following code. {ggplot2} is part of the {tidyverse} package collection, tgther with other helpful packages for a data science workflow such as {dplyr}, {tibble}, {tidyr}, and {readr}.

#install.packages("ggplot2")
install.packages("tidverse")
install.packages("here")
install.packages("forcats")
install.packages("ggdist")
install.packages("ggtext")

Import the Data

We can import the data with the {readr} package. One could alternatively use the base function readRDS().

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------ tidyverse 1.3.0 --

## v ggplot2 3.3.2.9000     v purrr   0.3.4     
## v tibble  3.0.3          v dplyr   1.0.1     
## v tidyr   1.1.1          v stringr 1.4.0     
## v readr   1.3.1          v forcats 0.5.0

## -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

genes <- readr::read_rds(here::here("data", "data-ddh.Rds"))

Explore the Raw Data

Let’s have a look at the data set:

tibble::glimpse(genes)

## Rows: 8,447
## Columns: 6
## $ cell_line       <chr> "S117", "SW948", "MALME3M", "DETROIT562", "UHO1", "...
## $ lineage         <chr> "Soft Tissue", "Colorectal", "Skin", "Upper Aerodig...
## $ lineage_subtype <chr> "Thyroid Sarcoma", "Colorectal Adenocarcinoma", "Me...
## $ gene_symbol     <chr> "PSMA5", "PSMA5", "PSMA5", "PSMA5", "PSMA5", "PSMA5...
## $ dep_score       <dbl> -2.579297, -2.552614, -2.497284, -2.384249, -2.3606...
## $ med             <dbl> -1.620174, -1.620174, -1.620174, -1.620174, -1.6201...

Story: Compare differences in dependency scores (dep_score) per gene (gene_symbol) Goal: Indicate similarity by overlapping distributions

We sort the genes by their median dependency score by turning them into a factor:

genes <- genes %>% 
  dplyr::mutate(gene_symbol = forcats::fct_reorder(gene_symbol, med))

We can easily inspect the raw data by plotting all dependency scores, grouped per gene. Since we already know that it is quite a big data set, we choose geom_jitter() instead of geom_point() and add some transparency. We can also quickly add the median visually by using stat_summary():

theme_set(theme_minimal())

ggplot(data = genes, 
       aes(x = dep_score, 
           y = gene_symbol)) +
  geom_jitter(alpha = .1) +
  stat_summary(aes(x = dep_score),
               fun = "median",
               shape = 23,
               fill = "white",
               size = 1)

The Original Density Plot

A usual way of comparing distributions is a density plot:

ggplot(data = genes, 
       aes(x = dep_score, 
           fill = gene_symbol)) +
  geom_density(alpha = .5) +
  ## change color palette and name
  scale_fill_viridis_d(name = "Gene:", direction = -1) +
  ## change spacing y axis
  scale_y_continuous(expand = c(.01, .01)) +
  ## add custom text labels
  labs(x = "Dependency score", y = NULL) +
  ## change legend properties and remove axis text
  theme(axis.text.y = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        legend.title = element_text(face = "bold"),
        legend.position = c(.1, .6))

There are some problems with an increasing number of genes: * it becomes hard to see individual distributions when the number of overlapping genes is large and distributions are more similar * it becomes impossible to map the colors to each curve

The Ridgeline Plot

Ridgeline plots show distribution aligned to the same horizontal scale but with a vertical spacing, often presented with a slight overlap. The {ggridges} package is the tool of your choice with {ggplot2}:

ggplot(data = genes, 
       aes(x = dep_score, 
           y = gene_symbol,
           fill = gene_symbol,
           color = after_scale(colorspace::lighten(fill, .3)))) +
  ggridges::geom_density_ridges(alpha = .7) +
  ## change color palette and name
  scale_fill_viridis_d(guide = "none", direction = -1) +
  ## add custom text labels
  labs(x = "Dependency score", y = NULL)

## Picking joint bandwidth of 0.0332

ggplot(data = genes, 
       aes(x = dep_score, 
           y = gene_symbol,
           ## color by dependency score
           fill = med, 
           color = med)) +
  ggridges::geom_density_ridges(alpha = .7) +
  ## change color palette and name
  scale_fill_viridis_c(guide = "none", direction = -1) +
  scale_color_viridis_c(guide = "none", direction = -1) +
  ## change spacing x axis
  scale_x_continuous(expand = c(.01, .01)) +
  ## change spacing y axis
  scale_y_discrete(expand = c(.01, .01)) +
  ## add custom text labels
  labs(x = "Dependency score", y = NULL)

## Picking joint bandwidth of 0.0332

The Half-Eye Plot

The {ggdist} package provides several very useful geoms. I am especially a fan of the halfeye plot that combines density curves with slab intervals showing the median and data range:

ggplot(data = genes, 
       aes(x = dep_score, 
           y = gene_symbol,
           fill = med)) +
  ggdist::stat_halfeye(alpha = .7) +
  ## change color palette and name
  scale_fill_viridis_c(guide = "none", direction = -1) +
  ## add custom text labels
  labs(x = "Dependency score", y = NULL)

To make it easier to map gene names to distirbutions, I add dotted liens to guide the reader (IMO nicer than gridlines but those would work as well).

ggplot(data = genes, 
       aes(x = dep_score, 
           y = gene_symbol,
           fill = gene_symbol)) +
  geom_linerange(aes(xmin = -Inf, xmax = med),
                 linetype = "dotted",
                 size = .2) +
  ggdist::stat_halfeye(alpha = .7) +
  ## change color palette and name
  scale_fill_viridis_d(guide = "none", direction = -1) +
  ## add custom text labels
  labs(x = "Dependency score", y = NULL)

The coloring per gene is not providing much additional information. It actually might even highlight some genes and hide some others. Instead, we can also color the distributions by threshold values. A dependency score above 1 or below -1 is of special interest so we use this as a threshold:

ggplot(data = genes, 
       aes(x = dep_score, 
           y = gene_symbol)) +
  geom_linerange(aes(xmin = -Inf, xmax = med,
                     color = abs(med) > 1),
                 linetype = "dotted",
                 size = .2) +
  ggdist::stat_halfeye(
    aes(fill = stat(abs(x) > 1),
        point_fill = after_scale(colorspace::lighten(fill, .15))),
    .width = c(.025, .975),
    color = "black",
    shape = 21,
    stroke = .7
  ) +
  ## change color palette and name
  scale_fill_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
  scale_color_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
  ## add custom text labels
  labs(x = "Dependency score", y = NULL)

By adding a shadding to the “area of no interest” we can increase the highlight effect:

ggplot(data = genes, 
       aes(x = dep_score, 
           y = gene_symbol)) +
  ## add a box to indicate thresholds
  geom_rect(xmin = -1, xmax = 1, 
            ymin = -Inf, ymax = Inf,
            fill = "grey92") +
  ## ad line to indicate zero scroes
  geom_vline(xintercept = 0, color = "white", linetype = "dashed") +
  geom_linerange(aes(xmin = -Inf, xmax = med,
                     color = abs(med) > 1),
                 linetype = "dotted",
                 size = .2) +
  ggdist::stat_halfeye(
    aes(fill = stat(abs(x) > 1),
        point_fill = after_scale(colorspace::lighten(fill, .15))),
    .width = c(.025, .975),
    color = "black",
    shape = 21,
    stroke = .7
  ) +
  ## change color palette and name
  scale_fill_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
  scale_color_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
  ## add custom text labels
  labs(x = "Dependency score", y = NULL) +
  ## remove gridlines and add axis line
  theme(panel.grid = element_blank(),
        axis.line.x = element_line(color = "grey40"))

The Final Plot

We can now change the default typeface and add a colored title to explain what the color represents and a caption:

## change default font and gridlines
library(showtext)

## Loading required package: sysfonts

## Loading required package: showtextdb

font_add_google("Monda", "Monda")
theme_update(text = element_text(family = "Monda"))

ggplot(data = genes, 
       aes(x = dep_score, 
           y = gene_symbol)) +
  ## add a box to indicate thresholds
  geom_rect(xmin = -1, xmax = 1, 
            ymin = -Inf, ymax = Inf,
            fill = "grey92") +
  ## ad line to indicate zero scroes
  geom_vline(xintercept = 0, color = "white", linetype = "dashed") +
  geom_linerange(aes(xmin = -Inf, xmax = med,
                     color = abs(med) > 1),
                 linetype = "dotted",
                 size = .2) +
  ggdist::stat_halfeye(
    aes(fill = stat(abs(x) > 1),
        point_fill = after_scale(colorspace::lighten(fill, .15))),
    .width = c(.025, .975),
    color = "black",
    shape = 21,
    stroke = .7
  ) +
  ## change color palette and name
  scale_fill_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
  scale_color_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
  ## add custom text labels
  labs(x = "Dependency score", y = NULL,
       title = "<b style='color:#0fb78e'>PSMA5</b> and <b style='color:#0fb78e'>MCM2</b> are interesting gene candidates with median dependency scores below -1",
       caption = "Experience your “Heureka” moment on DataDrivenHypothesis.org!") +
  ## remove gridlines, add axis line, adjust title, and add spacing
  theme(panel.grid = element_blank(),
        axis.line.x = element_line(color = "grey40"),
        plot.title = ggtext::element_markdown(margin = margin(b = 10)),
        plot.title.position = "plot",
        plot.caption = element_text(color = "#0fb78e", margin = margin(t = 20)),
        plot.margin = margin(rep(15, 4)))

Session Info

## [1] "2020-09-28 20:01:27 CEST"

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## system code page: 65001
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] showtext_0.8-1     showtextdb_3.0     sysfonts_0.8.1     forcats_0.5.0     
##  [5] stringr_1.4.0      dplyr_1.0.1        purrr_0.3.4        readr_1.3.1       
##  [9] tidyr_1.1.1        tibble_3.0.3       ggplot2_3.3.2.9000 tidyverse_1.3.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5           lubridate_1.7.9      here_0.1            
##  [4] assertthat_0.2.1     rprojroot_1.3-2      digest_0.6.25       
##  [7] utf8_1.1.4           R6_2.4.1             cellranger_1.1.0    
## [10] plyr_1.8.6           ggridges_0.5.2       backports_1.1.7     
## [13] reprex_0.3.0         evaluate_0.14        httr_1.4.2          
## [16] pillar_1.4.6         rlang_0.4.7          curl_4.3            
## [19] readxl_1.3.1         rstudioapi_0.11      blob_1.2.1          
## [22] rmarkdown_2.3        labeling_0.3         gridtext_0.1.1      
## [25] munsell_0.5.0        broom_0.7.0          compiler_4.0.2      
## [28] modelr_0.1.8         xfun_0.16            pkgconfig_2.0.3     
## [31] htmltools_0.5.0      ggtext_0.1.0         tidyselect_1.1.0    
## [34] bookdown_0.20        fansi_0.4.1          viridisLite_0.3.0   
## [37] crayon_1.3.4         dbplyr_1.4.4         withr_2.2.0         
## [40] ggdist_2.2.0         grid_4.0.2           distributional_0.2.0
## [43] jsonlite_1.7.0       gtable_0.3.0         lifecycle_0.2.0     
## [46] DBI_1.1.0            magrittr_1.5         scales_1.1.1        
## [49] rmdformats_0.3.7     cli_2.0.2            stringi_1.4.6       
## [52] farver_2.0.3         fs_1.5.0             xml2_1.3.2          
## [55] ellipsis_0.3.1       generics_0.0.2       vctrs_0.3.2         
## [58] tools_4.0.2          glue_1.4.1           markdown_1.1        
## [61] hms_0.5.3            yaml_2.2.1           colorspace_1.4-1    
## [64] rvest_0.3.6          knitr_1.29           haven_2.3.1