Install Packages

We are going to use a suite of packages. If those are not installed on your machine, please run the following code. {ggplot2} is part of the {tidyverse} package collection, tgther with other helpful packages for a data science workflow such as {dplyr}, {tibble}, {tidyr}, and {readr}.


Import the Data

We can import the data with the {readr} package. One could alternatively use the base function readRDS().

## -- Attaching packages ------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2.9000     v purrr   0.3.4     
## v tibble  3.0.3          v dplyr   1.0.1     
## v tidyr   1.1.1          v stringr 1.4.0     
## v readr   1.3.1          v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Explore the Raw Data

Let’s have a look at the data set:

## Rows: 8,447
## Columns: 6
## $ cell_line       <chr> "S117", "SW948", "MALME3M", "DETROIT562", "UHO1", "...
## $ lineage         <chr> "Soft Tissue", "Colorectal", "Skin", "Upper Aerodig...
## $ lineage_subtype <chr> "Thyroid Sarcoma", "Colorectal Adenocarcinoma", "Me...
## $ gene_symbol     <chr> "PSMA5", "PSMA5", "PSMA5", "PSMA5", "PSMA5", "PSMA5...
## $ dep_score       <dbl> -2.579297, -2.552614, -2.497284, -2.384249, -2.3606...
## $ med             <dbl> -1.620174, -1.620174, -1.620174, -1.620174, -1.6201...

Story: Compare differences in dependency scores (dep_score) per gene (gene_symbol) Goal: Indicate similarity by overlapping distributions

We sort the genes by their median dependency score by turning them into a factor:

We can easily inspect the raw data by plotting all dependency scores, grouped per gene. Since we already know that it is quite a big data set, we choose geom_jitter() instead of geom_point() and add some transparency. We can also quickly add the median visually by using stat_summary():


The Half-Eye Plot

The {ggdist} package provides several very useful geoms. I am especially a fan of the halfeye plot that combines density curves with slab intervals showing the median and data range:

To make it easier to map gene names to distirbutions, I add dotted liens to guide the reader (IMO nicer than gridlines but those would work as well).

The coloring per gene is not providing much additional information. It actually might even highlight some genes and hide some others. Instead, we can also color the distributions by threshold values. A dependency score above 1 or below -1 is of special interest so we use this as a threshold:

By adding a shadding to the “area of no interest” we can increase the highlight effect:


The Final Plot

We can now change the default typeface and add a colored title to explain what the color represents and a caption:

## Loading required package: sysfonts
## Loading required package: showtextdb

Session Info

## [1] "2020-09-28 20:01:27 CEST"
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## system code page: 65001
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] showtext_0.8-1     showtextdb_3.0     sysfonts_0.8.1     forcats_0.5.0     
##  [5] stringr_1.4.0      dplyr_1.0.1        purrr_0.3.4        readr_1.3.1       
##  [9] tidyr_1.1.1        tibble_3.0.3       ggplot2_3.3.2.9000 tidyverse_1.3.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5           lubridate_1.7.9      here_0.1            
##  [4] assertthat_0.2.1     rprojroot_1.3-2      digest_0.6.25       
##  [7] utf8_1.1.4           R6_2.4.1             cellranger_1.1.0    
## [10] plyr_1.8.6           ggridges_0.5.2       backports_1.1.7     
## [13] reprex_0.3.0         evaluate_0.14        httr_1.4.2          
## [16] pillar_1.4.6         rlang_0.4.7          curl_4.3            
## [19] readxl_1.3.1         rstudioapi_0.11      blob_1.2.1          
## [22] rmarkdown_2.3        labeling_0.3         gridtext_0.1.1      
## [25] munsell_0.5.0        broom_0.7.0          compiler_4.0.2      
## [28] modelr_0.1.8         xfun_0.16            pkgconfig_2.0.3     
## [31] htmltools_0.5.0      ggtext_0.1.0         tidyselect_1.1.0    
## [34] bookdown_0.20        fansi_0.4.1          viridisLite_0.3.0   
## [37] crayon_1.3.4         dbplyr_1.4.4         withr_2.2.0         
## [40] ggdist_2.2.0         grid_4.0.2           distributional_0.2.0
## [43] jsonlite_1.7.0       gtable_0.3.0         lifecycle_0.2.0     
## [46] DBI_1.1.0            magrittr_1.5         scales_1.1.1        
## [49] rmdformats_0.3.7     cli_2.0.2            stringi_1.4.6       
## [52] farver_2.0.3         fs_1.5.0             xml2_1.3.2          
## [55] ellipsis_0.3.1       generics_0.0.2       vctrs_0.3.2         
## [58] tools_4.0.2          glue_1.4.1           markdown_1.1        
## [61] hms_0.5.3            yaml_2.2.1           colorspace_1.4-1    
## [64] rvest_0.3.6          knitr_1.29           haven_2.3.1