Makeover DataDrivenHypothesis Distributions
Install Packages
We are going to use a suite of packages. If those are not installed on your machine, please run the following code. {ggplot2}
is part of the {tidyverse}
package collection, tgther with other helpful packages for a data science workflow such as {dplyr}
, {tibble}
, {tidyr}
, and {readr}
.
#install.packages("ggplot2")
install.packages("tidverse")
install.packages("here")
install.packages("forcats")
install.packages("ggdist")
install.packages("ggtext")
Import the Data
We can import the data with the {readr}
package. One could alternatively use the base function readRDS()
.
## -- Attaching packages ------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2.9000 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.1
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Explore the Raw Data
Let’s have a look at the data set:
## Rows: 8,447
## Columns: 6
## $ cell_line <chr> "S117", "SW948", "MALME3M", "DETROIT562", "UHO1", "...
## $ lineage <chr> "Soft Tissue", "Colorectal", "Skin", "Upper Aerodig...
## $ lineage_subtype <chr> "Thyroid Sarcoma", "Colorectal Adenocarcinoma", "Me...
## $ gene_symbol <chr> "PSMA5", "PSMA5", "PSMA5", "PSMA5", "PSMA5", "PSMA5...
## $ dep_score <dbl> -2.579297, -2.552614, -2.497284, -2.384249, -2.3606...
## $ med <dbl> -1.620174, -1.620174, -1.620174, -1.620174, -1.6201...
Story: Compare differences in dependency scores (dep_score
) per gene (gene_symbol
) Goal: Indicate similarity by overlapping distributions
We sort the genes by their median dependency score by turning them into a factor:
We can easily inspect the raw data by plotting all dependency scores, grouped per gene. Since we already know that it is quite a big data set, we choose geom_jitter()
instead of geom_point()
and add some transparency. We can also quickly add the median visually by using stat_summary()
:
theme_set(theme_minimal())
ggplot(data = genes,
aes(x = dep_score,
y = gene_symbol)) +
geom_jitter(alpha = .1) +
stat_summary(aes(x = dep_score),
fun = "median",
shape = 23,
fill = "white",
size = 1)
The Original Density Plot
A usual way of comparing distributions is a density plot:
ggplot(data = genes,
aes(x = dep_score,
fill = gene_symbol)) +
geom_density(alpha = .5) +
## change color palette and name
scale_fill_viridis_d(name = "Gene:", direction = -1) +
## change spacing y axis
scale_y_continuous(expand = c(.01, .01)) +
## add custom text labels
labs(x = "Dependency score", y = NULL) +
## change legend properties and remove axis text
theme(axis.text.y = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
legend.title = element_text(face = "bold"),
legend.position = c(.1, .6))
There are some problems with an increasing number of genes: * it becomes hard to see individual distributions when the number of overlapping genes is large and distributions are more similar * it becomes impossible to map the colors to each curve
The Ridgeline Plot
Ridgeline plots show distribution aligned to the same horizontal scale but with a vertical spacing, often presented with a slight overlap. The {ggridges}
package is the tool of your choice with {ggplot2}
:
ggplot(data = genes,
aes(x = dep_score,
y = gene_symbol,
fill = gene_symbol,
color = after_scale(colorspace::lighten(fill, .3)))) +
ggridges::geom_density_ridges(alpha = .7) +
## change color palette and name
scale_fill_viridis_d(guide = "none", direction = -1) +
## add custom text labels
labs(x = "Dependency score", y = NULL)
## Picking joint bandwidth of 0.0332
ggplot(data = genes,
aes(x = dep_score,
y = gene_symbol,
## color by dependency score
fill = med,
color = med)) +
ggridges::geom_density_ridges(alpha = .7) +
## change color palette and name
scale_fill_viridis_c(guide = "none", direction = -1) +
scale_color_viridis_c(guide = "none", direction = -1) +
## change spacing x axis
scale_x_continuous(expand = c(.01, .01)) +
## change spacing y axis
scale_y_discrete(expand = c(.01, .01)) +
## add custom text labels
labs(x = "Dependency score", y = NULL)
## Picking joint bandwidth of 0.0332
The Half-Eye Plot
The {ggdist}
package provides several very useful geoms. I am especially a fan of the halfeye plot that combines density curves with slab intervals showing the median and data range:
ggplot(data = genes,
aes(x = dep_score,
y = gene_symbol,
fill = med)) +
ggdist::stat_halfeye(alpha = .7) +
## change color palette and name
scale_fill_viridis_c(guide = "none", direction = -1) +
## add custom text labels
labs(x = "Dependency score", y = NULL)
To make it easier to map gene names to distirbutions, I add dotted liens to guide the reader (IMO nicer than gridlines but those would work as well).
ggplot(data = genes,
aes(x = dep_score,
y = gene_symbol,
fill = gene_symbol)) +
geom_linerange(aes(xmin = -Inf, xmax = med),
linetype = "dotted",
size = .2) +
ggdist::stat_halfeye(alpha = .7) +
## change color palette and name
scale_fill_viridis_d(guide = "none", direction = -1) +
## add custom text labels
labs(x = "Dependency score", y = NULL)
The coloring per gene is not providing much additional information. It actually might even highlight some genes and hide some others. Instead, we can also color the distributions by threshold values. A dependency score above 1 or below -1 is of special interest so we use this as a threshold:
ggplot(data = genes,
aes(x = dep_score,
y = gene_symbol)) +
geom_linerange(aes(xmin = -Inf, xmax = med,
color = abs(med) > 1),
linetype = "dotted",
size = .2) +
ggdist::stat_halfeye(
aes(fill = stat(abs(x) > 1),
point_fill = after_scale(colorspace::lighten(fill, .15))),
.width = c(.025, .975),
color = "black",
shape = 21,
stroke = .7
) +
## change color palette and name
scale_fill_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
scale_color_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
## add custom text labels
labs(x = "Dependency score", y = NULL)
By adding a shadding to the “area of no interest” we can increase the highlight effect:
ggplot(data = genes,
aes(x = dep_score,
y = gene_symbol)) +
## add a box to indicate thresholds
geom_rect(xmin = -1, xmax = 1,
ymin = -Inf, ymax = Inf,
fill = "grey92") +
## ad line to indicate zero scroes
geom_vline(xintercept = 0, color = "white", linetype = "dashed") +
geom_linerange(aes(xmin = -Inf, xmax = med,
color = abs(med) > 1),
linetype = "dotted",
size = .2) +
ggdist::stat_halfeye(
aes(fill = stat(abs(x) > 1),
point_fill = after_scale(colorspace::lighten(fill, .15))),
.width = c(.025, .975),
color = "black",
shape = 21,
stroke = .7
) +
## change color palette and name
scale_fill_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
scale_color_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
## add custom text labels
labs(x = "Dependency score", y = NULL) +
## remove gridlines and add axis line
theme(panel.grid = element_blank(),
axis.line.x = element_line(color = "grey40"))
The Final Plot
We can now change the default typeface and add a colored title to explain what the color represents and a caption:
## Loading required package: sysfonts
## Loading required package: showtextdb
font_add_google("Monda", "Monda")
theme_update(text = element_text(family = "Monda"))
ggplot(data = genes,
aes(x = dep_score,
y = gene_symbol)) +
## add a box to indicate thresholds
geom_rect(xmin = -1, xmax = 1,
ymin = -Inf, ymax = Inf,
fill = "grey92") +
## ad line to indicate zero scroes
geom_vline(xintercept = 0, color = "white", linetype = "dashed") +
geom_linerange(aes(xmin = -Inf, xmax = med,
color = abs(med) > 1),
linetype = "dotted",
size = .2) +
ggdist::stat_halfeye(
aes(fill = stat(abs(x) > 1),
point_fill = after_scale(colorspace::lighten(fill, .15))),
.width = c(.025, .975),
color = "black",
shape = 21,
stroke = .7
) +
## change color palette and name
scale_fill_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
scale_color_manual(values = c("#b3b3b3", "#0fb78e"), guide = "none") +
## add custom text labels
labs(x = "Dependency score", y = NULL,
title = "<b style='color:#0fb78e'>PSMA5</b> and <b style='color:#0fb78e'>MCM2</b> are interesting gene candidates with median dependency scores below -1",
caption = "Experience your “Heureka” moment on DataDrivenHypothesis.org!") +
## remove gridlines, add axis line, adjust title, and add spacing
theme(panel.grid = element_blank(),
axis.line.x = element_line(color = "grey40"),
plot.title = ggtext::element_markdown(margin = margin(b = 10)),
plot.title.position = "plot",
plot.caption = element_text(color = "#0fb78e", margin = margin(t = 20)),
plot.margin = margin(rep(15, 4)))
Session Info
## [1] "2020-09-28 20:01:27 CEST"
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
## system code page: 65001
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] showtext_0.8-1 showtextdb_3.0 sysfonts_0.8.1 forcats_0.5.0
## [5] stringr_1.4.0 dplyr_1.0.1 purrr_0.3.4 readr_1.3.1
## [9] tidyr_1.1.1 tibble_3.0.3 ggplot2_3.3.2.9000 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 lubridate_1.7.9 here_0.1
## [4] assertthat_0.2.1 rprojroot_1.3-2 digest_0.6.25
## [7] utf8_1.1.4 R6_2.4.1 cellranger_1.1.0
## [10] plyr_1.8.6 ggridges_0.5.2 backports_1.1.7
## [13] reprex_0.3.0 evaluate_0.14 httr_1.4.2
## [16] pillar_1.4.6 rlang_0.4.7 curl_4.3
## [19] readxl_1.3.1 rstudioapi_0.11 blob_1.2.1
## [22] rmarkdown_2.3 labeling_0.3 gridtext_0.1.1
## [25] munsell_0.5.0 broom_0.7.0 compiler_4.0.2
## [28] modelr_0.1.8 xfun_0.16 pkgconfig_2.0.3
## [31] htmltools_0.5.0 ggtext_0.1.0 tidyselect_1.1.0
## [34] bookdown_0.20 fansi_0.4.1 viridisLite_0.3.0
## [37] crayon_1.3.4 dbplyr_1.4.4 withr_2.2.0
## [40] ggdist_2.2.0 grid_4.0.2 distributional_0.2.0
## [43] jsonlite_1.7.0 gtable_0.3.0 lifecycle_0.2.0
## [46] DBI_1.1.0 magrittr_1.5 scales_1.1.1
## [49] rmdformats_0.3.7 cli_2.0.2 stringi_1.4.6
## [52] farver_2.0.3 fs_1.5.0 xml2_1.3.2
## [55] ellipsis_0.3.1 generics_0.0.2 vctrs_0.3.2
## [58] tools_4.0.2 glue_1.4.1 markdown_1.1
## [61] hms_0.5.3 yaml_2.2.1 colorspace_1.4-1
## [64] rvest_0.3.6 knitr_1.29 haven_2.3.1