Syllabus

Course title and instructor

Title: DSC 385 Data Exploration, Visualization, and Foundations of Unsupervised Learning

Instructor: Claus O. Wilke
GitHub: clauswilke

Purpose and contents of the class

In this class, students will learn how to visualize data sets and how to reason about and communicate with data visualizations. Students will also learn how to assess data quality and provenance, how to compile analyses and visualizations into reports, and how to make the reports reproducible. A substantial component of this class will be dedicated to learning how to program in R.

What you will learn:

Prerequisites

Students are expected to have basic knowledge of statistics. Prior experience with the programming language R is beneficial but not strictly required.

Textbook

This class draws heavily from materials presented in the following book:

Additionally, we will also make use of the following books:

All these books are freely available online and you do not need to purchase a physical copy of either book to succeed in this class.

Topics covered

Class Topic Coding concepts covered
1. Introduction, reproducible workflows RStudio setup online, R Markdown
2. Aesthetic mappings ggplot2 quickstart
3. Telling a story
4. Visualizing amounts geom_col(), geom_point(), position adjustments
5. Coordinate systems and axes coords and position scales
6. Visualizing distributions 1 stats, geom_density(), geom_histogram()
7. Visualizing distributions 2 violin plots, sina plots, ridgeline plots
8. Color scales color and fill scales
9. Data wrangling 1 mutate(), filter(), arrange()
10. Data wrangling 2 group_by(), summarize(), count()
11. Visualizing proportions bar charts, pie charts
12. Getting to know your data 1: Data provenance
13. Getting to know your data 2: Data quality and relevance handling missing data, is.na(), case_when()
14. Getting things into the right order fct_reorder(), fct_lump()
15. Figure design ggplot themes
16. Color spaces, color vision deficiency colorspace package
17. Functions and functional programming map(), nest(), purrr package
18. Visualizing trends geom_smooth()
19. Working with models lm, cor.test, broom package
20. Visualizing uncertainty frequency framing, error bars, ggdist package
21. Dimension reduction 1 PCA
22. Dimension reduction 2 kernel PCA, t-SNE, UMAP
23. Clustering 1 k-means clustering
24. Clustering 2 hierarchical clustering
25. Data ethics
26. Visualizing geospatial data geom_sf(), coord_sf()
27. Redundant coding, text annotations ggrepel package
28. Interactive plots ggiraph package
29. Over-plotting jittering, 2d histograms, contour plots
30. Compound figures patchwork package

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Any computer code (R, HTML, CSS, etc.) in slides and worksheets, including in slide and worksheet sources, is also licensed under MIT. Note that figures in slides may be pulled in from external sources and may be licensed under different terms. For such images, image credits are available in the slide notes, accessible via pressing the letter ‘p’.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.