Title: DSC 385 Data Exploration, Visualization, and Foundations of Unsupervised Learning
Instructor: Claus O. Wilke
GitHub: clauswilke
In this class, students will learn how to visualize data sets and how to reason about and communicate with data visualizations. Students will also learn how to assess data quality and provenance, how to compile analyses and visualizations into reports, and how to make the reports reproducible. A substantial component of this class will be dedicated to learning how to program in R.
What you will learn:
Students are expected to have basic knowledge of statistics. Prior experience with the programming language R is beneficial but not strictly required.
This class draws heavily from materials presented in the following book:
Additionally, we will also make use of the following books:
Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen. ggplot2: Elegant Graphics for Data Analysis, 3rd ed. Springer, to appear.
Kieran Healy. Data Visualization: A Practical Introduction. Princeton University Press, 2018.
All these books are freely available online and you do not need to purchase a physical copy of either book to succeed in this class.
Class | Topic | Coding concepts covered |
---|---|---|
1. | Introduction, reproducible workflows | RStudio setup online, R Markdown |
2. | Aesthetic mappings | ggplot2 quickstart |
3. | Telling a story | |
4. | Visualizing amounts | geom_col() , geom_point() ,
position adjustments |
5. | Coordinate systems and axes | coords and position scales |
6. | Visualizing distributions 1 | stats, geom_density() ,
geom_histogram() |
7. | Visualizing distributions 2 | violin plots, sina plots, ridgeline plots |
8. | Color scales | color and fill scales |
9. | Data wrangling 1 | mutate() , filter() , arrange() |
10. | Data wrangling 2 | group_by() , summarize() , count() |
11. | Visualizing proportions | bar charts, pie charts |
12. | Getting to know your data 1: Data provenance | |
13. | Getting to know your data 2: Data quality and relevance | handling missing data, is.na() , case_when() |
14. | Getting things into the right order | fct_reorder() , fct_lump() |
15. | Figure design | ggplot themes |
16. | Color spaces, color vision deficiency | colorspace package |
17. | Functions and functional programming | map() , nest() , purrr package |
18. | Visualizing trends | geom_smooth() |
19. | Working with models | lm , cor.test , broom package |
20. | Visualizing uncertainty | frequency framing, error bars, ggdist package |
21. | Dimension reduction 1 | PCA |
22. | Dimension reduction 2 | kernel PCA, t-SNE, UMAP |
23. | Clustering 1 | k-means clustering |
24. | Clustering 2 | hierarchical clustering |
25. | Data ethics | |
26. | Visualizing geospatial data | geom_sf() , coord_sf() |
27. | Redundant coding, text annotations | ggrepel package |
28. | Interactive plots | ggiraph package |
29. | Over-plotting | jittering, 2d histograms, contour plots |
30. | Compound figures | patchwork package |
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Any computer code (R, HTML, CSS, etc.) in slides and worksheets, including in slide and worksheet sources, is also licensed under MIT. Note that figures in slides may be pulled in from external sources and may be licensed under different terms. For such images, image credits are available in the slide notes, accessible via pressing the letter ‘p’.
If you see mistakes or want to suggest changes, please create an issue on the source repository.