SDS 348, Spring 2020
This is the home page for class SDS 348, Computational Biology and Bioinformatics. All relevant course materials will be posted here.
Syllabus: SDS348_syllabus_spring2020.pdf
Revised syllabus due to COVID-19: SDS348_syllabus_spring2020_revised.pdf
Lectures
1. Jan 21, 2020 – Introduction, R Markdown
- Slides: class1.pdf
- R Markdown basics: https://rmarkdown.rstudio.com/authoring_basics.html
- Class compute servers:
- You can download R from here: https://cran.r-project.org/
- You can download RStudio from here: https://www.rstudio.com/products/rstudio/download/
- In-class worksheet:
2. Jan 23, 2020 – R review
- Slides: class2.pdf
- Biostats supplement on regression modeling: statistical_modeling.pdf
- General R tutorial (fairly long and detailed): http://www.cyclismo.org/tutorial/R/index.html
- In-class worksheet:
3. Jan 28, 2020 – Data visualization with ggplot2
- Slides: class3.pdf
- R for Data Science: https://r4ds.had.co.nz/
- ggplot2 reference manual: https://ggplot2.tidyverse.org/
- ggplot2 video tutorial: http://varianceexplained.org/RData/lessons/lesson2/segment1/
- In-class worksheet:
4. Jan 30, 2020 – Data visualization with ggplot2
- Slides: class4.pdf
- Visualization book: Fundamentals of Data Visualization
- Tidyverse style guide: style.tidyverse.org
- In-class worksheet:
5. Feb 4, 2020 – Working with tidy data
- Slides: class5.pdf
- dplyr chapter in R for Data Science: Chapter 5: Data transformation
- dplyr package on the tidyverse website: https://dplyr.tidyverse.org/
- Tidy data paper by Wickham: J. Stat. Soft. 59:10, 2014
- In-class worksheet:
6. Feb 6, 2020 – Working with tidy data
- Slides: class6.pdf
- In-class worksheet:
7. Feb 11, 2020 – Working with tidy data
- Slides: class7.pdf
- In-class worksheet:
8. Feb 13, 2020 – Rearranging data tables with tidyr
- Slides: class8.pdf
- tidyr vignette: https://tidyr.tidyverse.org/articles/tidy-data.html
- In-class worksheet:
9. Feb 18, 2020 – Principal Components Analysis (PCA)
- Slides: class9.pdf
- Intro to PCA: http://setosa.io/ev/principal-component-analysis/
- PCA tutorial with mathematical background: https://arxiv.org/pdf/1404.1100.pdf
- In-class worksheet:
10. Feb 20, 2020 – k-means clustering
- Slides: class10.pdf
- Interactive k-means demonstration: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
- Stackoverflow post on choosing the right number of clusters
- Medium article: The 5 clustering algorithms data scientists need to know.
- In-class worksheet:
11. Feb 25, 2020 – Binary prediction/logistic regression
- Slides: class11.pdf
- Wikipedia page on logistic regression: https://en.wikipedia.org/wiki/Logistic_regression
- In-class worksheet:
12. Feb 27, 2020 – Sensitivity/Specificity, ROC curves
- Slides: class12.pdf
- Wikipedia page on sensitivity and specificity: https://en.wikipedia.org/wiki/Sensitivity_and_specificity
- Wikipedia page on ROC curves: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
- ROC animations: https://github.com/dariyasydykova/open_projects/tree/master/ROC_animation
- In-class worksheet:
13. Mar 3, 2020 – Training and test data sets, cross-validation
- Slides: class13.pdf
- Twitter thread: Information in ROC curves
- Wikipedia page on cross-validation: here
- In-class worksheet:
14. Mar 5, 2020 – Introduction to python, basic data structures
- Slides: class14.pdf
- Alternative 1 to educcomp: Google Collaboratorium
- Alternative 2 to educcomp: Anaconda
- Official Python3 tutorial: https://docs.python.org/3/tutorial/
- Chapter 3 of the official tutorial: An informal introduction
- In-class worksheet:
15. Mar 10, 2020 – Control flow in python
- Slides: class15.pdf
- Chapter 4.1–4.5 of the official tutorial: More Control Flow Tools
- Chapter 5.5 of the official tutorial: Dictionaries
- In-class worksheet:
16. Mar 12, 2020 – Functions in python
- Slides: class16.pdf
- Chapter 4.6, 4.7 of the official tutorial: Defining functions
- In-class worksheet:
17. Mar 31, 2020 – More on python data structures, classes
- Slides: class17.pdf
- Chapter 9 of the official tutorial: Classes
- In-class worksheet:
18. Apr 2, 2020 – Working with files
- Slides: class18.pdf
- Chapter 7.2 of the official tutorial: Reading and writing files
- In-class worksheet:
19. Apr 7, 2019 – Introduction to Biopython
- Slides: class19.pdf
- Biopython website: https://biopython.org
- Official Biopython tutorial: https://biopython.org/DIST/docs/tutorial/Tutorial.html
- NCBI Entrez/PubMed website: https://www.ncbi.nlm.nih.gov/
- In-class worksheet:
20. Apr 9, 2020 – Working with gene features and genomes
- Slides: class20.pdf
- Biopython Tutorial on sequence features: SeqFeature objects
- Official feature documentation from the International Nucleotide Sequence Database Collaboration: Feature Key Reference
- In-class worksheet:
21. Apr 14, 2019 – Running queries on Entrez
- Slides: class21.pdf
- Biopython SeqIO documentation: SeqIO
- In-class worksheet:
22. Apr 16, 2020 – Regular expressions
- Slides: class22.pdf
- Python regular expression editor: https://pythex.org/
- Regular expression visualization: https://regexper.com/
- Official Python regular expression documentation: Regular Expression HOWTO
- Alternative regular expression tutorial: Python Regular Expressions
- In-class worksheet:
23. Apr. 21, 2020 – Using regular expressions to analyze data
- Slides: class23.pdf
- Python regular expression editor: https://pythex.org/
- Regular expression visualization: https://regexper.com/
- Regex crosswords: regexcrossword.com
- In-class worksheet:
24. Apr. 23, 2020 – Using regular expressions to analyze data
- Slides: class24.pdf
- Python regular expression editor: https://pythex.org/
- Regular expression visualization: https://regexper.com/
- In-class worksheet:
25. Apr. 28, 2020 – Aligning sequences
- Slides: class25.pdf
- Wikipedia page on the Needleman–Wunsch algorithm: Needleman–Wunsch algorithm
- Alignment software:
- Example alignments
- In-class worksheet:
26. Apr. 20, 2020 – Global and local alignments, BLAST
- Slides: class26.pdf
- Wikipedia page on the Smith-Waterman algorithm: Smith-Waterman algorithm
- NCBI BLAST search: https://blast.ncbi.nlm.nih.gov/Blast.cgi
- Wikipedia page on BLAST: BLAST
- Biopython BLAST documentation: Chapter 7: BLAST
- In-class worksheet:
27. May 5, 2020 – Multiple sequence alignments and phylogenetic trees
- Slides: class27.pdf
- Wikipedia page on multiple sequence alignments: Multiple sequence alignment
- Wikipedia page on phylogenetic trees: Phylogenetic trees
- In-class exercises:
28. May 7, 2020 – Plotting geospatial data
- Slides: class28.pdf
- Fundamentals of dataviz book chapter: Visualizing geospatial data
- R sf package: Simple Features for R vignette
- In-class worksheet:
Homeworks
All homeworks are due by noon (12:00pm) on the day they are due. Homeworks need to be submitted as pdf files on Canvas.
- Homework 1 (due Jan 27, 2020)
- Homework 2 (due Feb 3, 2020)
- Homework 3 (due Feb 10, 2020)
- Homework 4 (due Feb 17, 2020)
- Homework 5 (due Mar 2, 2020)
- Homework 6 (due Mar 9, 2020)
- Homework 7 (due Mar 30, 2020)
- Homework 8 (due Apr 13, 2020)
- Homework 9 (due Apr 20, 2020)
- Homework 10 (due Apr 27, 2020)
Labs
1. Jan 22, 2020
- Slides: lab1.pdf
- Guide to converting from HTML to PDF: html_to_pdf_guide.pdf
- Lab worksheet:
2. Jan 29, 2020
- Guide to all functions available in ggplot2: https://ggplot2.tidyverse.org/reference/
- Guide to interactive plots using ggplotly: https://plot.ly/ggplot2/user-guide/ (note: not necessary for HW, just for fun)
- Lab worksheet:
3. Feb 4, 2020
- Tidyverse style guide (syntax): https://style.tidyverse.org/syntax.html
- Lab worksheet:
4. Feb 11, 2020
- Animated visualizations of different join() functions:
- Lab worksheet:
5. Feb. 18, 2020
- Decent step-by-step walkthrough of principal component calculations: https://builtin.com/data-science/step-step-explanation-principal-component-analysis
- Visualization of data input vs PCA output: http://setosa.io/ev/principal-component-analysis/
- Visualization of eigenvectors/eigenvalues if you really want to dig in: http://setosa.io/ev/eigenvectors-and-eigenvalues/
- Lab worksheet:
6. Feb. 25, 2020
- Lab worksheet:
7. Mar. 3, 2020
- Demo for training and using a machine learning classifier in R: Predicting Legendary Pokemon: A Machine Learning Demo
- Source Github repository for the above demo can be found here!
- Lab worksheet:
8. Mar. 11, 2020
- Python style guide: https://www.python.org/dev/peps/pep-0008/#code-lay-out
- Lab worksheet:
9. Apr. 1, 2020
- Intro to object-oriented programming (first half of this article): https://towardsdatascience.com/object-oriented-programming-for-data-scientists-build-your-ml-estimator-7da416751f64
- Lab worksheet:
10. Apr. 8, 2020
- Lab worksheet:
11. Apr. 15, 2020
- Medline Abbreviation Guide: https://www.nlm.nih.gov/bsd/mms/medlineelements.html
- Lab worksheet:
12. Apr. 22, 2020
- NCBI Database Search Field Descriptions: https://www.ncbi.nlm.nih.gov/books/NBK49540/
- Lab worksheet:
Projects
All projects are due by noon (12:00pm) on the day they are due. Projects need to be submitted on Canvas, both in pdf format and as source code (plus data where needed).
- Project 1 (due Fri., Feb 28, 2020):
- Project 1 Example
- Project 2 (due Mon., Apr 6, 2020):
- SDS 385 Assignment (due Apr 14, 2020, grad students only):
- Project 3 (due Thurs., May 7, 2020):