class: center, middle, title-slide .title[ # Getting to know your data 2 ] .author[ ### Claus O. Wilke ] .date[ ### last updated: 2023-02-21 ] --- ## Recall from previous lecture: Data dictionary **Data Dictionary**<br> A "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format" ([Wikipedia](https://en.wikipedia.org/wiki/Data_dictionary)) --- ## Recall from previous lecture: Data dictionary ### Dataset: Births in NC .pull-left.small-font.width-45.move-up-1em[ **Details and Source**<br> This dataset contains data on a sample of 1450 birth records from 2001 that statistician John Holcomb at Cleveland State University selected from the North Carolina State Center for Health and Environmental Statistics. The dataset has 1450 observations on 15 variables. ] .pull-right.xtiny-font.move-up-4em.width-50[ Variable | Description ----------: | :----------- `ID` | Patient ID code `Plural` | `1`=single birth, `2`=twins, `3`=triplets `Sex` | Sex of the baby: `1`=male `2`=female `MomAge` | Mother's age (in years) `Weeks` | Completed weeks of gestation `Marital` | Marital status: `1`=married or `2`=not married `RaceMom` | Mother's race: `1`=white, `2`=black, `3`=American Indian, `4`=Chinese, `5`=Japanese, `6`=Hawaiian, `7`=Filipino, or `8`=Other Asian or Pacific Islander `HispMom` | Hispanic origin of mother: `C`=Cuban, `M`=Mexican, `N`=not Hispanic, `O`=Other Hispanic, `P`=Puerto Rico, `S`=Central/South America `Gained` | Weight gained during pregnancy (in pounds) `Smoke` | Smoker mom? `1`=yes or `0`=no `BirthWeightOz` | Birth weight in ounces `BirthWeightGm` | Birth weight in grams `Low` | Indicator for low birth weight: `1`=2500 grams or less `Premie` | Indicator for premature birth: `1`=36 weeks or sooner `MomRace` | Mother's race: `black`, `hispanic`, `other`, or `white` ] --- ## A complete record of data provenance requires more info -- .absolute-bottom-right.tiny-font[ Gebru et al., "Datasheets for Datasets", [arXiv:1803.09010](https://arxiv.org/abs/1803.09010) ] 1. Motivation -- 2. Composition -- 3. Collection Process -- 4. Preprocessing/cleaning/labeling -- 5. Uses -- 6. Distribution -- 7. Maintenance --- ## Data provenance: Motivation .absolute-bottom-right.tiny-font[ Gebru et al., "Datasheets for Datasets", [arXiv:1803.09010](https://arxiv.org/abs/1803.09010) ] -- - For what purpose was the dataset created? -- - Who created the dataset? -- - Who funded the creation of the dataset? --- ## Data provenance: Composition .absolute-bottom-right.tiny-font[ Gebru et al., "Datasheets for Datasets", [arXiv:1803.09010](https://arxiv.org/abs/1803.09010) ] -- - What do the instances that comprise the dataset represent? -- - How many instances are there in total? -- - Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? -- - Is any information missing from individual instances? -- - ... See Gebru et al. for complete set of questions --- ## Data provenance: Collection Process .absolute-bottom-right.tiny-font[ Gebru et al., "Datasheets for Datasets", [arXiv:1803.09010](https://arxiv.org/abs/1803.09010) ] -- - How was the data associated with each instance acquired? -- - If the dataset is a sample from a larger set, what was the sampling strategy? -- - Were any ethical review processes conducted? -- - If the dataset relates to people, did the individuals in question consent to the collection and use of their data? -- - If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? -- - ... See Gebru et al. for complete set of questions --- ## Data provenance: Preprocessing/cleaning/labeling .absolute-bottom-right.tiny-font[ Gebru et al., "Datasheets for Datasets", [arXiv:1803.09010](https://arxiv.org/abs/1803.09010) ] -- - Was any preprocessing/cleaning/labeling of the data done? -- - Was the raw data saved in addition to the preprocessed/cleaned/labeled data? -- - Is the software used to preprocess/clean/label the instances available? --- ## Data provenance: Uses .absolute-bottom-right.tiny-font[ Gebru et al., "Datasheets for Datasets", [arXiv:1803.09010](https://arxiv.org/abs/1803.09010) ] -- - Has the dataset been used for any tasks already? -- - Is there a repository that links to any or all papers or systems that use the dataset? -- - What (other) tasks could the dataset be used for? -- - Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? -- - Are there tasks for which the dataset should not be used? --- ## Data provenance: Distribution .absolute-bottom-right.tiny-font[ Gebru et al., "Datasheets for Datasets", [arXiv:1803.09010](https://arxiv.org/abs/1803.09010) ] -- - Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? -- - How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)? -- - When will the dataset be distributed? -- - Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? -- - Have any third parties imposed IP-based or other restrictions on the data associated with the instances? -- - Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? --- ## Data provenance: Maintenance .absolute-bottom-right.tiny-font[ Gebru et al., "Datasheets for Datasets", [arXiv:1803.09010](https://arxiv.org/abs/1803.09010) ] -- - Who is supporting/hosting/maintaining the dataset? -- - How can the owner/curator/manager of the dataset be contacted (e.g., email address)? -- - Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)? -- - If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? -- - Will older versions of the dataset continue to be supported/hosted/maintained? -- - If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? --- ## Provenance at a glance: Data Nutritional Labels -- <img src="know-your-data-2_files/datanutrition.png" width = 80%> .absolute-bottom-right.tiny-font[ The Data Nutrition Project<br>https://datanutrition.org/ ] [//]: # "segment ends here" --- class: center middle ## Some real-world examples --- ## Some real-world examples Let's review three datasets on Kaggle: - [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database) - [Pokemon](https://www.kaggle.com/abcsds/pokemon) - [Women's E-Commerce Clothing Reviews](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) [//]: # "segment ends here" --- class: center middle ## Working with missing values in R --- ## R propagates missingness ```r x <- c(1, 2, NA, 4) sum(x) ``` ``` [1] NA ``` -- ```r mean(x) ``` ``` [1] NA ``` -- ```r x == 2 ``` ``` [1] FALSE TRUE NA FALSE ``` --- ## Many functions allow explicit exclusion of `NA` values ```r x <- c(1, 2, NA, 4) sum(x, na.rm = TRUE) ``` ``` [1] 7 ``` -- ```r mean(x, na.rm = TRUE) ``` ``` [1] 2.333333 ``` -- But is this the right thing to do? --- ## There is no general right or wrong approach ```r x <- c(2, 1, 1, 2, 1, 1, 1, 2, NA, 1, 2, 1, 1, 2, 1, 1, 1, 2) mean(x, na.rm = TRUE) ``` ``` [1] 1.352941 ``` -- ```r x <- c(NA, NA, NA, 2, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA) mean(x, na.rm = TRUE) ``` ``` [1] 1.5 ``` -- R's default is conservative: If there's at least one `NA`, the result is `NA` --- ## We need to use `is.na()` to check for missing values .small-font[ ```r c(1, 2, NA, 4) == NA # does not work ``` ``` [1] NA NA NA NA ``` ```r is.na(c(1, 2, NA, 4)) # works ``` ``` [1] FALSE FALSE TRUE FALSE ``` ] --- ## There are several missing values of different types - `NA`: logical constant indicating missing value - `NA_integer_`: missing integer - `NA_real_`: missing real - `NA_complex_`: missing complex value - `NA_character_`: missing string --- ## Sometimes the type matters ```r c(1, 2, NA) ``` ``` [1] 1 2 NA ``` ```r c(1, 2, NA_real_) ``` ``` [1] 1 2 NA ``` ```r c(1, 2, NA_character_) ``` ``` [1] "1" "2" NA ``` --- ## Sometimes the type matters ```r typeof(c(1, 2, NA)) ``` ``` [1] "double" ``` ```r typeof(c(1, 2, NA_real_)) ``` ``` [1] "double" ``` ```r typeof(c(1, 2, NA_character_)) ``` ``` [1] "character" ``` --- ## Replacing `NA` values Remember from class on data wrangling: .small-font[ ```r band_data <- full_join(band_members, band_instruments) band_data ``` ``` # A tibble: 4 × 3 name band plays <chr> <chr> <chr> 1 Mick Stones <NA> 2 John Beatles guitar 3 Paul Beatles bass 4 Keith <NA> guitar ``` ] --- ## Replacing `NA` values Replace `NA`s with empty strings in `plays` column: .small-font[ ```r band_data %>% mutate(plays = replace_na(plays, "")) ``` ``` # A tibble: 4 × 3 name band plays <chr> <chr> <chr> 1 Mick Stones "" 2 John Beatles "guitar" 3 Paul Beatles "bass" 4 Keith <NA> "guitar" ``` ] --- ## Replacing `NA` values Replace `NA`s with empty strings in all columns: .small-font[ ```r band_data %>% mutate(across(everything(), ~replace_na(.x, ""))) ``` ``` # A tibble: 4 × 3 name band plays <chr> <chr> <chr> 1 Mick "Stones" "" 2 John "Beatles" "guitar" 3 Paul "Beatles" "bass" 4 Keith "" "guitar" ``` ] --- ## Replacing `NA` values Replace empty strings with `NA` in `plays` column (requires **naniar** package): .small-font[ ```r band_data %>% mutate(across(everything(), ~replace_na(.x, ""))) %>% replace_with_na_at("plays", ~.x == "") ``` ``` # A tibble: 4 × 3 name band plays <chr> <chr> <chr> 1 Mick "Stones" <NA> 2 John "Beatles" guitar 3 Paul "Beatles" bass 4 Keith "" guitar ``` ] --- ## Replacing `NA` values Replace empty strings with `NA` in all columns (requires **naniar** package): .small-font[ ```r band_data %>% mutate(across(everything(), ~replace_na(.x, ""))) %>% replace_with_na_all(~.x == "") ``` ``` # A tibble: 4 × 3 name band plays <chr> <chr> <chr> 1 Mick Stones <NA> 2 John Beatles guitar 3 Paul Beatles bass 4 Keith <NA> guitar ``` ] --- ## Removing rows with `NA` values Remove all rows with any `NA`s with `na.omit()`: .small-font[ ```r band_data %>% na.omit() ``` ``` # A tibble: 2 × 3 name band plays <chr> <chr> <chr> 1 John Beatles guitar 2 Paul Beatles bass ``` ] --- ## Removing rows with `NA` values Remove all rows where specific columns contain `NA`s: .small-font[ ```r band_data %>% filter(!is.na(plays)) ``` ``` # A tibble: 3 × 3 name band plays <chr> <chr> <chr> 1 John Beatles guitar 2 Paul Beatles bass 3 Keith <NA> guitar ``` ] --- ## Removing rows with `NA` values Conversely: .small-font[ ```r band_data %>% filter(is.na(plays)) ``` ``` # A tibble: 1 × 3 name band plays <chr> <chr> <chr> 1 Mick Stones <NA> ``` ] --- ## Visualizing `NA`s .small-font.pull-left.width-40[ ```r ggplot(NCbirths) + aes(Weeks, Gained) + geom_point() ``` By default, missing points are not shown ] .pull-right.xtiny-font.width-55[ ``` Warning: Removed 40 rows containing missing values (`geom_point()`). ``` ![](know-your-data-2_files/figure-html/NCbirths-plot-missing-out-1.svg)<!-- --> ] --- ## Visualizing `NA`s .small-font.pull-left.width-40[ ```r library(naniar) ggplot(NCbirths) + aes(Weeks, Gained) + geom_miss_point() ``` Can show them with the **naniar** package ] .pull-right.xtiny-font.width-55[ ![](know-your-data-2_files/figure-html/NCbirths-plot-missing2-out-1.svg)<!-- --> ] [//]: # "segment ends here" --- ## Further reading - Gebru et al., "Datasheets for Datasets", [arXiv:1803.09010](https://arxiv.org/abs/1803.09010) - Data Nutrition Project, https://datanutrition.org/ - **dplyr** reference documentation: [across()](https://dplyr.tidyverse.org/reference/across.html) - **naniar** documentation: [geom_miss_point()](http://naniar.njtierney.com/reference/geom_miss_point.html) - **naniar** reference documentation: [replace_with_na_at()](http://naniar.njtierney.com/reference/replace_with_na_at.html)