Project 1

Enter your name and EID here

This is the dataset you will be working with:

ufo_sightings <- 
  read_csv("https://wilkelab.org/classes/SDS348/data_sets/ufo_sightings_clean.csv") %>%
  separate(datetime, into = c("month", "day", "year"), sep = "/") %>%
  separate(year, into = c("year", "time"), sep = " ") %>%
  separate(date_posted, into = c("month_posted", "day_posted", "year_posted"), sep = "/") %>%
  select(-time, -month_posted, -day_posted) %>%
  mutate(year = as.numeric(year)) %>%
  filter(!is.na(country))
## Parsed with column specification:
## cols(
##   datetime = col_character(),
##   city = col_character(),
##   state = col_character(),
##   country = col_character(),
##   shape = col_character(),
##   duration_seconds = col_double(),
##   duration_hours_min = col_character(),
##   comments = col_character(),
##   date_posted = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )

Part 1

Question: How have patterns of UFO sightings developed in different US states since 1940?

To answer this question, consider only the US states California (ca), Texas (tx), and New Mexico (nm), and discard all sightings before 1940. Find which of these states had the highest number of cumulative sightings since 1940. Then, using ggplot, make a plot that shows UFO sightings per year, plotted as a line graph.

Introduction: This dataset contains over 70,000 reports of UFO sightings over the last century. The data has 13 variables, which contain information on the location of the UFO sighting (city, state, country, longitude, latitude), when the UFO sighting occurred (month, day, year), length of the UFO sighting (duration_seconds, duration_hours_min), details about the UFO in question (shape, comments), and when the report for the sighting was filed (year posted). To answer the question above, we will need the columns state and year.

Approach: The dataset contains individual records of sightings. Therefore, we have to filter out the correct sightings for those in Texas, California, and New Mexico (keeping only the sightings that have occurred since 1940) and then summarize the number of reports for each state. To make a time series plot, we will need the same state and year filters, but this time group by both state and year to capture the number of UFO sightings per year in each state.

Analysis:

# calculate the total number of UFO sightings for each state
ufo_sightings %>%
  filter(state %in% c("tx", "ca", "nm") & year >= 1940) %>%
  group_by(state) %>%
    summarize(num_sightings = n()) 
## # A tibble: 3 x 2
##   state num_sightings
##   <chr>         <int>
## 1 ca             8911
## 2 nm              720
## 3 tx             3446
# colorblind-safe palette
color_palette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999")

# line graph of sightings per year by state
ufo_sightings %>%
  filter(state %in% c("tx", "ca", "nm") & year > 1940) %>%
  group_by(state, year) %>% # need to group by state AND year to maintain a data point for each year and state
  summarize(num_sightings = n()) %>% # instead of total sightings for each state, now we are calculating total sightings **per year** for each state
  ggplot(aes(x = year, y = num_sightings, color = state)) +
  geom_line(size = 1, alpha = 0.75) +
  scale_color_manual(values = color_palette)

Discussion: For the three states considered, California has the highes total number of UFO sightings (n = 8911) since 1940, Texas has the second highest number (n = 3446), and New Mexico has the lowest number of UFO sightings (n = 720). One hypothesis for this observation is that the number of sightings may be proportional to state population; i.e., New Mexico has the lowest population, and accordingly the lowest number of sightings. After plotting a line graph of UFO sightings by year since 1940, we can see a sharp spike in reports around 1995. One hypothesis for this is the beginning of the internet revolution in the mid 1990s, i.e., widespread communication of observations led to significantly higher accumulation of reports in an aggregated database.

Part 2

Question: For the top six Texas cities with the most UFO reports, what are the top three shapes most UFOs are described to have?

Introduction: This dataset contains information for the city, state, country, year, time, duration, description, and observed shape of UFO sightings in the last century or so. To answer the question above, we will need the columns state, city, and shape. The columns state and city described where in the United States the UFO sighting occurred and the column shape describes the apparent shape of the UFO at the time of the sighting.

Approach: To evaluate which Texas cities have the most UFO sightings, we will need to filter the dataset for sightings occurring only in Texas, tally the total number of reports and store the top 6 cities with the most reports in a new dataframe called top_texas. Then, we will do a second manipulation of ufo_sightings where we filter for the cities in the top_texas dataframe, group by city and shape then summarize the number of occurrences of each shape in each city. To visualize the distribution of each top shape descriptor in each city, a bar plot will be best for directly comparing counts of categorical information.

Analysis:

# create dataframe containing top 6 Texas cities with most UFO sightings
top_tx <- ufo_sightings %>%
  filter(state == "tx") %>%
  group_by(city) %>%
  tally() %>% # count the number of UFO sightings in each city
  top_n(6) %>% # take the top 6 cities with the most UFO sightings
  arrange(desc(n)) # order from most to least sightings
## Selecting by n
top_tx
## # A tibble: 6 x 2
##   city            n
##   <chr>       <int>
## 1 houston       294
## 2 austin        212
## 3 san antonio   173
## 4 dallas        139
## 5 el paso        86
## 6 arlington      61
# create a dataframe containing the top 3 shapes observed in the top 6 Texas cities with the most UFO sightings
tx_ufo_shapes <- ufo_sightings %>%
  filter(city %in% top_tx$city) %>% # get top 6 cities with most sightings
  group_by(city, shape) %>%
  summarize(count = n()) %>% # count the number of descriptions for each shape
  top_n(3) %>% # take the top 3 shapes with the highest counts for each city
  arrange(city, desc(count)) # order from most to least common shape
## Selecting by count
tx_ufo_shapes
## # A tibble: 18 x 3
## # Groups:   city [6]
##    city        shape    count
##    <chr>       <chr>    <int>
##  1 arlington   light       36
##  2 arlington   triangle    19
##  3 arlington   sphere      14
##  4 austin      light       35
##  5 austin      triangle    30
##  6 austin      fireball    23
##  7 dallas      light       30
##  8 dallas      triangle    20
##  9 dallas      disk        18
## 10 el paso     light       18
## 11 el paso     circle      10
## 12 el paso     disk         9
## 13 houston     light       56
## 14 houston     triangle    29
## 15 houston     other       27
## 16 san antonio light       30
## 17 san antonio triangle    22
## 18 san antonio other       15
# get total for the top shapes in each city
tx_ufo_shapes %>% 
  group_by(shape) %>%
  summarize(total = sum(count)) %>%
  arrange(desc(total))
## # A tibble: 7 x 2
##   shape    total
##   <chr>    <int>
## 1 light      205
## 2 triangle   120
## 3 other       42
## 4 disk        27
## 5 fireball    23
## 6 sphere      14
## 7 circle      10
# colorblind-safe palette
color_palette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

# visualize the most common kinds of UFO shapes witnessed in each Texas city 
tx_ufo_shapes %>%
  ggplot(aes(x = shape, y = count, fill = shape)) +
  geom_col() +
  facet_wrap(~city) +
  coord_flip() +
  scale_fill_manual(values = rev(color_palette))

Discussion: The cities in Texas with the most UFO sightings are Houston (n = 294), Austin (n = 212), San Antonio (n = 173), Dallas (n = 139), El Paso (n = 86) and Arlington (n = 61). Interestingly, these are six of the most populous cities in Texas, which again implies a correlation between population and number of UFO sightings. We might hypothesize that UFOs appearing over densely populated areas are more likely to be seen and reported by one of the people in that area. The most common description of UFO shape is “light,” which occurs in reports from all six cities. The second most common description is “triangle” (seen in 5 cities), then “disk” and “other” (2 cities).

We visualized the top three shapes for each respective city, and can easily spot that the description “fireball” is unique to Austin, “disk” unique to Dallas and El Paso, and the description “other” only reaches top spots in Houston and San Antonio. El Paso is the only city with the top shape descriptor “circle,” and Arlington is the only city with “sphere” as a top reported UFO shape. Regional preferences for describing round objects/shapes might explain the different distributions.