Project 2

Name, UTEID

This is the dataset you will be working with:

ufo_sightings <- 
  read_csv("https://wilkelab.org/classes/SDS348/data_sets/ufo_sightings_clean.csv") %>%
  separate(datetime, into = c("month", "day", "year"), sep = "/") %>%
  separate(year, into = c("year", "time"), sep = " ") %>%
  separate(date_posted, into = c("month_posted", "day_posted", "year_posted"), sep = "/") %>%
  select(-time, -month_posted, -day_posted) %>%
  mutate(year = as.numeric(year)) %>%
  filter(!is.na(country))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   datetime = col_character(),
##   city = col_character(),
##   state = col_character(),
##   country = col_character(),
##   shape = col_character(),
##   duration_seconds = col_double(),
##   duration_hours_min = col_character(),
##   comments = col_character(),
##   date_posted = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )

Part 1

Question: Since 1990, which cities have reported the most UFO sightings, and how has the number of UFO sightings for these cities changed over time?

Introduction: We are working with the ufo_sightings dataset, which contains 70,662 reports of UFO sightings from 1910 to 2014 for five countries (US, Canada, Australia, Great Britain, and Germany). Each row of the dataset represents a single UFO sighting. The dataset contains 14 columns that provide the time, location, and description of the sighting.

To determine how the number of UFO sightings has changed over the years in the cities with the highest number of reported sightings, we will be working with the following columns:

  1. city: the city in which the sighting was reported
  2. year: the year of the reported sighting

Approach: Our approach is to fist determine which cities have the highest number of UFO sightings. Next, we will visualize the number of UFO sightings across the years for the top six cities using a scatter plot and a linear regression line. A regression line can be used to determine if there is a trend between the two continuous variables. The alternative to a scatter plot with regression line could be a line plot. However, since we do not know if there is a relationship between UFO sightings and time, a line plot does not seem appropriate here.

To look at the cities with the highest number of UFO sightings, these function will be applied:

  1. filter() to extract only the sightings after 1990
  2. count() to count the number of sightings per city
  3. arrange() and desc() to sort the table by descending count
  4. slice() to keep the top cities with the highest number of reported UFO sightings

To plot the number of UFO sightings over time, we will use the following functions:

  1. filter() to reduce the dataset to the top cities with the highest number of sightings and to all sightings after 1990
  2. count() to count the number of observations per year and city
  3. mutate() to rewrite the city column in a new order
  4. fct_reorder() to reorder the city column by the number of sightings
  5. fct_recode() to change the city names to upper case
  6. geom_point() to create a scatter plot of UFO sighting counts for each year
  7. geom_smooth() to add a regression line to the scatter plot
  8. facet_wrap() to create scatter plot facets for each city

Analysis:

# extracting the top 6 cities with the highest number of UFO reports:
top_cities <- ufo_sightings %>%
  filter(year > 1990) %>%
  count(city) %>%
  arrange(desc(n)) %>%
  slice(1:6)

# let's look at the table:
top_cities
## # A tibble: 6 x 2
##   city            n
##   <chr>       <int>
## 1 seattle       503
## 2 phoenix       439
## 3 portland      360
## 4 las vegas     357
## 5 los angeles   324
## 6 san diego     315
# counting the number of UFO sightings in the top cities for each year:
summary <- ufo_sightings %>%
  filter(city %in% top_cities$city, year > 1990) %>%
  count(year, city) %>%
  mutate(city = fct_rev(fct_reorder(city, n, sum))) %>%
  mutate(
    # change all city names to upper case
    city = fct_recode(
      city,
      Seattle = "seattle",
      Phoenix = "phoenix",
      Portland = "portland",
      `Las Vegas` = "las vegas",
      `Los Angeles` = "los angeles",
      `San Diego` = "san diego"
    )
  )

# looking at the top 3 rows in the summarized data.
head(summary, n = 3)
## # A tibble: 3 x 3
##    year city            n
##   <dbl> <fct>       <int>
## 1  1991 Las Vegas       3
## 2  1991 Los Angeles     1
## 3  1991 Phoenix         1
# plotting the number of UFO sightings across time:
ggplot(summary, aes(year, n)) + 
  geom_point(size = 1) + 
  geom_smooth(
    method = "lm",
    color = "salmon3",
    fill = "antiquewhite3",
    size = 0.9) + 
  facet_wrap(vars(city)) +
  scale_x_continuous(
    name = "Year",
    limits = c(1990, 2015),
    breaks = seq(from = 1990, to = 2015, by = 5),
    labels = seq(from = 1990, to = 2015, by = 5),
    expand = c(0.05, 0.05)) +
  scale_y_continuous(
    name = "Number of UFO Sightings",
    limits = c(0, 45),
    breaks = seq(from = 0, to = 45, by = 10),
    expand = c(0, 0)) +
  theme_bw(12) +
  theme(
    axis.text = element_text(color = "black", size = 10),
    panel.grid.minor = element_blank(),
    panel.spacing = unit(1, "lines"),
    strip.text.x = element_text(size = 12),
    aspect.ratio = 3/5
  )
## `geom_smooth()` using formula 'y ~ x'

Discussion: The top 6 cities with the highest number of reported UFO sightings are Seattle, Phoenix, Portland, Las Vegas, Los Angeles and San Diego. Their reported UFO sighting counts were 503, 439, 360, 357, 324, and 315 sightings, respectively. All six of these cities are located in the west and midwest United States.

Looking at the scatter plots, we see that, overall, the number of UFO sightings increases with time for all six cities. The cities Phoenix and Seattle appear to have the greatest variance in UFO sighting counts across time; there were almost as many reports from 1995 to 2005 as from 2005 to 2015. These two cities also had the widest confidence bands. Las Vegas, on the other hand, appears to have the least variance; we see a narrow confidence band and a more prominent linear increase in sightings across the years. We can speculate that the improvement of communication technology in recent years has allowed easier access to the National UFO Reporting Center.