*Claus O. Wilke, EID*

This is the dataset you will be working with:

```
NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
NCbirths
```

```
## # A tibble: 1,409 x 10
## Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 32 40 38 0 3147. 0 0 0
## 2 1 2 32 37 34 0 3289. 0 0 0
## 3 1 1 27 39 12 0 3912. 0 0 0
## 4 1 1 27 39 15 0 3856. 0 0 0
## 5 1 1 25 39 32 0 3430. 0 0 0
## 6 1 1 28 43 32 0 3317. 0 0 0
## 7 1 2 25 39 75 0 4054. 0 0 0
## 8 1 2 15 42 25 0 3204. 0 0 1
## 9 1 2 21 39 28 0 3402 0 0 0
## 10 1 2 27 40 37 0 3515. 0 0 1
## # â€¦ with 1,399 more rows
```

**Question:** Is there a relationship between whether a mother smokes or not and her babyâ€™s weight at birth?

To answer this question, we will plot the distribution of birth weight by smoking status, and we will also plot the number of mothers that are smokers and non-smokers, respectively.

**Introduction:** We are working with the `NCbirths`

dataset, which contains 1409 birth records from North Carolina in 2001. In this dataset, each row corresponds to one birth, and there are ten columns providing information about the birth, the mother, and the baby. Information about the birth includes whether it is a single, twin, or triplet birth, the number of completed weeks of gestation, and whether the birth is premature. Information about the baby includes the sex, the weight at birth, and whether the birth weight should be considered low. Information about the mother includes her age, the weight gained during pregnancy, whether she is a smoker, and whether she is married.

To answer the question of Part 1, we will work with four variables, the babyâ€™s birthweight (column `BirthWeightGm`

), whether the baby was born prematurely (column `Premie`

), whether it was a singleton, twin, or triplet birth (column `Plural`

), and the mother is a smoker or not (column `Smoke`

). The birthweight is provided as a numeric value, in grams. The premature birth status is encoded as 0/1, where 0 means regular and 1 means premature (36 weeks or sooner). The number of births is encoded as 1/2/3 representing singleton, twins, and triplets, respectively. The smoking status is encoded as 0/1, where 0 means the mother is not a smoker and 1 means she is a smoker.

**Approach:** Our approach is to show the distributions of birthweights versus the mothersâ€™ smoking status using violin plots (`geom_violin()`

). We also separate out regular and premature births, because babies born prematurely have much lower birthweight and therefore must be considered separately. Violins make it easy to compare multiple distributions side-by-side.

One limitation of the violin plots is that they donâ€™t show us how many observations fall into the different categories. Therefore, we will visualize the numbers of regular and premature births to smoking and non-smoking mothers with bar plots (`geom_bar()`

). Jointly, these two plots will allow us to answer the question.

**Analysis:** First we plot the birthweight distributions as violins.

```
ggplot(NCbirths, aes(factor(Smoke), BirthWeightGm, fill = factor(Premie))) +
geom_violin() +
scale_x_discrete(
name = "Mother",
labels = c("non-smoker", "smoker")
) +
scale_y_continuous(
name = "Birth weight (gm)"
) +
scale_fill_manual(
name = NULL,
labels = c("regular birth", "premature birth"),
values = c(`0` = "#56B4E9", `1` = "#E69F00")
) +
theme_bw(12)
```

Then we plot the numbers of regular and premature births as bar plots. We facet by smoking status of the mother so we can clearly see how many observations there are in each subset of the data. We also separately account for singleton, twin, and triplet births, to see whether twin and triplet births may be driving some of the birth-weight patterns we saw in the first figure.

```
ggplot(NCbirths, aes(y = factor(Premie), fill = factor(Plural))) +
geom_bar(
position = position_stack(reverse = TRUE) # stack in reverse order
) +
facet_wrap(
vars(Smoke),
ncol = 1,
labeller = as_labeller(c(`0` = "non-smoker", `1` = "smoker"))
) +
scale_y_discrete(
name = NULL,
limits = c("1", "0"), # manually reverse the axis ticks
labels = c("premature birth", "regular birth")
) +
scale_fill_viridis_d(
name = NULL,
labels = c("singleton", "twins", "triplets"),
option = "E",
begin = 0.3,
end = 0.8,
direction = -1
) +
theme_bw(12)
```

**Discussion:** For regular births, smoking status of the mother appears to have a small effect on the average birth weight. We can see this by comparing the blue violins in the first plot, where we see that they are slightly shifted relative to each other but have otherwise comparable shape. However, a much bigger effect comes from whether the baby is born prematurely or not. Premature births have on average a much lower birthweight than regular births, and the variance is also bigger (the orange violins are taller than the blue violins). Interestingly, smoking status does not seem to affect the distribution of birthweights for premature births much. We can see this from the fact that the orange violins look approximately the same. We would have to run a multivariate statistical analysis to determine whether any of these observed patterns are statistically significant.

When we look at the breakdown of number of births by regular/premature, non-smoking/smoking, and singleton/twins/triplets, we see that by far the largest number of cases correspond to singleton births to non-smoking mothers giving birth after 35 weeks. And, for both non-smoking and smoking mothers, premature births are relatively rare, less than 20% of the total. Similarly, twin and triplet births make up only a small percentage of the total dataset. Therefore, it makes sense to compare the birth weights for regular births only (blue violins in the first plot). Thus, the final answer to our question is: Smoking appears to consistently reduce the birth weight of babies by a small amount, on the order of 100-200g.

*Part 2 should be similar in length and complexity to Part 1.*