I will put both the simple linear model and the polynomial models into the data frame, map the residuals for both as in the text, and then facet on the models.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ─────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(modelr)
library(gapminder)
by_country <- gapminder %>%
group_by(country, continent) %>%
nest()
country_model_1 <- function(df) {
lm(lifeExp ~ year, data = df)
}
country_model_2 <- function(df) {
lm(lifeExp ~ poly(year - mean(year), degree = 2), data = df)
}
models_1 <- map(by_country$data, country_model_1)
models_2 <- map(by_country$data, country_model_2)
by_country <- by_country %>%
mutate(model_1 = map(data, country_model_1),
model_2 = map(data, country_model_2))
by_country <- by_country %>%
mutate(
resids_1 = map2(data, models_1, add_residuals),
resids_2 = map2(data, models_2, add_residuals)
)
residuals <- unnest(by_country, resids_1, resids_2) %>% select(country:resid, resid1)
#As revealed by this bit of code, the construction here is a bit hack-ish
residuals <- residuals %>% gather(resid, resid1, key = "model", value = "residual") %>%
mutate(model = ifelse(model == "resid", "Simple Linear Model", "Polynomial Model"))
residuals %>% ggplot(aes(x = year, y = residual)) + geom_line(aes(group = country), alpha = 1/4) +
geom_smooth(se = FALSE) + facet_wrap(~ model)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We can also facet on both model and continent:
residuals %>% ggplot(aes(x = year, y = residual)) + geom_line(aes(group = country), alpha = 1/4) +
geom_smooth(se = FALSE) + facet_wrap(continent~ model, ncol = 2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Now consider the quantitative comparison of the models, in terms of R-squared
#remove the residuals to simplify the data frame
#by_country <- by_country %>% select(-resids_1, -resids_2)
glance <- by_country %>% gather(model_1, model_2, key = "model_name", value = "model")
#We apply glance() from the broom package, which, it appears, needs to be invoked inside the map function.
glance <- glance %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance, .drop = TRUE)
glance %>%
mutate(model_name = ifelse(model_name == "model_1", "Simple Linear Model", "Quadratic Polynomial Model")) %>%
ggplot(aes(continent, r.squared)) +
geom_jitter(width = 0.5) + facet_wrap(~ model_name)
Finding the countries for which r.squared is less than 0.125 in one or the other of the models:
glance %>% filter(r.squared < 0.125)
All of these countries display a mix of economic development (e.g. felicitous use of money from diamonds in Botswana) and tragedy (e.g. e.g. AIDS in Southern Africa and Genocide in Rwanda). The failure of both of the models for Rwanda is quite striking, and likely reflects the fact that both the genocide and the strong economic development that followed it went against the trends of other countries.
library(ggbeeswarm)
#substitute "beeswarm" for "jitter" in the geom and take out the width parameter, which geom_beeswarm doesn't seem to recognize
glance %>%
mutate(model_name = ifelse(model_name == "model_1", "Simple Linear Model", "Quadratic Polynomial Model")) %>%
ggplot(aes(continent, r.squared)) +
geom_beeswarm() + facet_wrap(~ model_name)
Here is the plot from the text:
bad_fit <- filter(glance, r.squared < 0.25 & model_name == "model_1") #model_1 is the model represented in the plot in the text
gapminder %>%
semi_join(bad_fit, by = "country") %>%
ggplot(aes(year, lifeExp, colour = country)) +
geom_line()
We recreate this plot without the semi_join.
#We apply glance() from the broom package, which, it appears, needs to be invoked inside the map function.
glance_2 <- by_country %>%
select(-model_2) %>% #remove the polynomial model that we used in the facet maps
mutate(glance = map(model_1, broom::glance)) %>%
unnest(glance) %>%
filter(r.squared < 0.125) %>%
unnest(data)
ggplot(glance_2) + geom_line(aes(x = year, y = lifeExp, color = country))
str_split()
and map()
come to mind.
map(c(1, 2), function(x) x)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
quantile()
, return multiple values.range()
comes to mind
quantile()
return that missing piece? Why isn’t that helpful here?mtcars %>%
group_by(cyl) %>%
summarise(q = list(quantile(mpg))) %>%
unnest()
quantile()
returns both the quantile values and the names of those values (e.g. “0%” for the lowest value). The names are included in the lists introduced in the summarize function in the above, but then removed when the data frame us unnested (the final step in the above). Beyond this point, it seems I don’t understand the question. It might be useful in the unnest()
function to also put the names (the values of probs) in the unnested dataframe.
mtcars %>%
group_by(cyl) %>%
summarise_each(funs(list))
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`
In addition to the message that summarise_each()
has been deprecated in favor of summarise_all()
, etc., the code produces, for each of the values of the variable that has been grouped by (cyl
), a row of vectors that contain all of the values of the variables in the other columns of the data frame. It may be useful when applying iterative constructions such as map() over the elements of the columns of the original data frame.
It is useful for testing whether the elements in list columns have the same same length, as per the following.
df <- tibble(x = list(1:3, 1), y = list(1:4, 1))
map(df, lengths)
## $x
## [1] 3 1
##
## $y
## [1] 4 1
Because 1:3
and 1:4
do not have the same length, df
cannot be un-nested.
unnest(df)
## Error: All nested columns must have the same number of elements.
The most common types of vectors would be the atomic types integer, double, and character, followed, I guess, by logical.