Chapter 27 Area Data V

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

27.1 Learning Objectives

In the previous chapter, you learned how to decompose Moran’s \(I\) coefficient into local versions of an autocorrelation statistic. You also learned about a concentration statistics, and saw how these local spatial statistics can be used for exploratory spatial data analysis, for example to search for “hot” and “cold” spots. In this practice, you will:

  1. Practice how to estimate regression models in R.
  2. Learn about autocorrelation as a model diagnostic.
  3. Learn about variable transformations.
  4. Use autocorrelation analysis to improve regression models.

27.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

27.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(plotly)
library(sf)
library(tidyverse)
library(units)
library(spdep)

Next, read an object of class sf (simple feature) with the census tracts of Hamilton CMA and some selected population variables from the 2011 Census of Canada. This dataset will be used for examples in this chapter:

data(Hamilton_CT)

27.4 Regression Analysis in R

Regression analysis is one of the most powerful techniques in the repertoire of data analysis. There are many different forms of regression, and they usually take the following form: \[ y_i = f(x_{ij}) + \epsilon_i \]

This is a model for a stochastic process. The outcome is \(y_i\), which could be the observed values of a variable \(y\) at locations \(i\). We will think of these locations as areas, but they could as well be points, nodes on a network, links on a network, etc. The model consists of two components: a systematic/deterministic part, that is \(f(x_{ij})\), which is a function of a collection of variables \(x_{i1}, x_{i2}, \cdots, x_{ij}, \cdots, x{ik}\); and a random part, captured by the term \(\epsilon_i\).

In this chapter we will deal with one specific form of regression, namely linear regression. A linear regression model posits (as the name implies) linear relationships between an outcome, called a dependent variable, and one or more covariates, called independent variables. It is important to note that regression models capture statistical relationships, not causal relationships. Even so, causality is often implied by the choice of independent variables. In a way, regression analysis is a tool to infer process from pattern: it is a formula that aims to retrieve the elements of the process based on our observations of the outcome.

This is the form of a linear regression model: \[ y_i = f(x_{ij}) + \epsilon_i = \beta_0 + \sum_{j=1}^k{\beta_jx_{ij}} + \epsilon_i \] where \(y_i\) is the dependent variable and \(x_ij\) (\(j=1,...,k\)) are the independent variables. The coefficients \(\beta\) are not known, but can be estimated from the data. And \(\epsilon_i\) is the random term, which in regression analysis is often called a residual (or error), because it is the difference between the systematic term of the model and the value of \(y_i\): \[ \epsilon_i = y_i - \bigg(\beta_0 + \sum_{j=1}^k{\beta_jx_{ij}}\bigg) \]

Estimation of a linear regression model is the procedure used to obtain values for the coefficients. This typically involves defining a loss function that needs to be minimized. In the case of linear regression, a widely used estimation procedure is least squares. This procedure allows a modeler to find the coefficients that minimize the sum of squared residuals, which become the loss function for the procedure. In very simple terms, the protocol is as follows: \[ \text{Find the values of }\beta\text{ that minimize }\sum_{i=1}^n{\epsilon_i^2} \]

For this procedure to be valid, there are a few assumptions that need to be satisfied, including:

  1. The functional form of the model is correct.

  2. The independent variables are not collinear; this is often diagnosed by calculating the correlations among the independent variables, with values greater than 0.8 often being problematic.

  3. The residuals have a mean of zero: \[ E[\epsilon_i|X]=0 \]

  4. The residuals have constant variance: \[ Var[\epsilon_i|X] = \sigma^2 \text{ }\forall i \]

  5. The residuals are independent, that is, they are not correlated among them: \[ E[\epsilon_i\epsilon_j|X] = 0 \text{ }\forall i\neq j \]

The last three assumptions ensure that the residuals are random. Violation of these assumptions is often a consequence of a failure in the first two (i.e., the model was not properly specified, and/or the residuals are not exogenous).

When all these assumptions are met, the coefficients are said to be BLUE: Best Linear Unbiased Estimates - a desirable property because we wish to be able to quantify the relationships between covariates without bias.

This section provides a refresher on linear regression, before reviewing the estimation of regression models in R. The basic command for multivariate linear regression in R is lm(), for “linear model”. This is the help file of this function:

# Remember that we can search the definition of a function by using a question mark in front of the function itself. 
?lm 

We will see now how to estimate a model using this function. The example we will use is of urban population density gradients. Population density gradients are representations of the variation of population density in cities. These gradients are of interest because they are related to land rent, urban form, and commuting patterns, among other things (see accompanying reading for more information).

Urban economic theory suggests that population density declines with distance from the central business district of a city, or its CBD. This leads to the following model, where the population density at location \(i\) is a function of the distance of \(i\) to the CBD. Since this is likely a stochastic process, we allow for some randomness by means of the residuals: \[ P_i = f(D_i) + \epsilon_i \]

To implement this model, we need to add distance to the CBD as a covariate in our dataframe. We will use Jackson Square, a central shopping mall in Hamilton, as the CBD of the city:

# Create a small data frame with the coordinates of Jackson Square; these coordinates,
# which are in lat-long are converted into a simple features table, with coordinate 
# reference system epsg:4326 (for lat-long); finally, we transform the coordinates to 
# the same coordinate reference system of our Hamilton census tracts, which we retrieve
# with the function `st_crs()`
xy_cbd <- data.frame(x = -79.8708,
                     y = 43.2584) %>%
  st_as_sf(coords = c("x", "y"),
           crs = 4326) %>%
  st_transform(st_crs(Hamilton_CT))

To calculate the distance from the census tracts to the CBD, we retrieve the centroids of the census tracts:

# We need to retrieve the centroids of Hamilton_CT by using 'coordinates' 
xy_ct <- st_centroid(Hamilton_CT)
## Warning in st_centroid.sf(Hamilton_CT): st_centroid assumes attributes are
## constant over geometries of x

Given these coordinates, the function geosphere::distGeo can be used to calculate the great circle distance between the centroids of the census tracts and Hamilton’s CBD. Call this dist2cbd.sl, i.e., straight line distance to CBD in a straight:

# Function `st_distance()` is used to calculate the distance between two sets
# of points. Here, we use it to calculate the distance from the centroids of 
# the census tracts to the coordinates of the CBD. We will call this variable
# `dist.sl`, for "straight line" to remind us what kind of distance this is. 
dist.sl <- st_distance(xy_ct,
                       xy_cbd)

Next. we add our new variable distance to CBD to our dataframe Hamilton_CT for analysis:

Hamilton_CT$dist.sl <- dist.sl

Regression analysis is implemented in R by means of the lm function. The arguments of the model include an object of type “formula” and a dataframe. Other arguments include conditions for subsetting the data, using sampling weights, and so on.

A formula is written in the form y ~ x1 + x2, and more complex expressions are possible too, as we will see below. For the time being, the formula is simply POP_DENSIT ~ dist.sl:

# The function `lm()` implements regression analysis in `R`. Recall that 'dist.sl' is the distance from the CBD (Jackson Square)
model1 <- lm(formula = POP_DENSITY ~ dist.sl, data = Hamilton_CT)
summary(model1) 
## 
## Call:
## lm(formula = POP_DENSITY ~ dist.sl, data = Hamilton_CT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3841.2 -1338.3  -177.1   950.8 10009.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4405.13325  250.16396  17.609  < 2e-16 ***
## dist.sl       -0.17994    0.02419  -7.439 3.63e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1892 on 186 degrees of freedom
## Multiple R-squared:  0.2293, Adjusted R-squared:  0.2251 
## F-statistic: 55.33 on 1 and 186 DF,  p-value: 3.633e-12

The value of the function is an object of class lm that contains the results of the estimation, including the coefficients with their diagnostics, and the coefficient of multiple determination, among other items.

Notice how the coefficient for distance is negative (and significant). This indicates that population density declines with increasing distance: \[ P_i = f(D_i) + \epsilon_i = 4405.15414 - 0.17989D_i + \epsilon_i \]

27.5 Autocorrelation as a Model Diagnostic

We can quickly explore the fit of the model. Since our model contains only one independent variable, we can use a scatterplot to see how it relates to population density. The points in the scatterplot are the actual population density and the distance to CBD. We also use the function geom_abline() to add the regression line to the plot, in blue:

ggplot(data = Hamilton_CT, aes(x = dist.sl, 
                               y = POP_DENSITY)) + 
  geom_point() +
  geom_abline(slope = model1$coefficients[2], # Recall that `geom_abline()` draws a line with intercept and slope as defined. Here the line is drawn using the coefficients of the regression model we estimated above. 
              intercept = model1$coefficients[1], 
              color = "blue", size = 1) +
  geom_vline(xintercept = 0) + # We also add the y axis... 
  geom_hline(yintercept = 0) # ...and the x axis.

Clearly, there remains a fair amount of noise after this model (the scatter of the dots around the regression line). In this case, the regression line captures the general trend of the data, but seems to underestimate most of the high population density areas closer to the CBD, and it also overestimates many of the low population areas.

If the pattern of under- and over-estimation is random (i.e., the residuals are random), that would indicate that the model successfully retrieved all the systematic pattern. If the pattern is not random, there is a violation of assumption of independence. To explore this issue, we will add the residuals of the model to the dataframe:

# Here we add the residuals from 'model1' to the dataframe, with the name `model1.e` 
Hamilton_CT$model1.e <- model1$residuals

Since we are interested in statistical maps, we will create a map of the residuals. In this map, we will use red to indicate negative residuals (values of the dependent variable that the model overestimates), and blue for positive residuals (values of the dependent variable that the model underestimates):

# Recall that 'plot_ly()' is a function used to create interactive plots
plot_ly(Hamilton_CT) %>% 
  # Recall that `add_sf()` is similar to `geom_sf()` and it draws a simple features object on a `plotly` plot. This example adds colors to represent positive (blue) and negative residuals (red).
  add_sf(type = "scatter",
         color = ~(model1.e > 0), 
         colors = c("red", 
                    "dodgerblue4"))