Spatially explicit models for distance sampling data

Aims

By the end of this practical, you should feel comfortable:

Looking at how changing k changes smooths
Plotting residuals vs. covariates
Fitting models to the residuals
Using gam.check and rqgam.check
Using obs_exp

Load data and packages

library(Distance)

## Loading required package: mrds

## This is mrds 2.2.3
## Built: R 4.0.2; ; 2020-08-01 10:33:56 UTC; unix

## 
## Attaching package: 'Distance'

## The following object is masked from 'package:mrds':
## 
##     create.bins

library(dsm)

## Loading required package: mgcv

## Loading required package: nlme

## This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.

## Loading required package: numDeriv

## This is dsm 2.3.0
## Built: R 4.0.2; ; 2020-07-16 23:56:50 UTC; unix

library(ggplot2)
library(knitr)

Load the data and the fitted dsm objects from the previous exercises:

load("spermwhale.RData")
load("dsms.RData")
load("df-models.RData")

We’re going to check the DSMs that we fitted in the previous practical, then save those that we think are good!

Looking at how changing `k` changes smooths

First checking that k is big enough, we should really do this during model fitting, but we’ve separated this up for the practical exercises.

First look at the text output of gam.check, are the values of k' for your models close to the edf in the outputted table. Here’s a silly example where I’ve deliberately set k too small:

dsm_k_check_eg <- dsm(count ~ s(Depth, k=4),
                    df_hn, segs, obs,
                    family=tw())

## Warning in make.data(response, ddf.obj, segment.data, observation.data, : Some
## observations are outside of detection function truncation!

gam.check(dsm_k_check_eg)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 8 iterations.
## Gradient range [-1.66079e-07,1.171305e-07]
## (score 392.2823 & scale 5.216272).
## Hessian positive definite, eigenvalue range [0.8702242,302.9225].
## Model rank =  4 / 4 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k' edf k-index p-value
## s(Depth) 3.0 2.9    0.78    0.18

Note here again I’m using a “chunk” option to suppress the plots printed by gam.check

Generally if the EDF is close to the value of k you supplied, it is worth doubling k and refitting to see what happens. You can always switch back to the smaller k if there is little difference. The ?choose.k manual page can offer some guidance.

Continuing with that example, if we double k:

dsm_k_check_eg <- dsm(count ~ s(Depth, k=8),
                    df_hn, segs, obs,
                    family=tw())

## Warning in make.data(response, ddf.obj, segment.data, observation.data, : Some
## observations are outside of detection function truncation!

gam.check(dsm_k_check_eg)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 8 iterations.
## Gradient range [-5.675949e-08,4.752562e-08]
## (score 392.2128 & scale 5.216315).
## Hessian positive definite, eigenvalue range [1.313321,300.6316].
## Model rank =  8 / 8 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'  edf k-index p-value
## s(Depth) 7.00 4.47    0.78     0.2

We get something much more reasonable. Doubling again

dsm_k_check_eg <- dsm(count ~ s(Depth, k=16),
                    df_hn, segs, obs,
                    family=tw())

## Warning in make.data(response, ddf.obj, segment.data, observation.data, : Some
## observations are outside of detection function truncation!

gam.check(dsm_k_check_eg)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-4.245788e-07,3.205948e-07]
## (score 392.2614 & scale 5.213115).
## Hessian positive definite, eigenvalue range [1.440887,300.5744].
## Model rank =  16 / 16 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##             k'   edf k-index p-value
## s(Depth) 15.00  4.83    0.78    0.19

We see almost no improvement, so might refit the second model here with k=8.

Check to see if any of your models need this treatment.

Plotting residuals vs. covariates

Now following from the slides, let’s try plotting residuals vs. covariates. We can use these diagnostics (along with those in the following section) to assess how well the model has accounted for the structure in the data. This involves a little data manipulation and a little thought about how to chop up the covariate values.

I’ll show an example here using the not very good model we fitted above, so we can see what happens when things go wrong:

dsm_k_low <- dsm(count ~ s(Depth, k=4),
                    df_hn, segs, obs,
                    family=tw())

## Warning in make.data(response, ddf.obj, segment.data, observation.data, : Some
## observations are outside of detection function truncation!

# get the data used to fit the model
resid_dat <- dsm_k_low$data

# get the residuals
resid_dat$resids <- residuals(dsm_k_low)

# now let's chop x into 20 chunks
resid_dat$x_chop <- cut(resid_dat$x,
                        seq(min(resid_dat$x), max(resid_dat$x), len=20))

# and plot the boxplot
boxplot(resids~x_chop, data=resid_dat, ylab="Deviance residuals", xlab="x")

This seems to show that something is going on in the far east of the data – there’s a wide range of residuals there. Note that for our data the most easterly points are also North (since our data is somewhat diagonal in shape, with respect to the compass). So let’s see if this is an issue with x alone (East-West) or y (North-Southness) is an issue too.

# now let's chop y into 20 chunks
resid_dat$y_chop <- cut(resid_dat$y,
                        seq(min(resid_dat$y), max(resid_dat$y), len=20))

# and plot the boxplot
boxplot(resids~y_chop, data=resid_dat, ylab="Deviance residuals", xlab="y")

We can also do this chopping with Depth (the only covariate in this model):

# now let's chop Depth into 20 chunks
resid_dat$Depth_chop <- cut(resid_dat$Depth,
                        seq(min(resid_dat$Depth), max(resid_dat$Depth), len=20))

boxplot(resids~Depth_chop, data=resid_dat, ylab="Deviance residuals", xlab="Depth")

Now we see something more pathological, there is pattern in these residuals.

Let’s make the same plot for a model with an spatial smooth in it:

dsm_better <- dsm(count ~ s(x,y) + s(Depth),
                    df_hn, segs, obs,
                    family=tw())

## Warning in make.data(response, ddf.obj, segment.data, observation.data, : Some
## observations are outside of detection function truncation!

# get the data from the larger k model
resid_dat <- dsm_better$data

# get the residuals
resid_dat$resids <- residuals(dsm_better)

# now let's chop Depth into 20 chunks
resid_dat$Depth_chop <- cut(resid_dat$Depth,
                        seq(min(resid_dat$Depth), max(resid_dat$Depth), len=20))

boxplot(resids~Depth_chop, data=resid_dat, ylab="Deviance residuals", xlab="Depth")

This looks better (note the change in vertical axis scale).

Try this out with your models, using the covariates in the model and perhaps x and y too. You may want to change the number of chunks (via the len argument to cut()).

Fitting models to the residuals

We can also fit a model to the residuals to see if there is residual structure left in the model. The idea here is that we know that the model won’t fit perfectly everywhere, but is the difference between model and data (the residuals) systematic. Going back to our “bad” model above, we can see what happens when things go wrong:

# get the data used to fit the model
resid_dat <- dsm_k_low$data

# get the residuals
resid_dat$resids <- residuals(dsm_k_low)

# fit the model with the same predictors, but larger k
resid_model_depth <- gam(resids ~ s(Depth, k=40),
                         data=resid_dat,
                         family=gaussian(), method="REML")

Note that we used the gam function from mgcv (which is what happens inside dsm anyway), to fit our residual model. The important things to note are:

gam() in place of dsm().
The response is resids, the column we added above.
Theoretically we expect the deviance residuals to be distributed normally, so we set family=gaussian().
Need to set method="REML" which happens automatically in dsm(), and makes sure the best method is used for fitting the smooths.

We want to look at the EDFs associated with the terms in the model:

summary(resid_model_depth)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## resids ~ s(Depth, k = 40)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.54782    0.03492  -15.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F  p-value    
## s(Depth) 3.304   4.14 8.315 1.34e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0344   Deviance explained = 3.77%
## -REML =   1422  Scale est. = 1.1571    n = 949

We can see that there’s still a bit of pattern there, but not much (you can also try using plot() to investigate this).

Now trying that with the better model:

# get the data used to fit the model
resid_dat_better <- dsm_better$data

# get the residuals
resid_dat_better$resids <- residuals(dsm_better)

# fit the model with the same predictors, but larger k
resid_model_better <- gam(resids ~ s(Depth, bs="ts", k=40) + s(x,y, bs="ts", k=100),
                         data=resid_dat_better,
                         family=gaussian(), method="REML")

Now again looking at the EDFs:

summary(resid_model_better)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## resids ~ s(Depth, bs = "ts", k = 40) + s(x, y, bs = "ts", k = 100)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.45018    0.03071  -14.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F  p-value    
## s(Depth) 2.435     39 0.550 6.79e-06 ***
## s(x,y)   1.028     99 0.045   0.0255 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0296   Deviance explained = 3.32%
## -REML = 1302.3  Scale est. = 0.8952    n = 949

We see that there’s not much going on in this model. Again you can use plot() here, remembering to compare the vertical axis limits and look at the width of the confidence bands.

Try this out with your model(s). What do you think is missing?

Using `rqgam.check`

We can use rqgam.check() to look at the residual check plots for this model. Here we’re looking for pattern in the plots that would indicate that there is unmodelled structure. Ideally the plot should look like a messy blob of points.

rqgam.check(dsm_better)

## Loading required namespace: tweedie

Try this out with your models and see if there is any pattern in the randomised quantile residuals.

Using `obs_exp`

To compare the observed vs. expected counts from the model, we need to aggregate the data at some level. We can use obs_exp() to do this. For continuous covariates in the detection function or spatial model, we just need to specify the cut points. For example:

obs_exp(dsm_better, "Depth", c(0, 1000, 2000, 3000, 4000, 6000))

##          (0,1e+03] (1e+03,2e+03] (2e+03,3e+03] (3e+03,4e+03] (4e+03,6e+03]
## Observed  4.000000      52.53333      139.1667      35.00000      8.000000
## Expected  5.163824      49.92310      126.8073      38.65673      8.134937

Try this out for your models.

Saving models

Now save the models that you’d like to use to predict with later (that have good check results!): I recommend saving a few models so you can compare the results later.

# add your models here
#save(dsm_nb_xy_ms,
#     file="dsms-checked.RData")

Spatially explicit models for distance sampling data

Practical 4: Model checking

David L Miller

Aims

Load data and packages

Looking at how changing `k` changes smooths

Plotting residuals vs. covariates

Fitting models to the residuals

Using `rqgam.check`

Using `obs_exp`

Saving models

Spatially explicit models for distance sampling data

Practical 4: Model checking

David L Miller

Aims

Load data and packages

Looking at how changing k changes smooths

Plotting residuals vs. covariates

Fitting models to the residuals

Using rqgam.check

Using obs_exp

Saving models

Looking at how changing `k` changes smooths

Using `rqgam.check`

Using `obs_exp`