Aims

By the end of this practical, you should feel comfortable:

Fitting a density surface model using dsm()
Understanding what the objects that go into a dsm() call
Understanding the role of the response in the formula= argument
Understanding the output of summary() when called on a dsm object
Increasing the k parameter of smooth terms to increase their flexibility

The example code below uses the df_hn detection function in the density surface models. You can substitute this for your own best model as you go, or copy and paste the code at the end and see what results you get using your model for the detection function.

In this practical, we’re just going to do basic comparisons looking at plots and summary output. We’ll get more into model comparison in later practicals.

Solutions are given between these horizontal lines with bold headers

The “solutions” presented here are not definitive, many modelling options are possible.

Here I’ll use a few different detection functions that I fitted in the first practical. To run the code in this file, you’ll first need to have run the first practical solution, to get the correct outputs.

Load the packages and data

library(Distance)

## Loading required package: mrds

## This is mrds 2.2.3
## Built: R 4.0.2; ; 2020-08-01 10:33:56 UTC; unix

## 
## Attaching package: 'Distance'

## The following object is masked from 'package:mrds':
## 
##     create.bins

library(dsm)

## Loading required package: mgcv

## Loading required package: nlme

## This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.

## Loading required package: numDeriv

## This is dsm 2.3.1.9007
## Built: R 4.0.2; ; 2020-10-01 05:44:34 UTC; unix

library(ggplot2)
library(knitr)

Loading sperm whale data again and the RData files where we saved our results:

load("spermwhale.RData")
load("df-models.RData")

Pre-model fitting

Before we fit a model using dsm() we must first remove the observations from the spatial data that we excluded when we fitted the detection function – those observations at distances greater than the truncation.

obs <- obs[obs$distance <= df_hn$ddf$meta.data$width, ]

Here we’ve used the value of the truncation stored in the detection function object, but we could also use the numeric value (which we can also find by checking the detection function’s summary()).

(If we don’t do this, the DSM will still fit fine, we just get a warning.)

Fitting DSMs

Using the data that we’ve saved so far, we can build a call to the dsm() function and fit out first density surface model. Here we’re only going to look at models that include spatial smooths.

Let’s start with a very simple model – a bivariate smooth of x and y:

dsm_nb_xy <- dsm(count~s(x, y, k=25),
                 ddf.obj=df_hn, segment.data = segs, 
                 observation.data=obs,
                 family=nb())

Note again that we try to have informative model object names so that we can work out what the main features of the model were from its name alone.

We can look at a summary() of this model. Look through the summary output and try to pick out the important information based on what we’ve talked about in the lectures so far.

summary(dsm_nb_xy)

## 
## Family: Negative Binomial(0.093) 
## Link function: log 
## 
## Formula:
## count ~ s(x, y, k = 25) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.4863     0.2207  -92.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df Chi.sq  p-value    
## s(x,y) 13.91  17.55  66.36 1.05e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.08   Deviance explained = 34.9%
## -REML = 394.96  Scale est. = 1         n = 949

Things to look for:

Was the bivariate smooth of space (s(x, y)) significant at the highest level?
What was the effective degrees of freedom (EDF) for this model? How does that compare with the k we provided?
Does the number of observations (n=) match the number of rows in segs?

Summary output

Answering the above questions:

From the summary output we can see that the bivariate term, s(x,y) is significant at the “0” level – that’s not too surprising given it’s the only term in the model though!
The EDF for the model was about 14. We set k=25 above, so we’re nowhere near using all of the degrees of freedom we gave the model. Note that we don’t need to reduce k, the penalty has done its job and reduced the wigglyness of the model accordingly.
We see from the summary that n = 949, we can check the number of rows in the segment data:

nrow(segs)

and confirm that’s correct. These might not match if you have NA (“not available”) values in your segment data, for example if you have covariates that don’t have values (SST due to cloud for example). If there are NAs where values are needed, the whole row will be dropped. It’s usually worth checking this to make sure you’re fitting to the data that you think you are.

Visualising output

As discussed in the lectures, the plot output is not terribly useful for bivariate smooths like these. We’ll use vis.gam() to visualise the smooth instead:

vis.gam(dsm_nb_xy, view=c("x","y"), plot.type="contour", 
        too.far=0.1, main="s(x,y) (link scale)", asp=1)

Fitted surface (on link scale) for s(x,y)

Notes:

The plot is on the scale of the link function, the offset is not taken into account – the contour values do not represent abundance!
We set view=c("x","y") to display the smooths for x and y (we can choose any two variables in our model to display like this)
plot.type="contour" gives this “flat” plot, set plot.type="persp" for a “perspective” plot, in 3D.
The too.far=0.1 argument displays the values of the smooth not “too far” from the data (try changing this value to see what happens).
asp=1 ensures that the aspect ratio of the plot is 1, making the pixels square.
Read the ?vis.gam manual page for more information on the plotting options.

Setting basis complexity

We can set the basis complexity via the k argument to the s() term in the formula. For example the following re-fits the above model with a much smaller basis complexity than before:

dsm_nb_xy_smallk <- dsm(count~s(x, y, k=10),
                        ddf.obj=df_hn, segment.data = segs,
                        observation.data=obs,
                        family=nb())

Compare the output of vis.gam() for this model to the model with a larger basis complexity.

Univariate models

Instead of fitting a bivariate smooth of x and y using s(x, y), we could instead use the additive nature and fit the following model:

dsm_nb_x_y <- dsm(count~s(x)+ s(y),
                 ddf.obj=df_hn, segment.data = segs, 
                 observation.data=obs,
                 family=nb())

Compare this model with dsm_nb_xy using vis.gam() and summary().

Comparing to the univariate model

First taking a look at the summary:

summary(dsm_nb_x_y)

## 
## Family: Negative Binomial(0.085) 
## Link function: log 
## 
## Formula:
## count ~ s(x) + s(y) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.4301     0.2374  -86.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df Chi.sq  p-value    
## s(x) 5.171  6.302  22.59 0.001711 ** 
## s(y) 4.779  5.886  25.73 0.000444 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.052   Deviance explained = 31.3%
## -REML = 395.86  Scale est. = 1         n = 949

we can see that the s(x) and s(y) terms have EDFs lower than the (default) k=10 they were given. The combined EDFs are bigger than those for the model with s(x,y) – perhaps because these terms have a lot of work to do without being able to properly model the interaction. We can see this when plotting the s(x) + s(y) model alongside the s(x,y) model:

par(mfrow=c(1,2))
vis.gam(dsm_nb_xy, view=c("x","y"), plot.type="contour", 
        too.far=0.1, main="s(x,y, k=25) (link scale)", asp=1)

vis.gam(dsm_nb_x_y, view=c("x","y"), plot.type="contour", 
        too.far=0.1, main="s(x) + s(y) (link scale)", asp=1)

We can also look at univariate plots of s(x) and s(y) for dsm_nb_x_y:

plot(dsm_nb_x_y, pages=1)

the s(x) + s(y) model is additive, we don’t allow the interactions that we can have in s(x,y).

Generally, it’s best to start with s(x,y) when you’re using (projected) spatial coordinates.

Tweedie response distribution

So far, we’ve used nb() as the response – the negative binomial distribution. We can also try out the Tweedie distribution as a response by replacing nb() with tw().

Try this out and compare the results.

Refitting with Tweedie

dsm_tw_xy <- dsm(count~s(x, y, k=25),
                 ddf.obj=df_hn, segment.data = segs, 
                 observation.data=obs,
                 family=tw())
                 
dsm_tw_x_y <- dsm(count~s(x)+ s(y),
                 ddf.obj=df_hn, segment.data = segs, 
                 observation.data=obs,
                 family=tw())

Looking at the summary results for each:

summary(dsm_tw_xy)

## 
## Family: Tweedie(p=1.292) 
## Link function: log 
## 
## Formula:
## count ~ s(x, y, k = 25) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.4637     0.2318  -88.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(x,y) 14.45  17.98 4.547  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0941   Deviance explained = 31.8%
## -REML = 396.95  Scale est. = 4.966     n = 949

summary(dsm_tw_x_y)

## 
## Family: Tweedie(p=1.306) 
## Link function: log 
## 
## Formula:
## count ~ s(x) + s(y) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.3954     0.2381  -85.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df     F p-value    
## s(x) 4.943  6.057 3.224 0.00425 ** 
## s(y) 5.293  6.419 4.034 0.00033 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0678   Deviance explained = 27.4%
## -REML = 399.84  Scale est. = 5.3157    n = 949

In the summary results alone we don’t see too much of a difference here. The Family: lines are different and there are some small differences between the EDFs/deviance explained etc. As we’ll see later when we look at model checking, the summary results are not the end of the story.

Estimated abundance as response

So far we’ve just used count as the response. That is, we adjusted the offset of the model to make it take into account the “effective area” of the segments (see lecture notes for a refresher).

Instead of using count we could use abundance.est, which will leave the segment areas as they are and calculate the Horvitz-Thompson estimates of the abundance per segment and use that as the response in the model. This is most useful when we have covariates in the detection function (though we can use it any time).

Try copying the code that fits the model dsm_nb_xy and make a model dsm_nb_xy_ae that replaces count for abundance.est in the model formula and uses the df_hr_ss_size detection function. Compare the results of summary and plot output between this and the count model.

Estimated abundance response models

Fitting the requested model, with abundance.est as the response, using the df_hr_ss_size model that I saved in the previous

dsm_nb_xy_ae <- dsm(abundance.est~s(x, y, k=25),
                    ddf.obj=df_hr_ss_size, segment.data = segs, 
                    observation.data=obs,
                    family=nb())

Once it’s fitted, we can look at the summary:

summary(dsm_nb_xy_ae)

## 
## Family: Negative Binomial(0.04) 
## Link function: log 
## 
## Formula:
## abundance.est ~ s(x, y, k = 25) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.1116     0.2092  -96.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df Chi.sq  p-value    
## s(x,y) 12.84   16.6  57.86 3.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0685   Deviance explained = 28.5%
## -REML = 514.66  Scale est. = 1         n = 949

Comparing these results to those from dsm_nb_xy, we don’t see too much of a difference here looking at the summary output alone. Again, we’ll see differences through the course as we have more tools available to investigate what’s going on.

Save models

It’ll be interesting to see how these models compare to the more complex models we’ll see later on. Let’s save the fitted models at this stage (add your own models to this list so you can use them later).

Saving models

# add your models here
save(dsm_nb_x_y, dsm_nb_xy, dsm_tw_x_y, dsm_tw_xy, dsm_nb_xy_ae,
     file="dsms-xy.RData")

Extra credit

If you have time, try the following:

Make the k value very big (~100 or so), what do you notice?

Making k really big

Be warned, this might take a while to fit!

dsm_nb_xy_bigk <- dsm(count~s(x, y, k=100),
                        ddf.obj=df_hn, segment.data = segs,
                        observation.data=obs,
                        family=nb())

Looking at the summary output:

summary(dsm_nb_xy_bigk)

## 
## Family: Negative Binomial(0.126) 
## Link function: log 
## 
## Formula:
## count ~ s(x, y, k = 100) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.9640     0.2909  -72.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df Chi.sq  p-value    
## s(x,y) 29.51  40.44  91.37 9.34e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.132   Deviance explained =   49%
## -REML = 390.91  Scale est. = 1         n = 949

Now we see that the EDF has increased a bit! BUT not all the way up to the 100 degrees of freedom that were possible.

Plotting that along side the small and medium k values:

par(mfrow=c(1,3))
vis.gam(dsm_nb_xy_smallk, view=c("x","y"), plot.type="contour", 
        too.far=0.1, main="s(x,y, k=10) (link scale)", asp=1)

vis.gam(dsm_nb_xy, view=c("x","y"), plot.type="contour", 
        too.far=0.1, main="s(x,y, k=25) (link scale)", asp=1)

vis.gam(dsm_nb_xy_bigk, view=c("x","y"), plot.type="contour", 
        too.far=0.1, main="s(x,y, k=100) (link scale)", asp=1)

We can see a lot more detail in the plot now.

The important thing to think about here is that whenever we increase k we might expect some increase in the EDF. If we got k right then the change is small. In this case the change is fairly big, so we might want to use k=50 as a precaution. However, this is a bit of an artificial situation: we don’t have other environmental covariates in the model, so the s(x,y) term is doing a lot of work! In a model with other covariates we’d expect them to explain more of the variation.

Summarizing models

To summarize multiple models, the following small bit of code might help:

# function to extract various bits of information from
# a model object, you can adapt to be more informative
summarize_dsm <- function(model){

  summ <- summary(model)

  data.frame(response = model$family$family,
             terms    = paste(rownames(summ$s.table), collapse=", "),
             AIC      = AIC(model)
            )

}

# make a list of models, in this case just the count models
model_list <- list(dsm_nb_xy, dsm_nb_x_y, dsm_tw_xy, dsm_tw_x_y)

# run this over each model in the list and make a table
library(plyr)
summary_table <- ldply(model_list, summarize_dsm)
row.names(summary_table) <- c("dsm_nb_xy", "dsm_nb_x_y", "dsm_tw_xy", "dsm_tw_x_y")

We can then again use the kable function to make a table, in this case sorting by AIC (though note that we can’t compare between Tweedie and negative binomial models as the former is a continuous distribution and the latter isn’t):

summary_table <- summary_table[order(summary_table$AIC),]

Then the table:

kable(summary_table)

	response	terms	AIC
dsm_nb_xy	Negative Binomial(0.093)	s(x,y)	785.5534
dsm_nb_x_y	Negative Binomial(0.085)	s(x), s(y)	789.7555
dsm_tw_xy	Tweedie(p=1.292)	s(x,y)	1249.1030
dsm_tw_x_y	Tweedie(p=1.306)	s(x), s(y)	1252.2601

In each case the s(x,y) model is preferred.

We’ll discuss how to decide between models in more detail in the lectures!

Spatially explicit models for distance sampling data

Practical “solution” 2: Simple density surface models

David L Miller