Spatially explicit models for distance sampling data

Aims

By the end of this practical you should feel confident doing the following:

Fitting a detection function using ds()
Checking detection functions using functions in package Distance
Selecting models using AIC
Estimating abundance (using R and maths!)

Solutions are given between these horizontal lines with bold headers

The “solutions” presented here are not definitive, many modelling options are possible. Our aim in this practical is to have some detection functions fitted to use later on and have you get familiar with the Distance package.

Preamble

First need to load the requisite R libraries

library(Distance)

## Loading required package: mrds

## This is mrds 2.2.3
## Built: R 4.0.2; ; 2020-08-01 10:33:56 UTC; unix

## 
## Attaching package: 'Distance'

## The following object is masked from 'package:mrds':
## 
##     create.bins

library(ggplot2)
library(knitr)

We can then load the data from the RData file (this needs to be in your R working directory):

load("spermwhale.RData")

Exploring the data

Before setting off fitting detection functions, let’s look at the relationship of various variables in the data.

The most important covariate in a distance sampling analysis is distance itself. We can plot a histogram of the distances to check that (1) we imported the data correctly and (2) it conforms to the usual shape for line transect data.

hist(distdata$distance, xlab="Distance (m)",
     main="Distance to sperm whale observations")

Distribution of observed perpendicular detection distances.

We can also see how many observations were recorded per Beaufort level (some background on Beaufort with photos). We can first use table() to see what values occur and how many there are:

table(distdata$Beaufort)

## 
##   0   1 1.8   2 2.4 2.5 2.7 2.8 2.9   3 3.1 3.5 3.6 3.7 3.9   4 4.5   5 
##   5   4   7  31   8  10   1   1   4  41   2   6   1   2   2   8   1   3

The survey protocol on these surveys took an average of several observers’ assessments of the weather, hence we get these decimal values. We can use the breaks= argument to hist() to bin these values to integer values. We’ll plot the values from the segments at the same time, so we can get an idea of what the conditions were like over the whole survey, not just when we saw sperm whales:

par(mfrow=c(1,2))

hist(distdata$Beaufort, breaks=seq(0, 5, by=1),
     main="Observations per Beaufort level",
     xlab="Beaufort")

hist(segs$Beaufort, breaks=seq(0, 5, by=1),
     main="Segments per Beaufort level",
     xlab="Beaufort")

We can see here that most of the segments had a Beaufort >= 2 and very few had Beaufort of 0-1 (right plot). We can see then that there are few observations at the lowest Beaufort (left plot) because there just wasn’t much time to observe things in those conditions. We see that with increasing Beaufort observations fall off (left plot), even as there were plenty of segments with those higher Beaufort levels (right plot).

We can create a rounded version of the variable using the cut function in R for later:

distdata$BeaufortCut <- cut(distdata$Beaufort, seq(0, 5, by=1),
                            include.lowest=TRUE,
                            ordered.result=TRUE)

Given we are including data from two different surveys we can also check that the surveys have similar histograms of distance (we’ll check with the detection function below whether we need to include a covariate to control for this).

par(mfrow=c(1,2))

hist(subset(distdata, Survey=="en04395")$distance,
     main="", ylim=c(0, 30),
     xlab="Distance (m)")

hist(subset(distdata, Survey=="GU0403")$distance,
     main="", ylim=c(0, 30),
     xlab="Distance (m)")

Fitting detection functions

It’s now time to fit some detection function models. We’ll use the ds() function from the Distance package to fit the detection function. You can access the help file for the ds() function by typing ?ds – this will give you information about what the different arguments to the function are and what they do.

We can fit a very simple model with the following code:

df_hn <- ds(data=distdata, truncation=6000, key="hn", adjustment=NULL)

## Fitting half-normal key function

## Key only model: not constraining for monotonicity.

## AIC= 2252.06

## No survey area information supplied, only estimating detection function.

Let’s dissect the call and see what each argument means:

data=distdata: the data to use to fit the model, as we prepared above.
truncation=6000: set the truncation distance. Here, observations at distances greater than 6000m will be discarded before fitting the detection function.
key="hn": the key function to use for the detection function, in this case half-normal (?ds lists the other options).
adjustment=NULL: adjustment term series to fit. NULL here means that no adjustments should be fitted (again ?ds lists all options).

Other useful arguments for this practical are:

formula=: gives the formula to use for the scale parameter. By default it takes the value ~1, meaning the scale parameter is constant and not a function of covariates.
order=: specifies the “order” of the adjustments to be used. This is a number (or vector of numbers) specifying the order of the terms. For example order=2 fits order 2 adjustments, order=c(2,3) will fit a model with order 2 and 3 adjustments (mathematically, it only makes sense to include order 3 with order 2). By default the value is NULL which has ds() select the number of adjustments by AIC.

We set truncation=6000 here, why? The truncation removes observations in the tail of the distribution. We do this because we only care about the detection function near 0, that has the most effect on $\hat{p}$ (since it’s where the detection function is biggest). We need to trade-off between getting the shape right and not using lots of parameters modelling what’s going on far from the observer and has little effect on the result. Here we use ~96% of the data. Len Thomas suggests $g(w)\approx 0.15$ (i.e., at the truncation distance the detection function should be about 0.15). You can play around with this if you’re interested here by changing truncation= and seeing what the differences in $\hat{p}$ are (but push on to the next section if you’re happy!).

Trying to fit a few detection functions

First we can try using a hazard-rate detection function:

df_hr <- ds(data=distdata, truncation=6000, key="hr", adjustment=NULL)

## Fitting hazard-rate key function

## Key only model: not constraining for monotonicity.

## AIC= 2247.594

## No survey area information supplied, only estimating detection function.

We can also try out using covariates in the detection function. Sea state (as we discretized above), using a hazard-rate and half-normal key function:

df_hn_beau <- ds(data=distdata, truncation=6000, key="hn",
                 adjustment=NULL, formula=~BeaufortCut)

## Fitting half-normal key function

## AIC= 2236.726

## No survey area information supplied, only estimating detection function.

df_hr_beau <- ds(data=distdata, truncation=6000, key="hr",
                 adjustment=NULL, formula=~BeaufortCut)

## Fitting hazard-rate key function

## AIC= 2240.354

## No survey area information supplied, only estimating detection function.

We can also include the observed group size:

df_hn_beau_size <- ds(data=distdata, truncation=6000, key="hn",
                      adjustment=NULL, formula=~BeaufortCut+size)

## Fitting half-normal key function

## AIC= 2238.411

## No survey area information supplied, only estimating detection function.

df_hr_beau_size <- ds(data=distdata, truncation=6000, key="hr",
                      adjustment=NULL, formula=~BeaufortCut+size)

## Fitting hazard-rate key function

## AIC= 2242.175

## No survey area information supplied, only estimating detection function.

Again this is just a sampling of the possible models we could fit to the detectability data. Next we’ll inspect these models…

Summaries

We can look at the summary of the fitted detection function using the summary() function:

summary(df_hn)

## 
## Summary for distance analysis 
## Number of observations :  132 
## Distance range         :  0  -  6000 
## 
## Model : Half-normal key function 
## AIC   : 2252.06 
## 
## Detection function parameters
## Scale coefficient(s):  
##             estimate         se
## (Intercept) 7.900732 0.07884776
## 
##                        Estimate          SE         CV
## Average p             0.5490484  0.03662569 0.06670757
## N in covered region 240.4159539 21.32287580 0.08869160

Looking at summaries

I won’t go through all of the above here, but just compare with and without covariates, let’s compare summary(df_hn) output above to that for

summary(df_hr_beau_size)

## 
## Summary for distance analysis 
## Number of observations :  132 
## Distance range         :  0  -  6000 
## 
## Model : Hazard-rate key function 
## AIC   : 2242.175 
## 
## Detection function parameters
## Scale coefficient(s):  
##                     estimate        se
## (Intercept)       8.59365686 1.0118407
## BeaufortCut(1,2] -1.24495135 1.0240637
## BeaufortCut(2,3] -0.34995758 0.9789634
## BeaufortCut(3,4] -1.57939745 1.0583225
## BeaufortCut(4,5] -3.64460160 1.3577675
## size             -0.07723057 0.1353780
## 
## Shape coefficient(s):  
##              estimate        se
## (Intercept) 0.5318624 0.2700877
## 
##                      Estimate          SE        CV
## Average p             0.36359  0.09128592 0.2510683
## N in covered region 363.04629 95.41676384 0.2628226

Note the difference in the Average p between the two models. In the covariate case, this is averaged over the detection probabilities for the different covariate values. We can look at the distribution of the probabilities of detection for each observation using the p_dist_table function:

p_dist_table(df_hr_beau_size)

##          p count
##    0 - 0.1     4
##  0.1 - 0.2     0
##  0.2 - 0.3    21
##  0.3 - 0.4    38
##  0.4 - 0.5     0
##  0.5 - 0.6     1
##  0.6 - 0.7    26
##  0.7 - 0.8    38
##  0.8 - 0.9     4
##    0.9 - 1     0
## Range of probabilities:  0.045 - 0.84

We can see there’s a broad range of probabilities of detection estimated for this model that get averaged out in the above summary output. Always worth checking!

Goodness of fit

Goodness of fit quantile-quantile plot and test results can be accessed using the gof_ds() function:

gof_ds(df_hn)

Goodness of fit QQ plot of half-normal detection function.

## 
## Goodness of fit results for ddf object
## 
## Distance sampling Cramer-von Mises test (unweighted)
## Test statistic = 0.396179 p-value = 0.0739475

Comparing Q-Q plots

Let’s run gof_ds() on each of our models and look at the corresponding plots arranged in a grid

# make a grid, 3 rows by 2 columns
par(mfrow=c(3,2))

gof_ds(df_hn, main="df_hn")

## 
## Goodness of fit results for ddf object
## 
## Distance sampling Cramer-von Mises test (unweighted)
## Test statistic = 0.396179 p-value = 0.0739475

gof_ds(df_hr, main="df_hr")

## 
## Goodness of fit results for ddf object
## 
## Distance sampling Cramer-von Mises test (unweighted)
## Test statistic = 0.0453519 p-value = 0.903832

gof_ds(df_hn_beau, main="df_hn_beau")

## 
## Goodness of fit results for ddf object
## 
## Distance sampling Cramer-von Mises test (unweighted)
## Test statistic = 0.170184 p-value = 0.333529

gof_ds(df_hr_beau, main="df_hr_beau")

## 
## Goodness of fit results for ddf object
## 
## Distance sampling Cramer-von Mises test (unweighted)
## Test statistic = 0.0451092 p-value = 0.905226

gof_ds(df_hn_beau_size, main="df_hn_beau_size")

## 
## Goodness of fit results for ddf object
## 
## Distance sampling Cramer-von Mises test (unweighted)
## Test statistic = 0.166067 p-value = 0.344127

gof_ds(df_hr_beau_size, main="df_hr_beau_size")

## 
## Goodness of fit results for ddf object
## 
## Distance sampling Cramer-von Mises test (unweighted)
## Test statistic = 0.0478406 p-value = 0.889263

We can see that overall the half-normal Q-Q plots are further away from the lines in each case than their corresponding hazard-rate models, which would be good grounds for prefering hazard-rate here. I’ll not discard these models for now, just so I can show other properties.

It’s hard to see what the difference is between the hazard-rate models at the moment, so we’ll investigate those more below.

Plotting

We can plot the models simply using the plot() function:

plot(df_hn)

Half-normal detection function.

The dots on the plot indicate the distances where observations are. We can remove them (particularly useful for a model without covariates) using the additional argument showpoints=FALSE (try this out!).

Plotting detection functions

We’ll start by plotting everthing in a big grid again, dropping points where there are not covariates in the model:

# make a grid, 3 rows by 2 columns
par(mfrow=c(3,2))

plot(df_hn, main="df_hn", showpoints=FALSE)
plot(df_hr, main="df_hr", showpoints=FALSE)
plot(df_hn_beau, main="df_hn_beau")
plot(df_hr_beau, main="df_hr_beau")
plot(df_hn_beau_size, main="df_hn_beau_size")
plot(df_hr_beau_size, main="df_hr_beau_size")

For the Beaufort case we can use the add_df_covar_line function to add lines to our plot, rather than using points:

# remember to remove the points
plot(df_hr_beau, main="df_hr_beau", showpoints=FALSE)

# make a data.frame with the BeaufortCut levels in it
bf_data <- data.frame(BeaufortCut=levels(df_hr_beau$ddf$data$BeaufortCut))

# draw the lines
add_df_covar_line(df_hr_beau, data=bf_data)
# draw a legend
legend(4300, 1, lty=1:6, legend=c("Average", bf_data$BeaufortCut))

the data= argument here is used to make a little data.frame with the levels of BeaufortCut in it. These plots can get a little more tricky with more than one covariate in the detection function.

Now you try…

Now try fitting a few models and comparing them using AIC. Don’t try to fit all possible models, just try a selection (say, a hazard-rate, a model with adjustments and a couple with different covariates). You can also try out changing the truncation distance.

Here’s an example to work from. Some tips before you start:

You can include as many models as you like in a given chunk (though you may find it more manageable to separate them up, remembering each time to give the chunk a unique name).
You can run the current line of code in RStudio by hitting Control+Enter (on Windows/Linux; Command+Enter on Mac).
Giving your models informative names will help later on! Here I’m using df_ to indicate that this is a detection function, then shortened forms of the model form and covariates, separated by underscores, but use what makes sense to you (and future you!).
It’s best not to include covariates and adjustments in the same model. There is some disagreement about this, but from a practical perspective it’s hard to make sure that your detection function is always decreasing (monotonic) at all levels of your covariate, so you can end up with models that don’t make sense (so best to avoid that!).

Here’s an example of adding covariates into the model using the formula= argument:

df_hr_ss_size <- ds(distdata, truncation=6000, adjustment=NULL,
                    key="hr", formula=~Beaufort+size)

## Fitting hazard-rate key function

## AIC= 2249.327

## No survey area information supplied, only estimating detection function.

Once you have the hang of writing models and looking at the differences between them, you should move onto the next section.

Model selection

Looking at the models individually can be a bit unwieldy – it’s nicer to put that data into a table and sort the table by the relevant statistic. The function summarize_ds_models() in the Distance package performs this task for us.

The code below will make a results table with relevant statistics for model selection in it. The summarize_ds_models() function takes a series of object names as its first argument. We can do that with the two models that I fitted like so:

model_table <- summarize_ds_models(df_hn, df_hr_ss_size, output="plain")
kable(model_table, digits=3, row.names = FALSE, escape=FALSE,
      caption = "Comparison of half normal and hazard rate with sea state and group size.")

Comparison of half normal and hazard rate with sea state and group size.
Model	Key function	Formula	C-vM $p$-value	Average detectability	se(Average detectability)	Delta AIC
df_hr_ss_size	Hazard-rate	~Beaufort + size	0.880	0.355	0.074	0.000
df_hn	Half-normal	~1	0.074	0.549	0.037	2.733

(You can add the models you fitted above into this list.)

Model selection

Let’s expand that table with the models I fitted above…

model_table <- summarize_ds_models(df_hn, df_hr, df_hn_beau, df_hr_beau, df_hn_beau_size, df_hr_beau_size, output="plain")
kable(model_table, digits=3, row.names = FALSE, escape=FALSE,
      caption = "Comparison of half normal and hazard rate with sea state and group size.")

Comparison of half normal and hazard rate with sea state and group size.
Model	Key function	Formula	C-vM $p$-value	Average detectability	se(Average detectability)	Delta AIC
df_hn_beau	Half-normal	~BeaufortCut	0.334	0.463	0.052	0.000
df_hn_beau_size	Half-normal	~BeaufortCut + size	0.344	0.462	0.051	1.686
df_hr_beau	Hazard-rate	~BeaufortCut	0.905	0.365	0.091	3.628
df_hr_beau_size	Hazard-rate	~BeaufortCut + size	0.889	0.364	0.091	5.449
df_hr	Hazard-rate	~1	0.904	0.362	0.077	10.868
df_hn	Half-normal	~1	0.074	0.549	0.037	15.334

Looking at the table we see that the df_hn_beau model does best in AIC terms but when we look back to the Q-Q plots above, we can see that the fit is not the best there. It’s best not to base our model selection on a single measure (say, AIC) and use multiple ways to assess our models. In this case the Q-Q plot is telling us that the half-normal models are not as good (though we can see from the table that the Cramér-von Mises test is not significant at the 5% level for any of these models, hence the next comment).

A further note about model selection for the sperm whale data

Note that there is a considerable spike in our distance data. This may be down to observers guarding the trackline (spending too much time searching near zero distance). It’s therefore possible that the hazard-rate model is overfitting to this peak. So we’d like to investigate results from the half-normal model too and see what the effects are on the final spatial models.

Estimating abundance

Just for fun, let’s estimate abundance from these models using a Horvtiz-Thompson-type estimator.

We know the Horvitz-Thompson estimator has the following form: \[ \hat{N} = \frac{A}{a} \sum_{i=1}^n \frac{s_i}{p_i} \] we can calculate each part of this equation in R:

A is the total area of the region we want to estimate abundance for. This was $A=5.285e+11 m^2$.
a is the total area we surveyed. We know that the total transect length was 9,498,474m and the truncation distance. Knowing that $a=2wL$ we can calculate $a$.
$s_i$ are the group sizes, they are stored in df_hn$ddf$data$size.
$p_i$ are the probabilities of detection, we can obtain them using predict(df_hn$ddf)$fitted.

We know that in general operations are vectorised in R, so calculating c(1, 2, 3)/c(4, 5, 6) will give c(1/4, 2/5, 3/6) so we can just divide the results of getting the $s_i$ and $p_i$ values and then use the sum() function to sum them up.

Try out estimating abundance using the formula below using both df_hn and your favourite model from above:

Calculating the Horvitz-Thompson estimate

Let’s try this out for our model df_hr_beau:

# total area of the region we want to estimate abundance for
A <- 5.285e+11

# total area we surveyed (2wL)
a <- 2 * 6000 * 9498474

# group sizes
group_sizes <- df_hr_beau$ddf$data$size
# need to get those that are within truncation!
group_sizes <- group_sizes[df_hr_beau$ddf$data$distance <= 6000]

# probabilities of detection
probs <- predict(df_hr_beau)$fitted

# now the H-T estimator

Nhat_hr_beau <- (A/a) * sum(group_sizes/probs)
Nhat_hr_beau

## [1] 2669.581

So this is our “flat” estimate that we might want to bear in mind as we go through the exercises etc.

Note that in the solutions to this exercise (posted on the course website) I show how to use the function dht() to estimate abundance (and uncertainty) for a detection function analysis. This involves some more complex data manipulation steps, so we’ve left it out here in favour of getting to grips with the mathematics.

Using dht

We can do this analysis using the dht function in the mrds package too (see ?dht for more information and the link at the end of the answer for more information). We can then also get an estimate of variance too. We need to setup some tables to do this…

The total area of the study area is $A=5.285e+11m^2$, we can use that to construct the region table, with one region in it:

region.table <- data.frame(Region.Label = "StudyArea",
                           Area         = 5.285e+11)

We need to put our segments back together into transects to assemble the sample table (you might need to re-download the spermwhale.RData file as I updated the data for this).

The code here is fiddly, so don’t worry too much about understanding it. This is only because I’m trying to get dsm formatted data to work for dht! In practice this would be easier.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

transects <- segs %>%
   # include only the TransectID and Effort columns
   select(TransectID, Effort) %>%
   # by Transect...
   group_by(TransectID) %>%
   # add all the effort up
   summarise(Effort = sum(Effort)) %>%
   # Sample.Label is now the transect id
   rename(Sample.Label = TransectID) %>%
   # get only the unique rows
   distinct()

## `summarise()` ungrouping output (override with `.groups` argument)

# add the Region.Label, same for every row
transects$Region.Label <- "StudyArea"

We can look at this:

transects

## # A tibble: 69 x 3
##    Sample.Label     Effort Region.Label
##    <chr>             <dbl> <chr>       
##  1 en0439520040624 144045. StudyArea   
##  2 en0439520040625 167647. StudyArea   
##  3 en0439520040626  59997. StudyArea   
##  4 en0439520040627  33822. StudyArea   
##  5 en0439520040628 147415. StudyArea   
##  6 en0439520040629 101108. StudyArea   
##  7 en0439520040630  98119. StudyArea   
##  8 en0439520040701 147763. StudyArea   
##  9 en0439520040702 172935. StudyArea   
## 10 en0439520040703 156280. StudyArea   
## # … with 59 more rows

We need to build a new observation table with the new Sample.Labels, so first create that data:

# we can use the obs table for the DSM and merge that
# with the segment data
obs_dht <- merge(obs, segs[,c("Sample.Label", "TransectID")],
                 by="Sample.Label")
# but use the transect IDs as Sample.Label
obs_dht$Sample.Label <- obs_dht$TransectID
# then just extract the columns you need
obs_dht <- obs_dht[, c("object", "Sample.Label")]
# and add the Region.Label
obs_dht$Region.Label <- "StudyArea"

Now we can use dht to calculate the abundance for df_hr_beau, say (here I’m just re-using the dist_dht table again as the observation data):

Nhat_dht_hr_beau <- dht(model=df_hr_beau$ddf, 
                        region.table=region.table,
                        sample.table=transects,
                        obs.table=obs_dht)

We can then look at the abundance estimates. dht calculates this for individuals and at the group level (“clusters”):

Nhat_dht_hr_beau

## 
## Summary for clusters
## 
## Summary statistics:
##      Region      Area  CoveredArea  Effort   n  k           ER        se.ER
## 1 StudyArea 5.285e+11 113981689066 9498474 132 69 1.389697e-05 2.701497e-06
##       cv.ER
## 1 0.1943947
## 
## Abundance:
##   Label Estimate       se        cv      lcl      ucl       df
## 1 Total 1679.042 492.7757 0.2934862 952.4239 2960.009 191.3157
## 
## Density:
##   Label     Estimate           se        cv          lcl          ucl       df
## 1 Total 3.176996e-09 9.324043e-10 0.2934862 1.802127e-09 5.600774e-09 191.3157
## 
## Summary for individuals
## 
## Summary statistics:
##      Region      Area  CoveredArea  Effort     n           ER        se.ER
## 1 StudyArea 5.285e+11 113981689066 9498474 238.7 2.513035e-05 5.667492e-06
##       cv.ER mean.size   se.mean
## 1 0.2255238  1.808333 0.1020928
## 
## Abundance:
##   Label Estimate       se        cv      lcl      ucl      df
## 1 Total 2669.581 702.6931 0.2632223 1601.332 4450.457 162.982
## 
## Density:
##   Label    Estimate           se        cv          lcl          ucl      df
## 1 Total 5.05124e-09 1.329599e-09 0.2632223 3.029957e-09 8.420921e-09 162.982
## 
## Expected cluster size
##   Region Expected.S se.Expected.S cv.Expected.S
## 1  Total   1.589942      0.169584     0.1066605
## 2  Total   1.589942      0.169584     0.1066605

We can keep this abundance and CV in mind as we explore the spatial models through the course.

More examples of doing these kinds of analyses are available on the distance sampling exercises page.

Accounting for perception bias

It’s common, especially in marine surveys, for animals at zero distance to be missed by observers. There are several ways to deal with this issue. For now, we are just going to use a very simply constant correction factor to inflate the abundance.

From Palka (2006), we think that observations on the track line were such that $g(0)=0.46$, we can apply that correction to our abundance estimate (in a very primitive way):

Perception bias correction

Nhat_hr_beau/0.46

## [1] 5803.436

This kind of correction works fine when it’s appropriate to use a single number to adjust by, in general we’d like to model the perception bias, for example using “mark-recapture distance sampling” techniques.

Save model objects

Save your top few models in an RData file, so we can load them up later on. We’ll also save the distance data we used to fit out models.

Saving model objects

save(df_hn, df_hr, df_hn_beau, df_hr_beau, df_hn_beau_size,
     df_hr_beau_size, df_hr_ss_size,
     distdata,
     file="df-models.RData")

You can check it worked by using the load() function to recover the models.

References

Palka, D. L. (2006) Summer Abundance Estimates of Cetaceans in US North Atlantic Navy Operating Areas. Northeast Fisheries Science Center Reference Document 06-03