Bootstrap

Last Updated: 3 April 2012

The bootstrap was invented by Brad Effron, Prof Statistics at Stanford in the 1970s. Davison & Hinkley have published an excellent text Bootstrap Methods and Their Application (Cambridge Series in Statistical and Probabilistic Mathematics , No 1) by A. C. Davison, D. V. Hinkley (Contributor) (Paperback - November 1997) .

My interest in applying the bootstrap to NONMEM was stimulated by John Parke and led to a publication based on an early implementation of the nmbs procedure (Parke, Holford & Charles 1999). A subsequent publication used it to describe the time course of washout of anti-parkinsonian drug effect (Hauser & Holford 2002).

The purpose of the bootstrap is to try and learn about the statistical parameters of a distribution e.g. mean, standard error when the true distribution is unknown and one just has a set of observations. The key idea is to use the set of observations as an empirical representation of the true distribution.

Suppose the true distribution is called F and the empirical distribution is Fhat. If we sample from Fhat we can use this sample to compute a statistic, Thetastar, which is an estimate of the true value of Theta as a parameter of F. If we sample many times from Fhat we can learn about the distribution of Thetastar and describe its properties e.g. its standard error or its 95% confidence interval.

This kind of sampling is "sampling with replacement" i.e. it is possible to get one of the original observations more than once in the bootstrap sample.

Algorithm

#Data is the empirical dist vector Fhat[] of length n

#Let NBOOT be the number of bootstrap samples
#Should be at least 100 for se and perhaps 1000 or more
#for 95% CI

for (i=1; i <= NBOOT; i++ ) {

#Sample the elements of Fhat n times using a uniform random distribution

for ( j=1; j <= n; j++ ) {

isub=int(n*rand())+1 # randomly sample a subject

BS[j] = Fhat[isub]

}

#Estimate parameter from the bootstrap sample e.g. the average

Thetastar[i] = average(BS)

}

#Describe the distribution of Thetastar e.g. standard error

se = stdev(Thetastar)

An implementation of the bootstrap using awk (bs.awk):

awk -f bs.awk -v NBOOT=100 -v SEED=981124 cep.dat

nboot 100 Mean 0.0714337 SE 0.0114872

#bs -v NBOOT= -v SEED=
#awk srand will start RNG at same place if SEED is left blank
#value of SEED does not change starting place

BEGIN {
    if ( NBOOT == "" ) NBOOT=100
    if ( SEED != "" ) srand(SEED)
    for (i=1; i <= 10; i++ ) #warm up RNG
        rand()
    MT="." #missing value
}

#main program
{
    # read the data
    n=getFhat()
    # compute the bootstrap
    bootstrap(n,NBOOT)
    exit
}

function getFhat() {
    n=0
    ok=getline
    do {
        n++
        Fhat[n]=$NF
        ok=getline
    } while ( ok > 0 )
    return n
}

function bootstrap(n,nboot, i) {
    for ( i=1; i <= nboot; i ++ ) {
        Thstar[i]=theta(n,Fhat)
    }
    stats(nboot,Thstar)
}

function stats(nboot,Thstar) {
    sum=0
    ssq=0
    for ( i=1; i <= nboot ; i++ ) {
        thstar=Thstar[i]
        sum=sum+thstar
        ssq=ssq+thstar*thstar
    }

    # NB The "data" in this case is the mean of each BS sample
    # so the std deviation of this set of means is the BS
    # standard error!
    if ( nboot > 0 ) {
        mean=sum/nboot
        if ( nboot > 1 ) {
            var=0
            for (i=1; i<=nboot; i++) {
                resid=Thstar[i]-mean
                var=var+resid*resid
            }
            sd=sqrt(var/(nboot-1))
        } else {
            sd=MT
        }
    } else {
        mean= MT
    }
    printf("nboot %5s\tMean %8s\tSE %8s\n",nboot,mean,sd)
}