Introduction to Computer-Intensive Methods of Data Analysis in Biology / Edition 1

Introduction to Computer-Intensive Methods of Data Analysis in Biology / Edition 1

by Derek A. Roff
ISBN-10:
0521608651
ISBN-13:
9780521608657
Pub. Date:
05/25/2006
Publisher:
Cambridge University Press
ISBN-10:
0521608651
ISBN-13:
9780521608657
Pub. Date:
05/25/2006
Publisher:
Cambridge University Press
Introduction to Computer-Intensive Methods of Data Analysis in Biology / Edition 1

Introduction to Computer-Intensive Methods of Data Analysis in Biology / Edition 1

by Derek A. Roff

Paperback

$62.99 Current price is , Original price is $62.99. You
$62.99 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE
    Check Availability at Nearby Stores
  • SHIP THIS ITEM

    Temporarily Out of Stock Online

    Please check back later for updated availability.


Overview

This 2006 guide to the contemporary toolbox of methods for data analysis will serve graduate students and researchers across the biological sciences. Modern computational tools, such as Maximum Likelihood, Monte Carlo and Bayesian methods, mean that data analysis no longer depends on elaborate assumptions designed to make analytical approaches tractable. These new 'computer-intensive' methods are currently not consistently available in statistical software packages and often require more detailed instructions. The purpose of this book therefore is to introduce some of the most common of these methods by providing a relatively simple description of the techniques. Examples of their application are provided throughout, using real data taken from a wide range of biological research. A series of software instructions for the statistical software package S-PLUS are provided along with problems and solutions for each chapter.

Product Details

ISBN-13: 9780521608657
Publisher: Cambridge University Press
Publication date: 05/25/2006
Edition description: 1ST
Pages: 378
Product dimensions: 6.69(w) x 9.61(h) x 0.79(d)

About the Author

Derek A. Roff is a Professor in the Department of Biology at the University of California, Riverside.

Read an Excerpt

Introduction to Computer-Intensive Methods of Data Analysis in Biology
Cambridge University Press
0521846285 - Introduction to Computer-Intensive Methods of Data Analysis in Biology - by Derek A. Roff
Excerpt



1

An introduction tocomputer-intensive methods




What are computer-intensive data methods?

For the purposes of this book, I define computer-intensive methods as those that involve an iterative process and hence cannot readily be done except on a computer. The first case I examine is maximum likelihood estimation, which forms the basis of most of the parametric statistics taught in elementary statistical courses, though the derivation of the methods via maximum likelihood is probably not often given. Least squares estimation, for example, can be justified by the principle of maximum likelihood. For the simple cases, such as estimation of the mean, variance, and linear regression analysis, analytical solutions can be obtained, but in more complex cases, such as parameter estimation in nonlinear regression analysis, whereas maximum likelihood can be used to define the appropriate parameters, the solution can only be obtained by numerical methods. Most computer statistical packages now have the option to fit models by maximum likelihood but they typically require one to supply the model (logistic regression is a notable exception).

The other methods discussed in this book may have an equally long history as that of maximum likelihood, but none have been so widely applied as that of maximum likelihood, mostly because, without the aid of computers, the methods are tootime-consuming. Even with the aid of a fast computer, the implementation of a computer-intensive method can chew up hours, or even days, of computing time. It is, therefore, imperative that the appropriate technique be selected. Computer-intensive methods are not panaceas: the English adage “you can’t make a silk purse out of a sow's ear” applies equally well to statistical analysis. What computer-intensive methods allow one to do is to apply a statistical analysis in situations where the more “traditional” methods fail. It is important to remember that, in any investigation, great efforts should be put into making the experimental design amenable to traditional methods, as these have both well-understood statistical properties and are easily carried out, given the available statistical programs. There will, however, inevitably be circumstances in which the assumptions of these methods cannot be met. In the next section, I give several examples that illustrate the utility of computer-intensive methods discussed in this book. Table 1.1 provides an overview of the methods and comments on their limitations.

Why computer-intensive methods?

A common technique for examining the relationship between some response (dependent) variable and one or more predictor (independent) variables is linear and multiple regression. So long as the relationship is linear (and satisfies a few other criteria to which I shall return) this approach is appropriate. But suppose one is faced with the relationship shown in Figure 1.1, that is highly nonlinear and cannot be transformed into a linear form or fitted by a polynomial function. The fecundity function shown in Figure 1.1 is typical for many animal species and can be represented by the four parameter (M, k, t0, b) model

Image not available in HTML version

Using the principle of maximum likelihood (Chapter 2), it can readily be shown that the “best” estimates of the four parameters are those that minimize the residual sums of squares. However, locating the appropriate set of parameter values cannot be done analytically but can be done numerically, for which most statistical packages supply a protocol (see caption to Figure 1.1 for S-PLUS coding).

In some cases, there may be no “simple” function that adequately describes the data. Even in the above case, the equation does not immediately “spring to mind” when viewing the observations. An alternative approach to curve fitting for such circumstances is the use of local smoothing functions, described in Chapter 6. The method adopted here is to do a piece-wise fit through the data, keeping the fitted curve continuous and relatively smooth. Two such fits are shown in Figure 1.2 for the Drosophila fecundity data. The loess fit is less rugged than the cubic spline fit and tends to de-emphasize the fecundity at the early ages. On the other hand, the cubic spline tends to “over-fit” across the middle and later ages. Nevertheless, in the absence of a suitable function, these approaches can prove very useful in describing the shape of a curve or surface. Further, it is possible to use these methods in hypothesis testing, which permits one to explore how complex a curve or a surface must be in order to adequately describe the data.

Table 1.1 An overview of the techniques discussed in this book

MethodChapterParameter estimation?Hypothesis testing?Limitations
Maximum likelihood2YesYesAssumes a particular statistical model and, generally, large samples
Jackknife3YesYesThe statistical properties cannot generally be derived from theory and the utility of the method should be checked by simulation for each unique use
Bootstrap4YesPossibleaThe statistical properties cannot generally be derived from theory and the utility of the method should be checked by simulation for each unique use. Very computer-intensive.
Randomization5PossibleYesAssumes difference in only a single parameter. Complex designs may not be amenable to “exact” randomization tests
Monte Carlo methods5PossibleYesTests are usually specific to a particular problem. There may be considerable debate over the test construction.
Cross-validation6YesYesGenerally restricted to regression problems. Primarily a means of distinguishing among models.
Local smoothing functions and generalized additive models6YesYesDoes not produce easily interpretable function coefficients. Visual interpretation difficult with more than two predictor variables
Tree models6YesYesCan handle many predictor variables and complex interactions but assumes binary splits.
Bayesian methods7YesYesAssumes a prior probability distribution and is frequently specific to a particular problem

a“Possible” = Can be done but not ideal for this purpose.


Image not available in HTML version

Figure 1.1 Fecundity as a function of age in Drosophila melanogaster with a maximum likelihood fit of the equation F(x) = M(1−ek(x−t0)) e−bx. Data are from McMillan et al. (1970).

Age (x)345678910131415161718
F32.151.8665860.557.249.149.351.445.744.435.135.233.6

S-PLUS coding for fit:
# Data contained in data file D
# Initialise parameter values
  Thetas <- c(M = 1, k = 1, t0 = 1, b =.04)
# Fit model
  Model <- nls(D[,2]~M*(1-exp(-k*(D[,1]-t0)))*exp(-b*D[,1]), start = Thetas)
# Print results
  summary(Model)
OUTPUT
Parameters:
    Value  Std. Error  t value
 M 82.9723000 7.52193000 11.03070
 k  0.9960840 0.36527300  2.72696
t0  2.4179600 0.22578200 10.70930
 b  0.0472321 0.00749811  6.29920

Image not available in HTML version

Figure 1.2 Fecundity as a function of age in Drosophila melanogaster with two local smoothing functions. Data given in Figure 1.1.

S-PLUS coding to produce fits:
# Data contained in file D. First plot observations       # Plot pointsplot (D[,1], D[,2])
  Loess.model <- loess(D[,2]~D[,1], span = 1, degree = 2)     # Fit loess model
# Calculate predicted curve for Loess model
  x.limits <- seq(min(D[,1]), max(D[,1]), length = 50      # Set range of x
  P.Loess <- predict.loess(Loess.model, x.limits, se.fit = T)  # Prediction
  lines(x.limits, D.INT$fit)                 # Plot loess prediction
  Cubic.spline <- smooth.spline(D[,1], D[,2])         # Fit cubic spline model
  lines(Cubic.spline)                   # Plot cubic spline curve

An important parameter in evolutionary and ecological studies is the rate of increase of a population, denoted by the letter r. In an age-structured population, the value of r can be estimated from the Euler equation

Image not available in HTML version

where x is age, lx is the probability of survival to age x and mx is the number of female births at age x. Given vectors of survival and reproduction, the above equation can be solved numerically and hence r calculated. But having an estimate of a parameter is generally not very useful without also an estimate of the variation about the estimate, such as the 95% confidence interval. There are two computer-intensive solutions to this problem, the jackknife (Chapter 3) and the bootstrap (Chapter 4). The jackknife involves the sequential deletion of a single observation from the data set (a single animal in this case) giving n (= number of original observations) data sets of n−1 observations whereas the bootstrap consists of generating many data sets by random selection (with replacement) from the original data set. For each data set, the value of r is calculated; from this set of values, each technique is able to extract both an estimate of r and an estimate of the desired confidence interval.

Perhaps one of the most important computer-intensive methods is that of hypothesis testing using randomization, discussed in Chapter 5. This method can replace the standard tests, such as the χ2 contingency test, when the assumptions of the test are not met. The basic idea of randomization testing is to randomly assign the observations to the “treatment” groups and calculate the test statistic: this process is repeated many (typically thousands) times and the probability under the null hypothesis of “no difference” estimated by the proportion of times the test statistic from the randomized data sets exceeded the test statistic from the observed data set. To illustrate the process, I shall relate an investigation into genetic variation among populations of shad, a commercially important fish species.

To investigate geographic variation among populations of shad, data on mitochondrial DNA variation were collected from 244 fish distributed over 14 rivers. This sample size represented, for the time, a very significant output of effort. Ten mitochondrial haplotypes were identified with 62% being of a single type. The result was that almost all cells had less than 5 data points (of the 140 cells, 66% had expected values less than 1.0 and only 9% had expected values greater than 5). Following Cochran's rules for the χ2 test, it was necessary to combine cells. This meant combining the genotypes into two classes, the most common one and all others. The calculated χ2 for the combined data set was 22.96, which just exceeded the critical value (22.36) at the 5% level. The estimated value of χ2 for the uncombined data was 236.5, which was highly significant (P < 0.001) based on the χ2 with 117 degrees of freedom. However, because of the very low frequencies within many cells, this result was suspect. Rather than combining cells and thus losing information, we (Roff and Bentzen 1989) used randomization (Chapter 5) to test if the observed χ2 value was significantly larger than the expected value under the null hypothesis of homogeneity among the rivers. This analysis showed that the probability of obtaining a χ2 value as large or larger than that observed for the ungrouped data was less than one in a thousand. Thus, rather than being merely marginally significant the variation among rivers was highly significant.

Most of the methods described in this book follow the frequentist school in asking “What is the probability of observing the set of n data x1, x2, … , xn given the set of k parameters θ1, θ2, … , θk?” In Chapter 7 this position is reversed by the Bayesian perspective in which the question is asked “Given the set of n data x1, x2, … , xn, what is the probability of the set of k parameters θ1, θ2, … , θk?” This “reversal” of perspective is particularly important when management decisions are required. For example, suppose we wish to analyze the effect of a harvesting strategy on population growth: in this case the question we wish to ask is “Given some observed harvest, say x, what is the probability that the population rate of increase, say θ, is less than 1 (i.e., the population is declining)?” If this probability is high then it may be necessary to reduce the harvest rate. In Bayesian analysis, the primary focus is frequently on the probability statement about the parameter value. It can, however, also be used, as in the case of the James–Stein estimator, to improve on estimates. Bayesian analysis generally requires a computer-intensive approach to estimate the posterior distribution.

Why S-PLUS?

There are now numerous computer packages available for the statistical analysis of data, making available an array of techniques hitherto not possible except in some very particular circumstances. Many packages have some computer-intensive methods available, but most lack flexibility and hence are limited in use. Of the common packages, SAS and S-PLUS possess the breadth of programming capabilities necessary to do the analyses described in this book. I chose S-PLUS for three reasons. First, the language is structurally similar to programming languages with which the reader may already be familiar (e.g., BASIC and FORTRAN. It differs from these two in being object oriented). In writing the coding, I have attempted to keep a structure that could be transported to another language: this has meant in some cases making more use of looping than might be necessary in S-PLUS. While this increases the run time, I believe that it makes the coding more readable, an advantage that outweighs the minor increase in computing time. The second reason for selecting S-PLUS is that there is a version in the public domain, known as R. To quote the web site (http://www.r-project.org/), “R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.” The programs written in this book will, with few exceptions, run under R. The user interface is definitely better in S-PLUS than R. My third reason for selecting S-PLUS is that students, at present, can obtain a free version for a limited period at http://elms03.e-academy.com/splus/.

Further reading

Although S-PLUS has a fairly steep learning curve there are several excellent text books available, my recommendations being:

Spector, P. (1994). An Introduction to S and S-PLUS. Belmont, California: Duxbury Press.

Krause, A. and Olson, M. (2002). The Basics of S-PLUS. New York: Springer.

Crawley, M. J. (2002). Statistical Computing: An Introduction to Data Analysis using S-PLUS. UK: Wiley and Sons.

Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. New York: Springer.

An overview of the language with respect to the programs used in this book is presented in the appendices.


© Cambridge University Press

Table of Contents

1. An introduction to computer intensive methods; 2. Maximum likelihood; 3. The Jack-knife; 4. The Bootstrap; 5. Randomisation; 6. Regression methods; 7. Bayesian methods; References; Exercises; Appendix A: an overview of S-Plus methods used in this book; Appendix B: brief description of S-Plus subroutines used in this book; Appendix C: S-Plus codes cited in text.
From the B&N Reads Blog

Customer Reviews