BETASIMU: Community data simulation from beta response function

This document discusses the program BETASIMU. You can proceed directly to download the program.

Mike Austin and co-authors have proposed a method to fit beta response functions in gradient analysis of vegetation using GLIM software (Journal of Vegetation Science 5, 215-228; 1994). I have criticized this as a general tool in gradient analysis (Journal of Vegetation Science 8, 147-152; 1997: abstract), though it may work in very special conditions. Since the arguments may be difficult to judge based on reading only, I have written a special purpose program BETASIMU to simulate vegetation stands using beta function. My argument is that if you fit a beta function to the simulated data, you won't get good estimates of parameter values you used in simulation. Your results will be usually biased, so that you have both Type I and Type II errors:

Consequently, fitting beta functions is a very unreliable method of assessing skewness. My recommendation is to use instead Huisman-Olff-Fresco models.

General on BETASIMU

BETASIMU generates data from beta response function and writes the results in a GLIM file. The main emphasis in developing the program was simulation of noise where I used the alternatives available in GLIM and regarded as adequate for gradient analysis. Especially, the program simulates data which are overdispersed to Binomial or Poisson error, since community data sets are overdispersed as a rule. For more details, see the program.

Community data simulation with BETASIMU involves two stages:

The user interface is very basic. The only parameter you may give in the command line is the seed for the random number generator. Random number generator is initialized using a text phrase, which is internally transformed to two numeric seeds within the program. If the seed phrase is not given on the command line, the random number generator will be initialized from the clock. The program will ask for other needed parameters. The results will be written in a GLIM file.

Parameters common to all species

Error distribution

Binomial denominator (when adequate)

You must give an integer which is the number of trials in Binomial distribution.

Gradient endpoints

Number of observations

Same number of observations will be used for all species. The observations will be evenly spread between gradient endpoints.

Parameters asked separately for each species

Parameters of beta response function

Beta response function is defined by k * (x-a)^alpha * (b-x)^gamma. Here x is the gradient, and k, a, b, alpha, gamma are the parameters of the beta response function. These parameters are asked in three groups:

Overdispersion parameters

With Beta-Binomial and Negative Binomial distribution, the magnitude of overdispersion (parameters tau² or sigma²) is asked for each species.

How to parametrize simulations?

For realistic simulations, it is desirable to parametrize model so that it produces similar types of responses as observed with real data sets. However, it is not quite clear how to do this. There are problems both in estimating response shapes and defining noise distribution in observed community data sets.

Shape of response function

Austin et al. (J. Veg. Sci. 5, 215-228; 1994) explain how to fit beta response function to the data with the GLIM software. I have criticized this on the grounds that there are great problems that make beta response function useless in practical gradient analysis (J. Veg. Sci. 8, 147-152; 1997):

I further showed in my article with a real data set, that beta response function over-estimates the number of skewed responses compared to Huisman-OIff-Fresco models. It seems that it is impossible to obtain reliable estimates of parameters of beta response function from observed community data. Actually, this program was written mainly to study this: Can you find the parameter values you used in simulating the data when fitting the model in the simulated data.

Minchin (Vegetatio 71, 145-156 (1987); 83, 97-110 (1989)) describes what kind of responses you expect with real community data. Due to problems in fitting beta response functions, it may be necessary to use related reasoning in paramterization.

If you want to fit beta response functions, you can use the following GLIM macro which tests the signficance of skewness as well:

$com 29/5/97. Macros edited for GLIM 4.
$com Fitting beta response function to the data.
     Finding endpoints of the response (range) a bit (%c) outside
     of the extreme observations.
     Fitting Beta, and a symmetric variant where alpha=gamma.
     In GLIM the response is defined as:
     log(mu) = eta = log(h) + alpha*log(x-%a) + gamma*(%b-x),
     where a and b are the endpoints, and h, alpha, and gamma
     the linear parameters.
     You have to have the gradient copied to x, 
     - define a decent model (error, link),
     - and call the macro with the %yv you prefer
$cal tota=1$
$error bino tota$
$weight weig$
$COM cal x=ph$
! GLIM3.77 $inp %plc fprob$
! GLIM3.77 $cal %u=1$
! GLIM3.77 $arg fprob %f %u %n$
$cal %c=0.01$
$macro go $
$cal w=%yv>0$
! GLIM 3.77: use a, b instead of %a, %b below
$tab the x s with w into %a$ !
$tab the x l with w into %b$ !
$cal %a=%a-%c: %b=%b+%c$ !
$cal weig = (x>%a)*(x<%b)$ !
$cal z = %log(x-%a): z1 = %log(%b-x)$ !
$fit z+z1$cal %d=%dv: %n=%df$ !
$cal fv=%fv$ !
$cal zsym = z+z1$ !
$fit zsym$cal %f=(%dv-%d)/(%d/%n): %p=1.0-%fp(%f,1,%n)$
!COM GLIM3.77 $cal p = fprob$ !
$out 16$
$print *n %yv '  Endpoints ' %a %b$ !
$print 'Devs:' %d %dv', df ' %df ', Chi2 =' %f ', P =' %p $
$out %poc$

Distribution of noise variation

No community data are free of noise, and therefore it is completely useless to analyse error-free data. Unfortunately, it is fairly difficult to say what kind of noise variation we have in the data, or to estimate the parameters of the noise variation needed in simulation. It is generally regarded that realistic distributions for noise may have following properties:

Bernoulli noise

Bernoulli distribution is the same as Binomial model with total 1. So the expected values will be probabilities of occurrence, and simulated values occurrences or absences. No extraneous variation can be used with Bernoulli noise, but exptected probability alone is sufficient to give all the moments of the distribution. This is the only fairly non-problematic error model in simulation. The use of this model is strongly recommended. However, many data sets are non-binary (quantitative), and therefore users may be willing to try noise models for quantitative data. It should be remembered that this means running into trouble, though, like you see below.

Binomial and Poisson noise

These noise models describe only the sampling variability when repeating the sampling in the same site. In general, the noise in real data sets is much higher, and therefore the following extra-Binomial and extra-Poisson models should be used.

Both noise models are discrete. This means that expected values can be any real numbers, but expected values are always counts (integers). Smallest possible value is zero in both noise models. Binomial noise is bound with an upper limit as well, so that the observations cannot be larger than binomial denominator. Binomial noise is generated by repeating an independent observation with fixed probability of success (pi) certain number of times (m, binomial denominator).

Many community data sets are based on visually estimated abundances (cover etc.). No adequate distribution can be found for these, since the numbers are subjective estimates, and the binomial denominator is not defined for cover estimates. For real counts we can use Poisson type models. For observed proportions, such as point-frequency or frequency-in-subplots we can use Binomial type models. However, even in these cases we have to estimate the magnitude of dispersion with the following extra-Binomial or extra-Poisson models (which may be almost adequate for subjective abundance estimation as well, especially if the simulation results are binned to similar classes as used in estimation).

Beta-Binomial and Negative Binomial noise

These are some of the easily handled cases of extra-Binomial and extra-Poisson models. In extra-Binomial and extra-Poisson models we simply assume that there are two components of error. Thus the error is larger than assumed by pure Binomial or Poisson models. In GLIM over-dispersion is evident when the residual deviance of the fitted model is significantly larger (by Chi-squared test!) than residual degrees of freedom.

We may assume that the extraneous variance is caused by other, possibly unknown or unmeasured, environmental variables so that gradient we use is not alone sufficient to give the exact expected value. Another reason for extraneous variance may be that there are biological processes, such as clonal growth, or interactions between species which increase the variance through spatial aggregation. For instance, with NegBin model we may think that there are two components in the variance: the size of units/ modules/ clones (continuous extra-Poisson variance), and the number of these units/ modules /clones in the sample (discrete Poisson variance).

In general data fitting, it may be dangerous to assume that this extraneous variance has a certain, fixed shape (distribution), especially if that shape is based on mathematical convenience rather than biological reasoning. However, this is what we have to do in simulation. The traditional, mathematical convenient, two-component error models are:

Data dependent estimates of magnitude of extraneous variation (Beta, Gamma) can be estimated with the GLIM software. Models for Beta-Binomial model were proposed by Williams (Applied. Statistics 31, 144-148; 1982), for Negative-Binomial model by Breslow (Applied Statistics 33, 38-44; 1984). However, these are moment estimates (not strict maximum likelihood estimates), and they may not be reliable for large overdispersion. These authors published iterative GLIM macros given below (you must edit these macros to suit for your data set).

$c Williams: Extra-binomial variation in logistic linear models.
$c     Appl. Stat. 31: 144-148; 1982. - 
$c     Method II
$macro will
$fit . $extract %vl$cal wvq=%pw*%wt*%vl$
$cal %p=(%x2-%cu(%pw*(1-wvq)))/(%cu((%bd-1)*%pw*(1-wvq)))$
$cal w=1/(1+%p*(%bd-1))$
$print '          Chi-square  ' %x2 $
$subf macros
$c Breslow: Extra-Poisson variation in log-linear models. Appl. Statist.
$c 33, pp. 38-44, 1984 (Appendix):
$sub extr
$pri '  macros for extra-Poisson variation'
$pri ' '
$pri '  For maximum-likelihood analysis declare numerators as yvar,'
$pri '  log-denominators as offset, and initializes prior weights w to 1'
$pri '  Then fit model, call in2 once, recycle and call it2 repeatedly'
$pri ''
$mac in2 $ext %vl $cal wq=%fv*(1-%fv*%vl) $
$cal %s=(%x2-%df)/%cu(wq) $
$print '  Initial sigma2' $look %s $$endmac
$mac sig2 $cal %r=%s $
$cal %s=%cu((%yv-%fv)**2/(%fv*(%fv+1/%r)))/%df: %e=(%s-%r)/%s
$cal %e=%gt(%e*%e,.0000001) $$endmac
$mac it2 $cal %e=1 $while %e sig2 $cal %g=%f
$print '  Next sigma2' $look %s $cal w=1/(1+%s*%fv) $fit .
$print '  New chi-square, d.f.' %x2 %df $endmac
$c macros to use breslow.mac to estimate extra-Poisson variation
$print ''
$print '--- to use Breslow macros write:   use go'
$macro go
$cal w=1$dis m$fit x+x2$dis e$use in2$recycle$
$cal %f=1$while %f iter$use it2$cycle$dis e$
$macro iter
$use it2$cal %f=%sqr((%x2-%df)**2):%f=%sqr(%f**2): %f=%gt(%f, 0.01)$

The program BETASIMU accepts these over-dispersion parameters directly and uses them in simulating extraneous variation.


BETASIMU resembles much Community Pattern Simulation software by Peter Minchin. However, the scope of BETASIMU is much more limited since it produces only a few specific responses in a GLIM file. BETASIMU was intended for assessing the fitting of beta response functions in gradient analysis. On the other hand, COMPAS attempts at generating community data matrices for evaluation of analysis methods for community data, and produces a great number of 'species' with specified types of response shapes. Both use beta response function, but in BETASIMU the user must specify explicitly the response function used, whereas in COMPAS the user need only specify general characteristics of response functions.

The main emphasis in BETASIMU is the incorporation of realistic and statistically tractable error models. It is not clear to me, how noise is generated in COMPAS. It seems to me that COMPAS is targeted at generating data resembling subjectively estimated cover or abundance data, whereas BETASIMU targets at statistically tractable (and rare in vegetation ecology) count or count-based proportion data. The most notably difference may be at the way zero values are simulated. In BETASIMU these come naturally with discrete error distributions with clearly defined integer counts (and I don't know how COMPAS does this). BETASIMU introduces two-component error models to simulation. Obviously COMPAS tries to do the same by introducing 'qualitative' and 'quantitative' error components. This may be ecologically interesting, but it is not mathematically as tractable as the two-component models used by BETASIMU, and I don't know too well how this is done in COMPAS.


Updated 9/7/97 Jari Oksanen