This document discusses the program BETASIMU. You can proceed directly to download the program.

Mike Austin and co-authors have proposed a method to fit *beta*
response functions in gradient analysis of vegetation using GLIM
software (*Journal of Vegetation Science* **5,** 215-228;
1994). I have criticized this as a general tool in gradient analysis
(*Journal of Vegetation Science ***8, **147-152; 1997:
abstract), though it may work
in very special conditions. Since the arguments may be difficult
to judge based on reading only, I have written a special purpose
program BETASIMU to simulate vegetation stands using *beta*
function. My argument is that if you fit a *beta* function
to the simulated data, you won't get good estimates of parameter
values you used in simulation. Your results will be usually biased,
so that you have both Type I and Type II errors:

- Type I error means that you
**reject a correct**null hypothesis**:**You will find that fitted responses are far too often significantly skewed even when the true responses are symmetric. - Type II error means that you
**accept an incorrect**null hypothesis: You will find that when the true responses are skewed, you will under-estimate the degree of the skewness or even find the skewness insignificant (*i.e.,*to accept an incorrect null-hypothesis of symmetry).

Consequently, fitting *beta* functions is a very unreliable
method of assessing skewness. My recommendation is to use instead
Huisman-Olff-Fresco models.

BETASIMU generates data from *beta* response function and
writes the results in a GLIM file. The main emphasis in developing
the program was simulation of noise where I used the alternatives
available in GLIM and regarded as adequate for gradient analysis.
Especially, the program simulates data which are overdispersed
to Binomial or Poisson error, since community data sets are overdispersed
as a rule. For more details, see the program.

Community data simulation with BETASIMU involves two stages:

**Exptected abundance**at certain gradient position is found from*beta*response function.**Observations**are simulated for that expected value using random deviates with biologically relevant distribution.

The user interface is very basic. The only parameter you may give in the command line is the seed for the random number generator. Random number generator is initialized using a text phrase, which is internally transformed to two numeric seeds within the program. If the seed phrase is not given on the command line, the random number generator will be initialized from the clock. The program will ask for other needed parameters. The results will be written in a GLIM file.

- Bernoulli distribution, or Binomial with total 1. Produces
binary (0/1) observations. Expected values (
*pi*) give the probability of occurence (range 0..1). Noise variance will be*Var(y) = pi*(1-pi)*. - Binomial distribution, total or binomial denominator (
*m*, integer > 0) asked separately. Simulated values will be integer counts 0 ..*m*. Expected height must be given as a probability (*pi*) for single observation. Expected values will be*m*pi*, and noise variance*Var(y) = m*pi*(1-pi)*. - Beta-Binomial distribution with total
*m*. There are two components of error: first from Beta distribution, and then for these values from Binomial distribution. The model is more realistic than pure Binomial. Expected values will be*m*pi*, and noise variance*Var(y) = m*pi*(1-pi)*(1+(m-1)*tau²)*. The program asks for the overdispersion parameter*tau²*separetely for each simulated species. - Poisson distribution. The simulated observation are integer
counts
__>__0. Expected values will be*mu*and noise variance*Var(y) = mu*. - Negative Binomial distribution. The simulated observations
are integer counts. Expected values will be
*mu*and noise variance*Var(y) = mu + sigma² *mu²*. The program asks the overdispersion parameter*sigma²*separately for each species.

You must give an integer which is the number of trials in Binomial distribution.

Same number of observations will be used for all species. The observations will be evenly spread between gradient endpoints.

*Beta* response function is defined by *k * (x-a)^alpha
* (b-x)^gamma*. Here *x* is the gradient, and *k, a,
b, alpha, gamma *are the parameters of the *beta* response
function. These parameters are asked in three groups:

- Maximum expected height of the response. This is not a direct
parameter in beta response function, but program finds the parameter
*k*from other parameters so that the response has the desired height, since it is difficult to guess a nice value directly for*k*. For Bernoulli, Binomial and Beta-Binomial error distribution, expected height should be a probability or proportion between 0..1. For Poisson and NegBin distributions it should be real number > 0. Value 0 will terminate the simulation. - Location of endpoints
*a, b*. The expted values will be >0 between these endpoints, zero outside. So they jointly define the range of occurrence. - Shape parameters
*alpha, gamma*. These jointly define both the skewness of the response and the location of the optimum. After the endpoints*a, b*are fixed, skewness and location of the optimum cannot be defined independently (Oksanen,*J. Veg. Sci.***8,**147-152; 1997). In addition, they define the kurtosis of the response shape. Identical values of*alpha*and*gamma*define a symmetric response with the optimum in the middle of the range. Increasing difference in parameter values produce increasing skewness, and move the optimum towards the end with the lower value. Increasing values of parameters generate flatter response shapes.

With Beta-Binomial and Negative Binomial distribution, the magnitude
of overdispersion (parameters *tau² *or *sigma²*)
is asked for each species.

For realistic simulations, it is desirable to parametrize model so that it produces similar types of responses as observed with real data sets. However, it is not quite clear how to do this. There are problems both in estimating response shapes and defining noise distribution in observed community data sets.

Austin *et al.* (*J. Veg. Sci. ***5, **215-228; 1994)
explain how to fit *beta* response function to the data with
the GLIM software. I have criticized this on the grounds that
there are great problems that make beta response function useless
in practical gradient analysis (*J. Veg. Sci. ***8,**
147-152; 1997):

- Austin
*et al.*suggest that endpoints*a, b*are fixed to give the range of species occurrence, and other parameters (*k, alpha, gamma*) are estimated after that conditional to fixed*a, b*. If this is done, location of the optimum and skewness cannot be estimated independently. Further it appears that the method is prone to Type I errors: it produces significantly skewed responses even when the real response is symmetric. Further analyses indicate that the method is prone to Type II errors as well: it underestimates skewness when the real response is skewed. - For adequate estimation of skewness and location of the optimum,
the endpoints
*a, b*should not be fixed before the analysis, but they should be estimated from the data together with other parameters. In that case they no longer give the range, but the fitting will be probably more correct (*i.e.*fitted responses are more similar to data generating responses). However, this is tedious, since*a*and*b*are non-linear parameters and they must be estimated by trial and error in GLIM. Moreover, as observations outside*(a,b)*must be excluded from the analysis, changing parameter values will change the observations used in model fitting, and current estimates of endpoints can even exclude some non-zero observations. These together make the estimation of*a*and*b*either technically difficult or outright impossible.

I further showed in my article with a real data set, that beta response function over-estimates the number of skewed responses compared to Huisman-OIff-Fresco models. It seems that it is impossible to obtain reliable estimates of parameters of beta response function from observed community data. Actually, this program was written mainly to study this: Can you find the parameter values you used in simulating the data when fitting the model in the simulated data.

Minchin (*Vegetatio* **71, **145-156 (1987); * ***83,
**97-110 (1989)) describes what kind of responses you expect
with real community data. Due to problems in fitting beta response
functions, it may be necessary to use related reasoning in paramterization.

If you want to fit beta response functions, you can use the following GLIM macro which tests the signficance of skewness as well:

$com 29/5/97. Macros edited for GLIM 4. $com Fitting beta response function to the data. Finding endpoints of the response (range) a bit (%c) outside of the extreme observations. Fitting Beta, and a symmetric variant where alpha=gamma. In GLIM the response is defined as: log(mu) = eta = log(h) + alpha*log(x-%a) + gamma*(%b-x), where a and b are the endpoints, and h, alpha, and gamma the linear parameters. You have to have the gradient copied to x, - define a decent model (error, link), - and call the macro with the %yv you prefer $COM PRELIMINARIES (EDIT AS NEEDED) $cal tota=1$ $error bino tota$ $weight weig$ $COM cal x=ph$ ! GLIM3.77 $inp %plc fprob$ ! GLIM3.77 $cal %u=1$ ! GLIM3.77 $arg fprob %f %u %n$ $cal %c=0.01$ $COM MACRO G0 $macro go $ $cal w=%yv>0$ ! GLIM 3.77: use a, b instead of %a, %b below $tab the x s with w into %a$ ! $tab the x l with w into %b$ ! $cal %a=%a-%c: %b=%b+%c$ ! $cal weig = (x>%a)*(x<%b)$ ! $cal z = %log(x-%a): z1 = %log(%b-x)$ ! $fit z+z1$cal %d=%dv: %n=%df$ ! $cal fv=%fv$ ! $cal zsym = z+z1$ ! $fit zsym$cal %f=(%dv-%d)/(%d/%n): %p=1.0-%fp(%f,1,%n)$ !COM GLIM3.77 $cal p = fprob$ ! $out 16$ $print *n %yv ' Endpoints ' %a %b$ ! $print 'Devs:' %d %dv', df ' %df ', Chi2 =' %f ', P =' %p $ $out %poc$ $endmac $return

No community data are free of noise, and therefore it is completely useless to analyse error-free data. Unfortunately, it is fairly difficult to say what kind of noise variation we have in the data, or to estimate the parameters of the noise variation needed in simulation. It is generally regarded that realistic distributions for noise may have following properties:

- Normal or uniform noise is not realistic. First, they would often indicate negative simulated values which are impossible in observed community data. Moreover, the error variance is dependent on the magnitude of expected values, so that you may expect larger absolute error with high expected values than in near-zero expected values.
- Community data have zeros. Although expected values may (with some response function types) be infinitesimally above zero anywhere on the gradient, the observed values will have more abrupt ranges. This calls for discrete error distributions which have meaningful zeros.
- Poisson and Binomial models seem to satisfy the above conditions.
However, they are adequate only if we analyse the error in repeated
sampling
**within**the same site. In general, community data have larger variance than indicated by Poisson or Binomial error. Moreover, this variance may vary among species, and we should be able to estimate the magnitude of variance separately for each species. - All above points call for compound error models: A continuous error distribution with estimated magnitude of noise, followed by discrete error distribution. For Binomial sampling error, an appropriate extraneous error model is given by Beta distribution; jointly these give Beta-Binomial model. For Poisson sampling error, an appropriate extraneous error model is given by Gamma distribution; jointly these give Negative Binomial distribution.

Bernoulli distribution is the same as Binomial model with total 1. So the expected values will be probabilities of occurrence, and simulated values occurrences or absences. No extraneous variation can be used with Bernoulli noise, but exptected probability alone is sufficient to give all the moments of the distribution. This is the only fairly non-problematic error model in simulation. The use of this model is strongly recommended. However, many data sets are non-binary (quantitative), and therefore users may be willing to try noise models for quantitative data. It should be remembered that this means running into trouble, though, like you see below.

These noise models describe only the sampling variability when repeating the sampling in the same site. In general, the noise in real data sets is much higher, and therefore the following extra-Binomial and extra-Poisson models should be used.

Both noise models are discrete. This means that expected values
can be any real numbers, but expected values are always counts
(integers). Smallest possible value is zero in both noise models.
Binomial noise is bound with an upper limit as well, so that the
observations cannot be larger than binomial denominator. Binomial
noise is generated by repeating an independent observation with
fixed probability of success (*pi*) certain number of times
(*m*, binomial denominator).

Many community data sets are based on visually *estimated*
abundances (cover etc.). No adequate distribution can be found
for these, since the numbers are subjective estimates, and the
binomial denominator is not defined for cover estimates. For real
counts we can use Poisson type models. For observed proportions,
such as point-frequency or frequency-in-subplots we can use Binomial
type models. However, even in these cases we have to estimate
the magnitude of dispersion with the following extra-Binomial
or extra-Poisson models (which may be almost adequate for subjective
abundance estimation as well, especially if the simulation results
are binned to similar classes as used in estimation).

These are some of the easily handled cases of extra-Binomial and extra-Poisson models. In extra-Binomial and extra-Poisson models we simply assume that there are two components of error. Thus the error is larger than assumed by pure Binomial or Poisson models. In GLIM over-dispersion is evident when the residual deviance of the fitted model is significantly larger (by Chi-squared test!) than residual degrees of freedom.

We may assume that the extraneous variance is caused by other, possibly unknown or unmeasured, environmental variables so that gradient we use is not alone sufficient to give the exact expected value. Another reason for extraneous variance may be that there are biological processes, such as clonal growth, or interactions between species which increase the variance through spatial aggregation. For instance, with NegBin model we may think that there are two components in the variance: the size of units/ modules/ clones (continuous extra-Poisson variance), and the number of these units/ modules /clones in the sample (discrete Poisson variance).

In general data fitting, it may be dangerous to assume that this extraneous variance has a certain, fixed shape (distribution), especially if that shape is based on mathematical convenience rather than biological reasoning. However, this is what we have to do in simulation. The traditional, mathematical convenient, two-component error models are:

**Beta-Binomial**: Continous Beta distribution noise followed by discrete Binomial noise.**Negative Binomial**: Continous Gamma distribution noise followed by discrete Poisson noise.

Data dependent estimates of magnitude of extraneous variation
(Beta, Gamma) can be estimated with the GLIM software. Models
for Beta-Binomial model were proposed by Williams (*Applied.
Statistics ***31, **144-148; 1982), for Negative-Binomial
model by Breslow (*Applied Statistics *

- Macro for Beta-Binomial model: The overdispersion parameter is given by GLIM scalar %p.

$c Williams: Extra-binomial variation in logistic linear models. $c Appl. Stat. 31: 144-148; 1982. - $c Method II ! FIRST FIT A MODEL WITH WEIGHTS=1, RECYCLE, AND CALL MACRO REPEATEDLY ! UNTIL %X2 DOES NOT CHANGE AND APPR. EQUALS %DF ! INVOKE WITH COMMAND $USE WILL. $macro will $fit . $extract %vl$cal wvq=%pw*%wt*%vl$ $cal %p=(%x2-%cu(%pw*(1-wvq)))/(%cu((%bd-1)*%pw*(1-wvq)))$ $cal w=1/(1+%p*(%bd-1))$ $print ' Chi-square ' %x2 $ $endmac

- Macro for Negative-Binoial model: The overdispersion parameter is given by GLIM scalar %s.

$subf macros $c Breslow: Extra-Poisson variation in log-linear models. Appl. Statist. $c 33, pp. 38-44, 1984 (Appendix): $sub extr $pri ' macros for extra-Poisson variation' $pri ' ' $pri ' For maximum-likelihood analysis declare numerators as yvar,' $pri ' log-denominators as offset, and initializes prior weights w to 1' $pri ' Then fit model, call in2 once, recycle and call it2 repeatedly' $pri '' $mac in2 $ext %vl $cal wq=%fv*(1-%fv*%vl) $ $cal %s=(%x2-%df)/%cu(wq) $ $print ' Initial sigma2' $look %s $$endmac $mac sig2 $cal %r=%s $ $cal %s=%cu((%yv-%fv)**2/(%fv*(%fv+1/%r)))/%df: %e=(%s-%r)/%s $cal %e=%gt(%e*%e,.0000001) $$endmac $mac it2 $cal %e=1 $while %e sig2 $cal %g=%f $print ' Next sigma2' $look %s $cal w=1/(1+%s*%fv) $fit . $print ' New chi-square, d.f.' %x2 %df $endmac $c macros to use breslow.mac to estimate extra-Poisson variation $print '' $print '--- to use Breslow macros write: use go' $macro go $COM MODEL DEFINED HERE: EDIT AS NEEDED $cal w=1$dis m$fit x+x2$dis e$use in2$recycle$ $cal %f=1$while %f iter$use it2$cycle$dis e$ $endmac $macro iter $use it2$cal %f=%sqr((%x2-%df)**2):%f=%sqr(%f**2): %f=%gt(%f, 0.01)$ $endmac $return

The program BETASIMU accepts these over-dispersion parameters directly and uses them in simulating extraneous variation.

BETASIMU resembles much Community Pattern Simulation software
by Peter Minchin. However, the scope of BETASIMU is much more
limited since it produces only a few specific responses in a GLIM
file. BETASIMU was intended for assessing the fitting of beta
response functions in gradient analysis. On the other hand, COMPAS
attempts at generating community data matrices for evaluation
of analysis methods for community data, and produces a great number
of 'species' with specified types of response shapes. Both use
*beta* response function, but in BETASIMU the user must specify
explicitly the response function used, whereas in COMPAS the user
need only specify general characteristics of response functions.

The main emphasis in BETASIMU is the incorporation of realistic and statistically tractable error models. It is not clear to me, how noise is generated in COMPAS. It seems to me that COMPAS is targeted at generating data resembling subjectively estimated cover or abundance data, whereas BETASIMU targets at statistically tractable (and rare in vegetation ecology) count or count-based proportion data. The most notably difference may be at the way zero values are simulated. In BETASIMU these come naturally with discrete error distributions with clearly defined integer counts (and I don't know how COMPAS does this). BETASIMU introduces two-component error models to simulation. Obviously COMPAS tries to do the same by introducing 'qualitative' and 'quantitative' error components. This may be ecologically interesting, but it is not mathematically as tractable as the two-component models used by BETASIMU, and I don't know too well how this is done in COMPAS.

*Updated 9/7/97* *Jari Oksanen*