Using CEPSHUFL and SOLCOMP to analyse instability of ordinations

CEPSHUFL is a program to permute ordering of species and sites in CANOCO formatted data matrices. The re-arrangend data matrices can be analysed with CANOCO 3 or other software with similar properties. SOLCOMP is a program to compare ordination results from the data matrix with the original ordering against the matrix shuffled with CEPSHUFL, and report the stability (similarity) statistics of these two runs. The programs are aimed to be used together. More details can be found in our article.

The easiest way of using CEPSHUFL and SOLCOMP is to run the program STABLTY which performs automatic analysis of basic cases of PCA, CA, DCA, RDA, CCA or DCCA with CANOCO 3. STABLTY generates automatically the needed MS-DOS batch file, CANOCO instruction file, executes the comparison, and summarizes the output from SOLCOMP. This manual gives a description of component programs CEPSHUFL and SOLCOMP for those users who want to modify their STABLTY runs or write their own routines. Others can go to the STABLTY document.

General on Cepshufl and Solcomp

Both programs are written in C and compiled with djgpp to run in Windows MS DOS prompt (see info on running djgpp programs).

All parameters can be given in the command line. You can see the available options by typing cepshufl -h or solcomp -h.

Command line commands must be given in form -l option, where -l is a letter that identifies the option, and its argument is written after the letter (changed in version 1.5). Program prompts only for the most necessary information, and uses otherwise defaults. These can be changed only from the command line.

CEPSHUFL v1.5: Permuting CANOCO formatted files

Command cepshufl -h (v1.4) gives the following help screen:

Program CEPSHUFL: Reads in CEP data, shuffles sites and species,
writes shuffled CEP file name.SPE, and shuffle index name.LST
(default names shuffle.spe, shuffle.lst)
Version 1.4 (Fri Sep 12 10:31:45 1997) - Jari Oksanen


Usage:  cepshulf <-lparameters>
-i<file.ext>    Input file name
-o<name>        Rootname for output files (default shuffle)
-r<Text>        Text to initialize random number generator
-e              Do not shuffle environmental variables/ species
-s              Output .spe file in strict CEP format
                (default relaxed Canoco format)
-d<number>      No. of decimals in .spe file (default 4)

The program should accept most types of data files accepted by CANOCO. The output data file will be always in condensed format (default: condensed CANOCO format, with switch -s in original strict CEP format). The output data file will always be of type .SPE. In addition, the program produces a shuffle list with the same name as the output data file, but with type .LST.

Random number generator can be initialized with a text phrase. Numbers can be used, but they will be converted internally into two seeds which have no obvious relation to the numbers given. Same seed will always produce same permutation so that the shuffling can be repeated any time. The rng is initialized from the computer clock if seed is not given.

CEPSHUFL permutes randomly both species and sites. With switch -e, the program permutes only sites. This option was mainly intended to be used for environmental data sets when analysing constrained ordination methods (RDA, CCA, DCCA).

The program asks only for the name of the input data file. Defaults are used for all other options.

Changes in version 1.4

The first release version of CEPSHUFL was v1.3. The current v1.4a was made available in this server on 18/9/97. Changes from the previous releases are:

Changes in version 1.5

This version was made available in this server on 16/1/98. Changes are cosmetic: Accepts blanks between command line option and its argument and prints less on the screen while reading in data.

SOLCOMP v1.5: Comparing ordinations of original and permuted data files

The program reads in CANOCO 3.12 solution files and a permutation file form CEPSHUFL, re-arranges the permuted file into original order, and compares the solutions. The comparison is based on root-mean-squared-error (rmse) statististics which in essence is the square root of average squared differences between two solutions. Two variants of rmse are produced:

  1. rmse based on Procrustes rotation to maximal similarity
  2. rmse ("SD") based on direct comparison without rotation (but with possible reflection) of axes

Both should give 0.0000 if two solutions are identical.

Command solcomp -h (v1.3) gives the following help screen (slightly changed in v1.5):

Program SOLCOMP Version 1.3 (Thu Feb 6 11:01:56 1997)
- by Jari Oksanen and Peter Minchin (Procrustes rotation)

Compares CANOCO solution files using Procrustes rotation.
One file is target, other is shuffled (from CEPSHUFL)
Reads shuffling form SHUFFLE.LST and arranges shuffled file
back to the original order before analysis.

Usage: solcomp <-parameters>
-t<file>        target file name
-c<file>        config file name
-s<file>        shuffle file (default shuffle.lst)
-s!             no shuffle file, config file in order
-m<file>        append to monitor file (default screen)
-v              voluminous output to monitor file

The program will always ask for the names of

The permutation of config file must be given in shuffle file (default name shuffle.lst), produced by CEPSHUFL.

The output will be on screen, except when a monitor file name is given with switch -m. The results will be appended to the monitor file. Summary statistics include Procrustes and nonrotated rmse (latter called "SD"), Procrustes rotation matrix, and largest individual differences in scores in 2 and 4 dimension. If voluminous output is requested (with switch -v), target file ordination scores, and rotated config file ordination scores are listed as well.

Example of the SOLCOMP output and its interpretation

==============================================================
SOLCOMP 1.4 (Wed Sep 17 08:41:54 1997) - J.Oksanen & P.Minchin
DUNE MEADOW SPECIES DATA (M. BATTERINK AND G. WIJFFELS, 1983)
Method DCCA, Scaling -1 --- 30 species, 20 sites
Random number seed: 3bd186b9
Eigenvalues: 0.3391 0.1149 0.0410 0.0093
Detrended: Solution centred before Prokrustes analysis
Date: Wed Sep 17 13:06:31 1997

**** Fit statistics for species ordination:
PROCRUSTES RMSE: 0.4051 SYME: 0.4056
0.0211 0.0427 0.0665 0.3967 == rmse by axes
Largest residuals in 2 dim 0.1098 (# 8), in 4 dim 1.7637 (# 23)
Rotation matrix:
-0.9998 0.0040 -0.0091 -0.0158
0.0044 0.9994 -0.0316 -0.0112
-0.0075 0.0305 0.9956 -0.0885
0.0165 -0.0141 -0.0880 -0.9959
NON-ROTATED SD: 0.4442
0.0001 0.0010 0.0001 0.4442 << sd by axes
0.0020 0.0131 0.0009 7.1930 %% Percent of target range

**** Fit statistics for sites ordination:
PROCRUSTES RMSE: 0.0874 SYME: 0.0871
0.0096 0.0085 0.0158 0.0850 == rmse by axes
Largest residuals in 2 dim 0.0298 (# 18), in 4 dim 0.2200 (# 2)
Rotation matrix:
-0.9990 0.0114 -0.0179 -0.0390
0.0108 0.9998 0.0013 0.0162
-0.0168 -0.0006 0.9994 -0.0299
0.0397 0.0158 -0.0292 -0.9987
NON-ROTATED SD: 0.0942
0.0000 0.0002 0.0000 0.0942 << sd by axes
0.0021 0.0084 0.0015 4.0343 %% Percent of target range

The program reads and reports header information from the shuffled file. Then it raports the same stability statistics for species and sites. The main entries are:

Changes in version 1.4

The first release version was 1.3. The current version 1.4 was made available in this server on 17/9/97. The major changes from v1.3 are:

Changes in version 1.5

Released 16/1/98. Changes to version 1.4 are cosmetic: Accepts now blanks between command line option and its argument.

Running the programs together

The easiest way of running these programs together is to use the program STABLTY which makes automatically everything described here. However, if you want to modify STABLTY files or make your own analysis, you may find some general information here.

In our article we compared two random permutations of data files, since there is no unique, correct order. For studying the stabililty of CANOCO in one particular data set, it may be sufficient to compare permuted data sets against one "original" data order. This won't give unbiased estimates of rmse, but it will give an idea of stability of one particular ordination: That produced from the "original" data set. In that case the commands can be simplified slightly.

For several data sets, it is best to write a batch file (e.g. compare.bat). For each comparison, use the following block of commands:

cepshufl -imyown.spe -rJustDoIt 
canoco < cano.con 
copy shuffle.* target.* 
cepshufl -itarget.spe -rNuke 
canoco < cano.con 
solcomp -ttarget.sol -cshuffle.sol -mmonitor.txt 

And the file cano.con has the commands read in by CANOCO. It may look like this:

     2 =  DO NOT CHANGE THIS LINE
     0 =  long dialogue?
 shuffle.spe                              = file with species data
 S                                        = file with covariables
 S                                        = file with environmental data (S to Skip)
 shuffle.out                              = print file
 shuffle.sol                              = solution file for CANOPLOT or other prog
  4  = analysis number
     0 = sample number to be omitted
   0 = transformation of species data
  2  2 = ordination output

First call to CEPSHUFL will read in the data file myown.spe produce data file shuffle.spe using seed JustDoIt for random number generator. As instructed in cano.con, CANOCO reads in shuffle.spe and produces solution file shuffle.sol. These are copied to files target.spe and target.sol. In the next call to CEPSHUFL with seed Nuke, target.spe is permuted to produce a new shuffle.spe which is again ordinated by CANOCO to produce a new shuffle.sol. Finally, target.sol and shuffle.sol are compared reading the permutation from shuffle.lst and the results are appended to the file monitor.txt.

It is possible to analyse the stability of Constrained Correspondence Analysis as well. Species, sites or environmental variables cannot be removed during the analysis, and so the data files must be edited so that these manipulations are not needed. Specifically, if only a part of the environmental variables are used as constraints, the other variables must be removed from the data set before CEPSHUFL, either editing input format or directly editing the file. Same random number seed will produce the same ordering of sites in species and environmental data files, and so CEPSHUFL must be called for both files with the same seed phrase.


Updated 16/9/97 Jari Oksanen