BPP Documentation
BPP Explained
Overview
Bayesian Phylogenetics & Phylogeography (BPP) is a Bayesian Markov chain Monte Carlo (MCMC) program for analyzing DNA sequence alignments of multiple loci from multiple closely-related species under the multispecies coalescent (MSC) model (Yang 2002; Rannala and Yang 2003). See Xu 2016 and Rannala et al 2020 for reviews of the MSC model. The program requires 3 input files:
-
Sequence data file (see Sequence File).
-
Imap file specifying the population source of each sequence (see Imap file).
-
Control file specifying other variables needed to run the program (e.g., initial species tree, prior distributions, etc)(see Control File).
This manual will provide details for creating these input files (describing syntax and permissible values for variables). The input files should be in text format and edited using a text format editor (for example, a Unix editor such as emacs, or vi). A basic set of map and control files using sensible default priors based on the sequence data can also be created using the web tool Minimalist BPP available at
https://brannala.github.io/bpps/
These auto-produced files can then be used as a template for more editing.
Possible types of analyses
The program can be used to conduct four different types of analyses, each specified using two variables in the control file:
-
A00 (
speciesdelimitation = 0
,speciestree = 0
): estimation of species divergence times and population sizes under the MSC, IM and MSC-I models when the species tree is specified by the user (Rannala and Yang 2003; Flouri et al 2020) -
A01 (
speciesdelimitation = 0
,speciestree = 1
): inference of the species tree when sequences are assigned to species by the user (Rannala and Yang 2017) -
A10 (
speciesdelimitation = 1
,speciestree = 0
): species delimitation using a user-specified guide tree (Yang and Rannala 2010; Rannala and Yang 2013) -
A11 (
speciesdelimitation = 1
,speciestree = 1
): joint species delimitation and species tree inference using unguided species delimitation Yang and Rannala 2014
A detailed guide to the use of each of these 4 methods of analysis, as well as example control files, are provided in Methods of Analysis. The BPP tutorial also provides examples illustrating the four types of analyses (Yang 2015; Flouri et al 2020).
Model Parameters
The basic parameters in the MSC model include the species divergence times ( \(\tau\) ) and mutation scaled population sizes
\(\theta = 4N\mu\)
where \(N\) is the effective population size and \(\mu\) is the mutation rate per site per generation. Note that \(\theta\) specifies the average proportion of sites that have different bases when comparing two sequences sampled at random from the population. Both \(\tau\) and \(\theta\) are measured in units of expected number of mutations per site ( \(\tau\) can be converted to units of years if the substitution rate per site per year is known). For example, if \(\tau\) is an estimated divergence time in units of expected substitutions per site (from a BPP analysis) and the substitution rate per site per year is \(\mu\), then the estimated divergence time in years is, \(\tau'\), is
\(\tau' = \frac{\tau}{\mu}\)
and the diploid effective population size is
\(N = \frac{\theta}{4\mu}\)
For a species tree with \(s\) species, there are (\(s – 1\)) divergence times (\(\tau\)) and at most (\(2s – 1\)) population size parameters (\(\theta\)). Analysis A00 estimates those parameters when the species delimitation and species tree are fixed (specified by the user). Analyses A01, A10, and A11 estimate parameters and model probabilities of the different MSC models. The multispecies-coalescent-with-introgression (MSC-I) model, first implemented in BPP version 4.1 and described in The MSC-I Model, adds introgression nodes to the tree, each with a new parameter, the introgression probability (\(\varphi\)) see Flouri et al 2020. The multispecies-coalescent-with--migration (MSC-M) model, first implemented in BPP version 4.6 and described in The MSC-M Model, adds a matrix of instantaneous migration rates (M). For reviews of the MSC model, see Yang 2014 (Chapter 9), Xu 2016 and Rannala et al 2020.
Model Assumptions
The MSC model implemented in BPP makes two basic assumptions about the data: no recombination between sites of the same locus, and free recombination between loci. The original MSC model (Rannala and Yang 2003) also assumes complete genetic isolation between populations, but more recent implementations allow gene flow between species (either continuous migration or episodic introgression); the MSC-I model implemented in BPP allows episodic introgression Flouri et al 2020 and the IM model allows continuous migration (see Introgression and Migration Models and Isolation With Migration Models). The default model assumes a strict molecular clock (constant substitution rates among lineages) and relaxed-clocks (with substitution rates varying among lineages) are implemented through variable clock models (see Substitution Models).
Nuclear genomic data
Ideal data for analysis using the BPP program are loosely-linked short genomic segments. Such data are likely to satisfy the model assumptions because intra-locus recombination is rare for short segments (500 to 1000bp) and distantly spaced segments undergo frequent recombination and thus have nearly independent histories. Genealogical trees are assumed to be a product of neutral evolution (not influenced by natural selection). However, protein-coding gene sequences appear to be useable in BPP analyses since most proteins are performing similar functions in closely related species and the main effect of purifying selection on nonsynonymous mutations is a reduction of the neutral mutation rate (Shi and Yang 2018; Thawornwattana et al 2018). It is a good idea to separate the noncoding and coding regions of the genome into two datasets.
The program allows the option of either a constant rate (strict molecular clock) or variable rates of substitution (relaxed molecular clocks) among lineages. Different substitution models are available, with the most complex (parameter rich) being the General Time-Reversible (GTR) model and the simplest the Jukes-Cantor (JC69) mutation model. All the models correct for multiple hits at individual sites (see Substitution Models). The use of the JC69 model and the strict molecular clock model should be limited to closely related species with sequence divergence not much higher than 10%. Sequences from distantly related species should be analyzed using a GTR substitution model in combination with one of the relaxed clock models (see Among-species rate variation. Models allowing substitution rate variation among sites (see Among-site rate variation) and among loci (see Among-locus rate variation)are also implemented in BPP (see Models of Substitution Rate Variation).
Mitochondrial data
The MSC model accommodates genealogical heterogeneity across the genome.
Thus, different regions of the autosomal genome may have different gene
tree topologies and branch lengths (coalescent times). While the
mutational process may also vary along the genome, this heterogeneity is
believed to be much less important. Thus multiple genes from the
mitochondrial genome, which in most species does not undergo
recombination, should be treated as one 'locus' in the MSC-based
analysis. The mitochondrial genome often has a mutation rate that
differs from the nuclear genome, as well as a different effective
population size from that of autosomes. While the model of locus-rate
variation (option variable locusrate
) and the heredity scalar (option
variable heredity
) are designed to deal with this, it may be prudent
to analyze the loci from the nuclear genome separately from the single
mitochondrial locus.
Identical sequences and phylogenetic signal
The model implemented in BPP assumes that the sequences represent random samples from the different species. Sequences from the same species that are identical should all be used. It is incorrect to use only the unique haplotypes, which will lead to biased parameter estimates. Similarly, it is incorrect to filter loci based on bootstrap support values and use only those loci with a high "phylogenetic information" signal.
How to Cite BPP
If you publish results obtained using BPP you should cite the version number of the BPP release you used as well as the canonical citation, which as of release v4.3.8 is
Tomáš Flouri, Xiyun Jiao, Bruce Rannala, Ziheng Yang (2018) Species
Tree Inference with BPP Using Genomic Sequences and the Multispecies
Coalescent, Molecular Biology and Evolution 35: 2585--2593.
You can also cite the one of the BPP tutorials and the original papers describing the methods you used (see above). Be sure to state the priors that you used since they are necessary for reproducibility. If you conduct a joint analysis of species delimitation and species tree inference, your method description may look like the following (replace the specific values in bold with those you used):
"Joint Bayesian species delimitation and species tree estimation was conducted using the program BPP (Flouri et al., 2018; Yang, 2015). The method uses the multispecies coalescent model to compare different models of species delimitation (Yang and Rannala, 2010; Rannala and Yang, 2013) and species phylogeny (Yang and Rannala, 2014; Rannala and Yang, 2017) in a Bayesian framework, accounting for incomplete lineage sorting due to ancestral polymorphism and gene tree-species tree discordance. The population size parameters (\(\theta\)s) are assigned the inverse-gamma prior IG(3, 0.002), with mean 0.002/(3–1) = 0.001. The divergence time at the root of the species tree (\(\tau_0\)) is assigned the inverse-gamma prior IG(3, 0.004), with mean 0.002, while the other divergence time parameters are specified by the uniform Dirichlet distribution (Yang and Rannala, 2010, eq. 2). Each analysis is run at least twice to confirm consistency between runs.”
Features New To BPP 4.0
- Multiple threads
- Introgression models (MSC-I)
- Migration models (MSC-M)
- Local-clock and locus-rates models (clock and locusrate)
- Topological constraints (constraint and outgroup)
Installing and Running BPP
This manual applies to BPP versions 4.1 and later. The program is written in C and executables are available for Windows, Mac OSX and Linux. Alternatively, the BPP program can be compiled for Unix (Linux, BSD, etc), Mac OSX, or Windows. If you are using a Windows, Mac OSX or Linux operating system (and you are not a programmer who prefers to compile from source) you should download the latest release executables for your machine (go to Obtaining BPP). If you have not used a command line program before, read Using the Command Line then go to Obtaining BPP. If you want to compile the program yourself (you must do this if you are using a Unix variant other than Linux or Mac OSX) go to Compiling the BPP Program.
Obtaining BPP
Executables and source code for the most recent BPP release are always available at:
https://github.com/bpp/bpp/releases/latest
Download the distribution file containing an executable for your operating system and uncompress the file. Change the current directory to be the root of the subdirectory created by uncompressing the file. For example, on a Linux machine you could type the following to download and install bpp version 4.8.0 (the latest version may be different from this):
wget https://github.com/bpp/bpp/releases/download/v4.8.0/bpp-4.8.0-linux-x86_64.tar.gz
tar -xzf bpp-4.8.0-linux-x86_64.tar.gz
cd bpp-4.8.0-linux-x86_64
You are now ready to proceed to BPP Trial Run
Using the Command Line
BPP is a command-line program, so the preferred way of running it is to open a terminal application and execute the program from the terminal command line, rather than double-clicking on the executable file in your file explorer. If you have not used the command line before, please work through one of the following short tutorials first:
Windows:
http://abacus.gene.ucl.ac.uk/software/CommandLine.Windows.pdf
Mac OSX:
http://abacus.gene.ucl.ac.uk/software/CommandLine.MACosx.pdf
Compiling the BPP Program
Requirements
To compile the program you will need to have a C compiler and the Make program installed on your machine. To test whether this software is installed (on a Unix machine) you can type:
cc --version; make --version
which should produce output similar to the following if a compiler is installed:
cc (Ubuntu 9.3.0-10ubuntu2) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
If instead you see an error message such as:
Command 'cc' not found
then you (or your system) administrator will need to install the compiler software (using a package manager such as apt on Ubuntu, for example) before you can compile BPP.
Downloading and compiling the source code
If you have the necessary software for compiling you can download the C source code for the latest release from:
https://github.com/bpp/bpp/releases/latest
The link to the source file on github is named Source code
. You will
need to download and uncompress the file, then change to the source code
subdirectory before executing the commands outlined below. The program
needs to be compiled only once. For example, the following commands use
the gcc compiler to compile the program and move the generated
executable file (bpp) into the bin/ folder.
cd bpp
mkdir bin
cd src
make
mv bpp ../bin
If you use git you can instead clone the bpp repository and check out the master branch (which contains source code for the latest stable version of bpp) and then compile the program:
git clone https://github.com/bpp/bpp.git
cd bpp
git checkout master
cd src
make
You are now ready to proceed to BPP Trial Run
BPP Trial Run
Run the program from the command line (rather than double-clicking the executable) so that you will see any potential error messages. Change your current directory to the top level of the directory created by uncompressing the bpp distribution file. For example, in Linux the following commands will uncompress the distribution file and move you to the top level of the bpp directory:
tar -xvzf bpp-4.8.0-linux-x86_64.tar.gz
cd bpp-4.8.0-linux-x86_64
In the bpp/ subdirectory, run the program in Windows by typing the following command within the Terminal application:
bin\bpp --cfile examples\frogs\A00.bpp.ctl
In Linux or Mac OSX type the following commands within a terminal:
cd examples/frogs/
../../bin/bpp --cfile examples/frogs/A00.bpp.ctl
If the program executed successfully, you should see initial output similar to the following:
bpp v4.8.0_linux_x86_64, 31GB RAM, 12 cores
https://github.com/bpp/bpp
Auto-selected SIMD ISA: AVX2
Starting timer..
Using seed: -1
Parsing species tree... Done
Parsing phylip file... Done
Alternatively, you may see one or more error messages. This may be due to a spelling error, or you may not be in the correct subdirectory. If this does not appear to be the case and the problem persists you can ask for help on the BPP discussion group at:
https://groups.google.com/forum/#!forum/bpp-discussion-group
When posting on the forum, please specify the exact command you typed and the error message you received (preferably by copying and pasting this information from your terminal or attaching a screen shot).
BPP Command-line Options
Below is a list of command-line options for running BPP.
Command-line option | Meaning |
---|---|
With no options, bpp prints out the version number and computer hardware information | |
--help |
Display help information |
--cfile FILENAME |
Run the MCMC using the specified control file |
--version |
Display version information |
--quiet |
Output warnings and fatal errors to stderr only |
--cfile FILENAME |
Run MCMC for the specified control file |
--resume FILENAME |
Resume analysis from a specified checkpoint file |
--arch SIMD |
Force specific vector instruction set (default: auto) |
--msci-create FILENAME |
Construct the extended Newick notation for the MSC-I model using an MSC-I definition file |
--bfdriver FILENAME |
Create control files for marginal likelihood calculation |
--points INTEGER |
Number of G-L quadrature points (used with --bfdriver) |
--summary FILENAME |
Summarize results using specified control file (no MCMC) |
--theta_slide FLOAT |
Use ‘slide’ move for \(\theta\) with probability PROB and ‘gibbs’ move with 1 – PROB (default PROB = 0.2) |
--theta_mode INTEGER |
Change window size for sliding-window moves for \(\theta\): |
1: one step length for all \(\theta\) (default) | |
2: one step length for tip and one for inner nodes | |
3: one step length for each node | |
--no-pin |
Disables thread pinning |
Input File Formats
Two input files are required for every BPP analysis, the sequence file and the control file. If more than one species is being analyzed an Imap file is also required. A few additional input files are required only when specific options are specified in the control file. Here we describe the content and format specifications of the sequence, Imap and control files that will be needed for most BPP analyses.
Sequence File
The sequence alignments must be in the phylip/paml format, with one
alignment following the other, all in one file. An alignment is called a
locus. Every locus must have at least 2 sequences, different loci can
have different numbers of sequences, and some species may have no
sequences present at some loci. In the phylip/paml format the first line
of the entry for each locus contains two numbers specifying the number
of sequences and the number of sites, respectively, that are present at
the locus. A sequence name must precede each sequence entry and be
separated from it by either a line break or two or more spaces. Each
sequence name must contain an individual ID, which is unique among the
sequences listed for a given locus, preceded by a caret ^
symbol. For
example, the sequence name GI01234567^Specimen1
has the individual ID
Specimen1
. The label GI01234567
that precedes the ^
is optional, so
^Specimen1
is also a valid sequence name but Specimen1
is not. The
sequences may be in interleaved format.
DNA bases can be specified using either upper or lower case letters,
missing data can be specified using a ?
symbol, and IUPAC ambiguity
codes may be used. The interpretation of an ambiguity code (as either an
uncertain base or a diploid genotype) depends on whether the data are
specified as phased in the control file. By default (control file
variable cleandata = 0
, see Control File),
alignment gaps and ambiguity nucleotides are used in the likelihood
calculation, with gaps treated as question marks
(see Yang 2014 pp. 111-112).
If cleandata = 1
, all columns with gaps or ambiguity
characters are removed before analysis. An example of a sequence data
file for 2 loci and 4 sequences per locus is given below (you can also
examine the sequence files frogs.txt
and yu2001.txt
in the
examples
subdirectory).
4 20
dog^alfy1 attgtgccctctctctctca
dog^alfy2 attgtgccctctctctctca
cat^smoot attgtgccctctagctctca
human^ben attgtgccctctgactctcc
4 10
dog^alfy1 attgtgccct
dog^alfy2 attgtgccct
fox^sample1 attgtgccct
^ben attgtgaagt
Imap File
Each sequence name has an individual ID tag after the caret ^
symbol.
For example, the sequence name GI01234567^Specimen1
has the individual
ID Specimen1
. An Imap file maps the individual IDs (specimens) to
their species/populations. For example, the following Imap file maps
Specimen1
to species A, Specimen4
to species B, and so on.
Specimen1 A
Specimen2 A
Specimen3 A
Specimen4 B
Specimen5 C
Specimen6 C
See also the map files frogs.Imap.txt
and
Rokas2003-5species-Imap.txt
in the examples
subdirectory. The MSC
model implemented in BPP infers the relationships among
species/populations not individuals and so it uses only the population
ID for a sequence (such as A, B, ... in the example), and does not use
the individual ID. Also when BPP reads the sequence names,
it use the individual ID to retrieve the species ID for each sequence
but then ignores the sequence name. One motivation for the use of the
Imap file and this two-layer design is that one may wish to analyze the
same sequence data with different population/species assignments of
individual IDs. This can be achieved by editing the small Imap file
rather than editing the larger sequence data file.
An example Imap file showing the sequence ID in the Imap file
corresponding to that in the sequence file. Note that the caret ^
symbol is not allowed in the sequence ID of the Imap file but is
required in the sequence label in the sequence data
file.
Outgroups and Constraints File
The outgroups and constraints file is optional and its name is
specified in the control file by setting variable constraintfile
to
the name of the constraint file. Topological constraints on species
trees can only be specified for analyses A01 (species tree inference
with a fixed species delimitation) and A11 (joint species tree inference
and species delimitation). See Methods of Analysis
for a detailed description of the A01 and
A11 methods of analysis. Note that species trees are always rooted in
BPP, whether you use the clock or relaxed clock. The
control variable constraintfile
in the control file has the format
constraintfile = constraints.txt
Inside the constraint file, three keywords are allowed: define
,
constraint
, and outgroup
. The define
keyword is used to assign a
name or alias to a clade, constraint
defines a clade or subtree, and
outgroup
means that species not on the list (the ingroup species) form
a clade.
define g1 as (G,H);
# identical to constraint (D,E,F,(G,H));
constraint (D,E,F,g1);
outgroup A,B,C;
constraint (A,B);
Either the equal sign (=) or spaces can be used to separate the keyword
from the specified value. In the example above, there are eight species
in the dataset: A, B, C, D, E, F, G, and H. Line 1 defines an alias g1
for clade (G,H), which may be used in subsequent define
or
constraint
statements. Line 3 specifies A, B, C as the outgroups; this
is interpreted to mean that species not on the list, (that is, D, E, F,
G and H) form a clade. Since this information is already in line 2, line
3 has no effect in this case.
The modifier keyword NOT
(case-insensitive) can be used with define
to instead specify the complement of the species on the list. Suppose in
the data we have 10 species of ratites (flightless birds), and two
outgroup species, chicken and ostrich. You can specify the constraint as
follows:
define ratites as NOT (chicken, ostrich);
constraint = (chicken, (ostrich, ratites));
Note that BPP interprets outgroup
to mean that all
species not on the list form a clade. In theory the outgroup
keyword
is unnecessary because you can always achieve the same thing by using
define
(especially with the NOT
option) and constraint
. However
outgroup
may be more convenient or explicit in some situations. If
there are four species (A,B,C,D),
outgroup = B,C,D;
achieves nothing since all the 15 rooted species trees are allowed, whereas
outgroup = C,D;
means (A,B) form a clade so that 3 rooted species trees are allowed. Finally if multiple compatible constraints are specified, BPP merges them into one so that
constraint (E,F,G,H);
constraint (F,G,H);
constraint (G,H);
is equivalent to
constraint (E,(F,(G,H)));
If any two constraints are in conflict, BPP will abort with an error message. For example the following will cause an error.
constraint = (E,F,G,H);
constraint = (G,H);
constraint = (F,G);
Date File
The date file is optional and its name is
specified in the control file by setting variable datefile
to
the name of the date file.
When using tip-dating, each sample has an associated sample age.
The datefile
assigns an age to the sample using the individual ID tag.
The dates are in units before time present, so larger numbers are older.
Any desired time units (e.g. years, thousands of year, etc) can be used so long as the mutation rate prior is on the same time scale (e.g. expected substitutions per year, expected substitutions per thousand years).
For example, the following date file assigns specimen1
an age 500, specimen2
an age of 10000. The population name can be included before the caret ^
symbol or it can be excluded.
A^specimen1 500
specimen2 10000
B^specimen3 45000
B^specimen4 35000
Each line should have one individual followed by the sample age.
Each sequence in the dataset must be included in the date file.
Also see the date file mammoth/dates.txt
in the examples
subdirectory.
Control File
The control file specifies most aspects of a BPP analysis, including
the names of input and output files, choices of models, prior
distributions on parameters, etc. The format is line oriented, with each
line specifying the value of a particular variable. The default control
file name is bpp.ctl
. Lines beginning with *
or #
are comments and
are ignored by the program. Most often the order of the lines is
unimportant, but there are exceptions (as explained below). Each
non-commented line of the control file contains a variable name and an
assignment of one or more values to the variable. The general syntax of
a line in the control file is:
variable = value
where variable
is the name of a valid variable and value
is a list
of one or more permissible values for the variable. Most variable
assignments follow this format --- the exceptions are described later.
BPP has complicated dependencies among variables, both in terms of
whether the variable needs to be defined (or has a default value when
undefined) and what its permissible values are. Here we begin by
defining all the variable names, the syntax of their arguments and
permissible types (integers, floating point numbers, etc), and outline
the structure of dependencies among variables
Symbol/Rule | Definition |
---|---|
b | Boolean (0,1) |
+d | Positive integer |
f | floating point number |
s | String (or character) |
t | Tree in Newick format |
x* | List of one or more elements of type x |
y [z] | z is a set of variables whose form is determined by y |
(w,z) | w and z are two different permissible types |
y [(w,z)] | whether w or z is the permissible type depends on value of y |
Table 1. Symbols used to represent syntax and value types of variables in variable definitions.
Detailed definitions of all the control file variables are provided at the end of this section, ordered according to their indexes in Table 2 below. If you are impatient to get started using the program with worked examples, you might want to now skip ahead to Example control file which explains the control file variables in the context of a simple example dataset analysis. You can later consult this section regarding particular control file variables as the need arises.
Index | Variable | Values | Condition | Dependencies |
---|---|---|---|---|
1 | seed | (-d,+d) | True | None |
2 | usedata | b | True | None |
3 | jobname | s | True | None |
4 | seqfile | s | True | <-16 |
5 | finetune | b [f*] | True | ->7 |
6 | b b b b b | True | None | |
7 | burnin |
+d | True | <-5 |
8 | sampfreq | +d | True | None |
9 | nsample | +d | True | None |
10 | species&tree | +d s +d t | True | ->{11,15,24} |
11 | Imapfile |
s | 10[1]>1 | <-10 |
12 | speciesdelimitation | b [b f*] | True | ->14 |
13 | speciestree | b | True | ->13 |
14 | speciesmodelprior |
+d | 12[1]=1 OR 13[1]=1 | <-{12,13} |
15 | phase | b* | 10[1] | <-10 |
16 | nloci | +d | True | ->4 <-30 |
17 | model | s [s] | True | None |
18 | Qrates | b +f +f +f +f +f +f | 17=''7'' | <-17 |
19 | basefreqs | b +f +f +f +f | 17=''7'' | <-17 |
20 | alphaprior | f f +d | True | None |
21 | cleandata | b | True | None |
22 | thetaprior | s [(f*, f f s)] | True | None |
23 | tauprior | s f f | True | None |
24 | phiprior |
f f | 10[4] | <-10 |
25 | locusrate | +d [(f f f s, s, )] | True | <-26 ->26 |
26 | clock | +d [f f f s s] | True | <-25 ->25 |
27 | heredity | +d [(f f, s)] | True | None |
28 | checkpoint | +d [(+d)] | True | None |
29 | constraintfile | s | True | None |
30 | threads |
+d [ +d +d ] | True | ->16 |
31 | datefile | s | 25[3] | None |
Table 2. Complete list of BPP control file variables. The column Variable
contains the
control file variable name, the column Values
specifies (using symbols) the format of the values (symbols are defined in
Table 1). The column Condition
describes the conditions (if any) needed for the variable to be required. An entry
of True in this column means that the variable is always required/defined although possibly not explicitly (there may be a
default value). The column Dependencies
lists the variables that either influence the variable (<-) or are influenced by the variable
(->) with the variables denoted using their index from column Index
.
Table 2 lists all the variables that can be defined in the BPP control file. The first column lists an index used to identify the variable in other columns of the table and to determine the order in which the variables are defined in subsequent text. The second column lists the variable name. The variable names are highlighted to indicate the structure of their dependencies among variables as follows:
-
Bold = variable always needs to be defined, has no default value, and does not depend on other variables.
-
Italic = variable always needs to be defined, has a default value if no value is provided in control file, and does not depend on other variables.
-
Solid background = whether variable needs to defined (and possibly also its permissible values) depends on the value of another variable.
-
Regular = variable always needs to be defined and has no default value but its permissible values depend on the value of another variable.
Column 3 lists the permissible types of each variable and the syntax for specifying arguments in defining the variable. The variable types are defined in Table 1.
Column 4 lists the condition (state of another variable) that either determines whether the variable on that line exists (solid background variables) or determines its permissible values (italic or regular variables). x [y] refers to value of argument y of the variable with index x. Column 5 lists indexes of variables that either are influenced by the variable (the notation ->[c,d] indicates that the variable influences variables c and d) or variables that influence the variable (the notation <-[a] indicates that variable a influences the variable). If you are writing a control file from scratch, these dependencies necessitate that you specify the bold variables first, followed by the italic (only needed if changing defaults) and then the solid background and regular.
BPP control file variables
1 seed
seed = (-d,+d)
DESCRIPTION
Specifies the seed used for the random number generator.
VALUES
-d
, use the wall clock (in Windows) or white noise (in Linux) to
generate a seed (store seed in file SeedUsed
).
+d
, use +d as the seed.
DEFAULT
-1
COMMENTS
Using the same positive integer seed will produce identical results in
different runs, which is useful for debugging, and using different
positive integers in different runs produces different results. Using
\(-d\), for example, \(-1\), the computer's entropy generator is used as a
source for the seed, and different runs will produce different results.
When evaluating MCMC convergence by comparing results across different
runs be sure to use -d. The positive integer seed automatically
generated when using option -d is stored in a file named SeedUsed and
can be used explicitly to replicate a result. It is recommended that you
run each analysis at least twice using different seeds to confirm that
the results are stable across runs.
EXAMPLES
seed = -1
seed = 278
2 usedata
usedata = b
DESCRIPTION
Specifies whether data (likelihood+priors) are used in calculating
probabilities during MCMC or only priors.
VALUES
0
, use only the priors to calculate probabilities (likelihood
constant).
1
, use likelihood and prior probabilities.
DEFAULT
1
COMMENTS
The 0
option can be used for debugging, or examining priors. When
using option 0
a MCMC run produces samples from the prior for each
variable.
EXAMPLES
usedata = 0
usedata = 1
3 jobname
jobname = s
DESCRIPTION
Defines the job name, which serves as a prefix for all output files generated by
the analysis, including the main output file and the MCMC sample file.
VALUES
s
, a string specifying the common prefix for all output files. This string may
include a directory path.
EXAMPLES
jobname = /home/mickey/mouse
The above will create several files such as:
/home/mickey/mouse.txt # main output file
/home/mickey/mouse.mcmc.txt # MCMC sample file
/home/mickey/mouse.SeedUsed # seed information file
Additional files may be generated depending on the type of analysis.
4 seqfile
seqfile = s
DESCRIPTION
Sets the name of the file containing the sequence data to be the string
s
VALUES
s
, a string of characters specifying the directory path and name
of file that contains the sequence data
EXAMPLES
seqfile = /home/mickey/mouse_seq.txt
seqfile = sequences.txt
5 finetune
finetune = b: [f*]
DESCRIPTION
Determines whether step lengths in MCMC proposals are automatically
optimized or manually set to fixed values as well as setting either the
fixed step lengths (manual) or the starting step lengths (optimized).
There are up to 15 proposals with step lengths that can be adjusted. Not
all proposals are defined for any given choice of model and analysis,
however 15 values should always be specified when using manual as the
order is fixed, step lengths specified for unused proposals will be
ignored by the program. The sizes of proposal steps for each parameter
are provided (from left to right) in the same order as listed (from top
to bottom in table. The proposals are as follows:
Abbreviation | Parameter Proposal Description |
---|---|
Gage | Node age on gene tree |
Gspr | Subtree pruning and regrafting move |
thet | Theta |
tau | Tau |
mix | Mixing step (jointly changing mu, tau and gene tree node ages |
lrht | empty |
phi | Introgression probability |
pi | Stationary frequencies of substitution model |
qmat | Q matrix elements of substitution model |
alfa | Alpha parameter of gamma model for rate variation among sites |
mubr | empty |
nubr | empty |
mu_i | empty |
nu_i | empty |
brte | empty |
VALUES
0: Gage Gspr thet tau mix lrht phi pi qmat alfa mubr nubr mu_i nu_i brte
Use fixed step lengths in MCMC proposals as specified, where Gage is the
specified step length for proposal Gage, and so on.
1: Gage Gspr thet tau mix lrht phi pi qmat alfa mubr nubr mu_i nu_i brte
Automatically optimize step lengths in MCMC proposals using specified
initial step lengths.
DEFAULT
0
Manually fixes all step lengths to be \(0.1\).
1
Fixes all initial proposal step lengths to \(0.1\) prior to optimization.
DEPENDENCIES
If b
is set to 1 (automatic optimization) then burnin
has to
be \(>200\).
COMMENTS
The first value, before the colon, is a switch, with 0 meaning no
automatic adjustments by the program and 1 meaning automatic adjustments
by the program. Following the colon are the step lengths for the
proposals used in the program. and then the step lengths specified here
will be the initial step lengths, and the program will try to adjust
them using the information collected during the burnin step. This option
appears to work fine. Some notes about manually adjusting those finetune
step lengths are provided below in section 3.2.
EXAMPLES
finetune = 0: .01 .02 .03 .04 .05 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01
finetune = 1
6 print
print = b b b b b
DESCRIPTION
Specifies what information is saved to the output files during the
MCMC.
VALUES
There are 5 boolean (0,1) variables that specify whether particular
variables are saved to file (1) or not saved (0). From left to right:
variable 1 specifies MCMC samples; variable 2 specifies locus-specific
rate parameters (rate \(\mu_i\), variance parameter \(\nu_i\) and
species-tree branch rates for locus \(r_{ij}\)); variable 3 specifies
locus-specific heredity scalars if those are estimated from the data;
variable 4 specifies locus-specific gene trees; and variable 5 specifies
locus-specific substitution-rate parameters (qmat for the \(Q\) matrix,
freqs for base frequencies, and alpha for gamma rates for sites).
EXAMPLES
print = 0 1 1 1 0
print = 1 1 1 1 1
7 burnin
burnin = +d
DESCRIPTION
Specifies the number of burn-in iterations that are executed in the MCMC
run before sampling begins.
VALUES
+d
, a positive integer specifying the number of burn-in
iterations.
DEPENDENCIES
If finetune = 1
(automatic optimization) then burnin
has to
be \(>200\).
COMMENTS
The total number of MCMC iterations is burnin
+ nsample
\(\times\) sampfreq
.
EXAMPLES
burnin = 100
burnin = 10000
8 sampfreq
sampfreq = +d
DESCRIPTION
Specifies the interval at which samples from the MCMC are to be written
to the MCMC output file specified by jobname
.
VALUES
+d
, a positive integer specifying the interval between samples
from the MCMC.
COMMENTS
The total number of MCMC iterations is burnin
+ nsample
\(\times\) sampfreq
.
EXAMPLES
sampfreq = 2
sampfreq = 10
9 nsample
nsample = +d
DESCRIPTION
Specifies the number of samples from the MCMC that are to be written to
the MCMC output file specified by jobname
.
VALUES
+d
, a positive integer specifying the number of MCMC samples.
COMMENTS
The total number of MCMC iterations is burnin
+ nsample
\(\times\) sampfreq
.
EXAMPLES
nsample = 5000
nsample = 25499
10 species&tree
species&tree = +d s*
+d*
t
DESCRIPTION
Specifies the number of species, the species names, the maximum number
of sequences (at any locus) for each species, and the species tree in
Newick format (or species graph in extended Newick format in models with
introgression).
VALUES
+d
, a positive integer specifying the total number of species.
s*
, a list of +d
strings specifying the name of each
species. +d*
, a list of +d
positive integers specifying the
maximum number of sequences (at any locus) for each species. t
, a
tree (or graph) in Newick format (or Newick extended format) specifying
the relationships among species (and possibly introgression events).
DEPENDENCIES
The variable Imapfile
must be specified only when the number of
species +d
> 1. The length of the boolean array
specified for variable phase
is equal to the number of species
+d
. The variable phiprior
must be specified only if t
is an introgression graph.
COMMENTS
See the subsection Notation for trees and introgression graphs for
details on the format of t
. The role of t
in the MCMC
analysis depends on the value of the speciestree
variable. If
speciestree=1
the variable t
specifies the starting tree for
the MCMC analysis, otherwise if speciestree=0
it is the fixed tree
used for estimating parameters or the fixed guide tree for species
delimitation.
EXAMPLES
species&tree = 4 H C G O
1 1 1 1
(((H, C), G), O);
11 Imapfile
Imapfile = s
DESCRIPTION
Sets the path/name of the Imap file to be the string s
.
VALUES
s
, a string of characters specifying the directory path and name
of a file that contains the species/sequence map information.
DEPENDENCIES
The variable Imapfile
must be specified only when the number of
species defined by the species&tree
variable
+d
> 1.
COMMENTS
See Imap File for a complete description of the Imap file format.
EXAMPLES
Imapfile = /home/foo/bar_Imap.txt
Imapfile = Imap.txt
12 speciesdelimitation
speciesdelimitation = b [b f*]
DESCRIPTION
Specifies whether species delimitation is performed and if so specifies
the parameters of the species delimitation rjMCMC proposal algorithm.
VALUES
0
, specifies no species delimitation.
1 0 epsilon
, specifies species delimitation with rjMCMC algorithm
0, with parameter \(\epsilon\) as specified in equations 3 and 4 of
Yang and Rannala 2010. Reasonable values for \(\epsilon\) are 1, 2, 5, etc.
1 1 alpha m
, specifies rjMCMC algorithm 1, with parameters
\(\alpha\) and \(m\) as specified in equations 6 and 7 of Yang and Rannala 2010.
Reasonable values are \(\alpha = 1, 1.5, 2\), etc. and \(m = 0.5, 1, 2\).
DEFAULT
0
DEPENDENCIES
If speciesdelimitation = 1
then speciesmodelprior
must be defined.
EXAMPLES
speciesdelimitation = 1 0 2
speciesdelimitation = 1 1 2 1
13 speciestree
speciestree = b
DESCRIPTION
Specifies whether the species tree is estimated or fixed in the MCMC
analysis.
VALUES
0
, specifies that species tree is fixed.
1
, specifies that species tree is estimated.
DEFAULT
0
DEPENDENCIES
If speciestree = 1
then speciesmodelprior
must be defined.
COMMENTS
This option invokes the nearest-neighbor interchange (NNI) or
subtree-pruning-regrafting (SPR) algorithm to propose changes to the
species tree topology during the MCMC run.
EXAMPLES
speciestree = 1
speciestree = 0
14 speciesmodelprior
speciesmodelprior
DESCRIPTION
Specifies the prior distribution for rooted species trees and species
delimitations.
VALUES
0
, specifies equal probabilities for the labeled histories (rooted
trees with the internal nodes ordered by age).
1
, specifies equal probabilities for the rooted species trees.
2
, specifies equal probabilities for the numbers of species (\(1/s\)
each for \(1,2,\ldots,s\) species given \(s\) populations) and divides the
probability for any specific number of species among the compatible
models (of species delimitation and species phylogeny) in proportion to
the number of labeled histories.
3
, specifies equal probabilities for the numbers of species (\(1/s\)
each for \(1,2,\ldots,s\) species given \(s\) populations) and divides the
probability for any specific number of species among the compatible
models (of species delimitation and species phylogeny) uniformly.
DEPENDENCIES
The variable speciesmodelprior
must be defined if either (or both)
speciestree = 1
or speciesdelimitation = 1
. If
speciestree = 1
and speciesdelimitation = 0
then only
options speciesmodelprior = 0
or speciesmodelprior = 1
may
be used.
COMMENTS
The priors specified by options 2 and 3 are described in Yang and Rannala 2014.
The prior specified by option 3 may be suitable when there are many populations.
EXAMPLES
speciesmodelprior = 0
speciesmodelprior = 3
15 phase
phase = b*
DESCRIPTION
Specifies whether the sequence data for each species is phased.
VALUES
1
, in position \(i\) of the list indicates that data is unphased for
species \(i\) in the list of species in variable species&tree
,
otherwise 0
indicates it is phased.
DEFAULT
0
DEPENDENCIES
The number of boolean (0,1) variables in the list must equal the number
of species d+
specified in the species&tree
variable.
COMMENTS
If at position \(i\) the phase
variable is 1, each sequence from
species \(i\) is treated as an unphased diploid sequence, with
heterozygous sites represented using the ambiguity characters YRMKSW
.
The missing data tokens N-?
are allowed and are all treated as ?, but
other ambiguity characters (HBVD
) are not allowed.
For phasing, each sequence is expanded/resolved into two sequences, and
the program averages over possible phase resolutions according to the
approach of @Gronau2011. For example, a sequence R...Y
with two
heterozygous sites, R
(meaning A/G
) and Y
(meaning C/T
), has two
possible phase resolutions:
-
A...T
andG...C
-
A...C
andG...T
A sequence with no heterozygous sites will be phased/resolved into two
identical sequences. The MSC model effectively averages over different
phase resolutions of the heterozygous sites in performing the likelihood
calculation. The option may increase memory usage and CPU time
considerably if many sequences exist with many heterozygote sites at a
locus. If the phase
variable for a species is 0, all sequences
from that species are treated as resolved haplotype sequences, and
ambiguities are interpreted in the usual way. For example, an ambiguity
code Y
at a position is interpreted to mean that one sequence exists
with an uncertain nucleotide at that position that is either a T or a C.
The program does not allow some sequences from a species to be phased
and other sequences from that same species to be unphased.
EXAMPLES
phase = 0 0 0 0 1
phase = 1 1 1 1 1
16 nloci
nloci = +d
DESCRIPTION
Specifies the number of loci present in the sequence data file specified
by variable seqfile
that will be analysed.
VALUES
+d
, a positive integer specifying the number of loci to be
analyzed.
DEFAULT
If nloci
is not specified, all the loci present in the sequence
file specified by seqfile
will be analysed.
DEPENDENCIES
nloci
must be less than or equal to the number of loci available
in the sequence file specified by seqfile
.
COMMENTS
If nloci
is less than the number of loci present in the file
specified by seqfile
only the first nloci
present in that
file will be analysed.
EXAMPLES
nloci = 1
nloci = 2000
17 model
model = s [s]
DESCRIPTION
Specifies the DNA substitution model (or amino acid substitution model)
that will be used. For DNA sequences the simplest (default) model that
can be chosen is the Jukes-Cantor (JC69) model and the most complex is
the General Time-Reversible (GTR) model. A range of models of
intermediate complexity are also available. For amino acid sequences,
most of the commonly used amino acid substitution matrix models are
available. If a model other than Custom
is specified it will apply to
all loci. If Custom filename
is specified, an additional file
filename
must be provided that contains one or more lines with the
format:
loci_indices data_type model
where loci_indices
is an index or range of indexes for loci (indexed
according to their order of occurrence in the sequence data file),
data_type
is either DNA
or AA
, and model specifies the
substitution model variable for the indexed loci as described above. For
example, the entry
1-10 DNA JC69
11 AA DAYHOFF
specifies that loci 1 to 10 comprise DNA sequences that will have the
JC69
model applied to them, while locus 11 is an amino acid sequence
that will have the DAYHOFF
model applied to it.
VALUES
For DNA sequences: JC69
, K80
, F81
, HKY
,
T92
, TN93
, F84
, GTR
.
For amino acid sequences: DAYHOFF
, LG
, DCMUT
,
JTT
, MTREV
, WAG
, RTREV
, CPREV
, VT
,
BLOSUM62
, MTMAM
, MTART
, MTZOA
, PMB
,
HIVB
, HIVW
, JTTDCMUT
, FLU
, and STMTREV
.
Among-locus model variation: Custom filename
.
DEFAULT
JC69
DEPENDENCIES
If model = GTR
the variables Qrates
and basefreqs
can
be optionally defined which provide priors on the rate matrix and base
frequencies, respectively (default priors will be used otherwise).
EXAMPLES
model = GTR
model = CPREV
model = Custom models.txt
18 Qrates
Qrates = b +f +f +f +f +f +f
DESCRIPTION
Specifies whether the exchangeability parameters in the GTR model are
fixed or variable and either their fixed values or prior relative mean
values.
VALUES
b
, specifies either fixed (1) or variable (0) exchangeability
parameters in the GTR model.
+f +f +f +f +f +f
, specify the exchangeability parameters in the
order \(a, b, c, d, e, f\) for TC, TA, TG, CA, CG, and AG, as described in
Yang 2014.
DEFAULT
0 1 2 1 1 2 1
DEPENDENCIES
The variable Qrates
is defined only if model = GTR
.
COMMENTS
Two example cases are discussed here with either fixed, or variable
rates. The following specifies fixed rates:
Qrates = 1 2 1 1 1 1 2
This specifies fixed rates at \(a = 2, b = c = d = e = 1\) and \(f = 2\), so that the transition rate is twice as high as the transversion rate, and the model corresponds to K80 or HKY. Note that only the relative rates matter, because the rate matrix (\(Q\)) is scaled so that the average rate (over the base frequencies) is 1 and branch lengths are measured in the expected number of mutations/substitutions per site. Nevertheless the program expects six rate parameters.
Qrates = 0 10 5 5 5 5 10
This specifies variable rates with a prior on \(a, b, c, d, e\) that is a
Dirichlet distribution for every locus. The above specifies Dir(10, 5,
5, 5, 5, 10), with parameters
\(\alpha_a = 10, \alpha_b = \alpha_c = \alpha_d = \alpha_e = 10\), and
\(\alpha_f = 10\). The rates generated from the Dirichlet sum to 1, and
they are rescaled. Note that larger values for those parameters mean
less variance, while the mean for \(a\), say, is given by
\(\alpha_a/\alpha\) with
\(\alpha = \alpha_a + \alpha_b + \alpha_c + \alpha_d + \alpha_e +
\alpha_f\). This specifies an average transition/transversion rate ratio
of 2 (if base frequencies are all fixed at \(\frac{1}{4}\)).
EXAMPLES
Qrates = 1 2 1 1 1 1 2
Qrates = 0 1 1 1 1 1 1
19 basefreqs
basefreqs = b +f +f +f +f
DESCRIPTION
Specifies whether the base frequency parameters are fixed or variable
and either their fixed values or prior relative mean values.
VALUES
b
, specifies either fixed (1) or variable (0) base frequency
parameters in the GTR model.
+f +f +f +f
, specify the base frequency parameters in the order
TCAG.
DEFAULT
0 1 1 1 1
DEPENDENCIES
The variable basefreqs
is defined only if model = GTR
.
COMMENTS
The parameters \(\pi_T, \pi_C, \pi_A, \pi_G\) can be either fixed or
variable with a prior specified by the Dirichlet distribution with
\(\alpha\) parameters specifying the relative mean frequencies.
EXAMPLES
basefreqs = 1 0.15 0.35 0.15 0.35
basefreqs = 0 10 10 10 10
20 alphaprior
alphaprior = f f +d
DESCRIPTION
Specifies the prior probability density for the shape parameter
controlling the pattern of among-site rate variation.
VALUES
f f +d
, specify the \(\alpha_p\) and \(\beta_p\) parameters of the
prior on the Gamma shape parameter and the number of rate categories
ncatG, respectively.
DEFAULT
\(\alpha=\infty\) (no rate variation among sites)
COMMENTS
Among-site rate variation is assumed to follow a Gamma distribution with
a mean of 1 and shape parameter \(\alpha\). The variance of rates among
sites is \(1/\alpha\) and therefore a smaller \(\alpha\) specifies more rate
variation. A discrete distribution is used to efficiently approximate
the Gamma distribution and ncatG is the number of rate categories used
in the approximation -- more rate categories provide a better
approximation but incur greater computational expense. A value for ncatG
of 4 is recommended. The prior on \(\alpha\) is also a Gamma distribution
with parameters \(\alpha_p\) and \(\beta_p\). The variance and mean of the
prior on \(\alpha\) are:
\(\textrm{Mean}(\alpha) = \frac{\alpha_p}{\beta_p}\),
\(\textrm{Var}(\alpha) = \frac{\alpha_p}{\beta_p^2}\)
EXAMPLES
alphaprior = 1 1 4
21 cleandata
cleandata = b
DESCRIPTION
Specifies whether columns in the alignment which have gaps or ambiguity
characters will be removed prior to analysis, or will instead be
retained for use in the likelihood calculation.
VALUES
0
, specifies that columns in the alignment which have gaps or
ambiguity characters will be treated as missing data and used in the
likelihood calculation.
1
, specifies the removal of columns in the alignment which have
gaps or ambiguity characters.
DEFAULT
0
EXAMPLES
cleandata = 1
cleandata = 0
22 thetaprior
thetaprior = s [(f*, f f s)]
DESCRIPTION
Specifies the distribution model and parameters of the prior
distribution on the contemporary and ancestral population parameters
\(\theta\) and whether the MCMC integrates over the parameters or the
integration is instead performed analytically.
VALUES
s
, specifies the form of the prior distribution on theta and
should be either invgamma
, gamma
or beta
to specify either an
inverse gamma, gamma, or beta distribution, respectively.
[(f*, f f s)]
, specifies the parameters of the prior distribution.
The form for the parameters depends on the type of distribution
specified in the first argument. The permissible combinations of
distributions and parameters are:
invgamma a b
invgamma a b e
which specifies an inverse gamma prior with \(\alpha\) and \(\beta\) parameters a and b, respectively. The first entry specifies that \(\theta\)s are integrated over analytically and the second entry (with trailing e) specifies that \(\theta\)s are estimated.
gamma a b
which specifies a gamma prior with \(\alpha\) and \(\beta\) parameters a and b, respectively.
beta p q l u
which specifies a beta prior constrained to the interval \((l,u)\) with
parameters p and q, respectively. All the distributions above, except
the inverse Gamma distribution (without the e argument) integrate over
the \(\theta\) parameters numerically as part of the MCMC, estimating the
posterior distribution of \(\theta\) for each population.
COMMENTS
The inverse-gamma prior invgamma a b
, where a and b are the parameters
\(\alpha\) and \(\beta\), has mean and variance:
\(\textrm{Mean}(\theta) = \frac{\beta}{\alpha - 1}\),
\(\textrm{Var}(\theta) = \frac{\beta^2}{(\alpha - 1)^2 (\alpha - 2)}\).
For example, if \(\alpha=3\) and \(\beta=0.002\) the mean is \(0.002/(3 – 1) = 0.001\) (one variable site per kb on average). Note that all \(\theta\) parameters in the MSC model (for both modern species and extinct ancestral species) are assigned the same specified prior distribution with identical parameters. The inverse-gamma is a conjugate prior for \(\theta\) (Hey and Nielsen, 2007), which means that both the prior and the posterior of \(\theta\) will be inverse-gamma. Use of the conjugate prior allows the \(\theta\) parameters to be integrated out analytically, and thus the dimension of the parameter space is reduced. This typically leads to improved mixing of the MCMC. However, with this option, the posterior of the \(\theta\) parameters will not be produced. To estimate the \(\theta\) parameters when using an inverse gamma prior, add the letter e (or E) on the line, as follows
thetaprior = 3 0.002 e
Whether \(\theta\)s are integrated out or estimated, other results (such
as the posterior probability of species trees or species-delimitation
models) should be identical if the same prior is used. The analytical
integration of \(\theta\) is only possible with an inverse gamma prior
distribution, if any other prior is used the \(\theta\)s are automatically
estimated and the E flag is not needed and should not be used. The gamma
prior gamma a b
, where a and b are the parameters \(\alpha\) and
\(\beta\), has mean and variance:
\(\textrm{Mean}(\theta) = \frac{\alpha}{\beta}\),
\(\textrm{Var}(\theta) = \frac{\alpha}{\beta^2}\).
For example, if \(\alpha=0.001\) and \(\beta=1\) the mean is
\(0.001/1 = 0.001\) (one variable site per kb on average). The beta prior
beta p q l u
has mean and variance:
\(\textrm{Mean}(\theta) = \frac{p u + q l}{p + q}\),
\(\textrm{Var}(\theta) = \frac{p q (u - l)^2}{(p + q)^2 (p + q + 1)}\).
For example, with beta 2 18 1e-6 0.1
the mean is:
\(\frac{2(0.1)+18(10^{-6})}{2+18} = 0.0100009\)
which is 10 variable sites per kb on average.
EXAMPLES
thetaprior = invgamma 3 0.002
thetaprior = invgamma 4 0.001 e
thetaprior = beta 2 18 1e-6 0.1
thetaprior = beta 2 18 1e-6 0.01
thetaprior = beta 1 10 0.0001 0.2
thetaprior = gamma 0.001 1
23 tauprior
tauprior = s f f
DESCRIPTION
Specifies the prior probability distribution of the divergence time
parameter for the root in the species tree.
VALUES\
s f f
, specifies either the gamma or inverse gamma prior
distribution and the \(\alpha\) and \(\beta\) parameters, respectively for
\(\tau_0\), the divergence time parameter for the root in the species
tree.
gamma a b
invgamma a b
specifies a gamma distribution, or an inverse gamma distribution,
respectively with a and b to be the parameters \(\alpha\) and \(\beta\),
respectively.
COMMENTS
If the specified prior on \(\tau_0\), the age of the root of the tree is
an inverse-gamma distribution IG(\(\alpha, \beta\)) with \(\alpha > 2\) the
mean and variance are:
\(\textrm{Mean}(\tau_0) = \frac{\beta}{\alpha - 1}\),
\(\textrm{Var}(\tau_0) = \frac{\beta^2}{(\alpha - 1)^2 (\alpha - 2)}\).
If the prior is a gamma distribution the mean and variance are:
\(\textrm{Mean}(\tau_0) = \frac{\alpha}{\beta}\),
\(\textrm{Var}(\tau_0) = \frac{\alpha}{\beta^2}\).
The remaining species divergence times are generated from the uniform
Dirichlet distribution [@Yang2010 eq. 2]. As an example, suppose that
tauprior = 3 0.03
specifies the prior for \(\tau_0\), the divergence
time parameter for the root in the species tree. The mean of \(\tau_0\) is
then \(0.03/(3 – 1) = 0.015\) (which means 1.5% of sequence divergence
between the root of the species tree and the present time). If the
mutation rate is \(10^{–9}\) mutations/site/year, this distance will
translate to an absolute prior mean divergence time of 15 MY.
EXAMPLES
tauprior = invgamma 3 0.03
tauprior = gamma 0.01 1
24 phiprior
phiprior = f f
DESCRIPTION
Specifies the parameters of the prior distribution on \(\varphi\), the
introgression probability under the MSC-I model.
VALUES
f f
, specify the parameters \(\alpha\) and \(\beta\), respectively, of
the beta distribution prior \(\textrm{B}(\alpha,\beta)\) for \(\varphi\).
DEPENDENCIES
If the species&tree
variable t
specifies a tree with
introgression nodes then phiprior
must be defined.
COMMENTS
The introgression probability \(\varphi\) takes values on the interval
\((0,1)\) and the prior distribution for \(\varphi\) is assumed to be beta
\(\textrm{B}(\alpha, \beta)\) which has mean and variance:
\(\textrm{Mean}(\varphi) = \frac{\alpha}{\alpha+\beta}\),
\(\textrm{Var}(\varphi) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\).
For example, phiprior = 1 1
specifies the beta prior
\(\textrm{B}(\alpha,\beta)\) with \(\alpha = 1\) and \(\beta = 1\) for \(\varphi\), which has a mean of \(1/2\) and a variance of \(1/12\).
EXAMPLES
phiprior = 1 1
phiprior = 1.5 1.5
25 locusrate
locusrate = d+[(f f f s, s)]
DESCRIPTION
Specifies the model of substitution rate variation among loci, as well
as the parameters, and the prior on the parameters of the model.
VALUES
d+[(f f f s, s)]
, specifies the model and parameters. The variable
d+ takes one of 4 possible values with different numbers of additional
variables depending on d+. The permissible combinations are:
d+ | [args] | Description |
---|---|---|
0 | Loci have identical rates | |
1 | f f f s | Locus rates variable and estimated |
\(\alpha_{\bar{\mu}} \,\, \beta_{\bar{\mu}} \,\, \alpha_{\mu_i}\) prior | ||
2 | s | Locus rates variable and specified |
Filename of file containing locus rates | ||
3 | f f | Loci have identical rates. The rate is estimated using tip-dating |
\(\alpha\) \(\beta\) |
where \(\alpha_{\bar{\mu}}\) and \(\beta_{\bar{\mu}}\) are the parameters of
the Gamma distribution prior on the average rate among loci \(\bar{\mu}\)
and \(\alpha_{\mu_i}\) is the parameter of the prior on the rate at the
\(i\)th locus given \(\bar{\mu}\). The prior variable must be either iid
for a conditional iid (hierarchical) prior on \(\mu_i\) or dir
for a
Gamma-Dirichlet prior. The prior variable is optional, if it is not
specified iid
is used.
DEFAULT
0
DEPENDENCIES
If both the prior
option for the variable locusrate
and the prior
option for the variable clock
are set, they should take the same
value. If the locusrate = 3 f f
option is used, datefile
much be specified.
COMMENTS
The setting locusrate = 0
(default) means that all loci have the same
mutation rate. Setting locusrate = 1 f f f s
specifies a model with
rate variation among loci in which rates are estimated from the sequence
data. Including the first integer (1) which species the model there are
5 arguments, the last one (prior) being optional. Parameters
\(\alpha_{\bar{\mu}}\) and \(\beta_{\bar{\mu}}\) specify the shape and rate
parameters of the gamma distribution for the mean rate across loci
(\(\bar\mu\)), while \(\alpha_{\mu_i}\) and prior are used to specify the
locus rates (\(\mu_i\)) given the mean rate (\(\bar\mu\)). The option prior
can take two values dir (for Gamma-Dirichlet rates for loci), and iid
(for conditional i.i.d. or hierarchical prior).
locusrate = 2 LocusRateFileName
specifies the fixed-rates model of
locus-rate variation (Burgess and Yang 2008). This is the strategy used by
Yang 2002, with the relative rates estimated by the distance to an
outgroup species. The relative locus rates are listed in the file: there
should be as many numbers in the file, separately by spaces or line
returns, as the number of loci (nloci). The program re-scales those
rates so that the average across all loci is 1 and then use those
relative rates as fixed constants. Specifically the mean rate across
loci (\(\bar{\mu}\)) is assigned a gamma prior:
\(\bar{\mu} = \mathrm{G}(\alpha_{\bar{\mu}},\beta_{\bar{\mu}}),\)
with mean and variance:
\(\textrm{Mean}(\bar{\mu}) = \frac{\alpha_{\bar{\mu}}}{\beta_{\bar{\mu}}}\),
\(\textrm{Var}(\bar{\mu}) = \frac{\alpha_{\bar{\mu}}}{\beta_{\bar{\mu}}^2}\).
When there are no fossil calibrations in the species tree, the rates
should all be relative. In this case we suggest fixing
\(\alpha_{\bar{\mu}}=\beta_{\bar{\mu}}=0\) (with 0 causing the program to
fix both parameters at \(\infty\)), and the program will fix the mean rate
across loci at \(\bar\mu = 1\). Otherwise one can use equal and large
values such as \(\alpha_{\bar{\mu}}=100\) and \(\beta_{\bar{\mu}}=100\), so
that \(\bar\mu\) is nearly fixed at 1. Given the mean rate \(\bar\mu\), two
priors are available to specify the locus rates \(\mu_i\), with
\(i = 1, 2, \cdots, L\), where \(L\) is the number of loci (nloci
). If
prior = dir
(for Gamma-Dirichlet distribution of locus rates), the
total rate \(L\bar\mu\), given the mean rate (\(\bar\mu\)), is partitioned
into locus rates (\(\mu_i\)), using the concentration parameter
\(\alpha_{\mu_i}\) The model and notation follow Burgess and Yang 2008 [eq. 4] and
Dos Reis et al 2014 [eqs. 3-5]. If prior = iid
(for conditional-i.i.d. or
hierarchical prior of locus rates), the locus rates (\(\mu_i\)) are
i.i.d. given the mean rate (\(\bar\mu\)):
\(\mu_i \sim \mathrm{G}(\alpha_{\mu_i},\alpha_{\mu_i}/\bar{\mu})\).
This model is described in Zhu et al 2015 [eq. 8] and is also implemented in [mcmctree]{.smallcaps}. In both the Gamma-Dirichlet and the conditional i.i.d. models, parameter \(\alpha_{\mu_i}\) is inversely related to the extent of rate variation among loci, with a large \(\alpha_{\mu_i}\) meaning similar rates among loci. If all loci are noncoding, the rates are probably similar, so \(\alpha_{\mu_i} =\) 10 or 20 may be reasonable, while for coding loci or exons, \(\alpha_{\mu_i} =\) 2 or 1 may be appropriate. The \(\alpha_{\mu_i}\) parameter may affect the estimates of the population size parameter (\(\theta\)) for the root node on the species tree (Burgess and Yang 2008). The \(L\) locus rates (\(\mu_i\)) are parameters in the model. If \(\alpha_{\bar{\mu}} > 0\), the mean rate (\(\bar\mu\)) is a parameter as well.
The setting locusrate = 3 f f
means the all loci have the same mutation rate, and the rate is estimated using tip-dating.
The rate prior is \(\Gamma (\alpha, \beta)\).
The units of the prior should match the units in the datefile
.
For example, if the dates are specified in years, the rate should be in expected number of substitutions per year.
EXAMPLES
locusrate = 0
locusrate = 2 rates.txt
locusrate = 1 0 0 2
locusrate = 1 2 3 2 dir
locusrate = 1 1 1 1 iid
locusrate = 3 20 1000000
26 clock
clock = +d [f f f s s]
DESCRIPTION
Specifies whether a strict molecular clock (equal substitution rates
among lineages) is used or instead a specific variable clock model
(allowing variation in substitution rates among lineages). If a variable
clock model is specified, the priors on the parameters of the variable
clock model are also specified.
VALUES
+d
, specifies a strict clock (1), a variable clock with
independent rates among branches (2), or a variable clock with
autocorrelated rates between ancestral and descendent branches (3).
f f f s s
, are required when clock =
2 or 3 and specify
respectively the parameters \(\alpha_{\bar{\nu}}\), \(\beta_{\bar{\nu}}\)
and \(\alpha_{\nu_i}\), the prior distribution for the locus rate
(\(\mu_i\)) which is either dir
for Dirichlet or iid
for conditional
i.i.d (hierarchical) prior, and the distribution for the branch rates
(\(r_{ij}\)) given the locus rate (\(\mu_i\)) which is either G
for Gamma
or LN
for log-normal. Parameters \(\alpha_{\bar{\nu}}\) and
\(\beta_{\bar{\nu}}\) are the parameters of the gamma distribution for the
average of variances across loci (\(\bar\nu\)), while \(\alpha_{\nu_i}\) and
prior
are used to specify the variance (\(\nu_i\)) for locus \(i\) given
the average (\(\bar\nu\)). The specification of the distribution of
\(\nu_i\) given \(\bar\nu\) follows the same procedure as the specification
of the distribution of \(\mu_i\) given the mean rate \(\bar\mu\) when
modeling among-locus rate variation; see notes above about the
locusrate
variable.
DEFAULT
1
DEPENDENCIES
If both the prior
option for the variable locusrate
and the prior
option for the variable clock
are set, they should take the same
value.
COMMENTS
As an example, the specification (clock = 2 10.0 100.0 5.0 iid G
)
specifies a local-clock model in which the mutation rate drifts over
branches, independently among loci. A detailed explanation is as
follows. First \(\bar\nu\) is assigned a gamma distribution
\(G(10.0, 100.0)\), with mean 0.1. Given \(\bar\nu\), the conditional
i.i.d. prior means that
\(\nu_i \sim G(\alpha_{\nu_i}, \alpha_{\nu_i}/\bar\nu)\) with shape
parameter \(\alpha_{\nu_i} = 5.0\) and mean \(\bar\nu\), for
\(i = 1, 2, \cdots, L\). Note that \(\alpha_{\nu_i}\) is inversely related
to the variance of \(\nu_i\): use small values of \(\alpha_{\nu_i}\) (2, 1,
or 0.5) if you believe that \(\nu_i\) varies among loci (meaning that the
clock nearly holds at some loci but is seriously violated at others).
As another example, the specification
(clock = 2 10.0 100.0 5.0 dir LN
) means that \(\bar\nu\) is assigned a
gamma prior \(G(10.0, 100.0)\), and then the sum \(L\bar\nu\) is partitioned
into \(\nu_i\) (for \(i = 1, 2, \cdots, L\)) according to the Dirichlet
distribution (prior = dir
) with concentration parameter
\(\alpha_{\nu_i} = 5.0\). Again large \(\alpha_{\nu_i}\) means the same
extent of clock violation at different loci.
Given the locus rate \(\mu_i\) (specified using the locusrate
variable)
and the variance \(\nu_i\) (specified using the clock
variable) for
locus \(i\), the different lineages may have different rates at the locus,
and those rates are independent among loci. If distribution specified is
G
(for gamma), the rate for species-tree branch \(j\) at locus \(i\) has
the following gamma distribution
\(r_{ij} | \mu_i, \nu_i \sim \mathrm{G}(\mu_i^2/\nu_i, \mu_i/\nu_i)\).
This has mean \(\mu_i\) and variance \(\nu_i\). Alternatively if
distribution = LN
(for log-normal), the rate for species-tree branch
\(j\) at locus \(i\) has the following the log-normal distribution
\(r_{ij} | \mu_i, \nu_i \sim \mathrm{LN}(\mu_i, \nu_i)\),
where \(\mu_i\) is the mean of the LN distribution and \(\nu_i\) is the variance parameter of the lognormal. Note that the rate-drift model specifies rates for branches on the species tree (rather than on the gene tree) for each locus, and gene-tree branches residing in the same population or species have the same rate. For example, if all sequences at a locus are from the same species and all coalescent events occur in that species (before reaching an ancestral species), all branches on the gene tree will have the same rate even if the relaxed-clock model allows rates among species. In contrast, if a gene-tree branch passes several species or populations, the different segments of the branch will have different rates. The branch length on the gene tree is calculated by summing up the lengths of those segments (with the length of each segment being the product of the rate and the time duration for the segment).
The option clock = 3
is similar to clock 2 but specifies the
autocorrelated-rates model. clock = 3
with the log-normal distribution
(LN) specifies the geometric Brownian motion model of Rannala and Yang 2007. This
assigns a rate to each species-tree branch, that is, to the mid-point of
the branch. Given the rate at the species-tree root (\(\mu_i\) at locus
\(i\)), the rates for the two branches around the root are specified. Then
given the rate for each ancestral branch, the rates for its two daughter
branches are specified, by integrating over the rate at the internal
node that is ancestral to the daughter branches. See figure 1 and
equations 3-8 in Rannala and Yang 2007. The rates for all species-tree branches
are thus assigned through a pre-order tree traversal, starting from the
root moving to the tips, until all branches are visited. If clock = 3
is specified with the gamma distribution (option G
), the model works
as follows, using a similar pre-order tree traversal. First the two
branches at the species-tree root have the gamma distribution with mean
\(\mu_i\) and variance \(\nu_i\). Then given the rate for each ancestral
branch, the rates for its two daughter branches are specified as
independent gamma variables with the mean to be the rate of the parental
branch and with the variance to be \(\nu_i\). Clock 3 is currently
implemented for the MSC model only and is unavailable under the MSC-I
model.
Note that a larger variance \(\nu_i\) implies a more serious violation of
the molecular clock at locus \(i\). Also note that \(\nu_i\) will be similar
to \(\bar\nu\), especially if \(\alpha_{\nu_i}\) is large, and \(\bar\nu\) has
prior mean \(\alpha_{\bar{\nu}} / \beta_{\bar{\nu}}\). If
\(\alpha_{\bar{\nu}} = 10\) and \(\beta_{\bar{\nu}} = 100\), the prior mean
will be 0.1. Thus, for the log-normal model \(\nu = 0.5\) represents a
serious violation of the clock while \(\nu < 0.1\) represents only a
slight violation.
EXAMPLES
clock = 1
clock = 2 10.0 100.0 5.0 dir LN
clock = 3 10.0 50.0 3.0 dir G
27 heredity
heredity = +d [(f f, s, )]
DESCRIPTION
Specifies whether proportional differences exist between all \(\theta\)
values for different loci, and if so whether the proportionality
multipliers (called inheritance scalars) are estimated from the data, or
set to specified fixed values. If inheritance scalars are estimated, the
parameters of the prior also must be specified.
VALUES
+d
, specifies whether no differences in \(\theta\) exist among loci
(0), proportional differences exist and inheritance scalars are
estimated (1), or proportional differences exist and inheritance scalars
are fixed to specified values (2).
f f
, is only specified when +d
is 1 and specifies the \(\alpha\)
and \(\beta\) parameters, respectively, of the Gamma prior on the scalar
\(s_i\) for locus \(i\).
s
, is only specified when +d
is 2 and specifies the name of a
file containing \(L\) (number of loci) heredity scalars (which must be
positive numbers) separated by spaces or linebreaks.
DEFAULT
0
COMMENTS
heredity = 0
is the default and constrains \(\theta\) to be equal across
loci. heredity = 1
or 2
specifies two models that allow \(\theta\) to
vary among loci, which may be useful for analyses combining data from
autosomal loci, loci on sex chromosomes, and/or mitochondrial loci. Such
mixed data, may be expected to have proportional differences in
effective population sizes among loci so that inheritance scalars
Hey and Nielsen 2004 should be applied. For example, in diploid sexual organisms
with strict maternal inheritance of mtDNA we expect the effective
population size of a mitochondrial locus to be \(1/4\) that of a nuclear
locus. Other factors such as natural selection may also cause \(\theta\)
to deviate from the neutral expectation even among autosomal loci.
BPP implements two options for allowing such variations.
The first option (heredity = 1
) specifies that locus-specific
inheritance scalars \(s_i\) be estimated, using a Gamma prior with
parameters \(\alpha\) and \(\beta\) specified by the user. The prior mean
and variance of \(s_i\) are:
\(\textrm{Mean}(s_i) = \frac{\alpha}{\beta}\),
\(\textrm{Var}(s_i) = \frac{\alpha}{\beta^2}\).
For example, (heredity = 1 4 4)
specifies a Gamma prior
\(\textrm{G}(4, 4)\), with mean \(4/4 = 1\) and variance \(1/4\), for the
inheritance scalar at each locus. The MCMC then generates the posterior
distribution of \(s_i\) for each locus. The second option (heredity = 2
)
allows the user to specify the \(s_i\) in a file so they remain fixed
constants during the MCMC run.
EXAMPLES
heredity = 0
heredity = 1 4 4
heredity = 2 scalars.txt
28 checkpoint
checkpoint = +d [(+d)]
DESCRIPTION
Specifies whether one or more checkpoint files are created to allow the
BPP program to resume using the current state of the MCMC at the
iteration during which the checkpoint file was created.
VALUES
+d[(+d)]
, specifies the iteration at which the first checkpoint
file is created (first argument) and how frequently checkpoint files are
subsequently created (second optional argument).
DEFAULT
If the checkpoint value is undefined no checkpoint file is created.
COMMENTS
The checkpoint files are named JOBNAME.Z.chk
where JOBNAME
is the name
specified by the jobname
variable and Z
is the number of the
checkpoint file (the first checkpoint file is labeled 1, the second 2,
and so on). To resume execution of the program starting from a
checkpoint use the BPP –resume
switch followed by the checkpoint file
name. For example, bpp –resume outrun1.1.chk
.
EXAMPLES
checkpoint = 100000
checkpoint = 100000 10000
29 constraintfile
constraintfile = s
DESCRIPTION
Specifies the name of a file containing information for placing
topological constraints on inferred trees and specifying outgroups for
tree rooting.
VALUES
s
, a string specifying the name of the constraint file.
DEFAULT
If constraintfile
is undefined no constraints or outgroups are used in
the analysis.
EXAMPLES
constraintfile = myconstraints.txt
30 threads
threads = +d [+d +d]
DESCRIPTION
Specifies the number of CPU threads to be used and optionally the
specific threads to be used.
VALUES
+d [+d +d\]
, either 1 or 3 positive integers. A single integer
argument specifies the number of threads to use. If three integers are
given the first specifies the number of threads, the second specifies
the index of the first thread used and the third specifies the
increment. For example, 12 4 1
specifies that 12 threads should be
used, the first with index 4 and incrementing by one so that threads
4-15 are used.
DEFAULT
If threads
is undefined, the calculations for each locus are placed on
different threads when possible.
DEPENDENCIES
The number of specified threads cannot exceed the number of loci
specified.
EXAMPLES
threads = 4
threads = 8 4 1
31 datefile
datefile = s
DESCRIPTION
Sets the path/name of the date file to be the string s
VALUES
s
, a string specifying the name of the date file.
DEFAULT
If datefile
is undefined, no tip dates are used in
the analysis.
DEPENDENCIES
Model A00 must be used with tip-dating (speciesdelimitation = 0
, speciestree = 0
).
Checkpointing cannoted be used.
The tip-dating locusrate
option much be used, locusrate = 3 d d
.
A global clock must be used.
This is the default or can be set with clock = 0
.
COMMENTS
See Date File for a complete dscription of the date file format.
EXAMPLES
datefile = dates.txt
datefile = /home/foo/seqDates.txt
Example control file
To examine the structure of a typical BPP control file, we consider the
example file A00.bpp.ctl
contained in the BPP distribution
subdirectory examples/frogs
. The goal here is to provide an example of a
correctly formatted control file and briefly explain the syntax and
meaning of specified variables, as well as proposing some conventions
for writing optional comments in control files. An exhaustive
description of all control file variables, their effects, and
permissible values is provided in the definitions of
BPP Control File Variables.
In depth descriptions of specific model and prior choices can be found in
Substitution Models, Introgression and Migration Models, and Isolation with Migration Models.
Control files illustrating the 4 methods of analysis A00, A01, A10 and A11 are found in
Methods of Analysis. The content of the control file A00.bpp.ctl
is displayed below:
seed = -1
seqfile = frogs.txt
Imapfile = frogs.Imap.txt
jobname = out
# fixed species tree and delimitation
speciesdelimitation = 0
speciestree = 0
species&tree = 4 K C L H
9 7 14 2
(((K, C), L), H);
# sequence data are unphased for all 4 populations
phase = 1 1 1 1
# 0: no data (prior); 1:seq like
usedata = 1
# number of data sets in seqfile
nloci = 5
# remove sites with ambiguity data (1:yes, 0:no)?
cleandata = 0
# invgamma(a, b) for theta
thetaprior = invgamma 3 0.004 E
# invgamma(a, b) for root tau & Dirichlet(a) for other tau's
tauprior = invgamma 3 0.002
* heredity = 1 4 4
* locusrate = 1 5
# finetune for GBtj, GBspr, theta, tau, mix, locusrate, seqerr
finetune = 1: 5 0.001 0.001 0.001 0.3 0.33 1.0
# MCMC samples, locusrate, heredityscalars, Genetrees
print = 1 0 0 0
burnin = 8000
sampfreq = 2
nsample = 100000
A comment about comments
Optional comments may be added to the control file. If either a '#' or
a '*' symbol occurs on a line, everything following the symbol up to
the end of the line will be ignored by the BPP program. This allows
optional comments to be added by the user either to explain choices (or
meanings) of variables, or to retain optional alternative variable
settings for future use. The control files presented in this
documentation use the different comment-initiating symbols for distinct
purposes:
-
'#' initializes a comment that lists and/or explains the possible settings of the variable found immediately below the comment.
-
'*' initializes a comment that retains deactivated variable settings (for future use). If an alternative to a defined variable it follows that variable.
It is good practice to make use of these conventions for comments in your own bpp control files.
Control file variables explained
Here we walk through the variables in the above control file example explaining the meaning of each and the valid values that can be specified. An exhaustive description of all control file variables and permissible values is given in Control File.
seed = -1
This line specifies the seed value used to initialize the random number
generator. Because BPP uses pseudorandom numbers to drive a stochastic
algorithm for inference the results will vary between analyses initiated
with different seeds, but will be identical if the same seed is used
(because pseudorandom numbers are generated using a deterministic
algorithm started with the value of seed). Setting seed = -1
specifies
that BPP use the clock on the computer to generate a seed. With this
setting, every run of the program will automatically be started with a
different seed. The seed that was used for a run is stored in a file
generated by the program called SeedUsed.
seqfile = frogs.txt
Imapfile = frogs.Imap.txt
jobname = out
These lines specify the names of the files used to run the program and
summarize the output. The program will assume that all these files are
located in the current working directory since no path is specified. The
line seqfile = frogs.txt
specifies that the sequence data are in the
file named frogs.txt
. See Sequence File for a
description of the sequence data file format.
The line Imapfile = frogs.Imap.txt
specifies that the map file is
named frogs.Imap.txt
. See Imap File for
a description of the map file format. The line jobname = out
specifies
that the program output and mcmc samples are printed to files named
out.txt
and out.mcmc.txt
, respectively. The format of the output file
depends on the type of analysis being performed and is described in
Methods of Analysis.
# fixed species tree
speciesdelimitation = 0
speciestree = 0
The line speciesdelimitation = 0
specifies that the number of species
is fixed (not delimited) and the line speciestree = 0
specifies that
the species phylogeny (topology) is fixed (not estimated). This
combination specifies an analysis of type A00, which is parameter
estimation under a MSC using multiple-loci and multiple-species data on
a fixed species tree (Rannala and Yang 2003; Burgess and Yang 2008). Analysis A00 is
described in detail in A00: Demography and Divergence Time Estimation.
species&tree = 4 K C L H
9 7 14 2
(((K, C), L), H);
The line species&tree = 4 K C L H
specifies the number of
species/populations and their names. This example specifies 4 species in
the data, which are K, C, L and H. The next line 9 7 14 2
specifies
the maximum number of sequences at any locus for each species/population
(in the same order as the species listing on the previous line). This
example specifies a maximum of 9 sequences for K, 7 for C, 14 for L and
2 for H. The program interprets these numbers as follows. First, if the
number of sequences for a species is 2 or greater, \(\theta\) for that
species will be a parameter estimated by the program. Second, the sum of
the numbers of sequences over all species specifies the maximum number
of sequences at a locus. However, specifying 2 sequences for a species
when there are actually 10 sequences for that species at some loci in
the data file has no harmful effects. Note that if there are \(s\) species
in the species tree, the model will minimally contain the following
parameters: \(s – 1\) species divergence times (\(\tau\)s) and \(s – 1\)
ancestral species \(\theta\)s. A \(\theta\) parameter will also exist for
each contemporary species with at least two sequences present at some
locus. If the data are for a single species/population, the model will
contain one parameter only, \(\theta\) for that species.
The line (((K, C), L), H);
specifies the (fixed) species tree in
Newick format ending with a semicolon (;). This example specifies a tree
with no introgression (the MSC model). Specification of the MSC-I model
with introgression is different (see
Introgression and Migration Models for a description of the MSC-I model). The
Newick and extended Newick formats for specifying phylogenetic trees are
described in Introgression and Migration Models and a good overview is found at
https://en.wikipedia.org/wiki/Newick_format.
# sequence data are unphased for all 4 populations
phase = 1 1 1 1
The phase
variable is specified using a boolean (0/1) indicator
variable for each species/population, with 0 (the default) specifying
that sequences from that species are fully phased haplotypes and 1
specifying they are unphased diploid sequences. If this line is omitted,
all sequences are assumed to be fully phased (that is, value 0 for all
species). The program does not allow some sequences from a species to be
phased while other sequences from that same species are unphased. If the
variable is 1, each sequence from that species is treated as an unphased
diploid sequence, with heterozygous sites represented using ambiguity
characters YRMKSW
. The chararcters N
, -
and ?
are allowed in the
sequence file and are all treated as ?
, but other ambiguities (HBVD)
are not allowed. Each sequence is then expanded/resolved into two
sequences, and the program averages over all the possible phase
resolutions according to the approach of Gronau et al 2011. For example a
sequence with two heterozygous sites R and Y is resolved into two
possible haplotypes: (i) A...T and G...C, and (ii) A...C and G...T.
Note that this option forces a sequence with no heterozygote sites to be
phased/resolved into two identical sequences.
The MSC model averages over different phase resolutions of the heterozygote sites in the likelihood calculation. This option may increase the memory usage and CPU time considerably if there are many sequences with many heterozygote sites at one locus. The simulation program MCcoal has been modified to simulate diploid unphased data as well. This allows one to use simulations to examine the difference between the analysis of the full data and the unphased data. If the phase flag for a species is 0, all sequences from that species are treated as resolved haplotype sequences, and ambiguities are interpreted in the usual way (for example, Y means one nucleotide that is either T or C).
# 0: no data (prior); 1:seq like
usedata = 1
The line usedata = 1
specifies that the likelihood is used in the MCMC
run to generate the posterior distribution of parameters. Setting this
to 0 allows one to run the MCMC without sequence data to generate the
prior (mostly used for debugging).
# number of data sets in seqfile
nloci = 5
The line nloci = 5
specifies the number of loci (alignments) to be 5.
There may be more loci in the sequence data file than is specified here.
For example, if 200 loci exist in the sequence data file and you specify
nloci = 2
in the control file, BPP will read the first
two loci only.
# remove sites with ambiguity data (1:yes, 0:no)?
cleandata = 0
The line cleandata = 0
specifies that columns in the alignment which
have gaps or ambiguity characters are retained and used in the
likelihood calculation. Setting cleandata = 1
instead specifies that
such data are removed prior to analysis.
# invgamma(a, b) for theta
thetaprior = invgamma 3 0.004 E
The line thetaprior = 3 0.004 E
specifies the inverse-gamma prior
IG(\(\alpha, \beta\)) for the \(\theta\) parameters, with the mean to be
\(\beta/(\alpha-1)\). In the example, the mean is \(0.004/(3 – 1) = 0.002\)
(two substitutions per kb on average) and the option E specifies that
the program analytically integrates over \(\theta\)s. Note that all
\(\theta\) parameters in the MSC model (for both modern species and
extinct ancestral species) are assigned the inverse-gamma prior with the
same parameters.
Analytical integration of \(\theta\) and the inverse gamma prior
The inverse-gamma is a conjugate prior for \(\theta\) (Hey and Nielsen 2007)
which means that both the prior and the posterior of \(\theta\) will be
inverse-gamma. Use of the conjugate priors allows the \(\theta\)
parameters to be integrated out analytically, and thus the dimension of
the parameter space is reduced. This typically leads to improved mixing
of the MCMC. However, with this option, the posterior of the \(\theta\)
parameters will not be produced. To estimate the \(\theta\) parameters,
add the letter e (or E) on the line, as follows
thetaprior = invgamma 3 0.002 e
Whether \(\theta\)s are integrated out or estimated, other results (such as the posterior of the \(\theta\) parameters or the posterior probability of species trees or species-delimitation models) should be identical. A useful strategy may be to run the A01, A10, and A11 analyses without estimating the \(\theta\) parameters, and after the MAP model (the best species tree or the best species delimitation model) is identified, run the A00 analysis with the model fixed to estimate all parameters.
Some notes about the inverse-gamma distribution. Since
BPP3.4, both the \(\theta\) and \(\tau\) parameters are
assigned the inverse gamma priors rather than the gamma priors in
version 3.3 or earlier. One difference is that the gamma is light-tailed
while the inverse-gamma is heavy-tailed, so that the inverse-gamma may
be less influential than the gamma if your prior mean is much too small.
The inverse-gamma distribution IG(\(\alpha, \beta\)) has mean
\(m = \alpha/(\beta-1)\) if \(\alpha > 1\) and variance
\(s^2 = \beta^2/[(\alpha-1)^2(\alpha-2)]\) if \(\alpha > 2\), while the
coefficient of variation is \(s/m = \sqrt{1/(\alpha-2)}\). If little
information is available about the parameters, you can use \(\alpha = 3\)
for a diffuse prior and then adjust the \(\beta\) so that the mean looks
reasonable. For example, for the human species,
\(\theta_H \approx 0.0006\), which means that two random sequences from
the human population are different at 0.06% of sites, less than 1
difference per kb. A sensible diffuse prior is then
thetaprior = 3 0.002
, with mean \(0.002/(3 – 1) = 0.001\).
# invgamma(a, b) for root tau & Dirichlet(a) for other tau's
tauprior = invgamma 3 0.002
The line tauprior = 3 0.002
specifies the inverse-gamma prior
IG(\(\alpha, \beta\)) for \(\tau_0\), the divergence time parameter for the
root in the species tree. Other divergence times are generated from the
uniform Dirichlet distribution (Yang and Rannala 2010, eq.2). In the example, the
mean is \(0.002/(3 – 1) = 0.001\) (which means 0.1% average sequence
divergence between the root of the species tree and the present time).
If the mutation rate is \(10^{–9}\) mutations/site/year, this distance
will translate to a divergence time of 1 MY.
# finetune for GBtj, GBspr, theta, tau, mix, locusrate, seqerr
finetune = 1: 5 0.001 0.001 0.001 0.3 0.33 1.0
The line finetune = 1: 5 0.001 0.001 0.001 0.3 0.33 1.0
specifies the
step lengths used in proposals for the parameters of the MCMC algorithm.
These step lengths can affect both mixing and convergence of the MCMC.
The first value, before the colon (1 in the example), is a boolean
switch, with 1 specifying automatic adjustments of step lengths by the
program and 0 specifying no automatic adjustments. Following the colon
are the step lengths for the proposals used in the program. If you
choose to let the program adjust the step lengths, burnin has to be
>200, and the step lengths specified here will then be the initial step
lengths, with the program adjusting/optimizing step lengths using the
information collected during the burnin step. Automatic adjustment works
well in most cases and is usually the best option. Some notes about
manually adjusting the step lengths are provided below in section ?.
# MCMC samples, locusrate, heredityscalars, Genetrees
print = 1 0 0 0
burnin = 8000
sampfreq = 2
nsample = 100000
The line print = 1 0 0 0
specifies the information that is printed out
during the MCMC run. It has several boolean variables (0 for off and 1
for on) that control the printouts during the MCMC. The first variable
specifies whether the MCMC samples are written into the main sample file
(out.mcmc.txt). The second flag is for locus-specific rate parameters (rate
\(\mu_i\), variance parameter \(\nu_i\) and species-tree branch rates for
locus \(r_{ij}\)). The third flag is for locus-specific heredity scalars
if those are estimated from the data. The fourth flag is for printing
locus-specific gene trees. The fifth flag is for printing locus-specific
substitution-rate parameters (qmat for the \(Q\) matrix, freqs for base
frequencies, and alpha for gamma rates for sites).
MCMC samples are taken after the burnin (8000 iterations), and in this
example, are taken every 2 iterations, with a total of 100,000 samples
taken. The total number of MCMC iterations is burnin
+ sampfreq
\(\times\) nsample
. The resulting file can be large. For analysis A00
(speciesdelimitation = 0, speciestree = 0
), this file can be analysed
using the R package or tracer. For other analyses (A01, A10, and A11),
the dimension of the sample is changing and cannot be analysed using
such traceplot programs.
Combining results from different runs
The variable setting print = -1
specifies that BPP will
not perform an MCMC analysis and will instead read an existing MCMC
sample file and summarize the results. Thus, with setting print = 1
,
the out.mcmc.txt
file receives the MCMC program output, but with
print = -1
, it instead provides the program input. When using this
option be careful not to overwrite files that you intended to keep. The
print = -1
option is useful for combining MCMC samples from multiple
runs to produce the posterior summary. Suppose you run the same analysis
3 times in different directories -- you can merge the out.mcmc.txt files
from the three runs into one file and run the program using the
print = -1
option to summarize the posterior for the combined sample.
Note that if you forced one or more of the jobs to terminate (using the
Unix kill command for example) the last line of each out.mcmc.txt file may
incomplete. If that is the case, you should delete the incomplete line.
The header lines of any files being concatenated should also be deleted.
Do not combine the MCMC samples from different types of analyses (i.e.,
using different data files and/or variable settings or priors).
Substitution Models
BPP includes several models that allow more realistic representations of the processes underlying the evolution of DNA and amino acid sequences. Fundamental to all molecular phylogeographic analyses is the model of the nucleotide or amino acid substitution process itself. DNA substitution models can accomodate different rates of substitution between nucleotides due to factors such as transition versus transversion bias and unequal stationary nucleotide frequencies. BPP allows the use of many different DNA base substitution and amino acid substitution models. For DNA sequences, the models range from a simple Jukes-Cantor model (computationally efficient) to a sophisticated General Time-Reversible (GTR) model (computationally expensive but more realistic). For closely related species the Jukes-Cantor model is often adequate, while the GTR often performs better for more distantly related species. The remaining DNA substitution models (K80, F81, HKY, T92, TN93, and F84) are all special cases (submodels) of the GTR model and are of intermediate complexity, usually having one or two free parameters (see Yang 2014). Many fixed substitution matrix models (no free parameters) for modeling substitutions among amino acid sequences are also available.
Another factor to consider is variation in overall (or average) substitution rates. The overall rate of substitution can vary in multiple ways and, based on empirical evidence, this type of variation is one of the most important factors to accommodate during phylogeographic inference. There are 3 distinct types of substitution rate variation that can be accommodated using the models implemented in BPP: rate variation among sites, rate variation among loci, and rate variation among species (or populations). Substitution rate variation among sites, or among loci, is a common pattern and is likely due to the pervasive effects of negative (purifying) selection. Variation of substitution rates among species (or populations) is more often observed among distantly related species and may reflect the effects of differential selection, evolution of DNA repair mechanisms, or other factors.
This Chapter briefly describes the models of DNA substitution, and of substitution rate variation, implemented in BPP, the parameters and priors associated with these models, and how to specify the different models in the BPP control file.
DNA Substitution Models
Jukes-Cantor model
One of the earliest and simplest models of DNA substitution, the
Jukes-Cantor model, assumes that all possible changes between the 4
nucleotides have equal rates and thus the stationary frequencies of
bases are \(1/4\) each. This model is the default in BPP and will be used
if no model is specified in the control file. The control file variable
model
specifies the substitution model and the explicit specification
of the JC69 model is:
model = JC69
The JC69 model appears exceedingly unrealistic. However, for closely related sequences (with few multiple substitutions) it often performs as well as more sophisticated models while incurring much less computational expense. There are no parameters associated with the JC69 model (other than the substitution rate) so no additional control file variables associated with the JC69 substitution model need to be specified.
GTR model
The General Time-Reversible (GTR) model implemented in BPP allows different rates of substitution between different nucleotide pairs, with the constraint that the process is "reversible" so that, for example, the rate of transition from A to T equals the rate of transition from T to A and so on. The GTR model accommodates factors such as transition-transversion bias and should be used when sequences are more divergent (distantly related species). The control file specification of the GTR model is:
model = GTR
Note that if the GTR model is used, two other variables may be specified
in the control file: Qrates
and basefreqs
. The Qrates
variable
specifies whether the exchangeability parameters of the GTR model are
fixed or variable as well as either their fixed values, or prior mean
values, respectively. The control file specification is:
Qrates = B a b c d e f
The variable B
is either 0 (variable exchangeability parameters) or 1
(fixed exchangeability parameters). The variables a b c d e f
are
positive numbers specifying the exchangeability parameters for TC, TA,
TG, CA, CG and AG substitutions respectively (see Yang 1994). Note that
TC is the rate of either T->C or C->T substitutions. Note that only
the relative rates matter because the rate matrix is scaled to have an
overall average substitution rate of 1. Thus, for the model of fixed
rates if x is any positive number a b c d e f and xa xb xc xd xe xf are
equivalent specifications. An example with fixed rates is as follows:
Qrates = 1 2 1 1 1 1 2
which specifies \(a=2,b=c=d=e=1,f=2\) so that the transition rate is twice as high as the transversion rate, corresponding to a K80 or HKY model. In most cases, users will want to use the GTR model with exchangeability parameters variable. Using this specification, the posterior distribution of the parameters will be estimated from the data. An example with variable rates is as follows:
Qrates = 0 10 5 5 5 5 10
The absolute magnitude of the numbers does matter in this case as it determines the variance of the prior probability distribution on exchangeability parameters. The variance is inversely proportional to the absolute magnitude of the variables. The prior on the free parameters \(a, b, c, d, e\) is a Dirichlet distribution with parameters
\(\alpha_a = 10, \alpha_b = \alpha_c = \alpha_d = \alpha_e = 5, \texttt{and} \, \alpha_f = 10\)
The rates generated from the Dirichlet sum to 1 and are rescaled. If we define
\(\alpha = \alpha_a + \alpha_b + \alpha_c + \alpha_d + \alpha_e + \alpha_f\)
then the mean for \(a\) is \(\alpha_a/\alpha\) and the variance for \(a\) is
\(\textrm{Var}(a) = \frac{\frac{\alpha_a}{\alpha}\left(1-\frac{\alpha_a}{\alpha}\right)}{\alpha + 1}\).
The means and variances of the other parameters are the same (substituting subscripts). Thus, in this example, the mean for \(a\) is \(\alpha_a/\alpha = 10/40 = 1/4\) and the variance is \((1/4)(3/4)/41 = 4.6 \times 10^{-3}\). The prior specified above has an average transition-transversion rate ratio of 2 (if base frequencies are all fixed at 1/4).
The other control file variable that must be specified when using the
GTR model is the basefreqs
variable which specifies whether the
stationary nucleotide (base) frequencies are fixed, or variable, and
either their fixed values, or relative prior means, respectively. The
control file specification is:
basefreqs = B T C A G
where B is either 0 (variable base frequencies) or 1 (fixed base frequencies) and T C A and G are numbers specifying the frequencies of bases T, C, A and G, respectively. An example with fixed base frequencies is as follows:
basefreqs = 1 0.1 0.2 0.2 0.5
In most cases, users will want to use the variable base frequencies setting which specifies that the posterior distribution of base frequencies is estimated from the data. An example with variable base frequencies is as follows:
0 10 10 10 10
Here the 4 variables T C A G specify the alpha parameters of a Dirichlet
prior distribution on the nucleotide frequencies. So, the prior mean
frequency of base A in this example is \(10/40 = 1/4\) and the variance is
\((1/4)(3/4)/41 = 4.6 \times 10^{-3}\). As noted above when discussing the
Dirichlet prior for exchangeability parameters the absolute magnitude of
the variables determines the variance of the Dirichlet prior. If a GTR
model is chosen and Qrates
and basefreqs
are not specified in the
control file default values of Qrates = 0 1 2 1 1 2 1
and
basefreqs = 0 1 1 1 1
will be used.
Other DNA substitution models
A range of models are available that fall between the JC69 model and the GTR model in terms of complexity and number of parameters. The additional models available include: K80, F81, HKY, T92, TN93, and F84. The priors on the parameters of these models are pre-specified by the program.
Amino acid substitution models
There are many amino acid substitution models available. These models completely specify the substitution matrix so there are no free parameters. Many of the models are described in Yang 2014. The available models are: DAYHOFF, LG, DCMUT, JTT, MTREV, WAG, RTREV, CPREV, VT, BLOSUM62, MTMAM, MTART, MTZOA, PMB, HIVB, HIV,W, JTTDCMUT, FLU, and STMTREV.
Custom among-locus model variation
If a single substitution model is specified, all loci are assigned that
model. The control file option model = Custom filename
can be used to
assign different models among loci, where the models are specified in
the file filename
. This file should contain one or more lines with the
format:
loci_indices data_type model
where loci_indices
is an index or range of indexes for loci (indexed
according to their order of occurrence in the sequence data file),
data_type
is either DNA
or AA
, and model specifies the
substitution model variable for the indexed loci as described above. For
example, the entry
1-10 DNA JC69
11 AA DAYHOFF
specifies that loci 1 to 10 comprise DNA sequences that will have the
JC69
model applied to them, while locus 11 is an amino acid sequence
that will have the DAYHOFF
model applied to it.
Models of Substitution Rate Variation
Among-site rate variation
Substitution rates vary among sites due to many factors, including
purifying selection in coding regions. BPP allows among-site rate
variation to be accomodated using the canonical Gamma distributed rate
variation model (see Yang 1994). In this model, the mean rate among
sites is constrained to be 1 so that the expected number of
substitutions on a branch is proportional to its length (in units of
expected substitutions). If the mean of the Gamma distribution is fixed,
the only free parameter is the shape parameter \(\alpha\). The variance of
rates among sites is \(1/\alpha\) and therefore a smaller \(\alpha\)
specifies more rate variation. Gamma distributed rate variation is
specified using the control file variable alphaprior
. If this variable
is not set the default is \(\alpha = \infty\) so that the variance is zero
(rates are constant among sites). The prior on \(\alpha\) is also a Gamma
distribution with parameters \(\alpha_p\) and \(\beta_p\). The control file
specification is as follows:
alphaprior = a b ncatG
where \(a = \alpha_p\) and \(b = \beta_p\) are the parameters of the Gamma
distributed prior on the parameter \(\alpha\) and ncatG
is the number of
rate categories used to approximate the Gamma distribution. A discrete
distribution is used to efficiently approximate the Gamma distribution
and ncatG is the number of rate categories used in the approximation --
more rate categories provide a better approximation but incur greater
computational expense. A value for ncatG of 4 is recommended. The
variance and mean of the prior on \(\alpha\) are:
\(\begin{aligned} \textrm{Mean}(\alpha) & = & \frac{\alpha_p}{\beta_p}, \nonumber \\ \textrm{Var}(\alpha) & = & \frac{\alpha_p}{\beta_p^2} \nonumber\end{aligned}\)
As an example, the setting alphaprior = 1 1 4
specifies a prior on the
shape parameter \(\alpha\) (with 4 rate categories) that has a mean and
variance of 1, thus the expected variance of \(\alpha\) is also 1. This is
a fairly diffuse (uninformative) prior.
Among-locus rate variation
Empirical evidence suggests that substitution rates can vary substantially among loci, probably reflecting the influences of selection. BPP incorporates a hierarchical prior to model rate variation among loci. The models implemented in BPP allow each locus to have a different substitution rate. For \(L\) loci (alignments), there are \(L\) substitution rate parameters, where \(\mu_i\) is the rate at the \(i\)th locus.
Prior distribution of the mean rate among loci
The mean rate across loci is
\(\bar{\mu} = \frac{1}{L} \sum_{i=1}^L \mu_i\).
The prior distribution for the mean rate across loci \(\bar{\mu}\) is a Gamma distribution with parameters \(\alpha_{\bar{\mu}}\) and \(\beta_{\bar{\mu}}\) and mean and variance,
\(\begin{aligned} \textrm{Mean}(\bar{\mu}) & = & \frac{\alpha_{\bar{\mu}}}{\beta_{\bar{\mu}}}, \nonumber \\ \textrm{Var}(\bar{\mu}) & = & \frac{\alpha_{\bar{\mu}}}{\beta_{\bar{\mu}}^2}. \nonumber\end{aligned}\)
If there are no fossil calibrations in the species tree, the rates should all be relative. In this case, to avoid overparameterization we suggest fixing the mean substitution rate across loci to be \(\bar\mu = 1\). This is done by setting \(\alpha_{\bar{\mu}} = \beta_{\bar{\mu}} = \infty\) (see discussion of control file settings below), or setting both these parameters to large values such as \(\alpha_{\bar{\mu}} = \beta_{\bar{\mu}} = 100\), in which case \(\bar{\mu}\) is nearly fixed.
Prior distribution for locus-specific rates given mean rate among loci
Conditioning on the mean rate \(\bar\mu\), two priors are available to
specify the probability distribution of locus-specific rates, \(\mu_i\).
The control file variable argument prior
(see discussion of control
file settings below) can take two values dir
(for Gamma-Dirichlet
rates for loci), and iid
(for conditional i.i.d. or hierarchical
prior). If prior = dir
(Gamma-Dirichlet distribution of locus rates),
the total rate \(L\bar\mu\), given the mean rate (\(\bar\mu\)), is
partitioned into locus rate variables , \(\mu_i\), that are constrained to
sum to \(L\bar\mu\) and have a Dirichlet distribution with concentration
parameter \(\alpha_\mu\). The model and notation follow
Burgess and Yang 2008 eq.4
and Dos Reis et al 2014 eqs.3-5.
If prior = iid
(Conditional-i.i.d. or hierarchical prior of locus rates), the locus
rates, \(\mu_i\), are independent and identically distributed i.i.d. given
the mean rate \(\bar\mu\), so the prior distribution of \(\mu_i\) is:
\(\mathtt{P}(\mu_i) = \textrm{Gamma}(\alpha_{\mu_i}, \alpha_{\mu_i/\bar{\mu}})\).
This model is described in Zhu et al 2015 eq.8 and implemented in [mcmctree]{.smallcaps}. In both the Gamma-Dirichlet and the Conditional i.i.d. models, the parameter \(\alpha_{\mu_i}\) is inversely related to the extent of rate variation among loci, with a large \(\alpha_{\mu_i}\) meaning similar rates among loci (less rate variation). If all loci are noncoding, the rates are probably similar, so \(\alpha_{\mu_I} =\) 10 or 20 may be reasonable, while for coding loci or exons, \(\alpha_{\mu_i} = 2\) or 1 may be appropriate. The \(\alpha_{\mu_i}\) parameter may affect the estimates of the population size parameter (\(\theta\)) for the root node on the species tree (Burgess and Yang 2008). The \(L\) locus rates (\(\mu_i\)) are parameters in the model. If \(\alpha_{\bar{\mu}} > 0\), the mean rate, \(\bar\mu\), is a parameter as well.
Control file settings for locusrate
The locusrate
variable takes from 1 and 5 arguments. In the case of 5
arguments the form is shown below, with the last argument (prior) being
optional.
locusrate = rate_variation a_mubar b_mubar a_mui prior
The rate_variation variable is either 0 (no rate variation, the
default), 1 (rate variation with rates inferred) or 2 (rate variation
with fixed user-specified rates). If rate_variation = 0 no other
variables are specified, if rate_variation = 2 then a second argument
specifies the name of a file containing \(L\) specified fixed rates, one
for each locus. If rate_variation = 1 the additonal variables a_mubar
and b_mubar
specify the shape and rate parameters
(\(\alpha_{\bar{\mu}}\) and \(\beta_{\bar{\mu}}\)) of the gamma distribution
that is the prior for the mean rate across loci (mubar
or \(\bar\mu\)),
and variable a_mui
(\(\alpha_{\mu_i}\)) and prior
are used to specify
parameter of the prior on the locus rates (\(\mu_i\)) given the mean rate
(\(\bar\mu\)) and the form of that prior (dir
or iid
). Some example
configuration file settings are:
# (0: No rate variation among loci. This is default)
locusrate = 0
# (1: estimate locus rates mui)
locusrate = 1 10 10 5.0 iid
# (1: estimate locus rates mui)
locusrate = 1 0 0 5.0 iid
# (2: locus rates from file)
locusrate = 2 LocusRateFileName
Note that locusrate = 2 LocusRateFileName
specifies the fixed-rates
model of locus-rate variation (Burgess and Yang 2008). This is the strategy used
by Yang 2002, with the relative rates estimated by the distance to an
outgroup species. The relative locus rates are listed in the file: there
should be as many numbers in the file, separately by spaces or line
returns, as the number of loci (nloci
). The program re-scales those
rates so that the average across all loci is 1 and then use those
relative rates as fixed constants. The following example nearly fixes
(first line), or fixes (second line) the mean rate \(\bar{\mu}\). This is
the recommended practice when no fossil calibrations are available:
# locusrate = 1 a_mubar b_mubar a_mui <prior>
# (1: estimate locus rates mui)
locusrate = 1 10 10 5 iid
# (1: estimate locus rates mui, mubar=1 fixed)
locusrate = 1 0 0 5 iid
Note that a_mubar = 0
and b_mubar = 0
(with 0 meaning \(\infty\))
cause the program to fix \(\bar{\mu}\), while equal and large values such
as a_mubar = 100
and b_mubar = 100
cause cause \(\bar\mu\) to be
nearly fixed at 1.
The locus-rate prior in IMa. The model of variable rates among loci implemented here has some differences from a similar model implemented in the IMa program Hey and Nielsen 2004. The biggest difference appears to be the parameterization. BPP defines mutation rate on a per-nucleotide basis, so the prior specifies that the expectation of the mutation rate per site is constant among loci. IMa defines mutation rate on a per-locus basis, so its prior specifies that the expectation of the mutation rate per locus is the same among loci. If locus one has 100 sites and locus two has 1000 sites, then IMa assumes that the per-site rate for locus one is 10 times for locus two, while BPP assumes the same per-site rate. Also IMa constrains the geometric mean of rates across loci to be one, while BPP constrains their arithmetic mean to be one.
Among-species rate variation
The strict molecular clock model of evolution assumes that rates of substitution for a particular gene, or genomic region, are the same in different species. Early applications of relative rates tests suggested that this assumption is often violated in empirical data, probably due to factors such as selection and in more distantly related species evolving DNA repair and proof-reading mechanisms.
One can allow rates of substitution to vary in an arbitrary way be inferring unrooted phylogenetic trees but this does not allow species divergence times to be estimated. Relaxed clock models allow the rate of substitution to vary among contemporary and ancestral species but with some constraints which allow species divergence times to be estimated. The BPP program implements two relaxed clock models, as well as a strict molecular clock model. The two relaxed clock models differ in terms of their assumptions about rates in ancestor and descendent species. The autocorrelated rates model assumes that descendents have rates that are correlated with the rate in the ancestor, while independent rates model assumes that the rates on each branch are iid from a common distribution. For the independents rates model, the variance of the distribution of rates among branches determines the degree of departure from a molecular clock - a low variance means that the rates are almost clocklike (vary little among branches).
Variation in the degree of departure from a molecular clock among loci
For a given set of species some loci may have rates that are highly clock-like while other loci may have much greater variation in substitution rates among species (branches of the species tree). To accomodate this, the variance of the substitution rate among species is allowed to vary among loci. The prior distribution of the average variance \(\bar{\nu}\) for all loci is
\(P(\bar{\nu}) = \textrm{Gamma}(\alpha_{\bar{\nu}},\beta_{\bar{\nu}})\),
so that the expected average variance is
\(\alpha_{\bar{\nu}}/\beta_{\bar{\nu}}\). If this value is larger loci
depart more from the molecular clock on average. Conditional on the
average variance parameter \(\bar{\nu}\) the prior for the variance
\(\nu_i\) at locus \(i\) can be either a Gamma-Dirichlet distribution or a
Conditional i.i.d distribution (controlfile variables dir
and iid
,
respectively). Both distributions have a parameter \(\alpha_{\nu_i}\) that
determines the variance of the variance in rates across loci. The
specification of the distribution of \(\nu_i\) given the mean \(\bar\nu\)
follows the same procedure as the specification of the distribution of
the locus-specific substitution rate \(\mu_i\) given the mean rate
\(\bar\mu\) in the previous section; see notes above about the locusrate
variable. Note that \(\alpha\) (control file variable a_vi
) is inversely
related to the variance in \(\nu_i\): use small values of \(\alpha\) (2, 1,
or 0.5) if you believe that \(\nu_i\) varies among loci (meaning that the
clock nearly holds at some loci but is seriously violated at others).
Relaxed clock with i.i.d rates among branches
The independent rates model of the relaxed molecular clock assumes that
for locus \(i\) each branch has a rate that is independent and identically
distributed (i.i.d). The distribution has mean \(\mu_i\) specified by the
locusrate
model and variance \(\nu_i\) specified by the prior
distribution for rates among loci discussed above. Two distributions are
available: a Gamma distribution or a log-normal distribution (control
file variables LN
and G
, respectively). Given the locus rate \(\mu_i\)
(specified using the locusrate
variable) and the variance \(\nu_i\)
(specified using the clock
variable) for locus \(i\), the different
lineages may have different rates at the locus, and those rates are
independent among loci. If distribution = G (for gamma), the rate for
species-tree branch \(j\) at locus \(i\) has the following gamma
distribution
\(r_{ij} | \mu_i, \nu_i \sim G(\mu_i^2/\nu_i, \mu_i/\nu_i)\).
This has mean \(\mu_i\) and variance \(\nu_i\).
Alternatively if distribution = LN
(for log-normal), the rate for
species-tree branch \(j\) at locus \(i\) has the following the log-normal
distribution \(\(r_{ij} | \mu_i, \nu_i \sim \mathrm{LN}(\mu_i, \nu_i),\)\)
where \(\mu_i\) is the mean of the LN distribution and \(\nu_i\) is the
variance parameter of the lognormal.
Relaxed clock with autocorrelated rates among branches
The autocorrelated-rates model (control file option clock = 3
) with
the log-normal distribution (LN) specifies the geometric Brownian motion
model of Rannala and Yang 2007.
This assigns a rate to each species-tree branch,
that is, to the mid-point of the branch. Given the rate at the
species-tree root (\(\mu_i\) at locus \(i\)), the rates for the two branches
around the root are specified. Then given the rate for each ancestral
branch, the rates for its two daughter branches are specified, by
integrating over the rate at the internal node that is ancestral to the
daughter branches. See figure 1 and equations 3-8 in Rannala and Yang 2007. The
rates for all species-tree branches are thus assigned through a
pre-order tree traversal, starting from the root moving to the tips,
until all branches are visited.
If the gamma distribution (G) is instead used the model works as
follows, using a similar pre-order tree traversal. First the two
branches at the species-tree root have the gamma distribution with mean
\(\mu_i\) (mu_i
) and variance \(\nu_i\) (nu_i
). Then given the rate for
each ancestral branch, the rates for its two daughter branches are
specified as independent gamma variables with the mean to be the rate of
the parental branch and with the variance to be \(\nu_i\). The
autocorrelated relaxed clock model is only implemented for the MSC model
and is not available when using the MSC-I model.
Rates vary among species tree branches\ The rate-drift model specifies rates for branches on the species tree (rather than on the gene tree) for each locus, and gene-tree branches residing in the same population or species have the same rate. For example, if all sequences at a locus are from the same species and all coalescent events occur in that species (before reaching an ancestral species), all branches on the gene tree will have the same rate even if the relaxed-clock model allows rates to vary among species. In contrast, if a gene-tree branch passes several species or populations, the different segments of the branch will have different rates. In that case, the branch length on the gene tree is the sum of the lengths of those segments (with the length of each segment being the product of the rate and the time duration for the segment).
Control file settings for clock
The clock
variable takes 6 arguments when using a relaxed clock as
follows:
clock = clock_type a_vbar b_vbar a_vi prior dist
the argument clock_type
takes on of three values: 1 (strict clock,
default), 2 (relaxed clock with i.i.d rates) and 3 (relaxed clock with
autocorrelated rates). In the case of the strict clock there are no
additional arguments. The next two variables a_vbar
and b_vbar
specify the parameters (\(\alpha_{\bar{\nu}}\) and \(\beta_{\bar{\nu}}\)) of
the Gamma prior specifying the average rate variance across loci (see
above). The variable a_vi
specifies the variance parameter of the
prior on the locus specific variance, \(\alpha_{\nu_i}\) (see above). The
variable prior
is either iid
or dir
and specifies the prior on
\(\nu_i\) given \(\bar{\nu}\) (see above). The variable dist
is either
LN
or G
specifying either a Gamma distribution of a log-normal
distribution for the branch rates conditional on the locus rate \(\mu_i\)
and among-branch variance \(\nu_i\).
Several example control file entries for the clock
variables are
provided below:
# (1: strict clock, default)
clock = 1
# (2: independent-rates)
clock = 2 10.0 100.0 5.0 iid G
# (3: correlated-rates)
clock = 3 10.0 100.0 5.0 iid G
The specification (clock = 2 10.0 100.0 5.0 iid G
) means the
following. First \(\bar\nu\) is assigned a gamma distribution
\(G(10.0, 100.0)\), with mean 0.1. Given \(\bar\nu\), the conditional
i.i.d. prior means that \(\nu_i \sim G(\alpha, \alpha/\bar\nu)\) with
shape parameter \(\alpha = 5.0\) (a_vi
) and mean \(\bar\nu\), for
\(i = 1, 2, \cdots, L\). As another example, the specification
(clock = 2 10.0 100.0 5.0 dir LN
) means that \(\bar\nu\) is assigned a
gamma prior \(G(10.0, 100.0)\), and then the sum \(L\bar\nu\) is partitioned
into \(\nu_i\) (for \(i = 1, 2, \cdots, L\)) according to the Dirichlet
distribution (prior = dir
) with concentration parameter \(\alpha = 5.0\)
(a_vi
). Again large \(\alpha\) means the same extent of clock violation
at different loci.
Choosing \(\alpha_{\bar{\nu}}\) and \(\beta_{\bar{\nu}}\)
A larger variance \(\nu_i\) represents a greater violation of the
molecular clock at locus \(i\). Note that \(\nu_i\) will be similar to
\(\bar\nu\), especially if \(\alpha_{\bar{\nu}}\) (a_vi
) is large, and
\(\bar\nu\) has prior mean \(\alpha_{\bar{\nu}}/\beta_{\bar{\nu}}\). If
(a_vbar = 10 b_vbar = 100
), the prior mean will be 0.1. For the
log-normal model \(\nu = 0.5\) suggests a serious violation of the clock
while \(\nu < 0.1\) represents only a slight violation.
Introgression and Migration Models
Another important factor influencing the results of species tree inference is introgression (gene flow) between species (or populations). BPP also implements a model of episodic introgression between species or populations Flouri et al 2020 as well as a model of ongoing gene flow (MSC-M). Since version 4.1, BPP implements the MSC-I model Flouri et al 2020 and since version 4.4 it implements the MSC-M model. Currently, introgression (MSC-I) and ongoing gene flow (MSC-M) analyses can only be performed using the A00 method of analysis (fixed species tree and delimitation, see Methods of Analysis). In other words, the user has to specify the number of introgression events (or bands of gene flow), their directions, and the populations involved on a fixed species tree. The program will then estimate the parameters of the MSC-I (or MSC-M) models using MCMC. We hope to implement MCMC moves that change the MSC-I (or MSC-M) models and species tree topology in the future.
The MSC-I Model
The MSC-I model involves three types of parameters: the species
divergence or hybridization times (\(\tau\)s), the population size
parameters (\(\theta\)s), and the introgression probabilities
(\(\varphi\)s). The control file must specify an extended Newick format
tree with one or more introgression events as well as a prior
distribution for the introgression probability \(\varphi\). The MSC-I model
is specified in the species&tree
block using the extended Newick
notation (Cardona et al 2008)
species&tree = 3 A B C
10 10 10
((A, (C)H[&phi=0.5,&tau-parent=yes])S, (H[&tau-parent=yes], B) T)R;
The rules for the extended Newick format are as follows. '(A, B)S' specifies two branches from S to A and from S to B, while '(A)H' specifies one branch from H to A. Every branch in the tree model is represented once. Each tip species occurs once. Internal nodes for speciation nodes may and may not be labeled, but hybridization nodes must be labeled. The extended Newick notation is not unique.
Each hybrid node has two parental species. Models A-C are distinguished
using a meta-data variable tau-parent
. The value yes means that the
parental population exists and has a separate age parameter \(\tau\), with
an associated \(\varphi\) parameter, while the value no means that such
parameters do not exist. For example, in model (A), \(\tau_S\), \(\tau_T\),
\(\theta_{Hl}\) (for H-left), and \(\theta_{Hr}\) (for H-right) are all
parameters; in model (B), \(\tau_S\) and \(\theta_{Hl}\) are not parameters
in the model while \(\tau_T\) and \(\theta_{Hr}\) are parameters; and
finally in model (C), none of those four parameters exists in the model.
Model (D) specifies a bidirectional introgression event between species
A and B.
Four different types of MSC-I models implemented in BPP
The models are specified using the extended Newick format, as follows:
(A): ((A, (C)H[&phi=0.5,&tau-parent=yes])S, (H[&tau-parent=yes], B) T)R;
(B): ((A, (C)H[&phi=0.5,&tau-parent=no])S, (H[&tau-parent=yes], B) T)R;
(C): ((A, (C)H[&phi=0.5,&tau-parent=no])S, (H[&tau-parent=no], B) T)R;
(D): ((A, (B) Y[&phi=0.3])X, (X[&phi=0.1])Y)R;
(D): ((A, (B)Y)X, (X)Y)R;
(D): ((A, Y)X, (B, X)Y) R;
(D): ((A, y)x, (B, x) y) r;
The control file variable phiprior
specifies the prior probability for
the \(\varphi\) parameter, which is a beta distribution with parameters
\(a\) and \(b\). The syntax is
phiprior = a b
where \(a\) and \(b\) are positive numbers. For example, phiprior = 1 1
specifies the beta prior beta(\(a, b\)) with \(a = 1\) and \(b = 1\) for
\(\varphi\), the introgression probability under the MSC-I model.
Autogeneration of extended newick graphs
Starting with BPP release 4.4.0, we have changed the definition of introgression probability (\(\varphi\)) so that it is assigned to the horizontal (introgression) branch. This note illustrates the msci-create
feature in bpp, which generates extended Newick notation for the MSC-I model from a data file which contains a binary species tree with introgression events specified using source and target branches on the tree.
The --msci-create
option has been available since BPP 4.2.1. The user prepares a input file (named msci.txt
in our example), which includes a binary species tree in Newick format and specifies the introgression events by identifying the source and target branches involved in the introgression. Run bpp using the following command:
$ bpp --msci-create msci.txt
This generates extended Newick notation for the MSC-I model, which can be copied into a bpp control file.
The input file msci.txt
uses four commands: tree
, define
, hybridization
, and bidirection
.
tree
defines the binary species tree, in Newick format.define
defines an ancestral species or internal node label, as the most recent common ancestor of the tip species. Alternatively you can label the internal nodes on the Newick tree.hybridization
defines a hybridization/introgression event by specifying the source and target branches (which represent the source and target populations involved).bidirection
defines a bidirectional introgression (BDI) event.
One may think of the binary species tree as describing the history of species divergences and add introgression events onto it as new (horizontal) branches. The introgression probability is assigned to the newly created introgression branch.
Examples using --msci-create
A series of examples follow. The introgression graph is displayed in a figure followed by the msci=create
encoding and the resulting extended Newick format.
# Model A, version 1
tree (A,(B,C));
define T as B,C
define R as A,B
hybridization R A, T C as S H tau=yes, yes phi=0.10
# Model A, version 2
tree (A,(B,C)T)R;
hybridization R A, T C as S H tau=yes, yes phi=0.10
# The generated Newick notation for model A is
# ((H[&phi=0.1,tau-parent=yes],A)S, (B,(C)H[&phi=0.9,tau-parent=yes])T)R;
# Model B1
tree (A,(B,C)T)R;
hybridization R A, T C as S H tau=no, yes phi=0.10
# The generated Newick notation for model B1 is
# ((H[&phi=0.1,tau-parent=no],A)S, (B,(C)H[&phi=0.9,tau-parent=yes])T)R;
# Model B2
tree ((A,C)S,B)R;
hybridization R B, S C as T H tau=no, yes phi=0.10
# The generated Newick notation for model B2 is
# ((A,(C)H[&phi=0.9,tau-parent=yes])S, (H[&phi=0.1,tau-parent=no],B)T)R;
# Model C
tree (A,(B,C)T)R;
hybridization R A, T C as S H tau=no, no phi=0.40
# The generated the Newick notation for model C is
# ((H[&phi=0.4,tau-parent=no],A)S, (B,(C)H[&phi=0.6,tau-parent=no])T)R;
# Model D
tree (A,B)R;
bidirection A R, B R as X Y phi=0.1,0.2
The above figure illustrates the specifications of models A, B, C, and D of Flouri et al 2020. In models A, B1, and C, the introgression event is from the source branch RA to the target branch TC, with a new introgression branch SH created. The introgression probability (\(\varphi = 0.1\)) is assigned to the newly added introgression branch (SH) while \(1 – \varphi\) is assigned to the other parent branch (TH). Models A, B1, and C are distinguished by using the keyword tau
(which means the same as tau-parent
in the Newick notation of the MSC-I model). Model A assumes that the two parental species S and T of the hybridization node H have distinct ages from H, with \(\tau_S > \tau_H\) and \(\tau_T > \tau_H\). Thus we have hybridization R A, T C as S H tau=yes, yes phi=0.10
: the first yes
means that parental node S on the source branch RA has a distinct tau from node H and that the second yes
means that parental node T on the target branch TC has a distinct tau from node H. Note that there are often multiple ways of specifying the same MSC-I model and here you can specify model A by starting with the binary tree ((A, C), B)
and adding branch TH as a introgression event. Model A might appear to be nonsensical biologically since only contemporary species can exchange migrants. Nevertheless, model A may be used to represent introgressions from a ghost species not sampled in the sequence dataset. Suppose at time \(\tau_S\) a speciation event generated two species SA and SH, and at \(\tau_H\), species SH contributed migrants into species THC, but species SH has since become extinct or is otherwise not sampled in the data. This scenario matches model A.
Model B1 assumes \(\tau_S = \tau_H\) and \(\tau_T > \tau_H\) so there is no new tau for parent S of the hybridization node H, and there is a new tau for parent T. Thus we have tau=no, yes
. Model B2 works similarly. In model C, none of parents S and T has a distinct age from H, so we have tau=no, no
. One interpretation is that species C is a hybrid species. In model D for the bidirectional introgression (BDI) model, there is no distinction of source and target branches. We specify two phi values, assigned to the introgression (horizontal) branches: \(\varphi_X = 0.1\) for node X (or into node X) and \(\varphi_Y = 0.2\) for node Y (see above figure).
The MSC-M Model
The multispecies-coalescent-with-migration model (or isolation-with-migration or IM model) is activated in BPP using the keyword migration. A binary species tree is specified using an extended Newick format that includes labels for internal nodes; these are subsequently used in the control file to specify the source and target populations involved in migration.
Control file specification of MSC-M
The basic model is specified in the control file as follows, using a species tree for four species (A, B, C, D) as an example:
((A, B)S, (C, D)T)R;
migprior = 2 200
migration = 2
A C
S C
Here S is the AB common ancestor, T is the CD common ancestor, while R is the ABCD ancestor. Not all internal nodes need to be labeled but those involved in migration have to be. The migration line specifies 2 migration connections: one connecting source population A to target population C, and another connecting source population S to target population C, with migration rates \(M_{AC}\) and \(M_{SC}\). Note that the migration rate
\(M_{AC} = m_{AC} N_C\)
is the expected number of immigrants in population C that are from population A per generation under the real-world view with time running forward, where \(m_{AC}\) is the proportion of immigrants in population C that are from population A. See Appendix A for definitions of migration rates used in different programs.
Prior on migration rates
The control file variable migprior = alpha beta
specifies a default gamma prior density with parameters alpha
and beta
for all migration rates. In the example above, both \(M_{AC}\) and \(M_{SC}\) are assigned the gamma prior \(G(2, 200)\), with prior mean \(2/200 = 0.01\).
A separate gamma prior can instead be defined for each migration rate (where the migration connection is specified using the source and target populations) and this specification takes precedence. For example if we specify
migprior = 2 200
migration = 2
A C 2 100
S C
the priors are \(M_{AC} ~ G(2, 100)\) with the prior mean \(0.02\) and \(M_{SC} ~ G(2, 200)\) from the default prior specified by migprior
.
Variable migration rates among loci
In this model, the migration rate \(M_i\) at each locus i varies according to a gamma distribution \(M_i \approx G(\alpha_M,\alpha_M/\overline{M})\), with shape parameter \(\alpha_M\) while the mean rate \(\overline{M}\) is assigned the gamma prior \(\overline{M} \approx G(\alpha,\beta)\). Here \(\alpha_M\) is a parameter that characterizes the variation of \(M_i\) among loci, with a small \(\alpha_M\) (e.g., 0.5 or 1) indicating highly variable rates among loci and a large \(\alpha_M\) indicating nearly constant migration rates among loci (when \(\alpha_M = \infty\) all loci have the same rate).
migprior = 2 200
migration = 2
A C 2 100 5
S C
In the above example, \(M_{SC}\) is applied to all loci with the prior \(G(2, 200)\), but \(M_{AC}\) varies among loci according to the shape parameter \(\alpha_M = 5\), and the mean rate for all loci \(\overline{M}_{AC}\) is assigned the gamma prior \(G(2, 100)\).
Priors and pseudo-priors
When the gene flow involves ancestral species, migration may become impossible because of changes to the species divergence times in the MCMC. In the example above, if \(\tau_S < \tau_T\) migration from S to C is possible during the time period \((\tau_S, \tau_T)\), but if \(\tau_S > \tau_T\) migration from S to C is impossible as the two populations were never contemporary. Thus the MCMC proposal to change species divergence times \(\tau_S\) or \(\tau_T\) may cause the migration rate parameter \(M_{SC}\) to disappear or reappear. When \(M_{SC}\) is absent from the model (that is, when \(\tau_S > \tau_T\)), the MCMC algorithm treats it as a pseudo-parameter (written as \(M^*_{SC}\)) and assigns it a pseudo-prior to facilitate the trans-dimensional move. The choice of pseudo-priors affects the mixing efficiency of the MCMC, but not the correctness of the algorithm. For good mixing, one should choose the pseudo-prior to be close to the posterior of the parameter \(M_{SC}\) (which can be generated by running short chains).
migprior = 2 200
migration = 2
A C 2 100
S C 2 200 100 250
In the example above, the prior \(M_{SC} \approx G(2, 200)\) is applied when \(\tau_S < \tau_T\), and the pseudo-prior \(M^*_{SC} \approx G(100, 250)\) with mean 0.4, is applied when \(\tau_S > \tau_T\). Note that in this example the pseudo-prior is far more concentrated than the prior (with shape parameter 100 versus 2). In order to summarize the posterior from the MCMC samples, only those samples taken when the migration rate is defined should be used (that is, only samples collected when \(\tau_S < \tau_T\) in our example). Samples collected when the migration rate does not exist in the model (that is, when \(\tau_S > \tau_T\) in our example) approximate the pseudo-prior.
In the case of four species, the following awk commands may be used to split the samples into two files (this assumes that tau_S and tau_T are in column 10 and 11)
awk 'BEGIN {cnt=0} NR>1{if ($10 > $11) {cnt=cnt+1;}}} END {print cnt}' out.mcmc.txt
# split sample file into ab.txt (samples where t_AB > t_CD) and cd.txt (t_CD > t_AB)
head -n 1 out.mcmc.txt > cd.txt; awk 'NR>1 {if ($10<$11) print $0}' out.mcmc.txt >> cd.txt
head -n 1 out.mcmc.txt > ab.txt; awk 'NR>1 {if ($10>$11) print $0}' out.mcmc.txt >> ab.txt
Use something like this if you understand the syntax. The running mean of the migration rate printed on the monitor during the MCMC run is the posterior mean after the filtering.
Note that this problem of the migration rate appearing and disappearing during the MCMC may exist when a migration event involves ancestral populations, and is not limited to the balanced species tree for four species. Later we should automate the summary of the MCMC sample.
Under the model of variable \(M\) among loci, all migration rates for all loci may appear and disappear if the migration involves ancestral species. The following five ways of specifying the model and priors and pseudo-priors are accepted (with \(M_{SC}\) as an example)
(a) S C
(b) S C a_M
(c) S C a b
(d) S C a b a_M
(e) S C a b pseudo_a pseudo_b
(f) S C a b a_M pseudo_a pseudo_b
In options (a) and (b), the default prior specified with migprior is assigned on \(M_{SC}\).
Combined Analyses of Organelle Genomes and Sex Chromosomes
When performing a combined analysis of data from autosomal, mitochondrial, chloroplast or sex chromosomes
one needs to account for the differences of effective population sizes between loci from these
different sources. This can be done using a heredity multiplier (referred to as an inheritance scalar by Hey and Nielsen 2004). The heredity
variable in the BPP control file allows for
such differences. Here are some examples:
heredity = 0 # (0: No variation)
heredity = 1 4 4 # (1: estimate, & a_gamma b_gamma)
heredity = 2 heredity.txt # (2: from file)
The specification heredity = 0
is the default and means that \(\theta\) is the same for
all loci. heredity = 1
or 2
specifies two models that allow \(\theta\)
to vary among loci, which may be useful for combined analysis of data
from autosomal, mitochondrial, X and Y loci. With such mixed data, the
effective population sizes are different among loci, so that a heredity
multiplier (Hey and Nielsen 2004) should be applied. Other
factors such as natural selection may also cause \(\theta\) to deviate
from the neutral expectation. BPP implements two options
for this. The first option (heredity = 1
) is to estimate the
multipliers from the data, using a gamma prior with parameters \(\alpha\)
and \(\beta\) specified by the user. In the example above, a gamma prior
\(G(4, 4)\), with mean \(4/4 = 1\), is specified for the multiplier for each
locus. The MCMC should then generate a posterior for the multiplier for
each locus. The second option (heredity = 2
) is for the user to
specify the multipliers in a file, and the multipliers will then be used
as fixed constants in the MCMC run. The file simply contains as many
numerical values as the number of loci, separated by spaces or line
breaks.
Genome | Heredity scalar |
---|---|
Autosome | 1 |
X chromosome | 0.75 |
Y chromosome | 0.25 |
Mitochondrial | 0.25 |
Table. Examples of Common Heredity Scalars
Note.--- The effect of the locus-specific mutation rates and the locus-specific heredity multipliers are different. A locus rate is used to multiply all \(\theta\)s and \(\tau\)s for the locus, while a heredity multiplier is used to multiply all \(\theta\) parameters for the locus but not the \(\tau\)s. Nevertheless, those parameters are quite likely to be strongly correlated, especially when the species tree is small.
Methods of Analysis
Here we describe the specifics of the four different types of analyses
that BPP can perform and illustrate them using several of the example
control files provided in the examples
subdirectory of the BPP
distribution. Briefly, BPP is capable of analyzing sequence data using 4
distinct phylogeographical models. The 4 models are determined by the
combinations of two possible settings for each of the two Boolean
control file variables speciesdelimitation
(0 = no species
delimitation, 1 = species delimitation) and speciestree
(0 = fixed
species tree, 1 = inferred species tree). Therefore, analysis A00 infers
the biogeographical parameters \(\theta\) and \(\tau\) on a fixed (user
specified) tree with a fixed (delimited) number of species/populations
as described in (Rannala and Yang 2003. Analysis A10 delimits the number of
species and infers biogeographical parameters using a fixed species tree
(guide tree) as described in Yang and Rannala 2010. Analysis A01 infers the
species tree and biogeographical parameters using a fixed number of
species (delimitation) as described in Rannala and Yang 2017. Analysis A11
jointly infers the species tree, species delimitation and
biogeographical parameters as described in Yang and Rannala 2014.
Under Model A00 it is also possible to include a model of either instantaneous
introgression or ongoing gene flow -- examples of such analyses are
given in Introgression and Migration Models. For each of these 4 major types of analyses
the control file specifications will differ as well as the form of
output printed to the screen and output files. Therefore, we will
include two subsections in our discussion of each analysis method: input
and output. Here we only consider the control file variables specific to
the methods of analysis under consideration. For a general discussion of
other control file variables see the example control file discussed in Control File.
Note that for most analyses the output printed to screen during the run
and that printed to the specified output file will be identical.
It is important to note that only analysis A00 applies within-model
inference. There is one specified model in that case (the MSC with a
fixed species tree) and the parameters are well defined, with the
objective being to estimate those parameters. In contrast, analyses A01,
A10, and A11 all apply trans-model inference. They move between
different models, and the main objective is to calculate the posterior
probabilities of the models. Each is an instance of the MSC model, but
the species delimitation (the number and nature of the species) and/or
the species phylogeny may differ between models. In analyses A01, A10,
and A11, the prior specified using control file variable thetaprior
applies to all \(\theta\) parameters in all models. Similarly, all MSC
models specifying two or more delimited species have a parameter
\(\tau_0\) (the divergence time of the root), and these parameters (for
different models) are assigned the same prior, specified by the
tauprior
control file variable.
A00: Demography and Divergence Time Estimation
Analysis A00(speciesdelimitation = 0, speciestree = 0)
, requires a
user-specified species tree topology, given by variable species&tree
in the control file, and infers the \(\theta\) and \(\tau\) parameters on
this fixed tree. The theory underlying estimation of species divergence
times and population sizes (\(\tau\)s and \(\theta\)s) under the MSC model
when the species phylogeny is given is described in (Rannala and Yang 2003. Note
that if there are \(s\) species in the species tree, the model will
minimally contain the following parameters: \(s – 1\) species divergence
times (\(\tau\)s) and \(s – 1\) ancestral species \(\theta\)s. A \(\theta\)
parameter will also exist for each contemporary species with at least
two sequences present at some locus. If the data are for a single
species/population, the model will contain one parameter only, \(\theta\)
for that species. If there is only one species, the MSC model becomes
the single-population Kingman coalescent. We will provide
examples of both a single population analysis (using the control file
yu2001.bpp.ctl
) and a multi-species analysis (using the control file
A00.bpp.ctl
).
How many \(\theta\) and \(\tau\) parameters exist?
The number of parameters that BPP includes in the MSC or
MSC-I/MSC-M models depends on the data configuration. The program always
includes \(\theta\) and \(\tau\) parameters for each internal node
(ancestral species) on the species tree, but includes a \(\theta\) for an
extant species (tip) if and only if that species has at least 2
sequences at some loci. If two or more sequences for a species are
specified in the control file but there are at most 0 or 1 sequence per
locus for that species in the sequence data file, \(\theta\) for that
species will not be identifiable or estimable. In that case, the
posterior for the parameter will be the prior. Nevertheless, other
results (such as the posterior distribution for other parameters or
posterior probabilities for species trees and species delimitations)
will still be correct. The same applies to other more complex cases of
missing data and parameter non-identifiability. As an example, suppose
the species tree and the numbers of sequences (for species A, B, C and D from left to right) for two kinds of loci are
as follows:
((A,B), (C,D))
locus configuration 1: 1 1 0 0
locus configuration 2: 2 0 0 1
In this case, \(\theta_A\), \(\theta_{AB}\), \(\theta_{ABCD}\), \(\tau_{AB}\), and \(\tau_{ABCD}\) are identifiable while \(\theta_B\), \(\theta_C\), \(\theta_D\), \(\theta_{CD}\) and \(\tau_{CD}\) are not. It is impossible to estimate \(\theta_D\) , \(\theta_{CD}\) and \(\tau_{CD}\) as no data are available from species C.
Input A00 (single population)
Our first example is a dataset of human sequence data generated by Yu et al 2001. The control file is found in the BPP distribution at location:
examples/yu2001/yu2001.bpp.ctl
The data comprise 61 phased sequences for a single locus that will be
analyzed to estimate the single parameter \(\theta\). There is no need to
specify an Imap file, or tag the sequence names in the sequence file
(yu2001.txt
); the sequence names are read and then ignored. Multiple
loci may be included in the sequence file. The contents of the control
file yu2001.bpp.ctl
are shown below:
seed = -1
seqfile = yu2001.txt
jobname = out
# fixed species delimitation and species tree
speciesdelimitation = 0
speciestree = 0
species&tree = 1 H
61 # max number of sequences
# 0: no data (prior); 1:seq like
usedata = 1
# number of data sets in seqfile
nloci = 1
# remove sites with ambiguity data (1:yes, 0:no)?
cleandata = 0
# gamma(a, b) for theta
thetaprior = gamma 2 2000
# auto (0 or 1): finetune for GBtj, GBspr, theta, tau, mix, locusrate, seqerr
finetune = 1: 2 0.00001 0.0001 0.0005 0.5 0.2 1.0
# MCMC samples, locusrate, heredityscalars, genetrees
print = 1 0 0 0
burnin = 4000
sampfreq = 2
nsample = 10000
The two lines:
speciesdelimitation = 0
speciestree = 0
specify that the species delimitation (number of species) is fixed as well as the species tree topology. In this case, there is no species tree and only a single species exists. There is no need for a species tree, so the block for specifying species names and species tree looks like this:
species&tree = 1 H
61
The line:
# gamma(a, b) for theta
thetaprior = gamma 2 2000
specifies a gamma distribution for the prior on \(\theta\) with parameters \(\alpha = 2\) and \(\beta = 2000\) which has mean \(0.001\) and variance \(5 \times 10^{-7}\).
Output A00 (single population)
When BPP is run using the above control file a summary of the progress of the MCMC analysis is printed to screen. Here, we briefly explain what this output means. The different analyses presented in this chapter all produce a similar form of output when the program is running, but the number of columns and their specific content varies somewhat. Omitting some mostly irrelevant output related to the adjustment of proposal moves, the output to screen when this control file is run appears as follows:
| Acceptance proportions |
Prgs | Gage Gspr thet tau mix | mthet1 log-PG log-L
-------------------------------------------------------------------
-15% 0.71 0.18 0.30 0.00 0.38 0.0004 467.11472 -12721.22305 0:01
(some output omitted here)
5% 0.71 0.29 0.33 0.00 0.24 0.0004 428.40742 -12721.09413 0:05
10% 0.71 0.29 0.31 0.00 0.25 0.0003 480.92848 -12720.75185 0:06
15% 0.71 0.29 0.30 0.00 0.24 0.0003 476.37432 -12720.65199 0:08
20% 0.71 0.29 0.31 0.00 0.23 0.0003 441.36960 -12720.92326 0:09
25% 0.71 0.29 0.32 0.00 0.23 0.0003 443.51740 -12720.85802 0:10
30% 0.71 0.29 0.31 0.00 0.23 0.0003 441.93616 -12720.86033 0:11
35% 0.71 0.29 0.31 0.00 0.23 0.0003 495.50716 -12720.84565 0:12
40% 0.71 0.29 0.31 0.00 0.23 0.0003 452.82049 -12720.87810 0:13
45% 0.71 0.29 0.31 0.00 0.23 0.0003 461.93346 -12720.94582 0:15
50% 0.71 0.29 0.32 0.00 0.23 0.0004 463.85955 -12721.01005 0:16
55% 0.71 0.29 0.32 0.00 0.23 0.0004 463.42744 -12720.98006 0:17
60% 0.71 0.29 0.32 0.00 0.23 0.0004 478.52797 -12721.03353 0:18
65% 0.71 0.29 0.32 0.00 0.23 0.0004 499.97705 -12720.96168 0:19
70% 0.71 0.29 0.32 0.00 0.23 0.0004 455.32489 -12720.93611 0:20
75% 0.71 0.29 0.32 0.00 0.23 0.0004 439.21544 -12720.99190 0:22
80% 0.71 0.29 0.32 0.00 0.23 0.0004 461.56954 -12721.00180 0:23
85% 0.71 0.29 0.32 0.00 0.23 0.0004 469.53443 -12720.99823 0:24
90% 0.71 0.29 0.32 0.00 0.23 0.0004 455.60303 -12720.99213 0:25
95% 0.71 0.29 0.32 0.00 0.23 0.0004 454.07141 -12721.00745 0:26
100% 0.71 0.29 0.32 0.00 0.23 0.0004 461.01044 -12720.97549 0:27
0:27 spent in MCMC
theta_1H lnL
mean 0.000352 -12720.995649
median 0.000337 -12720.726000
S.D 0.000116 2.936769
min 0.000089 -12735.535000
max 0.001293 -12712.774000
2.5% 0.000170 -12727.454000
97.5% 0.000617 -12716.086000
2.5%HPD 0.000145 -12726.920000
97.5%HPD 0.000574 -12715.718000
ESS* 757.838791 1180.434956
Eff* 0.075784 0.118043
Note that in these examples we are using a random seed from the computer clock so your results will differ slightly from those shown. The line:
| Acceptance proportions |
Prgs | Gage Gspr thet tau mix | mthet1 log-PG log-L
-------------------------------------------------------------------
is a header that explains the content of each column printed during the
run. The Prgs
(progress) column will appear in all analyses and
indicates the percentage of the MCMC iterations that have been
completed. If this number is negative it indicates that the program is
still running the burn-in iterations. For example, -15%
means that 15
percent of the burn-in iterations remain to be completed. The next 5
columns are the current acceptance proportions for different parameter
proposals in the MCMC. The proposals are defined as follows:
-
Gage
: proposal to change ages of nodes on gene trees -
Gspr
: proposal to change gene tree topology using SPR move -
thet
: proposal to change \(\theta\) parameter -
tau
: proposal to change \(\tau\) parameter -
mix
: mixing step proposal
In the output the acceptance proportions are stable. The fifth column
(the \(\tau\) proposal acceptance proportion) is zero; this is expected
because there are no species divergence times (\(\tau\)s) in the model
when only a single species exists. Column 7, labeled mthet1
gives the
current average (mean) value of theta in the MCMC. Columns 8 and 9 are
the log probability of the coalescent model (log-PG
) and the
log-likelihood of the sequence data on the gene tree (log-L
). Check
that the acceptance proportions are in a reasonable range (about 20% to
40% for continuous parameters such as \(\theta\)). If you need to
terminate the program before it finishes, you can do so using the key
combination Ctrl-C
(e.g., hold down the Control key and then press the
C key). The final screen output, printed after the MCMC is completed,
summarizes the posterior parameter estimates for \(\theta\) (theta_1H
)
in the second column. The posterior mean in the first row, for example,
is \(0.000352\) and the lower and upper limits of the highest posterior
density (HPD) set of values are in rows 8 and 9, respectively. In this
case, the HPD credibility interval for \(\theta\) is
\((0.000145,0.000574)\).
Input A00 (multiple populations)
Our second example is a dataset of frog sequence data generated by Zhou et al 2012. This same dataset will also be used to illustrate analyses A01, A10 and A11. The control file is found in the BPP distribution at location:
examples/frogs/A00.bpp.ctl
The data comprise 5 loci for 4 populations, with the number of sequences
at each locus varying between 21 and 30. We will be estimating \(\theta\)s
for each of the 4 contemporary and 3 ancestral populations as well as
the 3 species divergence times (\(\tau\)s). The contents of the control
file A00.bpp.ctl
are shown below:
seed = -1
seqfile = frogs.txt
Imapfile = frogs.Imap.txt
jobname = out
# fixed number of species/populations
speciesdelimitation = 0
# fixed species tree
speciestree = 0
species&tree = 4 K C L H
9 7 14 2
(((K, C), L), H);
# unphased data for all 4 populations
phase = 1 1 1 1
# use sequence likelihood
usedata = 1
nloci = 5
# do not remove sites with ambiguity data
cleandata = 0
# gamma(a, b) for theta (estimate theta)
thetaprior = gamma 2 2000
# gamma(a, b) for root tau & Dirichlet(a) for other tau's
tauprior = gamma 2 1000
# finetune for GBtj, GBspr, theta, tau, mix, locusrate, seqerr
finetune = 1: 5 0.001 0.001 0.001 0.3 0.33 1.0
# MCMCsamples, locusrate, heredityscalars, genetrees, substitutionparams
print = 1 0 0 0
burnin = 8000
sampfreq = 2
nsample = 100000
As in the previous analysis, the two lines:
speciesdelimitation = 0
speciestree = 0
specify that the species delimitation (number of species) is fixed as well as the species tree topology. In this case, a species tree is specified along with population labels. The block for specifying species names and the species tree looks like this:
species&tree = 4 K C L H
9 7 14 2
(((K, C), L), H);
The first line above lists the number of species (populations) which is 4 in this example. This is followed by the label for each species (these should match the labels in the map file):
K C L H
The next line specifies the maximum number of sequences per species in the same order as the species labels:
9 7 14 2
the actual number of sequences at any locus must be less than this value. The third line is a rooted phylogenetic tree topology specified in Newick format:
(((K, C), L), H);
The next line specifies that the loci for each of the 4 populations are unphased:
phase = 1 1 1 1
The specified phase variables should include a 0 or 1 (Boolean) entry
for each population, where the populations are assumed to be in the same
order as specified above (in the species&tree
block) and the Boolean
variable indicates that the sequence data for a population are either
unphased (1) or phased (0). Ambiguity characters are used to represent
genotypes for unphased sequences (see BPP control file variables).
In the example data the sequences
are unphased for all 4 populations. Currently, it is not possible to
have partially phased data (e.g., only some loci unphased in a
particular population). The program will infer the phase as part of the
analysis.
Unlike the earlier single population analysis, in which we estimated a single \(\theta\) parameter, in the multipopulation analysis we will be estimating multiple \(\theta\)s (one for each population) and divergence times (\(\tau\)s) between populations. Thus, we have prior distributions for both \(\theta\) and \(\tau_0\). The lines below specify a gamma distribution as the prior for \(\theta\) with parameters \(\alpha = 2\) and \(\beta = 2000\) (the prior mean and variance of \(\theta\) are \(0.001\) and \(0.0000005\)):
# gamma(a, b) for theta
thetaprior = gamma 2 2000
The lines:
# gamma(a, b) for root tau & Dirichlet(a) for other tau's
tauprior = gamma 1 1000
specify a gamma distribution as a prior for the root age and the remaining \(\tau\)s have a uniform Dirichlet distribution conditional on the root age (the prior mean and variance of the root age are \(0.001\) and \(0.000001\),respectively):
Output A00 (multiple populations)
When BPP is run using the above control file a summary of the progress of the MCMC is again printed to screen as follows:
| Acceptance proportions |
Prgs | Gage Gspr thet tau mix | mthet1 mthet2 mthet3 mtau1 mtau2 mtau3 log-PG log-L
------------------------------------------------------------------------------------------------------
-3% 0.65 0.10 0.43 0.04 0.12 0.0031 0.0095 0.0068 0.0021 0.0015 0.0014 1204.63468 -4458.38814
(some output omitted here)
5% 0.64 0.26 0.30 0.28 0.27 0.0035 0.0096 0.0067 0.0019 0.0018 0.0017 1225.51650 -4443.85539 1:09
10% 0.65 0.27 0.30 0.28 0.28 0.0035 0.0095 0.0068 0.0019 0.0018 0.0018 1202.62497 -4445.50020 1:47
15% 0.64 0.27 0.30 0.28 0.28 0.0034 0.0095 0.0068 0.0019 0.0018 0.0017 1211.95125 -4446.35694 2:25
20% 0.64 0.26 0.31 0.28 0.28 0.0034 0.0095 0.0067 0.0018 0.0018 0.0017 1211.09206 -4445.74292 3:05
25% 0.64 0.26 0.31 0.28 0.28 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1222.97603 -4446.12020 3:43
30% 0.64 0.26 0.31 0.28 0.28 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1216.60217 -4445.55424 4:21
35% 0.64 0.26 0.31 0.28 0.28 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1260.59722 -4445.22113 5:00
40% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1269.45020 -4445.20340 5:38
45% 0.65 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1243.07668 -4445.00207 6:16
50% 0.65 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1229.59858 -4445.04036 6:55
55% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1209.31273 -4445.43161 7:33
60% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1206.22162 -4445.53994 8:11
65% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1190.78837 -4445.21618 8:50
70% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1203.82684 -4445.04124 9:30
75% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1186.32176 -4445.00578 10:09
80% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1222.75718 -4444.94315 10:47
85% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1228.39106 -4444.68671 11:26
90% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1228.72385 -4444.54844 12:04
95% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1223.25921 -4444.25455 12:43
100% 0.64 0.26 0.31 0.28 0.27 0.0034 0.0096 0.0067 0.0018 0.0018 0.0017 1199.52731 -4444.17857 13:21
}
13:21 spent in MCMC
theta_1K theta_2C theta_3L theta_4H theta_5KCLH theta_6KCL theta_7KC tau_5KCLH tau_6KCL tau_7KC lnL
mean 0.003422 0.009588 0.006683 0.003031 0.003720 0.001577 0.001648 0.001823 0.001787 0.001710 -4444.16
median 0.003351 0.009492 0.006599 0.002932 0.003642 0.001412 0.001463 0.001809 0.001773 0.001700 -4443.69
S.D 0.000747 0.001461 0.001118 0.000819 0.000944 0.000899 0.000954 0.000234 0.000230 0.000238 11.615936
min 0.001309 0.004913 0.003258 0.000866 0.001131 0.000029 0.000039 0.001047 0.001043 0.000772 -4496.72
max 0.008147 0.017569 0.012778 0.009724 0.010355 0.008461 0.008987 0.003140 0.003113 0.002847 -4403.99
2.5% 0.002171 0.006999 0.004743 0.001729 0.002100 0.000325 0.000340 0.001404 0.001375 0.001271 -4468.01
97.5% 0.005084 0.012720 0.009104 0.004919 0.005780 0.003751 0.003991 0.002319 0.002280 0.002209 -4422.74
2.5%HPD 0.002071 0.006802 0.004595 0.001566 0.001949 0.000142 0.000164 0.001396 0.001342 0.001252 -4467.06
97.5%HPD 0.004928 0.012469 0.008903 0.004650 0.005573 0.003312 0.003512 0.002308 0.002237 0.002185 -4421.90
ESS* 2721.25 8954.56 5279.29 11762.38 1199.63 4543.27 3299.32 850.33 970.61 932.00 420.40
Eff* 0.027213 0.089546 0.052793 0.117624 0.011996 0.045433 0.032993 0.008503 0.009706 0.009320 0.004204
There are again 5 acceptance proportions for the 5 parameter proposals,
but now the current mean values of 6 parameters are printed. The current
mean values of \(\theta\) for nodes 1, 2 and 3 (mthet1 mthet2 mthet3
)
are printed -- which are \(\theta\)s of the contemporary species K, C and
L, and those of the 3 \(\tau\)s (mtau1 mtau2 mtau3
) which are the
divergence times for ancestral species KCLH, KCL and KC, respectively.
If there are many populations/species only the first few \(\tau\)s will be
printed to screen but all the \(\tau\)s will be available in the summary
created at the end of the run (see below). When the run finishes, a
final block is written summarizing the results. The first two rows give
the mean and median of the posterior distribution for each parameter.
This is followed by summaries of the statistical uncertainty: the
standard deviation (S.D.) of the posterior distribution; the minimum
(min) and maximum (max) values; the lower and upper bounds of the 95%
Credible Interval
(2.5% and 97.5%); the lower and upper bounds of the Highest Posterior
Density (HPD) Interval
(2.5%HPD and 97.5%HPD); the Effective Sample Size (ESS*); and the
Efficiency (Eff*). Larger values for ESS* and Eff* indicate better
mixing and more reliable estimates. The above information is also
printed (along with other technical details of the run) to the output
file specified by the jobname
variable in the control file (in our
example, jobname = out
and thus the output file is called out.txt
).
A tree file in NEXUS/Newick format is also created and placed in the file named
FigTree.tre which is formatted for viewing/printing using the
Figtree program
MCMC output file
A file will be produced (with a name specified by the control file
variable jobname
) that contains the mcmc samples from the run for all
the \(\theta\) and \(\tau\) parameters. When running under analysis method
A00 the contents of this file can be analyzed using a program such as
Tracer to visually examine the convergence and mixing of the MCMC
run. However, the other three analysis methods (A01, A10 and A11) are
trans-model analyses in which the number of parameters and or/the
meaning of the parameters changes as the chain runs so it is not correct
to examine the trace plot without conditioning on a particular model.
An example of 5 lines from the out.mcmc.txt
file produced by the above
example are given below:
Gen theta_1K theta_2C theta_3L theta_4H theta_5KCLH theta_6KCL theta_7KC tau_5KCLH tau_6KCL tau_7KC lnL
2 0.002848 0.010633 0.006000 0.002231 0.003697 0.001594 0.000704 0.001754 0.001717 0.001708 -4466.479
4 0.002848 0.011774 0.004789 0.002231 0.003697 0.002134 0.000798 0.001754 0.001717 0.001708 -4457.097
6 0.002848 0.010660 0.005826 0.002231 0.003697 0.001250 0.000799 0.001754 0.001717 0.001705 -4448.002
8 0.002848 0.012103 0.005826 0.002231 0.002615 0.001250 0.000803 0.001754 0.001717 0.001696 -4455.900
A01: Species Tree Estimation
Analysis A01 (speciedelimitation = 0, speciestree = 1)
assumes that the species delimitation is fixed
(e.g., species assignments are as specified in the Imap file) but that the species tree is unknown. The
program estimated the species tree topology in addition to the \(\theta\)s and \(\tau\)s.
Input A01
Our example uses the dataset of frog sequence data generated by Zhou et al 2012
that we considered previously. The contents of the control file A01.bpp.ctl
are shown below:
seed = -1
seqfile = frogs.txt
Imapfile = frogs.Imap.txt
jobname = out
speciesdelimitation = 0 * fixed species tree
speciestree = 1 0.4 0.2 0.1 * speciestree pSlider ExpandRatio ShrinkRatio
speciesmodelprior = 1 * 0: uniform LH; 1:uniform rooted trees; 2: uniformSLH; 3: uniformSRooted
species&tree = 4 K C L H
9 7 14 2
((K, C), (L, H));
phase = 1 1 1 1
usedata = 1 * 0: no data (prior); 1:seq like
nloci = 5 * number of data sets in seqfile
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
thetaprior = gamma 2 2000 # gamma(a, b) for theta (estimate theta)
tauprior = gamma 2 1000 # gamma(a, b) for root tau & Dirichlet(a) for other tau's
finetune = 1: 5 0.001 0.001 0.001 0.3 0.33 1.0 # finetune for GBtj, GBspr, theta, tau, mix, locusrate, seqerr
print = 1 0 0 0 * MCMC samples, locusrate, heredityscalars, Genetrees
burnin = 8000
sampfreq = 2
nsample = 100000
The two lines:
speciesdelimitation = 0 * fixed species delimitation
speciestree = 1 0.4 0.2 0.1 * speciestree pSlider ExpandRatio ShrinkRatio
now specify that the species delimitation is fixed and the species tree topology is being estimated
speciestree = 1
. The 3 parameters following the speciestree
variable specify the probability that the
snakes and ladders (SNL) move is used, rather than the subtree-pruning-regrafting (SPR) move to modify the species tree
and the expand/shrink ratios for the SNL move; these proportions can be adjusted on the interval \((0,1)\)
and will affect mixing of the MCMC. For more details, see the description of the control file variable 14 speciestree.
If no parameters are specified after speciestree = 1
then the SPR move is used exclusively. See
Rannala and Yang 2017 for a description of the proposals available
to change species trees in the MCMC.
The line:
speciesmodelprior = 1 * 0: uniform LH; 1:uniform rooted trees; 2: uniformSLH; 3: uniformSRooted
specifies the prior distribution on the species tree topologies. The specification speciesmodelprior = 1
uses
a prior that gives equal probability to all rooted trees apriori. See the description of the control file
variable 15 speciesmodelprior for more details. The available priors are described in
Yang and Rannala 2014.
Output A01
When BPP is run using the above control file a summary of the progress of the MCMC is again printed to screen as follows:
| Acceptance proportions |
Prgs | Gage Gspr thet tau mix Sspr Ssnl | mthet1 mtau1 log-PG log-L
---------------------------------------------------------------------------------------
(some output omitted here)
5% 0.64 0.26 0.31 0.28 0.27 0.0980 0.0000 0.0037 0.0019 1213.12189 -4448.22572 1:00
10% 0.64 0.26 0.30 0.28 0.27 0.0865 0.0000 0.0037 0.0019 1212.13271 -4447.30801 1:37
15% 0.64 0.26 0.30 0.27 0.27 0.0885 0.0000 0.0037 0.0019 1205.76722 -4446.44035 2:20
20% 0.64 0.26 0.30 0.27 0.27 0.0867 0.0000 0.0037 0.0019 1209.18617 -4445.20436 3:03
25% 0.64 0.26 0.31 0.27 0.27 0.0919 0.0000 0.0038 0.0019 1189.24742 -4444.02665 3:45
30% 0.65 0.26 0.31 0.27 0.27 0.0959 0.0000 0.0038 0.0018 1218.30083 -4443.50919 4:28
35% 0.65 0.26 0.31 0.27 0.27 0.0908 0.0000 0.0038 0.0019 1271.24840 -4443.30946 5:13
40% 0.65 0.26 0.31 0.27 0.27 0.0903 0.0000 0.0038 0.0018 1221.51424 -4442.93455 5:56
45% 0.65 0.26 0.31 0.28 0.27 0.0865 0.0000 0.0037 0.0018 1179.64791 -4442.58322 6:39
50% 0.65 0.26 0.31 0.28 0.27 0.0883 0.0000 0.0038 0.0018 1220.10587 -4442.33168 7:21
55% 0.65 0.26 0.31 0.28 0.27 0.0882 0.0000 0.0037 0.0018 1230.47093 -4442.53042 8:03
60% 0.65 0.26 0.31 0.28 0.27 0.0900 0.0000 0.0037 0.0019 1210.69277 -4443.13901 8:46
65% 0.65 0.26 0.31 0.28 0.27 0.0906 0.0000 0.0037 0.0019 1259.02426 -4443.42635 9:29
70% 0.65 0.26 0.31 0.28 0.27 0.0907 0.0000 0.0037 0.0019 1227.92007 -4443.74417 10:11
75% 0.65 0.26 0.31 0.28 0.27 0.0907 0.0000 0.0037 0.0019 1254.82843 -4443.96093 10:55
80% 0.65 0.26 0.31 0.28 0.27 0.0912 0.0000 0.0037 0.0019 1211.86344 -4443.93739 11:40
85% 0.65 0.26 0.31 0.28 0.27 0.0922 0.0000 0.0037 0.0019 1205.06522 -4443.99758 12:23
90% 0.65 0.26 0.31 0.28 0.27 0.0924 0.0000 0.0037 0.0019 1191.95318 -4444.06010 13:05
95% 0.65 0.26 0.31 0.28 0.27 0.0918 0.0000 0.0037 0.0019 1213.34484 -4444.05286 13:47
100% 0.65 0.26 0.31 0.28 0.27 0.0916 0.0000 0.0037 0.0019 1224.45872 -4443.84148 14:30
14:30 spent in MCMC
Species in order:
1. K
2. C
3. L
4. H
(A) Best trees in the sample (15 distinct trees in all)
33978 0.33978 0.33978 ((C, (H, L)), K);
17928 0.17928 0.51905 (C, ((H, L), K));
8961 0.08961 0.60866 ((C, K), (H, L));
8236 0.08236 0.69102 (C, ((H, K), L));
6896 0.06896 0.75998 (((C, H), L), K);
4548 0.04548 0.80546 (((C, L), H), K);
4066 0.04066 0.84612 (C, (H, (K, L)));
3037 0.03037 0.87649 ((C, (H, K)), L);
2775 0.02775 0.90424 (((C, K), H), L);
2245 0.02245 0.92669 (((C, H), K), L);
2131 0.02131 0.94800 ((C, L), (H, K));
1786 0.01786 0.96586 (((C, K), L), H);
1423 0.01423 0.98009 (((C, L), K), H);
1094 0.01094 0.99103 ((C, H), (K, L));
897 0.00897 1.00000 ((C, (K, L)), H);
(B) Best splits in the sample of trees (10 splits in all)
60867 0.608664 0011
45422 0.454215 0111
30230 0.302297 1011
13522 0.135219 1100
13404 0.134039 1001
10235 0.102349 0101
8102 0.081019 0110
8057 0.080569 1101
6057 0.060569 1010
4106 0.041060 1110
(C) Majority-rule consensus tree
(K, C, (L, H) #0.608664);
(D) Best tree (or trees from the mastertree file) with support values
((C, (H, L) #0.608664) #0.454215, K); [P = 0.339777]
The first part of this output are the current acceptance proportions and running means for different parameters as the MCMC was run. Columns 2 to 6 are the five acceptance proportions for the conventional MCMC moves, as discussed above. Columns 7 and 8 are the acceptance proportions for moves that change the species tree topology, both are much lower than the other parameter proposals, this is typical and it is often not possible to improve these acceptance proportions. Columns 9 and 10 are posterior means of \(\theta\) and \(\tau\) for the root population (-1 is printed in column 9 if \(\theta\)s are integrated out). The last two numbers are the log MSC gene-tree density (Rannala and Yang 2003) and the average log sequence likelihood (Felsenstein 1981).
Next, there are 4 sections: A, B, C and D summarizing the results. Section (A) lists the species trees in decreasing order of posterior probabilities. From this you can easily identify the 95% or 99% credibility set of species trees. Section (B) lists the splits (or bipartitions) and their posterior probabilities. The splits here take into account the location of the root, and may be different from the splits for unrooted trees. Section (C) prints the majority-rule consensus tree, with posterior probabilities for nodes.
MCMC output file
The MCMC sample of species trees is collected in the file out.mcmc.txt.
Below are five lines from that file. The numbers after :
are the branch
lengths (\(\tau\)s), while those after #
are the \(\theta\)s.
(K #0.002983: 0.001998, ((C #0.009671: 0.001684, H #0.003031: 0.001684) #0.001835: 0.000250, L #0.006266: 0.001934) #0.000560: 0.000065) #0.002770;
(K #0.003247: 0.002176, ((C #0.010528: 0.001871, H #0.004726: 0.001871) #0.002902: 0.000147, L #0.006615: 0.002017) #0.000950: 0.000158) #0.003016;
(K #0.003234: 0.002167, ((C #0.010486: 0.001843, H #0.003294: 0.001843) #0.002186: 0.000147, L #0.007822: 0.001989) #0.001087: 0.000178) #0.003003;
(K #0.002677: 0.001794, ((C #0.010973: 0.001516, H #0.001906: 0.001516) #0.001821: 0.000093, L #0.005827: 0.001609) #0.001261: 0.000185) #0.002486;
(K #0.002677: 0.001794, ((C #0.009869: 0.001516, H #0.003071: 0.001516) #0.001821: 0.000109, L #0.005788: 0.001625) #0.002495: 0.000169) #0.002486;
A10: Species Delimitation
Analysis A10 (speciedelimitation = 1, speciestree = 0)
assumes that the species delimitation is unknown
(e.g., population assignments as specified in the Imap file may not represent distinct species) but that the species guide tree is fixed (known). The
program calculates the posterior probabilities of different possible species delimitations as well as of \(\theta\)s and \(\tau\)s.
Input A10
Our example uses the dataset of frog sequence data generated by Zhou et al 2012
that we considered previously. The contents of the control file A10.bpp.ctl
are shown below:
seed = -1
seqfile = frogs.txt
Imapfile = frogs.Imap.txt
jobname = out
speciesdelimitation = 1 1 2 1 * species delimitation rjMCMC algorithm1 finetune (a m)
speciestree = 0 * species tree NNI/SPR
speciesmodelprior = 1 * 0: uniform LH; 1:uniform rooted trees; 2: uniformSLH; 3: uniformSRooted
species&tree = 4 K C L H
9 7 14 2
((K, C), (L, H));
phase = 1 1 1 1
usedata = 1 * 0: no data (prior); 1:seq like
nloci = 1 * number of data sets in seqfile
cleandata = 1 * remove sites with ambiguity data (1:yes, 0:no)?
thetaprior = gamma 2 2000 # gamma(a, b) for theta (estimate theta)
tauprior = gamma 2 1000 # gamma(a, b) for root tau & Dirichlet(a) for other tau's
finetune = 1: 5 0.001 0.001 0.001 0.3 0.33 1.0 # finetune for GBtj, GBspr, theta, tau, mix, locusrate, seqerr
print = 1 0 0 0 * MCMC samples, locusrate, heredityscalars, Genetrees
burnin = 8000
sampfreq = 2
nsample = 100000
the lines:
speciesdelimitation = 1 1 2 1 * species delimitation rjMCMC algorithm1 finetune (a m)
speciestree = 0 * species tree NNI/SPR
specify that the species tree is fixed speciestree = 0
(a guide tree is used as specified in the control file)
and the rjMCMC algorithm 1, with \(\alpha = 2\) and \(m = 1\) in equations 6 and 7 of
Yang and Rannala 2010 is used for species
delimitation. Another possible specification for delimitation would be:
speciesdelimitation = 1 0 2
which would specify rjMCMC algorithm 0 with \(\epsilon = 2\) in equations 3 and 4 of Yang and Rannala 2010. The two algorithms in theory should produce identical results. The line:
speciesmodelprior = 1 * 0: uniform LH; 1:uniform rooted trees; 2: uniformSLH; 3: uniformSRooted
specifies the prior distribution on topologies (see above) Prior 0 means
equal probabilities for labeled histories (which are rooted trees with
internal nodes ordered by their age). This is the prior used by
Yang and Rannala 2010.
Prior 1 means equal probabilities for rooted trees (now the default).
The prior with user specified probabilities
for nodes described by Rannala and Yang 2013 was removed from BPP after version 2.3.
We also reduced the number of loci to 1 (nloci = 1
) in this control file because otherwise species
delimitation model 5 (three species) has a posterior probability of 1.
Output A10
The species delimitation models that can be generated from the fixed guide tree will be
listed in the output, together with their prior probabilities calculated by
BPP. (As a check, if you use usedata = 0
, the MCMC
should be sampling from this prior distribution.) The species
delimitation model (for 4 populations as in this data set) is represented using three 0-1 flags for the three
interior nodes 5, 6, 7 in the guide tree, with 0 for 'collapsed' and
1 for 'resolved'. Note that the tips in the guide tree are numbered 1,
2, \(\cdots, s\) for \(s\) potential species, while the interior (ancestral)
nodes are numbered \(s + 1, s + 2, \cdots, 2s – 1\), with \(s + 1\) to be
the root of the guide tree. The numbering is through a tree-traversal
algorithm, fixed by the program. This same order is used to specify the
divergence time parameters (\(\tau\)s), so you can work out the order by
looking at the list of nodes in the screen output (look at the
"population by population table", "# species divergence times in the
order:", etc.).
When BPP is run using the above control file a summary of the progress of the MCMC is again printed to screen as follows:
| Acceptance proportions |
Prgs | Gage Gspr thet tau mix rj | np del mldp mthet1 mtau1 log-PG log-L
----------------------------------------------------------------------------------------------------
(some output omitted here)
5% 0.62 0.25 0.30 0.33 0.29 0.0061 10 111 P[5]=0.9726 0.0026 0.0009 217.01433 -972.53408 0:08
10% 0.63 0.25 0.30 0.34 0.29 0.0037 10 111 P[5]=0.9845 0.0026 0.0009 229.43024 -972.60186 0:12
15% 0.63 0.25 0.30 0.34 0.29 0.0033 10 111 P[5]=0.9886 0.0026 0.0009 224.68967 -972.41862 0:16
20% 0.63 0.25 0.30 0.34 0.29 0.0037 10 111 P[5]=0.9885 0.0026 0.0009 210.93188 -972.34981 0:21
25% 0.63 0.25 0.30 0.33 0.29 0.0038 10 111 P[5]=0.9892 0.0026 0.0009 224.49484 -972.24813 0:26
30% 0.63 0.25 0.30 0.33 0.28 0.0035 10 111 P[5]=0.9901 0.0026 0.0009 216.40848 -972.27563 0:30
35% 0.63 0.25 0.30 0.33 0.28 0.0033 10 111 P[5]=0.9907 0.0026 0.0009 224.00442 -972.50421 0:35
40% 0.63 0.25 0.30 0.33 0.28 0.0031 10 111 P[5]=0.9912 0.0026 0.0009 213.49838 -972.43257 0:40
45% 0.63 0.25 0.30 0.33 0.29 0.0034 10 111 P[5]=0.9905 0.0026 0.0009 209.30067 -972.37969 0:45
50% 0.63 0.25 0.30 0.33 0.29 0.0033 10 111 P[5]=0.9906 0.0026 0.0009 232.27193 -972.51538 0:50
55% 0.63 0.25 0.30 0.33 0.29 0.0035 10 111 P[5]=0.9900 0.0026 0.0009 203.35457 -972.50992 0:55
60% 0.63 0.25 0.30 0.33 0.29 0.0037 10 111 P[5]=0.9895 0.0026 0.0009 206.99225 -972.50202 1:00
65% 0.63 0.25 0.30 0.33 0.28 0.0038 10 111 P[5]=0.9898 0.0026 0.0009 217.40496 -972.48296 1:05
70% 0.63 0.25 0.30 0.33 0.29 0.0038 10 111 P[5]=0.9893 0.0026 0.0009 216.82582 -972.50246 1:10
75% 0.63 0.25 0.30 0.33 0.29 0.0037 10 111 P[5]=0.9896 0.0026 0.0009 204.99588 -972.56907 1:15
80% 0.63 0.25 0.30 0.33 0.29 0.0036 10 111 P[5]=0.9897 0.0026 0.0009 217.51889 -972.54354 1:21
85% 0.63 0.25 0.30 0.33 0.28 0.0037 10 111 P[5]=0.9893 0.0026 0.0009 212.80419 -972.51050 1:27
90% 0.63 0.25 0.30 0.33 0.29 0.0037 10 111 P[5]=0.9896 0.0026 0.0009 223.34976 -972.47616 1:33
95% 0.63 0.25 0.30 0.33 0.29 0.0036 10 111 P[5]=0.9893 0.0026 0.0009 215.63544 -972.48464 1:39
100% 0.63 0.25 0.30 0.33 0.28 0.0037 10 111 P[5]=0.9895 0.0026 0.0009 201.02451 -972.47685 1:45
1:45 spent in MCMC
Summarizing the species-delimitation sample in file out.mcmc.txt
Number of species-delimitation models = 5
model prior posterior
1 000 0.200000 0.000000
2 100 0.200000 0.000000
3 101 0.200000 0.004630
4 110 0.200000 0.005960
5 111 0.200000 0.989410
Order of ancestral nodes:
KCLH
KC
LH
Guide tree with posterior probability for presence of nodes:
((K, C)#0.995370, (L, H)#0.994040)#1.000000;;
The starting species delimitation model is generated by choosing at random one of the models defined by the guide tree or starting tree. After the chain has started, columns 2-6 give the acceptance proportions for the conventional MCMC moves discussed above while column 7 (rj) now shows the acceptance proportion for reversible jump proposals (collapsing or expanding species delimitations on the guide tree). In general the larger this proportion, the more efficient the rjMCMC algorithm is. However there is no optimal acceptance proportion for the rjMCMC move, and a value close to 0 may not necessarily mean a problem. If one model has posterior probability close to 1, the acceptance proportion should be near 0 as well. Thus both poor mixing of the rjMCMC algorithm and extreme posterior model probabilities can cause the acceptance proportion for the rjMCMC to be close to 0. It has been noted that if the rjMCMC algorithm is suffering from poor mixing, different starting species trees often lead to different results. Next there are three numbers related to the rjMCMC move. The rest of the line shows the posterior means for \(\theta\) and \(\tau_0\) of the root ancestral population (which exist in all species-tree or species delimitation models), the current log MSC density and average log sequence likelihood. The three numbers in columns 8-10 that are related to the rjMCMC move, for example,
10 111 P[5]=0.9895
represent the number of parameters in the current model (np), in this case 10, the species delimitation, in this case 111, and the current posterior probability of the most probable delimitation P[5]=0.9895. This means that delimitation model 5 (111) is the most favoured model, with the posterior at 0.9895.
After the MCMC is finished, the program will summarize the sample. This output is:
Summarizing the species-delimitation sample in file out.mcmc.txt
Number of species-delimitation models = 5
model prior posterior
1 000 0.200000 0.000000
2 100 0.200000 0.000000
3 101 0.200000 0.004630
4 110 0.200000 0.005960
5 111 0.200000 0.989410
Order of ancestral nodes:
KCLH
KC
LH
Guide tree with posterior probability for presence of nodes:
((K, C)#0.995370, (L, H)#0.994040)#1.000000;;
The five delimitation models are listed, together with their posterior and prior probabilities.
MCMC output file
The MCMC output was saved in the file out.mcmc.txt. As the number of parameters changes when the rjMCMC moves between models, the mcmc sample file may not be very useful, so you can often ignore it. Right now the header line is generated using the starting species tree and should be ignored. After the header line, each line of output has the following numbers, separated by TABs: iteration number, the number of parameters, the tree, the sampled parameter values, and lnL. Here are the first 6 lines of out.mcmc.txt:
Gen np tree theta_1K theta_2C theta_5KCLH theta_6KC theta_7LH tau_5KCLH tau_6KC lnL
2 10 111 0.001207 0.003438 0.003433 0.000722 0.004404 0.003324 0.000834 0.000736 0.000516 0.000528 -972.035
4 10 111 0.001207 0.003438 0.001733 0.000874 0.004570 0.003914 0.001036 0.000736 0.000596 0.000528 -971.105
6 10 111 0.001207 0.003438 0.003411 0.001107 0.003353 0.001307 0.001944 0.000736 0.000708 0.000386 -973.526
8 10 111 0.001698 0.004529 0.001902 0.000913 0.003032 0.000349 0.001474 0.000607 0.000584 0.000346 -972.899
10 10 111 0.001979 0.005280 0.002217 0.001065 0.004159 0.000407 0.001990 0.000707 0.000680 0.000348 -970.674
If you know the unix command grep
, you can retrieve the lines for the
same tree model to summarize the posterior for parameters in that model, for example,
grep "3 111" out.mcmc.txt > result.Tree111.txt
In theory this should give you the same posterior as if you run analysis A00 with the species tree fixed at tree 111. In practice it is simpler to edit the Imap file and the control file to run analysis A00.
Running rjMCMC Algorithms.
-
The rjMCMC algorithms for species delimitation allow the chain to move from one model to another but can have mixing problems. Check that similar results are produced from multiple runs using Algorithm0 and Algorithm1 regardless of the starting species tree.
The starting tree is chosen by the program at random and will vary among runs. Make sure that some of your runs are started with the one-species model, some from the fully resolved tree, and some from other trees in between. If you get consistent results among runs with different starting trees and using the two algorithms, you are unlikely to have a convergence or mixing problem.
If you have a computer with multicore, you can run those different combinations or replicates in different folders at the same time.
-
In medium-sized or large datasets with multiple loci, we have come across cases where the chain is stuck at the one-species model, or have difficulty moving into the one-species model. The problem is less common when there are only 1 or 2 loci. If you have a similar problem, you may start the analysis with 1 locus, then 2 loci, etc. to observe how the results change (you can do this by changing nloci in the control file as there is no need to change the sequence file).
-
The program used the burnin to adjust the step lengths for the proposals in the MCMC algorithm. If the rjMCMC stays in species model 0, which does not have any \(\tau\) parameter, no information is collected during the burnin about the proposals to change \(\tau\) and the automatically adjusted step length for \(\tau\) can be very poor.
-
You probably need to evaluate the impact of the priors on \(\theta\)s and \(\tau\). See the note about the gamma distribution later in this document.
A11: Joint Species Delimitation and Species Tree Estimation
Analysis A11 (speciedelimitation = 1, speciestree = 1)
assumes that the species delimitation is unknown
(e.g., population assignments as specified in the Imap file may not represent distinct species) and also that the species tree topology relating
the species of any delimitation must be estimated (see Yang and Rannala 2014). The program calculates the joint posterior probabilities of different possible species delimitations and species tree topologies as well as of \(\theta\)s and \(\tau\)s.
Input A11
Our example uses the dataset of frog sequence data generated by Zhou et al 2012
that we considered previously. The contents of the control file A11.bpp.ctl
are shown below:
seed = -1
seqfile = frogs.txt
Imapfile = frogs.Imap.txt
jobname = out
speciesdelimitation = 1 1 2 1 * species delimitation rjMCMC algorithm1 finetune (a m)
speciestree = 1 0.4 0.2 0.1 * speciestree pSlider ExpandRatio ShrinkRatio
speciesmodelprior = 1 * 0: uniform LH; 1:uniform rooted trees; 2: uniformSLH; 3: uniformSRooted
species&tree = 4 K C L H
9 7 14 2
(((K, C), L), H);
phase = 1 1 1 1
usedata = 1 * 0: no data (prior); 1:seq like
nloci = 5 * number of data sets in seqfile
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
thetaprior = gamma 2 2000 # gamma(a, b) for theta (estimate theta)
tauprior = gamma 2 1000 # gamma(a, b) for root tau & Dirichlet(a) for other tau's
* heredity = 1 4 4
* locusrate = 1 5
finetune = 1: 5 0.001 0.001 0.001 0.3 0.33 1.0 # finetune for GBtj, GBspr, theta, tau,
mix, locusrate, seqerr
print = 1 0 0 0 * MCMC samples, locusrate, heredityscalars, Genetrees
burnin = 8000
sampfreq = 2
nsample = 100000
the lines:
speciesdelimitation = 1 1 2 1 * species delimitation rjMCMC algorithm1 finetune (a m)
speciestree = 1 0.4 0.2 0.1 * speciestree pSlider ExpandRatio ShrinkRatio
specify that the rjMCMC algorithm 1, with \(\alpha = 2\) and \(m = 1\) in equations 6 and 7 of
Yang and Rannala 2010 is used for species
delimitation and the species tree topology is being estimated
speciestree = 1
. As noted above, the 3 parameters following the speciestree
variable specify the probability that the
SNL move is used, rather than the subtree-pruning-regrafting (SPR) move to modify the species tree
and the expand/shrink ratios for the SNL move; these proportions can be adjusted on the interval \((0,1)\)
and will affect mixing of the MCMC. The line:
speciesmodelprior = 1 * 0: uniform LH; 1:uniform rooted trees; 2: uniformSLH; 3: uniformSRooted
specifies the prior distribution on delimitations and tree topologies.
For analysis A11, speciesmodelprior
can take the four values 0, 1, 2, 3, which
mean Priors 0, 1, 2, 3, respectively. As mentioned above, Prior 1 (which
is the default) assigns equal probabilities to the rooted species trees,
while Prior 0 means equal probabilities for the labeled histories
(rooted trees with the internal nodes ordered by age). Priors 2 and 3
assign equal probabilities for the numbers of species (\(1/s\) each for
\(1,2,\ldots,s\) species given \(s\) populations) and then divide up the
probability for any specific number of species among the compatible
models (of species delimitation and species phylogeny) either uniformly
(Prior 3) or in proportion to the labeled histories (Prior 2).
Priors 2 and 3 are discussed in Yang and Rannala 2014.
Prior 3 may be suitable when there are many populations.
Output A11
When BPP is run using the above control file a summary of the progress of the MCMC is again printed to screen as follows:
| Acceptance proportions |
Prgs | Gage Gspr thet tau mix Sspr Ssnl rj | sp np mldp mthet1 mtau1 log-PG log-L
----------------------------------------------------------------------------------------------------------------
(some output omitted here)
5% 0.65 0.26 0.30 0.28 0.30 0.0904 0.0000 0.0000 4 10 P[4]=1.0000 0.0041 0.0017 1239.61701 -4439.20352 0:59
10% 0.65 0.26 0.30 0.27 0.30 0.0964 0.0000 0.0000 4 10 P[4]=1.0000 0.0042 0.0017 1221.22458 -4437.63784 1:35
15% 0.65 0.26 0.30 0.26 0.30 0.0962 0.0000 0.0000 4 10 P[4]=1.0000 0.0042 0.0017 1238.23420 -4436.36392 2:13
20% 0.65 0.26 0.30 0.26 0.30 0.0998 0.0000 0.0000 4 10 P[4]=1.0000 0.0042 0.0016 1207.06023 -4435.09525 2:54
25% 0.65 0.26 0.30 0.26 0.30 0.1005 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1211.57636 -4434.31571 3:37
30% 0.65 0.26 0.30 0.26 0.30 0.1018 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1199.97256 -4434.32883 4:19
35% 0.65 0.26 0.30 0.26 0.30 0.1023 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1200.88102 -4433.94315 4:57
40% 0.65 0.26 0.30 0.26 0.30 0.1002 0.0000 0.0000 4 10 P[4]=1.0000 0.0044 0.0016 1191.04407 -4434.06936 5:38
45% 0.65 0.26 0.30 0.26 0.30 0.1045 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1193.71849 -4434.12002 6:21
50% 0.65 0.26 0.30 0.26 0.30 0.1036 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1247.11644 -4434.33624 7:05
55% 0.65 0.26 0.30 0.26 0.30 0.1017 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1218.84106 -4434.33556 7:43
60% 0.65 0.26 0.30 0.26 0.30 0.0994 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1233.73293 -4434.00290 8:22
65% 0.65 0.26 0.30 0.26 0.30 0.1005 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1235.85799 -4433.85179 9:06
70% 0.65 0.26 0.30 0.26 0.30 0.0995 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1218.90538 -4434.20067 9:50
75% 0.65 0.26 0.30 0.26 0.30 0.1002 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1214.40621 -4434.67870 10:32
80% 0.65 0.26 0.30 0.26 0.30 0.0999 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1215.56021 -4435.19959 11:13
85% 0.65 0.26 0.30 0.26 0.30 0.0980 0.0000 0.0000 4 10 P[4]=1.0000 0.0043 0.0016 1227.95558 -4435.47778 11:52
90% 0.65 0.26 0.30 0.26 0.30 0.0963 0.0000 0.0000 4 10 P[4]=1.0000 0.0042 0.0016 1225.62882 -4435.73626 12:35
95% 0.65 0.26 0.30 0.27 0.30 0.0958 0.0000 0.0000 4 10 P[4]=1.0000 0.0042 0.0017 1244.11759 -4435.87327 13:19
100% 0.65 0.26 0.30 0.26 0.30 0.0949 0.0000 0.0000 4 10 P[4]=1.0000 0.0042 0.0017 1214.44552 -4436.19194 14:03
14:03 spent in MCMC
(A) List of best models (count postP #species SpeciesTree)
24746 0.247460 0.247460 4 (C H K L) ((C, (H, L)), K);
20105 0.201050 0.448510 4 (C H K L) (C, ((H, L), K));
10918 0.109180 0.557690 4 (C H K L) (C, ((H, K), L));
9358 0.093580 0.651270 4 (C H K L) ((C, K), (H, L));
6874 0.068740 0.720010 4 (C H K L) ((C, (H, K)), L);
5122 0.051220 0.771230 4 (C H K L) (((C, H), L), K);
4616 0.046160 0.817390 4 (C H K L) (C, (H, (K, L)));
3746 0.037460 0.854850 4 (C H K L) (((C, K), H), L);
3681 0.036810 0.891660 4 (C H K L) (((C, H), K), L);
3080 0.030800 0.922460 4 (C H K L) (((C, L), H), K);
1741 0.017410 0.939870 4 (C H K L) ((C, L), (H, K));
1693 0.016930 0.956800 4 (C H K L) (((C, K), L), H);
1691 0.016910 0.973710 4 (C H K L) ((C, H), (K, L));
1474 0.014740 0.988450 4 (C H K L) ((C, (K, L)), H);
1155 0.011550 1.000000 4 (C H K L) (((C, L), K), H);
(B) 1 species delimitations & their posterior probabilities
100000 1.000000 4 (C H K L)
(C) 4 delimited species & their posterior probabilities
100000 1.000000 H
100000 1.000000 C
100000 1.000000 L
100000 1.000000 K
(D) Posterior probability for # of species
P[1] = 0.000000 prior[1] = 0.238095
P[2] = 0.000000 prior[2] = 0.238095
P[3] = 0.000000 prior[3] = 0.285714
P[4] = 1.000000 prior[4] = 0.238095
The arrangement of acceptance proportions, running means etc, are very similar to those of analysis
A10, except that there are two additional columns (with headers Sspr
and Ssnl
) that provide the
acceptance proportions for the SPR and SNL moves proposing changes to the species tree topology.
There are four sections of output that summarize the results.
Section (A) lists the best models in the decreasing order of the
posterior probabilities. Here a model is a full MSC model that species
both the species delimitation and species phylogeny. From this section,
one can easily construct the 95% or 99% credibility sets of models. This
section is further summarized to produce sections B, C, and D. Section
(B) gives the posterior probabilities for the top few delimitations.
Section (C) lists the delimited species and their posterior
probabilities, and section (D) lists the posterior probability for the
number of species, together with the prior probabilities calculated by
BPP.
MCMC output file
The sampled trees and delimitations from the MCMC are printed to the specified file out.mcmc.txt
. An example of 5 lines from this file is:
(C #0.010039: 0.001675, ((L #0.006075: 0.001466, H #0.003617: 0.001466) #0.002155: 0.000205, K #0.003031: 0.001671) #0.000955: 0.000004) #0.004713; 4
(C #0.009000: 0.001462, ((L #0.006482: 0.001228, H #0.001665: 0.001228) #0.001957: 0.000230, K #0.002632: 0.001458) #0.000511: 0.000004) #0.003204; 4
((C #0.009488: 0.001537, K #0.002525: 0.001537) #0.001082: 0.000004, (L #0.006926: 0.001384, H #0.002947: 0.001384) #0.001931: 0.000157) #0.003378; 4
((C #0.010464: 0.001806, K #0.002966: 0.001806) #0.001272: 0.000004, (L #0.008138: 0.001487, H #0.002791: 0.001487) #0.002480: 0.000323) #0.004458; 4
((C #0.009022: 0.001765, K #0.002899: 0.001765) #0.001243: 0.000004, (L #0.007955: 0.001449, H #0.003637: 0.001449) #0.002001: 0.000321) #0.004358; 4
Because the number of species delimited and the tree topology are both potentially changing during the MCMC this file is usually not useful for summarizing the results unless a custom script filtering the MCMC samples is used. This file is used by the program to produce the summaries printed to screen at the end of the run.
Adjusting step lengths for MCMC moves (finetune)
The specified proposal step lengths are the initial values when automatic adjustment is used and are the fixed step lengths when manual specification of step lengths is used. The automatic adjustment of the finetune variable (MCMC step lengths) appears to be reliable and is recommended. However, be sure to check that the resulting acceptance proportions are neither too small nor too large. Here we provide some guidance for manual adjustments of the step lengths. A good strategy is to use automatic adjustments initially to generate good step lengths and then copy the step lengths into the control file, while continuing to use automatic adjustment; that way the next time the control file is run the automatic adjustment procedure will start with near-optimal step lengths and less optimization will therefore be required.
Adjusting the step lengths manually (finetune = 0)
First note the following line found in the control file yu2001.bpp.ctl
.
Here finetune = 1 means that MCMC step lengths will be adjusted
automatically, and the specified values are used as initial values.
finetune = 1: 2 0.00002 0.0001 0.0005 0.5 0.2 1.0
There are seven finetune steplengths here. They are in a fixed order and always read by the program even if the concerned proposal is not used. The first five of them are \(\epsilon_1\), \(\epsilon_2\), \(\epsilon_3\), \(\epsilon_4\), and \(\epsilon_5\), described in (Rannala and Yang 2003). These are the step lengths used in the MCMC proposals that (1) change internal node ages in the gene tree, (2) prune and re-graft nodes in the gene tree (SPR), (3) update \(\theta\), (4) update \(\tau\)s using the rubber-band algorithm, and (5) implements the mixing step. The 6th and 7th are for the proposals that change the locus rates or heredity multipliers and that change the sequencing errors, respectively. If the model assumes the same rate for all loci and does not use heredity multipliers, the 6th proposal step is not used. If the model assumes no sequencing errors, the 7th step is not used. The acceptance proportions for the first five proposals are always printed out on the screen, but those for the 6th and 7th are printed out only if the concerned proposal is used in the model. In the example above, only the first five proposals are used in the algorithm and we will manually change the step lengths so that the acceptance proportions become close to 30%. First, edit the line, changing it to:
finetune = 0: 2 0.00002 0.0001 0.0005 0.5 0.2 1.0
The basic strategy for adjusting fine-tune parameters is if the acceptance proportion is too small (e.g., \(<0.10\)), decrease the corresponding finetune parameter. If the acceptance proportion is too large (e.g., \(>0.80\)), increase the finetune parameter. Run the program for a small number of iterations and look at the screen output to examine the acceptance proportions:
0% 0.71 0.19 0.34 0.00 0.41 0.0004 463.81031 -12720.98567 0:04
5% 0.72 0.19 0.32 0.00 0.41 0.0003 458.65988 -12720.53426 0:04
10% 0.72 0.19 0.34 0.00 0.42 0.0004 454.99289 -12720.66361 0:05
15% 0.72 0.19 0.34 0.00 0.41 0.0004 481.20520 -12720.78839 0:06
Here the first acceptance proportion, at 0.72, appears too large, which means that the corresponding finetune parameter (2 in the control file) is too small, similarly the second acceptance proportion, at 0.19, appears too small, which means that the corresponding finetune parameter is too large. Terminate the run (Ctrl-C) and edit the finetune values in the control file and then run the program again, checking the acceptance rates. Repeat this process a few times until every acceptance proportion is neither too small nor too large. In this example, changing the second finetune parameter (SPR) from \(0.00001\) to \(0.000003\) brings the acceptance proportion to about \(0.30\), namely:
10% 0.71 0.31 0.34 0.00 0.30 0.0004 447.59035 -12721.04474 0:06
15% 0.71 0.30 0.33 0.00 0.30 0.0004 487.91342 -12720.96697 0:07
20% 0.71 0.30 0.32 0.00 0.30 0.0003 430.67545 -12720.83600 0:08
However, increasing the first finetune parameter (even by an order of magnitude to 20) has little effect. Sometimes it will not be possible to reduce the acceptance proportion for a parameter, as in this case (even using automatic adjustment produces an acceptance proportion of about \(0.71\) in this case).
The MCMC proposals discussed above are used in all four analyses (A00, A01, A10, A11), so that the description here applies to all of them. Note that the finetune parameters affect the efficiency of the MCMC or how fast one can obtain reliable results. In theory they do not change the results if runs using different finetune parameters (with potentially different acceptance proportions) are all run for long enough to generate reliable results.
Simulating Gene Trees and Alignments Using BPP
The --simulate
option of BPP4 replaces the earlier MCCOAL program distributed with
BPP versions 3.4 and earlier. This option can be used to simulate gene trees and
sequence alignments at multiple loci under the MSC (Rannala and Yang 2003) and
MSC-I (Flouri et al 2020) models.
The simulation program allows a GTR+G DNA substitution model (and its special cases)
and allows among-loci heterogeneity in the process of sequence evolution. The different loci can have
different exchangeability parameters (\(a, b, c, d, e, f\)) and base
compositions in the GTR model, different overall evolutionary rates,
differing extents of among-site rate variation (reflected in the alpha
parameter for the gamma model), and different among-species rate drift processes
(violations of the clock). See @Shi2018 for examples of such simulations.
The simulation option also includes a MSC-M continuous migration model, with a
user-specified migration rate matrix between species/populations.
Running the simulation program
To run BPP using the simulation option, create a suitable control file (for example, named MCcoal.ctl) and type the following:
bpp --simulate MCcoal.ctl
Always examine the screen output carefully to confirm that the program has read and executed the control file correctly.
Simulation under the MSC model
The control file
Here we consider a simple example of a simulation under the MSC model with no migration or introgression. The example control file for simulating sequence alignments and gene trees has the following content:
seed = 12345
seqfile = MySeq.txt * comment out this line if you don't want seqs
treefile = MyTree.tre * comment out this line if you don't want trees
Imapfile = MyImap.txt
* concatfile = concatenatedfile.txt * concatenated alignment
modelparafile = modelparas.txt * comment out this line if you don't want seqs
species&tree = 4 A B C D
3 2 1 1
((A #0.01, B #0.01) :0.01 #.01, (C, D) :0.011 #0.01) : 0.012 #0.01;
phase = 1 1 0 0 * 0: do not phase (fully phased seqs), 1: diploid unphased seqs
loci&length = 100 1000 * number of loci & number of sites at each locus
* locusrate = 0 (default: same rate for all loci)
* locusrate = mu_bar a_mui prior
* clock = 1 (default: strict clock)
* clock = 2 v_bar a_vi prior dist (independent-rates)
* clock = 3 v_bar a_vi prior dist (correlated-rates)
model = 7 * model: 0:JC69, 7:REV (GTR)
Qrates = 0 10 5 5 5 5 10 * 1: fixed; 0: dirichlet, for TC TA TG CA CG AG
basefreqs = 0 10 10 10 10 * 0: random, Dirichlet(aT,aC,aA,aG), for base frequencies
alpha_siterate = 0 100 20 5 * G(a, b) for alpha for sites & K for discrete gamma
We now go through each control file variable and explain its purpose.
seed
Use seed = -1
to simulate different datasets each time the program is
run (obtaining a random number seed from the computer clock).
If you use a positive integer as a seed the data will be exactly the same
each time the program is run with that control file; this is usually not
what you want to do.
seqfile, treefile, Imapfile, concatfile, modelparafile
These variables specify the names of files to be generated. If you want to simulate gene trees and
not sequence files, you can comment out the line for the seqfile
.
The file specified by concatfile
will contain the sequence alignment concatenated across loci.
The sequence files can become very large. The file specified by modelparafile
contains the
parameter values for the GTR model, as well as the locus rate and the
rates for nodes on the species tree. There is one line of output per
locus.
phase
This variable is used in the same way as in the BPP inference (data
analysis) mode. Each species is assigned a switch/flag, with 0 meaning
fully phased (haplotype) sequences and 1 for unphased diploid sequences.
With the specifications in the example (sampling configuration 3 2 1 1
and phase = 1 1 0 0), the program will simulate 6, 4, 1, 1 sequences for
A, B, C, D, respectively, and then combine every two sequences from A,
B, C into one single diploid sequence with heterozygote sites
represented using ambiguity characters (YRMKSW). If you use the same
random number seed, the diploid data generated using the sampling
configuration (3 2 1 1) with phase = 1 1 0 0 should match the fully
resolved sequences generated using the sampling configuration (6 4 1 1)
with phase = 0 0 0 0. The program also prints out the fully resolved
sequences in a file with the suffix _full
(or MySeq_full.txt
in the
example).
loci&length
This variable specifies the number of loci to be simulated and the length
(number of sites) for each loci. In this example, 100 loci,
each of 1000 sites, will be simulated.
species&tree
This block specifies the number and names of species,
the species tree, and the parameters under the multispecies coalescent
model, including the species divergence times (\(\tau\)s) and population
size parameters (\(\theta = 4N\mu\)). For the example above, 100 loci,
each of 1000 sites, will be simulated, on the species tree ((A, B), (C, D)),
with 3, 2, 1, 1 sequences for A, B, C, D, so that there are 7
sequences at each locus. The divergence time parameters (\(\tau\)s) are
after :
in the tree, while the population size parameters (\(\theta\)s)
are after #
. Thus we have 0.01 for all \(\theta\)s, and
\(\tau_{ABCD} = 0.012\), \(\tau_{AB} = 0.01\), and \(\tau_{CD} = 0.011\). We
need \(\theta_A\) and \(\theta_B\) because 2 or more sequences are sampled
from species A and B. Parameters \(\theta_C\) and \(\theta_D\) are
unnecessary in this case as only one sequence is sampled from C and D:
if you specify them, they will be ignored by the program. Note that both
\(\theta\)s and \(\tau\)s are measured by the expected number of mutations
per site, so that \(\theta = 0.01\) means that two sequences sampled from
the population are 1% different. For reference, the estimate for the
human species is \(\theta_H = 0.0006\).
model
This variable can take two possible values: 0 for JC69 and 7 for GTR.
The next few lines are relevant for the GTR model.
Qrates
This variable specifies the exchangeability parameters in the GTR model
(\(a, b, c, d, e, f\) for TC, TA, TG, CA, CG, and AG, in Yang 1994).
There are two options, either to have those rates fixed for all loci or
to have them sampled at random. The first has the following format (note
that the first 0-1 number is a switch):
Qrates = 1 2 1 1 1 1 2 * 1: fixed; for TC TA TG CA CG AG
The input is one integer value (0 or 1) which acts as a switch followed by six real numbers. The line above fixes the rates at \(a = 2, b = c = d = e = 1\) and \(f = 2\), so that the transition rate is twice as high as the transversion rate, and the model corresponds to K80 or HKY. Note that only the relative rates matter, because the rate matrix (\(Q\)) is scaled so that the average rate (over the base frequencies) is 1 and branch lengths are measured in the expected number of mutations/substitutions per site. Nevertheless the program expects six rate parameters in the input here:
Qrates = 0 10 5 5 5 5 10 * 0: dirichlet, for TC TA TG CA CG AG
The second option, above, is to sample \(a, b, c, d, e\) from the Dirichlet distribution for every locus. The above specifies Dir(10, 5, 5, 5, 5, 10), with parameters \(\alpha_a = 10, \alpha_b = \alpha_c = \alpha_d = \alpha_e = 10\), and \(\alpha_f = 10\). The rates generated from the Dirichlet sum to 1, and they are rescaled. Note that larger values for those parameters mean less variance, while the mean for \(a\), say, is given by \(\alpha_a/\alpha\) with \(\alpha = \alpha_a + \alpha_b + \alpha_c + \alpha_d + \alpha_e + \alpha_f\). The specification here gives on average a transition/transversion rate ratio of 2 (if the base frequencies are fixed at \(\frac{1}{4}\) each), but it varies among loci.
basefreqs
The base frequency parameters
(\(\pi_T, \pi_C, \pi_A, \pi_G\)) can also be fixed or sampled from the
Dirichlet distribution, like the Qrates
. The two options are as
follows:
basefreqs = 1 0.15 0.35 0.15 0.35 * 1: fixed; Base frequencies are for TCAG.
basefreqs = 0 10 10 10 10 * 0: random, Dirichlet(aT, aC, aA, aG), for base frequencies
alpha_siterate
The gamma shape parameter (\(\alpha\)) for rate
variation among sites of the same locus can be fixed or sampled from a
gamma distribution. The possible options are as follows. When alpha is
fixed, the same value is used for all loci. Otherwise different
\(\alpha\)s are used for different loci. Note that given alpha, the rates
for sites at the same locus have a G((\(\alpha, \alpha\)) distribution
with mean 1 [@Yang1993; @Yang1994]. The fourth option below will allow
the program to sample \(\alpha\) from \(G(100, 20)\), with mean 5, for each
locus and then use \(G(\alpha, \alpha)\) to describe the among-site rate
variation for the locus:
alpha_siterate = 1 0 * 1: alpha fixed at 0(inf), one rate for all sites at the locus
alpha_siterate = 1 5.6 0 * 1: alpha fixed at 5.6, K = 0(inf) for continuous gamma
alpha_siterate = 1 5.6 5 * 1: alpha fixed at 5.6, K = 5 categories for discrete gamma
alpha_siterate = 0 100 20 5 * 0: alpha sampled from G(100, 20), with mean 5, K = 5 for discrete gamma
The variables locusrate
and clock
are used to specify
relaxed-clock models, with rate variation among branches and among loci.
These are the same models as implemented in the inference program.
Please see Models of Substitution Rate Variation for details.
locusrate
This specifies variable rates for different loci. The default is
0, meaning that all loci have the same rate. To specify variable rates
for loci, use the following format:
locusrate = 0 (default)
locusrate = mu_bar a_mui prior
locusrate = mu_bar a_mui
locusrate = 1.0 5.0 dir
Here mu_bar
is the mean rate across loci.
If prior = dir
(gamma-Dirichlet), the total rate (\(L\bar\mu\)) is
partitioned to generate the locus-rates (\(\mu_i\)) according to the
Dirichlet distribution with concentration parameter \(\alpha\) (a_mui
).
Large \(\alpha\) (10 or 100, say) means that the rates are similar among
loci, while small values (1 or 0.5, say) mean that rates are highly
variable among loci. If prior = iid
(conditional i.i.d.), the locus
rates (\(\mu_i\)) are i.i.d. given the mean rate (\(\bar\mu\)):
mu_i ~ Gamma(a_mui, a_mui/mubar)
clock
This variable is used to specify strict-clock or relaxed-clock models.
The default is clock = 1
for strict clock, while clock = 2
is the
independents-rates model, and clock = 3
is the correlated-rates model
(which is implemented for the MSC model with no introgression and is
unavailable for the MSC-I model).
clock = 1 (default: strict clock)
clock = 2 v_bar a_vi prior dist * independent-rates model
clock = 2 0.1 10 dir G * gamma-dirichlet for loci & gamma for branches
clock = 2 0.1 10 iid LN * iid for loci & log-normal for branches
clock = 3 v_bar a_vi prior dist * correlated-rates model
clock = 3 0.1 10 iid G * iid for loci & gamma for branches
clock = 3 0.1 10 iid LN * iid for loci & log-normal for branches
The specification (clock = 2 0.1 10 dir G
) means the following. First
the average variance parameter is = 0.1. Then the total \(L\bar\mu\) is
partitioned into \(\mu_i\) (for \(i = 1, 2, \cdots, L\)) according to the
Dirichlet distribution with concentration parameter \(\alpha = 10\)
(a_vi
). In contrast, the specification (clock = 2 0.1 10 iid LN
) means that
the variance parameter for locus \(\nu_i\) is generated as
\(\nu_i|\bar\nu \sim G(\alpha, \alpha/\bar\nu)\) with shape parameter
\(\alpha = 10\) (a_vi
) and mean \(\bar\nu = 0.1\).
In both prior models (prior=dir
and iid
), \(\alpha\) (a_vi
) is
inversely related to the variance in \(\mu_i\) among loci: use small
values of \(\alpha\) (2, 1, or 0.5) if the clock nearly holds at some loci
but is seriously violated at others, and large values (such as 10 or
100) if the clock is violated in the same extent at different loci.
clock = 2
Given the locus rate \(\mu_i\) (specified using the
locusrate
variable) and the variance parameter \(\nu_i\) (specified
using the clock
variable) for locus \(i\), the rates for species-tree
branches are specified as follows. If dist = G
(for gamma), the rate
for branch \(j\) at locus \(i\) has the following gamma distribution with
mean \(\mu_i\) and variance \(\nu_i\)
\(r_{ij} | \mu_i, \nu_i \sim G(\mu_i^2/\nu_i, \mu_i/\nu_i).\)
If dist = LN
(for log-normal), the rate has the log-normal
distribution
\(r_{ij} | \mu_i, \nu_i \sim LN(\mu_i, \nu_i).\)
clock = 3
The correlated-rates model is specified similarly, and for
the MSC model only. Given the locus rate \(\mu_i\) (specified using the
locusrate
variable) and the variance parameter \(\nu_i\) (specified
using the clock
variable) for locus \(i\), the rates for species-tree
branches are specified recursively, starting from the root towards the
tips.
If dist = G
(for gamma), the rate for branch \(j\) at locus \(i\) has the
following gamma distribution with shape parameter
\(\alpha = \mu_i^2/\nu_i\) and with the mean to be the rate for the
parental branch of \(j\) or \(r_{i,p(j)}\):
\(r_{ij} | \mu_i, \nu_i, r_{i,p(j)} \sim G(\frac{\mu_i^2}{\nu_i}, \frac{\mu_i^2}{\nu_i r_{i,p(j)}}).\)
If dist = LN
(for log-normal), the rate for branch \(j\) at locus \(i\)
has the log-normal distribution with the mean to be the parental rate
and variance parameter \(\nu_i\):
\(r_{ij} | \nu_i, r_{i,p(j)} \sim LN(r_{i,p(j)}, \nu_i).\)
After the rates for branches (populations) on the species tree are generated for the locus, they are used to calculate the gene-tree branch lengths. Gene-tree branches residing in the same species/populations have the same rate (the rate of that species at the locus), while those residing in different species may have different rates. A branch on a gene tree may have segments residing in different species/populations, and the length of the branch (measured by the expected number of mutations/substitutions per site) is calculated by adding up those segments.
Program output
Running the program on the above control file produces the following screen output, indicating that the simulation has been successful:
bpp --simulate MCcoal.ctl
Site rates: alpha sampled from Gamma(100.000000,20.000000), K = 5
bpp v4.7.0_linux_x86_64, 15GB RAM, 8 cores
https://github.com/bpp/bpp
Detected CPU features: mmx sse sse2 sse3 ssse3 sse4.1 sse4.2 popcnt avx avx2
Auto-selected SIMD ISA: AVX2
Substitution model: GTR with generated rates from Dirichlet
10.000000 5.000000 5.000000 5.000000 5.000000 10.000000
Base frequencies: generated from Dirichlet
a_T=10.000000 a_C=10.000000 a_A=10.000000 a_G=10.000000
4 species: A (3) B (2) C (1) D (1)
((A #0.010000: 0.010000, B #0.010000: 0.010000) #0.010000: 0.002000, (C: 0.011000, D: 0.011000) #0.010000: 0.001000) #0.010000;
Map of populations and ancestors (1 in map indicates ancestor):
Species 1 2 3 4 5 6 7
1 A 1 0 0 0 1 1 0 tau = 0.000000 theta = 0.010000
2 B 0 1 0 0 1 1 0 tau = 0.000000 theta = 0.010000
3 C 0 0 1 0 1 0 1 tau = 0.000000 theta = 0.000000
4 D 0 0 0 1 1 0 1 tau = 0.000000 theta = 0.000000
5 ABCD 0 0 0 0 1 0 0 tau = 0.012000 theta = 0.010000
6 AB 0 0 0 0 1 1 0 tau = 0.010000 theta = 0.010000
7 CD 0 0 0 0 1 0 1 tau = 0.011000 theta = 0.010000
Sequence data file -> MySeq.txt
Trees -> MyTree.tre
Model parameters for loci -> modelparas.txt
Tags to species mapping (Imap) -> MyImap.txt
Several output files are produced. The file MySeq.txt
contains the simulated sequence alignments in Phylip/BPP format:
7 1000
a1^A ACATGCCACT ACACCGGACA TTAGAGCTAC GCGAGCTCTT TGWACTCGTT ACAACCTCAC AATGTAATAG TGTAATTACT TAGCTTAAGT AMT
CTTGATC TTTTCTATAR TTACATCGAA CGAATATCTG TACCTGAGCT TTTCACGATC CCTYCCTTCA TCTAGCAATC GTAAACTWTG TAAATTTCCT ACAGTCA
TAY TRATAACCCA TATCATGTAA CCAGACCCCC CTWAATTTMT CGAGAATGCC CMTCTTCATT CAATCCCAAA CAGTCAAAGC GTCCACAAAT TAATTGCACA
TAACCATCGT AATTGGAGAA ATTCCACCAT TCAATAGTTT ACTTGATTMA AACTAGTGGA TCCCATGTAA GCATAATCCC CCTACYCCAT ATGTATCTTA GAGT
MTCGTT TGRGACTTTM CATGCCCCAG CTGTTTTTAA ATGCCAGTAA ACARCGGCGA CTACTATGTC TTACGACAGA ACGAGGTTTC CACTGAGAGT RCAAAGGA
TC ACTCGTCTAT AGTTCAAAAW TCATTGCACG ATGTTCTCGA TCTACCTAAA TCTTGCCTGC TTTTTCMCCA TGCAATACAA CACCGTTAAA CAATATCTTA A
TATCAATCG TACAGTCGAT GCAGTATAGC CGATACTCCT TAACCTAACT YTTCAATATG ACMTAATACT CAACTGTTGC GRTACCAATT ATCCMTGCTG TACTG
ATAAC TTTATTGAAT GCCAATMCTA TWATGATCTA TTTCAGAYTT AATTCACAAT GAATTCAATA CGAATTTTCC TATCGTCAGS ACATCAATTA GATTCACAT
G TKGRAAGATC TATTTTGCAT CGACAAATGC ACCACCGCTC CCTCGCCTCT AAATCGCTCA CAATCGAATT AGAAAATATA ACACWTCCAT TGATCAASCT AA
AAGCGCAC AGGGAGCGAT TGGTTCGAAA ATGAGTTACA ACGTAAAGTT CTCAGCATAA TGGRGGCTAT ACACCCAAAA
a2^A ACATGCCACT ACACCGGACA TTAGAGCTAC GCGAGCTCTT TGTACTCGTT ACAACCTCAC AATGTAATAG TGTAATTACT TAGCTTAAGT AAT
CTTGATC TTTTCTATAA TTACATCGAA CGAATATCTG TACCTGAGCT TTTCACGATC CCTCCCTTCA TCTAGCAATC GTAAACTTTG TAAATTTCCT ACAGTCA
TAT TAATAACCCA TATCATGTAA CCAGACCCCC CTTAATTTCT CGAGAATGCC CATCTTCATT CAATCCCAAA CAGTCAAAGC GTCCACAAAT TAATTGCACA
TAACCATCGT AATTGGAGAA ATTCCACCAT TCAATAGTTT ACTTGATTCA AACTAGTGGA TCCCATGTAA GCATAATCCC CCTACCCCAT ATGTATCTTA GAGT
CTCGTT TGGGACTTTC CATGCCCCAG CTGTTTTTAA ATGCCAGTAA ACAGCGGCGA CTACTATGTC TTACGACAGA ACGAGGTTTC CACTGAGAGT GCAAAGGA
TC ACTCGTCTAT AGTTCAAAAT TCATTGCACG ATGTTCTCGA TCTACCTAAA TCTTGCCTGC TTTTTCCCCA TGCAATACAA CACCGTTAAA CAATATCTTA A
TATCAATCG TACAGTCGAT GCAGTATAGC CGATACTCCT TAACCTAACT CTTCAATATG ACATAATACT CAACTGTTGC GATACCAATT ATCCATGCTG TACTG
ATAAC TTTATTGAAT GCCAATCCTA TTATGATCTA TTTCAGATYT AATTCACAAT GAATTCAATA CGAATTTTCC TATCGTCAGC ACATCAATTA GATTCACAT
G TGGGAAGATC TATTTTGCAT CGACAAATGC ACCACCGCTC CCTCGCCTCT AAATCGCTCA CAATCGAATT AGAAAATATA ACACTTCCAY TGATCAAGCT AA
AAGCGCAC AGGGAGCGAT TGGTTCGAAA ATGAGTTACA ACGTAAAGTT CTCAGCATAA TGGGGGCTAT ACACCCAAAA
...
The file MyTree.tre contains the simulated gene tree for each locus in Newick format:
((((((a1a^A:0.000217,(a2b^A:0.000182,a3a^A:0.000182):0.000035):0.000362,a2a^A:0.000579):0.000065,a3b^A:0.000644):0
.011379,c1^C:0.012023):0.001408,(a1b^A:0.010443,(((b1a^B:0.001294,b2b^B:0.001294):0.000858,b1b^B:0.002152):0.00144
4,b2a^B:0.003596):0.006847):0.002989):0.010039,d1^D:0.023471):0.000000; [TH=0.023471, TL=0.091502]
(((((a1a^A:0.000584,((a2b^A:0.000357,a3b^A:0.000357):0.000057,a3a^A:0.000414):0.000170):0.001213,a1b^A:0.001797):0
.000482,a2a^A:0.002279):0.008361,(b1a^B:0.001291,(b1b^B:0.000558,(b2a^B:0.000345,b2b^B:0.000345):0.000213):0.00073
2):0.009350):0.010901,(d1^D:0.014926,c1^C:0.014926):0.006615):0.000000; [TH=0.021541, TL=0.076273]
((((a1a^A:0.001025,((a1b^A:0.000262,a2a^A:0.000262):0.000003,a2b^A:0.000265):0.000760):0.003850,(a3a^A:0.003915,a3
b^A:0.003915):0.000961):0.005929,((b1a^B:0.000958,(b2a^B:0.000139,b2b^B:0.000139):0.000819):0.000315,b1b^B:0.00127
3):0.009531):0.005512,(d1^D:0.015947,c1^C:0.015947):0.000369):0.000000; [TH=0.016316, TL=0.072093]
...
The file MyImap.txt
contains the Imap data. These files can be used as input to inference programs such as
BPP along with a suitable control file that matches the simulated data
.
Simulation with migration under the MSC-M model
Here we consider a simple example of a simulation under the MSC-M model with continuous migration.
The control file
The example control file for simulating sequence alignments and gene trees has the following content:
seed = -1
treefile = mytree.tre
Imapfile = myimap.txt
seqfile = mydata.txt
*concatfile = concat.txt
modelparafile = modelparas.txt
#phase = 0 0 0
species&tree = 3
A B C
4 4 4
((A #0.015, B #0.025) S #0.015 :0.01, C #0.025) :0.02 #0.025;
loci&length = 2000 500
model = 0
* JC
migration = 2
S C 0.1
A C 0.2 5
There are two major differences from the previous (MSC) example. First, in the species tree the ancestral population to clade (A, B) is given a label S:
species&tree = 3
A B C
4 4 4
((A #0.015, B #0.025) S #0.015 :0.01, C #0.025) :0.02 #0.025;
A label is needed if migration is going to be specified for the ancestral population.
Second, there is a new migration
block:
migration = 2
S C 0.1
A C 0.2 5
Here, two migration connections are specified with migration = 2
. The first migration connection is from source population
S (ancestor of AB) to target population C and migration occurs at the rate \(M_{SC} = 0.1\) migrants per generation. The second
migration connection is from source population A to target population C. In this case,
the migration rate \(M_{AC}\) varies among loci with the mean to be 0.2 and the shape parameter to be 5.
The migration rate from populations \(i\) to \(j\) is defined as \(M_{ij} = N_j m_{ij}\) , where \(m_{ij}\) is the
proportion of individuals in population \(j\) that are migrants from population \(i\) in each generation. In the example above,
\(M_{SC} = 0.1\), which means that on average 0.1 individuals are immigrants from population S
to population C. We use forward-time for our definition of rates as the rates are then more biologically meaningful.
This control file format is used in BPP version 4.4 onward, and is different from the old MCcoal format, in which the labels and
order for populations are fixed by the program. The control file included in the release ( MCcoal.im-3s-saturated.ctl
) simulates data on a
species tree for three species, with eight migration rates.
Simulation with tip-dating under the MSC
Here we consider a simple example of a simulation under the MSC with tip-dating.
The control file
The example control file for simulating sequence alignments and gene trees has the following content:
seed = 1
seqfile = simulate.txt
treefile=simulate_trees.txt
Imapfile = simple.Imap.txt
datefile = dates.txt
seqDates = seqDates.txt
# fixed number of species/populations
*speciesdelimitation = 0
# fixed species tree
species&tree = 3 A B C
4 4 2
(A #0.001, B #0.001):.007 #0.001, (C #0.001, D #0.001):.004 #.001);
phase = 0 0 0
loci&length = 100 1000
clock = 1
locusrate =0
model = 0
The only difference in the control file when using tip-dating is the inclusion of the datefile
and seqDates
options. datefile
specifies the name/path to the file containing the sequence ages
The ages are in units of expected number of substitutions, which is not the same as the inference program.
The dates are assigned to populations.
The number of sample dates must match the number of samples from each population.
For example, in the below date file, there are 4 samples from population A with sample ages 0.00005 and 0.00006 substitutions per site.
A 0.00006
A 0.00005
A 0.00005
A 0.00006
B 0.00007
B 0
B 0
B 0
C 0
C 0
The seqDates
option specifies the name/path of the output file for the date file with individual names with sample dates.
The bpp
simulator automatically names the sequences based
on the population names. Since the sequences are not named in the input files for the simulator,
the simulator must generate a file to match the dates to the sequences. Below is the seqDates
file that corresponds to the datefile
shown above.
A^a1 0.000050
A^a2 0.000050
A^a3 0.000060
A^a4 0.000060
B^b1 0.000000
B^b2 0.000000
B^b3 0.000000
B^b4 0.000070
C^c1 0.000000
C^c2 0.000000
The seqDates
and datefile
options must be used together in the simulator. A strict clock model
(the default, which is equivalent to clock = 1
) is required for simulating with tip dates. The
simulator uses dates in units of expected substitutions, but the inference program uses dates in
calendar time such as years or thousand years. The dates in the seqDates
file can be converted to
a unit of years by dividing by the per year substition rate. This is required to use the simulator
to generate data to use in the inference program. For example, if we assume a substitution rate of
\(10^{−8}\) per year and use the same example as above, the datefile
to use for inference would be
A^a1 5000
A^a2 5000
A^a3 6000
A^a4 6000
B^b1 0
B^b2 0
B^b3 0
B^b4 7000
C^c1 0
C^c2 0
BPP does not generate a file with the dates in years or similar units. The user must prepare this file after running the simulator.
Advanced Features of BPP
Threading and Checkpointing
Threads
BPP4 can use Single-Instruction-Multiple-Data (SIMD) instruction sets (also called vector instructions) available on modern processors. It automatically detects and uses the best instruction set available so you don't have to do anything about this. However, you can manually specify and force an instruction set in the control file using the arch tag. For instance, arch=SSE forces BPP to use the SSE instruction set even if AVX (which has twice the register size of SSE and thus in theory yields twice the speedup) is present on the system. To completely disable vector instruction, one can use the arch=CPU option. The available values for the arch tag are 'CPU', 'SSE', 'AVX', 'AVX2' and 'NEON'.
Modern computer architecture using Non-Uniform momery access (NUMA).
BPP4 can use multiple cores/threads. The control variable threads has the following syntax.
Threads = 8 19 1
This means BPP will use 8 threads, starting from core/thread 19, with increment 1; in other words hardware threads 19-36 will be used. The software threads are pinned onto the hardware threads and they not allowed to migrate. We found that this helps with performance.
Modern clusters based on microprocessors use a common shared-memory configuration called NUMA (for non-uniform memory access). The cluster typically consists of two or four microprocessors interconnected on a local bus to a shared memory on a single motherboard.
You can use commands like lscpu
to see the machine architecture. For
example, on our server lscpu gives the following output:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 144
On-line CPU(s) list: 0-143
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 4
NUMA node(s): 4
...
NUMA node0 CPU(s): 0-17,72-89
NUMA node1 CPU(s): 18-35,90-107
NUMA node2 CPU(s): 36-53,108-125
NUMA node3 CPU(s): 54-71,126-143
This means that hardware threads 1-18 are on CPU0, while threads 73-90 are the hyperthreads on CPU0. The specification Threads = 8 19 1 thus uses 8 threads on CPU1. Our recommendation is that you do some tests yourself using threads = 2 or 4 or 8, and use options that work well for your servers. Adjust the step lengths first. Then use burnin=0 and a small nsample so that the running time is about 1-2 minutes. Change threads and record running time.
Make sure all threads for the same BPP job are on the same CPU. Use lscpu to see the machine configuration and htop to examine the load on the hardware threads.
Checkpointing
This option instructs BPP to create a checkpoint file
(save the progress) after a specified number of MCMC iterations. The
MCMC loop in BPP consists of
N = burnin + sampfreq*nsample
iterations, numbered as \(1,2,\cdots,N\).
There are two formats, where X, Y are whole numbers:
checkpoint = X
checkpoint = X Y
In the first format, a single checkpoint is created after X MCMC iterations (including burnin).
In the second format, a checkpoint is created after X steps, and then additional checkpoints are created every Y steps. For example, if X = 10000 and Y = 10000, the first checkpoint is created at MCMC iteration 10000, and further checkpoints at iterations 20000, 30000, etc.
The checkpoint files are named JOBNAME.Z.chk
, where JOBNAME
is the label
specified for the jobname
option, and Z
is the number of the checkpoint
file, starting from 1 and incrementing with every new checkpoint file.
To resume from a checkpoint file, use the resume switch, i.e.
bpp --resume checkpoint-file.chk
Marginal Likelihood Calculations
The C program BFdriver generates control files and job subscription scripts for running MCMC (using BPP or MCMCtree) to calculate the marginal likelihood (or the Bayes factor), as described in Rannala and Yang 2017.
The program takes a control file you provide (such as bpp.ctl) and generates \(K = 16\) control files with different beta values, which are used to run bpp to sample from the different power posterior distributions. The program also generates job submission scripts and submit the jobs using qsub. All generated control files and output files are in the same current directory. The frogs dataset in the bpp release ((Yang 2015) is used as the example.
You need the following: a linux system with SUN grid engine managing job
submission (including commonds such as qsub
, qstat
, qdel
, etc.),
and a C compiler. If you don't have this job submission system, you can
use BFdriver to generate the control files and run the MCMC jobs from
the command line.
Compiling and running BFdriver
cc -o BFdriver -O3 BFdriver.c tools.c -lm
BFdriver <controlfilename> <npoints> <scriptname.sh>
BFdriver A00.ctl 16 tmp.sh
You may need to edit the following two lines inside BFdriver.c, and if
you do, remember to compile the program after editing. Here bpp is
assumed to be on your search path. You can use a full path for the
executable prorgam, such as /home/gooduser/bin/bpp3.3
. Also the second
line is for submitting the jobs using qsub
. Here the limits are set to
4G of RAM and 360 hours of running time. Check those values if necessary
(and recompile).
fprintf(fcommand, " echo \"bpp %s.b$I.ctl > log.b$I.txt\" > %s\n", ctlf, scriptf);
fprintf(fcommand, " qsub -S /bin/bash -l h_vmem=4G -l tmem=4G -l h_rt=360:0:0 -cwd %s\n", scriptf);
Running the program
Create a folder inside frogs/
, say bf1
:
mkdir frogs/bf1
cd frogs/bf1
Prepare a control file (A00.ctl, say) for the A00 analysis in the folder. Check that it works. This should have a fixed species tree, which is ((K, C), (L, H)). Species delimitation and species tree estimation should be turned off. The control file specifies the priors for theta, tau, and also specifies burnin, nsample, sampfreq etc. Run bpp to confirm that the control file works. Then run BDdriver as follows:
BFdriver A00.ctl 16 tree1.sh
Here A00.ctl is the control file we have prepared. \(K = 16\) is the
number of points in the Gauss-Legendre qradrature algorithm for
numerical integration. You can use 8 for testing, and 16 or 32 for real
calculation. tree1.sh
is the temporary script file for job submission
using qsub
. The BFdriver command does a few things. First it reads the
control file specified (A00.ctl
) and creates 16 control files with
names like A00.b01.ctl
, ..., A00.b16.ctl
. Each of those control
files has the same content as A00.ctl
except that one extra line is
inserted at the beginning, like the following
BayesFactorBeta = 0.122298 * w=0.124629.ctl
This specifies the beta value when the control file is used to run BPP.
Second BFdriver
creates a file named betaweights.txt
, which lists
the beta values and Gauss-Legendre weights. I have copied those values
into an excel file in frogs/BFdriver.frogs.xls
.
Third BFdriver
creates a file named commands, which has the bash shell
scripts for submitting the 16 jobs using qsub
. You use the following
to submit the jobs.
source commands
You can look at the content of tree1.sh
to see the script for the last
job:
more tree1.sh
which should have the content like the following:
bpp --cfile bpp.b16.ctl > log.b16.txt
You can use qstat
to check the status of the 16 jobs you have
submitted. When the jobs are running, they generate output files in the
current folder, such as mcmc.b01.txt
, out.b01.txt
, and log.b01.txt
(which logs the screen output). After all jobs are finished, you can use
grep to extract the line with BFbeta
from screen log (this command is
at the bottom of the file commands).
grep BFbeta log.b*.txt
Then copy the ElnfX values into the excel file, and estimate the
logarithm of the marginal likelihood by summing (weights * ElnfX / 2
)
over the 16 points. This gives the log marginal likelihood to be
\(-3185.93\).
Exercise and results
Duplicate the calculation for the alternative species tree (tree2) in
the folder /bf2
. Change the species tree topology to (((L, H), C), K),
and use tree2.sh as the temporary job script file. Everything else
should be the same as described above.
My runs gave the log marginal likelihood for tree 2 to be \(-3186.14\). The ratio of the posterior probabilities for the two trees is then estimated to be
\(\frac{P_1}{P_2} = \mathrm{e}^{-3185.93 - (-3186.14)} = \mathrm{e}^{0.21} = 1.24\).
With BPP4.0 and the inverse-gamma priors on \(\theta\) and \(\tau\)s, the MCMC runs (A01) gave the posterior probabilities for trees 1 and 2 as 0.167 and 0.137, with the ratio 1.22. The MCMC runs and the marginal likelihood calculations seem to agree with each other. (Note that with BPP3 and the gamma priors on \(\theta\) and \(\tau\)s, the posteriors are 0.16 and 0.13 with the ratio 1.2 (Yang 2015, fig.4).
Common errors and problems
Check the bpp control file by running bpp at the command line before
submitting the jobs. Make sure that the bpp program is on your path or
use a full path. You may have to edit the source file BFdriver.c
and
recompile.