ABC (Approximate Bayesian Computation) can be used to assign observed population genetic patterns to one of several possible models. My questions circulate around the presence of gene flow (introgression) between species, or if speciation is better described by a strict isolation model without any gene flow after separation.
Recently, random forest machine learning has been added to the tools used for ABC:
It is fast and easy to use, and can deal with lots of “noise” summary statistics without problem. However, I noticed that it offers one additional advantage:
If distinguishability of models is very low for some parameter combinations, rejection-ABC will sometimes make the wrong choice with high confidence, and the probability that it does this mistake will depend on the prior range! Thats terrible, I do not want that my result depends on the width of the prior, I want that my result honestly reflects whether or not there was sufficient power to make a choice! And this is what abcrf random forest model choice is capable of!
Here is an example, doing model choice for “gene flow” vs. “strict isolation” on 4 pseudo-observed datasets simulated under a model of “strict isolation”, with rejection-ABC and then with abcrf:
Traditional ABC relies on the rejection step, hence the posterior probabilities for competing models are simply the weighted frequencies whith which they produce summary statistic patterns similar to the observed data (although various regression methods can be used to adjust these posteriors). However, if overall divergence is low in the observed data, this will of course attract a higher posterior probability for a model involving gene flow rather than a model of strict isolation, and this posterior probability will increase with the width of the prior for split-time. In other words, the gene flow model will recieve higher posterior probability the larger the time prior:
Above, we see that rejection-ABC has no problem finding the truth if divergence time is intermediate. But for very low divergence time (pseudo-osberved dataset po_tsplit_0.01), it tends to choose the “gene flow” model over the (true) isolation model, and more so the larger the split time prior! Rejection-ABC also has trouble with the largest split time, but at least remains undecided rather than wrong in this case.
Posterior probabilites of abcrf in contrast did not depend on the width of this prior, always remaining undecided (0.25 < p < 0.75) for the po_tsplit_0.01 dataset, rather than falsely classifying it to the “gene flow” model. abcrf also had no problem at all with the po_tsplit_2.0: