SVDQuartets Tutorial for Bioinformatics Bootcamp 2020

In this tutorial, we will explore species tree inference using SVDQuartets and related methods. The theory behind SVDQuartets was developed in collaboration with Julia Chifman. The initial methodology for estimating speciation times was developed in collaboration with Jing Peng. Dave Swofford has implemented both methods in PAUP*, and in doing so, has contributed immensely to their continued development in all aspects. Much of the material in this tutorial was developed by Dave in some form.

Throughout the tutorial, I will provide instructions for using both the GUI version of PAUP* in green font color and the command line version in orange font color. In my own work, I often use the GUI version to test out an analysis before running the full analysis using a command line version on a high performance computing cluster (though you'll see that for many data sets, SVDQ is so quick that this won't be needed).

All of the files referenced in this tutorial are available at https://phylosolutions.com/tutorials/svdq-qage/tutorial-data.zip. Please download these files now if you haven't already.


Example data: Gibbons

In today's tutorial, we'll explore a data set consisting of five species of gibbons (Carbone et al. (2014); Shi and Yang (2018)). We'll use a data set for coding regions, which consists of 11,323 genes for a total of 2,264,600 sites. The number of individuals sampled per species is as follows:

Note that each individual contributes two sequences to the data set. Thus, the overall data set consists of 17 sequences.

The data can be found in two files. The file CodingSequences.nex contains the actual DNA sequence data. The file gibbons.nex contains information about the taxa and sequences. We'll discuss these files as a group.

Please note that although the input format involves concatenating loci, SVDQuartets is NOT a concatenation method! In fact, SVDQuartets assumes that each site in the alignment has its own underlying gene tree, generated under the coalescent model from the species tree. We will see here that SVDQuartets also works well for multilocus when the amount of data is sufficient. But please do not confuse the data format with the underlying model!

The first step is to get the data into PAUP* and specify the outgroup. Below are separate instructions for the GUI and command line versions.

GUI version:
  1. Double-click on the file gibbons.nex. This will launch PAUP*.
  2. In the File menu, select "Execute gibbons.nex".
  3. In the Data menu, select "Define Outgroup ...". Use the dialog box to move
    sequence 13 (^hg19) to the outgroup.
Command line version:
  1. From your unix prompt, launch PAUP*. The command will be something like: ./paup4a168_osx
  2. Execute the file gibbons.nex by typing the following at the PAUP* prompt: exe gibbons.nex
    Note that you may need to include the full pathname to the file.
  3. Move the human sequence to the outgroup by typing: outgroup 13


Part 1: Estimating a species tree topology

We are now ready for our first run of SVDQuartets. In our first run, we will obtain an estimate of the species tree topology. The commands for both interfaces are below. You can find information about all of the options associated with SVDQuartets in PAUP* by typing help svdq; at either the command line or in the command line box at the bottom of the PAUP* GUI window.

GUI version:
  1. From the Analysis menu, select "SVDQuartets ...".
  2. At the bottom of the dialog box that appears, check the box that
    corresponds to the option "Assign tips to species ...".
  3. Click "OK".
Command line version:
  1. Run SVDQ using the command: svdq taxpartition = gibbonspecies;

Note that this analysis evaluates all possible quartets, and outputs the species tree inferred by SVDQuartets. Just above the species tree, some information regarding the analysis is given, such as the proportion of inferred quartets that are compatible with the species tree. This analysis should have run quickly.

Let's get some intuition about what SVDQ is doing. We'll repeat the analysis we just did, but this time we'll take advantage of the option to output quartet scores:

GUI version:
  1. From the Analysis menu, select "SVDQuartets ...".
  2. At the bottom of the dialog box that appears, check the box that
    corresponds to the option "Assign tips to species ...".
  3. On the right, click the box that corresponds to the option
    "Show quartet scores".
  4. Click "OK".
Command line version:
  1. svdq taxpartition=gibbonspecies showScores=yes;

What you see in the output are the scores for all three possible quartets for each of the 752 possible sets of four taxa for these data when one taxon is selected from each species. Recall from the lecture that the smaller the score, the more supported that quartet is, compared to the other two possibilities. Scan through the sets of the quartet scores, and notice how some sets of four taxa provide stronger evidence for a particular quartet relationship than others.

Next, we'll run a bootstrap analysis. The command for running the analysis with 100 bootstrap replicates and exhaustive quartet sampling is given below. You can change the random number seed to whatever you like. The analysis should run quickly, but if you are running on your own computer and it is slow, you may wish to reduce the number of bootstrap replicates and/or the number of quartets evaluated on each replicate for the sake of this exercise.

GUI version:
  1. From the Analysis menu, select "SVDQuartets ...".
  2. At the bottom of the dialog box that appears, check the box that
    corresponds to the option "Assign tips to species ...".
  3. On the right, remove the check from the box that corresponds to the option
    "Show quartet scores".
  4. Select the option to "Perform bootstrapping ...".
  5. Select the option to "Write trees from each bootstrap replicate to treefile"
    and pick an appropriate file name.
  6. Click "OK".
Command line version:
  1. svdq taxpartition=gibbonspecies showScores=no seed=1234568 bootstrap
    nreps=100 treeFile=mybootstraptrees.tre;

On the GUI versions of PAUP*, you can easily view the trees estimated by SVDQuartets, using the commands below. You can also save your tree to a file for graphing with other programs, e.g., R, FigTree, etc. Note that the SVDQuartets estimates an unrooted tree! You may wish to root the tree before plotting it (here, we'll use the outgroup to root the tree).

GUI version:
  1. To root the tree, select "Root Trees ..." from the Trees menu, and click "Root".
  2. To view the trees within PAUP*, select "Print/View SVDQuartets Trees ..." from the "Trees" menu.
    Note that both the SVDQuartets tree and the bootstrap tree can be viewed; you can switch
    between trees using the options in the top left of the window.
  3. To save the tree to a file, select "Save Trees to File ..." from the "Trees" menu.
Command line version:
  1. To root the tree, use the command: rootTrees rootMethod=outgroup;
  2. To save the tree to a file, use the command:
    savetrees file=MyTree.tre;


Part 2: Estimating speciation times

PAUP* can provide estimates of the times of speciation events along a given species tree topology. Note that this method is distinct from SVDQuartets -- it can be applied to any species tree of interest. In this portion of the tutorial, we will obtain estimates of the speciation times for the gibbon data we analyzed in Part 1. There is not yet a set of menu options for the estimation of speciation times, so we will provide command-line instructions for both interfaces in this portion of the tutorial. In the GUI version of PAUP*, these commands can be typed in the white box that runs along the bottom of the window.

The first step is to provide PAUP* with the species tree for which you would like to estimate the speciation times. Since we just estimated a tree with SVDQuartets, we can use that species tree. However, so that you will see how to analyze any tree you might be interested in, the command to specify the species tree is given below. Note that the data set must already have been loaded into PAUP*, as well as the taxon partition and a locus partition if you are using multilocus data.

The species tree can be specified with the following command: 
speciestree taxpartition=gibbonspecies (H._sapiens,(((H._pileatus,H._moloch),N._leucogenys),
(S._syndactylus,H._leuconedys)));
A basic command to estimate the speciation times is given below. To view the full set of options available in the qAge command, type: help qage; If you'd like to see a short summary of available options, you can type: qage ?;
set outwidth=140;
qage patProb=exactJC taxpartition=gibbonspecies loci=lociset bootstrap=multilocus treefile=MyTreeBL.tre;

Note that in the above command both the taxon partition and a locus partition are defined. The locus partition is used to carry out a bootstrap procedure that uses locus-level information and is invoked with "bootstrap=multilocus". The treefile command specifies the name of the file to which the tree with branch lengths should be saved. The "patProb=exactJC" tells PAUP* to use the formulas from Chifman and Kubatko (2015) to carry out calculations. Since these calculations assume the JC69 model, this choice will be most appropriate when this model is a reasonable fit to the data. Below we will discuss how to use PAUP* to estimate speciation times when this is not the case, but first let's look at the output from the command above.

You should see a picture of the tree with the nodes numbered, and below that a table that gives branch lengths and node ages, both with corresponding 95% confidence intervals (these confidence intervals are obtained by bootstrapping). The default is to return the estimates in coalescent units. If substitution units are preferred, this can be requested by adding "outUnits=substitutions" or "outUnits=both" to the above command. Above the tree and the estimated lengths, you will see other information about the estimation procedure. Of particular interest are the estimated theta (together with a 95% confidence interval; recall that qAge assumes a single theta for the entire tree) and the composite MAP score. This score is the negative of the log posterior density and can be compared across trees provided that the same prior distributions have been specified. Smaller values indicate a higher composite likelihood.

qAge also includes an option to carry out estimation under more complex substitution models. This option is invoked by changing "patProb=exactJC" to "patProb=expBL". Prior to doing this, however, you must define a model using the "lset" command and provide the name of that model to the qAge command. Below is sample code. In the first line, we specify a GTR model, with parameters in the model to be estimated. The second line asks PAUP* to estimate those parameters using our data and the current tree in memory. In the third line, we set the model with the estimated values of the parameters, and we name it "MyGTR". The final line uses qAge to carry out estimation under that model. Note that this analysis is more computationally intensive than the previous one that assumes the JC69 model. If you plan to try this analysis now, you may want to reduce the number of bootstrap replicates.

lset nst=6 rmatrix=estimate baseFreq=empirical;
lscores;
lset nst=6 rmatrix=(1.01 3.49 0.54 1.36 3.20 1.0) baseFreq=(0.2533 0.2467 0.2454 0.2547) modelName=MyGTR;
qage patProb=expBL taxpartition=gibbonspecies loci=lociset bootstrap=multilocus
treefile=test.tre modelName=MyGTR poolJC=no nreps=100;
Comparing trees:
If you look at the Shi and Yang (2018) paper (reference below), you will note that BPP estimates a different tree than that estimated by SVDQuartets and used for branch length estimation in this portion of the tutorial. We can evaluate the tree found by BPP, which differs in the placement of N. leucogenys, using the commands below.
speciestree taxpartition=gibbonspecies (H._sapiens,((H._pileatus,H._moloch),
(N._leucogenys,(S._syndactylus,H._leuconedys))));
qage patProb=exactJC taxpartition=gibbonspecies loci=lociset bootstrap=multilocus;

Note that branch length estimates are very similar, and in particular, the branch connecting N. leucogenys to its ancestral node is very short, indicating the possibility of extensive incomplete lineage sorting. Comparing composite MAP scores, we see that the score for the tree estimated by SVDQuartets is lower than that found by BPP once branch lengths are optimized for both, indicating more support for the tree found by SVDQuartets.



Part 3: Simulation in PAUP*

In this exercise, we will explore the performance of SVDQuartets using a simulation approach in PAUP*. We'll consider three factors: the number of loci, the length of each locus, and the scaling factor for the branch lengths. We'll consider a 6-taxon model tree, as given in the nexus file sim-6tax.nex. When you look at the file, you will notice that the character @ appears in the two places corresponding to the two values we'll change. You will need to edit your file as follows:

We will consider a varying number of loci (200, 500, 1000, 3000, 5000), and we will consider three values for the branch length scaling factor: 0.1, 1.0, and 10.0. You can run all of the different locus values in one run, as shown below. The file below is for 200 sites per locus (since I'm an "L") with the branch length scaling factor set to 1.0:
#NEXUS

begin taxa;
    dimensions ntax=6;
    taxlabels A B C D E F;
end;

begin trees;
    tree 1 = [&R] (A:5.0,((B:1.0,C:1.0):3.0,(D:3.5,(E:3.0,F:3.0):0.5):0.5):1.0);

end;

begin paup;
    showtrees/userbrlens;
end;

begin dnasim;
    simdata multilocus=y nloci=(200 500 1000 3000 5000) nsitesperlocus=200;
    truetree source=memory treenum=1  units=2Ngen
             scalebrlen=1.0
             mscoal=y Ne=100000 mu=1e-8
             showtruetree=brlens showgenetrees=n storetruetrees=n
             seed=1;
    lset model=jc nst=1 basefreq=eq;    [= Jukes-Cantor model]
    beginsim nreps=100 seed=0;
        svdq nthreads=2 taxpartition=none bootstrap=no;
        tally 'svdq';
    endsim;
end;

When you are done editing the file, you can run the analysis as follows:

GUI version:
  1. Select "Execute sim-6tax.nex" from the "File" menu.
Command line version:
  1. At the PAUP* prompt, type: exe sim-6tax.nex

You will each do three separate runs, one with each choice of branch length scaling factor (0.1, 1.0, 10.0). When each run finishes, enter its results in this Google spreadsheet: https://docs.google.com/spreadsheets/d/18KRKinji0x3SMw3rzgOiMGkEKZJ65crxL1d5-wckpsk/edit?usp=sharing . The graphs will dynamically update (hopefully!), so you can watch the progress made by the class while you wait for your runs to finish (you may want to copy your results somewhere and get the next run started before you enter the data into the spreadsheet). We will come together as a group to discuss the results if there is time.



References

Carbone, L., Harris, R. A., Gnerre, S., Veeramah, K. R., Lorente-Galdos, B., Huddleston, J., Meyer, T. J., Herrero, J., Roos, C., and B. Aken et al (2014). Gibbon genome and the fast karyotype evolution of small apes. Nature, 513, 195–201.

Chifman J., Kubatko L.S. 2014. Quartet inference from SNP data under the coalescent model. 30:3317–3324.

Chifman J., Kubatko L.S. 2015. Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J Theor Biol. 374:35–47.

Peng, J., D. Swofford, and L. Kubatko. 2020. Estimation of speciation times under the multispecies coalescent, available on bioRxiv.

Shi, C.-M. and Yang, Z. (2018). Coalescent-based analyses of genomic sequence data provide a robust resolution of phylogenetic relationships among major groups of gibbons. Molecular Biology and Evolution, 35, 159–179.