Welcome to ngesh’ documentation!

PyPI CI Codacy Badge Documentation Status

ngesh is a Python library and command-line tool for simulating phylogenetic trees and related data (characters, states, branch length, etc.). It is intended for benchmarking phylogenetic methods, especially in historical linguistics and stemmatology. The generation of stochastic phylogenetic trees also goes by the name of “simulation methods for phylogenetic trees”, “synthetic data”, or just “phylogenetic tree simulation”.

ngesh

Manual

ngesh is a Python library and command-line tool for simulating phylogenetic trees and related data (characters, states, branch length, etc.). It is intended for benchmarking phylogenetic methods, especially in historical linguistics and stemmatology. The generation of stochastic phylogenetic trees also goes by the name “simulation methods for phylogenetic trees”, “synthetic data generation”, or just “phylogenetic tree simulation”.

ngesh

Among the highlights of the package, with ngesh:

  • any hashable element can be provided as a seed for the pseudo-random number generators, guaranteeing that the synthetic trees are reproducible (including across different systems)
  • trees can be generated according to user-specified parameters such as birth and death ratios (and the death ratio can be set to zero, resulting in a birth-only tree)
  • trees will have random topologies and, if desired, random branch-lengths
  • trees can be constrained in terms of number of extant leaves, evolution time (as related to the birth and death parameters), or both
  • non-extant leaves can be pruned from birth-death trees
  • speciation events default to two descendants, but the number of descendants can be randomly drawn from a user-defined Poisson process (allowing to model hard politomies)
  • character evolution can be simulated in relation to branch lengths, with user-specified ratios for mutation and for horizontal gene transfer, with different rates of change for each character
  • nodes can receive unique labels, either sequential ones (like “L01”, “L02”, and “L03”), random names easy to pronounce (like “Sume”, “Fekobir”, and “Tukok”), or random biological names approximating the binomial nomenclature standard (like “Sburas wioris”, “Zurbata ceglaces”, and “Spellis spusso”)
  • trees are normal ETE3 tree objects that can be exported in a variety of formats, such as Newick trees, ASCII representation, tabular textual listings, etc.

Installation

In any standard Python environment, ngesh can be installed with:

pip install ngesh

The pip installation will fetch the dependencies ete3 and numpy, if necessary. The built-in tree visualization tool from ete3 requires the PyQt5 library which is not installed by default, but which should be available in most systems. If necessary, it can be installed along with the package with:

pip install ngesh[gfx]

How to use

You can test your installation from the command line with the ngesh command, which will return a different random small birth-death tree in Newick format each time it is called:

$ ngesh
((Vovrera:0.149348,(Wigag:3.11592,(Pallo:2.68125,Zoei:1.85803)1:1.29704)1:0.204529)1:0.607805,(((Avi:0.347942,Uemi:0.0137646)1:1.41697,(((Kufo:0.817012,
(Gapurem:0.0203582,Hukub:0.0203582)1:0.796654)1:0.395727,Tablo:0.00846148)1:0.484705,(Kaza:0.140656,((Tozea:0.240634,Pebigmom:0.240634)1:1.13579,(Kata:0
.109977,((Fabom:0.04242,Upik:0.04242)1:0.549364,(Amue:0.182635,Lunida:0.182635)1:0.409149)1:0.366701)1:0.417941)1:0.162968)1:0.158051)1:1.47281)1:1.0326
,(Kunizob:0.650455,Madku:0.221172)1:1.22008)1:0.587783);


$ ngesh
((((Povi:0.325601,Udo:0.325601)1:0.0750448,Hiruta:0.400646)1:0.181454,(Voebi:0.0293506,Sodi:0.0293506)1:0.55275)1:0.258834,((Vandemif:0.0160558,(((Dubik
:0.0543122,Fuvu:0.0543122)1:0.36458,Hitfuv:0.418892)1:0.0388987,Pizuna:0.457791)1:0.0535386)1:0.179893,(Uo:0.67132,Zegna:0.163427)1:0.0199021)1:0.149711
);

The same command-line tool can use parameters provided in a textual configuration file. Here, we generate the Nexus data for a reproducible Yule tree (note the 123 seed) with a birth ratio of 0.666, at least 8 leaves with "human" labels, and 10 presence/absence characters:

$ cat ngesh_demo.conf
[Config]
labels=human
birth=0.666
death=0.0
output=nexus
min_leaves=8
num_chars=10

$ ngesh -c ngesh_demo.conf --seed 123
#NEXUS

begin data;
  dimensions ntax=16 nchar=38;
  format datatype=standard missing=? gap=-;
  matrix
Abel        10001001011000010000010010010000100000
Azogu       10001001011000010000010010010000100000
Bou         10001001100010100000010010010000000010
Dipu        10001001010001000010000110010000000001
Gezepsem    10001001100010100000010010010000000010
Gupote      10001001010010010000010010010000000100
Hefi        10100100010010010001000001010001000000
Lerzo       10001001010001000010000110010000000001
Magumel     10001001010010010000010010010000000010
Pao         01001010010100001000100010001000100000
Sanigo      10010100010010000100001000100010010000
Tuzizo      10001001100010100000010010010000000010
Wialum      10001001011000010000010010000100100000
Zudal       10001001010010010000010010010000100000
Zukar       10001001011000010000010010000100100000
Zusu        10010100010010000100001000100010001000
  ;
end;

All parameters provided in the configuration files can be overridden at the command-line.

A textual representation of the same tree (that is, of the random tree generated with the set of parameters and the same seed) can be obtained with the -o ascii flag:

$ ngesh -c ngesh_demo.conf --seed 123 -o ascii

         /-Zudal
        |
        |               /-Azogu
        |              |
        |            /-|      /-Wialum
        |           |  |   /-|
        |           |   \-|   \-Zukar
        |         /-|     |
        |        |  |      \-Abel
        |        |  |
      /-|        |  |   /-Dipu
     |  |        |   \-|
     |  |      /-|      \-Lerzo
     |  |     |  |
     |  |     |  |         /-Bou
     |  |     |  |      /-|
     |  |     |  |   /-|   \-Gezepsem
     |  |   /-|  |  |  |
   /-|  |  |  |   \-|   \-Tuzizo
  |  |  |  |  |     |
  |  |   \-|  |      \-Magumel
  |  |     |  |
  |  |     |   \-Pao
  |  |     |
--|  |      \-Gupote
  |  |
  |  |   /-Zusu
  |   \-|
  |      \-Sanigo
  |
   \-Hefi

The package is, however, designed to be used as a library. If you have PyQt5 installed, the following command will open the ETE Tree Viewer on the same random tree:

$ ngesh -c ngesh_demo.conf --seed 123 -o gfx
random tree

Likewise, the following code is useful for quick demonstration and will pop up the Viewer on a random tree each time it is called:

python3 -c "import ngesh ; ngesh.show_random_tree()"
random tree

The primary functions for generation are gen_tree() (doc), which returns a random tree topology, and add_characters() (doc), which simulates character evolution in a provided tree. As they are separate tasks, it is possible to just generate a random tree or to simulate character evolution in an user provided tree.

The code snippet below shows a basic tree generation, character evolution, and the output flow.

>>> import ngesh
>>> tree = ngesh.gen_tree(1.0, 0.5, max_time=3.0, labels="human")
>>> print(tree)

      /-Butobfa
   /-|
  |  |   /-Defomze
  |   \-|
  |      \-Gegme
--|
  |      /-Bo
  |   /-|
  |  |   \-Peoni
   \-|
     |   /-Riuzo
      \-|
         \-Hoale

>>> tree = ngesh.add_characters(tree, 10, 3.0, 1.0)
>>> print(ngesh.tree2nexus(tree))
#NEXUS

begin data;
  dimensions ntax=7 nchar=15;
  format datatype=standard missing=? gap=-;
  matrix
Hoale      100111101101110
Butobfa    101011101110101
Defomze    101011110110101
Riuzo      100111101101110
Peoni      110011101110110
Bo         110011101110110
Gegme      101011101110101
  ;
end;

Parameters for tree generation

The parameters for tree generation, as also given by the command ngesh -h, are:

  • birth: The tree birth rate (l)
  • death: The tree death rate (mu)
  • max_time: The stopping criterion for maximum evolution time
  • min_leaves: The stopping criterion for minimum number of leaves
  • labels: The model for textual generation of random labels (None, "enum" for a simple enumeration, "human" for randomly generated names, and "bio" for randomly generated specie names)
  • num_chars: The number of characters to be simulated
  • k_mut: The character mutation gamma k parameter
  • th_mut: The character mutation gamma th parameter
  • k_hgt: The character HGT gamma k parameter
  • th_hgt: The character HGT gamma th parameter
  • e: The character general mutation e parameter

How does ngesh work?

An event_rate is first computed from the sum of the birth and death rates. At each iteration, which takes place after a random expovariant time from the event_rate, the library selects one of the extant nodes for an “event”: either a birth or a death, drawn from the proportion of each rate. All other extant leaves have their distances updated with the event time.

The random labels follow the expected methods for random text generation from a set of patterns, taking care to generate names that should be easy to pronounce by most users.

For random character generation, it adds characters according to parameters of gamma distributions related to the length of each branch. The two possible events are mutation (assumed to be always to a new character, i.e., no parallel evolution) and horizontal gene transfer. No perturbation, such as the simulation of errors in sequencing/data collection, is performed during character generation. However, these can be simulated by the function for bad sampling simulation. Note that character generation only simulates states analogous to those of historical linguistics (cognate sets) and assumes character independence (that is, no block movement as common in stemmatology). While we might implement the latter in the future, there are currently no plans for simulating genetic data.

Bad sampling is simulated in an uniform distribution, i.e., all existing leaves have the same probability of being removed. Note that if a full simulation of tree topology and characters is performed, this task must be carried out after character evolution simulation, as otherwise characters would fit the sampled tree and not the original one. No method for data perturbation is available at the moment, but we have plans to implement them in the future.

Integrating with other software

Integration with other packages is facilitated by various export functions. For example, it is possible to generate random trees with characters for which we know all details on evolution and parameters, and generate Nexus files that can be fed to phylogenetic software such as MrBayes or BEAST2 to either check how they perform or how good is our generation in terms of real data.

Let’s simulate phylogenetic data for an analysis using BEAST2 through BEASTling. We start with a birth-death tree (lambda=0.9, mu=0.3), with at least 15 leaves, and 100 characters whose evolution is modelled with the default parameters and a string seed "uppsala" for reproducibility; the tree data is exported in "wordlist" format:

$ cat examples/example_ngesh.conf
[Config]
labels=human
birth=0.9
death=0.3
output=nexus
min_leaves=15
num_chars=100

$ ngesh -c examples/example_ngesh.conf --seed uppsala > examples/example.csv

$ head -n 20 examples/example.csv
Language_ID,Feature_ID,Value
Akup,feature_0,0
Buter,feature_0,0
Dufou,feature_0,0
Emot,feature_0,0
Kiu,feature_0,0
Kovala,feature_0,0
Lusei,feature_0,0
Oso,feature_0,0
Puota,feature_0,0
Relenin,feature_0,976
Sotok,feature_0,0
Tetosur,feature_0,0
Usimi,feature_0,976
Voe,feature_0,0
Vusodur,feature_0,0
Zeba,feature_0,0
Zufe,feature_0,0
Akup,feature_1,1
Buter,feature_1,1

We can now use a minimal BEASTling configuration and generate an XML model for BEAST2. Let’s assume we want to test how well our pipeline performs when assuming a Yule tree when the data actually includes extinct taxa. The results here presented are not expected to perfect, as we will use a short chain length to make it faster and a model which differs from the assumptions used for generation (besides the fact of the default parameters for horizontal gene transfer being too high for this simulation).

$ cat examples/example_beastling.conf
[admin]
basename=example

[MCMC]
chainlength=500000

[model example]
model=covarion
data=example.csv

$ beastling example_beastling.conf

$ beast example.xml

We can go ahead normally here: use BEAST2’s treeannotator (or similar software) to generate a summary tree, which we store in examples/summary.nex, and plot the results with figtree (or, again, similar software).

Let’s plot our summary tree and compare the results with the actual topology (which we can regenerate with the earlier seed).

summary tree
$ ngesh -c examples/example_ngesh.conf --seed uppsala --output newick > examples/example.nw
original tree

The results are not excellent given the limits we set for quick demonstration, but it still capture major information and sub-groupings (as clearer by the radial layout below) — manual data exploration show that at least some errors, including the group in the first split, are due to horizontal gene transfer. For an analysis of the inference performance, we would need to improve the parameters above and repeat the analysis on a range of random trees, including studying the log of character changes (including borrowings) involved in this random tree.

summary tree radial

We can compare trees with common methods of tree comparison, such as Robinson–Foulds metric. All packages and programming languages for this purpose should be able to read the trees exported in Newick or NEXUS format; however, as ngesh trees are actually ETE3 trees, we can do it directly from Python:

d = tree1.robinson_foulds(tree_2)

The files used and generated in this example can be found in the /examples directory.

What does “ngesh” mean?

Technically, “ngesh” is just an unique name, coming from one of the Sumerian words for “tree”, ĝeš. The name was chosen because the library was first planned as part of a larger system for simulating language evolution and benchmarking related tools, named Enki after the Sumerian god of (among many other things) language and “randomness”.

The intended pronunciation, as in the most accepted reconstructions, is /ŋeʃ/. But don’t stress over it, and feel free to call it /n̩.gɛʃ/, as most people have been doing.

Alternatives

There are many tools for simulating phylogenetic processes to obtain random phylogenetic trees. The most complete is probably the R package TreeSim by Tanja Stadler, which includes many flexible tree simulation functions. In R, one can also use the rtree() function from package ape and the birthdeath.tree() one from package geiger, as well as manually randomizing taxon placement in cladograms.

In Python, a snippet that works in a way similar to ngesh, and which served as initial inspiration, is provided by Marc-Rolland Noutahi on the blog post How to simulate a phylogenetic tree ? (part 1).

For simpler simulations, the .populate() method of the Tree class in ETE might be enough as well. Documentation on the method is available here. The toytree and dendropy packages also offer comparable functionality.

A number of on-line tools for simulating trees are available at the time of writing:

References

  • Bailey, Norman. T. J. (1964). The elements of stochastic processes with applications to the natural sciences. John Wiley & Sons.
  • Bouckaert, Remco; Vaughan, Timothy G.; Barido-Sottani, Joëlle; Duchêne, Sebastián; Fourment, Mathieu; Gavryushkina, Alexandra., et al. (2019). “BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis”. PLoS computational biology, 15(4), e1006650. DOI: 10.1371/journal.pcbi.1006650.
  • Foote, Mike; Hunter, John P.; Janis, Christine M.; and Sepkoski J. John Jr. (1999). “Evolutionary and preservational constraints on origins of biologic groups: Divergence times of eutherian mammals”. Science 283:1310–1314.
  • Harmon, Luke J. (2019). Phylogenetic Comparative Methods – learning from trees. Available at: https://lukejharmon.github.io/pcm/chapter10_birthdeath/. Access date: 2019-03-31.
  • Huerta-Cepas, Jaime; Serra, Francois; and Bork, Peer (2016). “ETE 3: Reconstruction, analysis and visualization of phylogenomic data.” Mol Biol Evol. DOI: 10.1093/molbev/msw046.
  • Maurits, Luke; Forkel, Robert; Kaiping, Gereon A.; Atkinson, Quentin D. (2017). “BEASTling: A software tool for linguistic phylogenetics using BEAST 2.” PLoS one 12(8), e0180908. DOI: 10.1371/journal.pone.0180908.
  • Noutahi, Marc-Rolland (2017). How to simulate a phylogenetic tree? (part 1). Available at: https://mrnoutahi.com/2017/12/05/How-to-simulate-a-tree/. Access date: 2019-03-31.
  • Robinson, D. R.; Foulds, L. R. (1981). “Comparison of phylogenetic trees”. Mathematical Biosciences 53 (1–2): 131–147. DOI: 10.1016/0025-5564(81)90043-2.
  • Stadler, Tanja (2011). “Simulating Trees with a Fixed Number of Extant Species”. Systematic Biology 60.5:676-684. DOI: 10.1093/sysbio/syr029.

The ngesh banner was designed by Tiago Tresoldi on basis of the vignette “Sherwood Forest” by J. Needham published in Needham, J. (1895) Studies of trees in pencil and in water colors. First series. London, Glasgow, Edinburgh: Blackie & Son. (under public domain and available on archive.org).

Community guidelines

While the author can be contacted directly for support, it is recommended that third parties use GitHub standard features, such as issues and pull requests, to contribute, report problems, or seek support.

Contributing guidelines, including a code of conduct, can be found in the CONTRIBUTING.md file.

Author and citation

The library is developed by Tiago Tresoldi (tiago.tresoldi@lingfil.uu.se). The library is developed in the context of the Cultural Evolution of Texts project, with funding from the Riksbankens Jubileumsfond (grant agreement ID: MXM19-1087:1).

During the first stages of development, the author received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. ERC Grant #715618, Computer-Assisted Language Comparison).

If you use ngesh, please cite it as:

Tresoldi, Tiago (2021). Ngesh, a tool for simulating random phylogenetic trees. Version 0.5. Uppsala: Uppsala universitet. Available at: https://github.com/tresoldi/ngesh

In BibTeX:

@misc{Tresoldi2021ngesh,
  author = {Tresoldi, Tiago},
  title = {Ngesh, a tool for simulating random phylogenetic trees. Version 0.5},
  howpublished = {\url{https://github.com/tresoldi/ngesh}},
  address = {Uppsala},
  publisher = {Uppsala universitet},
  year = {2021},
}

Beginners’ guide

The ngesh library is designed to generate simulated data in reproducible manner: the pseudo-random generator seeds are used not only in a way that guarantees reproducibility (that is, the same seed returning the same tree) but also in an incremental way that will return similar trees when different parameters are used to the same seed.

A pure random tree can be generated with a call to the ngesh.gen_tree() function. Either a minimum number of leaves (min_leaves) or maximum time (max_time) must be specified as a stopping criterion.

>>> import ngesh
>>> tree1 = ngesh.gen_tree(min_leaves=5)
>>> print(tree1)

      /-L1
   /-|
  |   \-L2
--|
  |   /-L3
   \-|
     |   /-L4
      \-|
         \-L5

>>> tree2 = ngesh.gen_tree(max_time=1.0)
>>> print(tree2)

      /-L1
   /-|
--|   \-L2
  |
   \-L3

The main parameters are the birth and death rates, which default to 1.0 and 0.5 respectively and are the first and second parameter. A Yule tree can be simulated by setting the death rate to 0.0 (i.e., ngesh.gen_tree(1.0, 0.0)).

For reproducibility, the main parameter is the seed, which takes any Python object that can be represented as a string. The equivalence of the generated tree can be verified by generating trees from the same seed with different labels:

>>> tree3 = ngesh.gen_tree(1.0, 0.33, min_leaves=5, labels="human", seed="ngesh")
>>> print(tree3)

      /-Fupsikmu
   /-|
  |   \-Enafu
--|
  |   /-Zuhu
   \-|
     |   /-Poizes
      \-|
         \-Buge

>>> tree4 = ngesh.gen_tree(1.0, 0.33, min_leaves=5, labels="bio", seed="ngesh")
>>> print(tree4)

      /-Ennapus satvo
   /-|
  |   \-Sbuges asecis
--|
  |   /-Spoizes rovues
   \-|
     |   /-Spupsicmus essis
      \-|
         \-Zusis spesus

The progression of the random generation can be verified by expanding the number of leaves in the tree with human labels above:

>>> tree5 = ngesh.gen_tree(1.0, 0.33, min_leaves=6, labels="human", seed="ngesh")
>>> print(tree5)

      /-Fupsikmu
   /-|
  |   \-Enafu
--|
  |   /-Zuhu
  |  |
   \-|      /-Poizes
     |   /-|
      \-|   \-Buge
        |
         \-Hesi

Note that a single new taxon was created, “Buge”, while the rest of tree is equivalent to one above.

All trees are normal ETE3 Tree objects, which means that all methods from ETE3, including those for visualization and export, can be used directly. For example, we can easily iterate over all the descendants to grab the node distances, that is, the age in relation to the root. Note that internal nodes are not named by default:

>>> for node in tree5.iter_descendants():
>>>     print([node.name, node.dist])

["", 0.8181260108242158]
["", 1.3503435886693707]
["Fupsikmu", 1.1736487266924596]
["Enafu", 1.1736487266924596]
["Zuhu", 0.6414311488473046]
["", 0.5001786833989259]
["", 0.07984912684272014]
["Hesi", 0.1412524654483787]
["Poizes", 0.06140333860565855]
["Buge", 0.06140333860565855]

Random characters, matching the topology, can be added with the ngesh.add_characters() function. A new tree is returned (characters are not added in-place) and it is necessary to specify, at least, the number of characters to be simulated along with the k and theta arguments for the gamma distribution related to mutation events. Note that the function also allows to simulate events equivalent to horizontal gene transfer.

>>> tree_char = ngesh.add_characters(tree5, 10, 2.0, 1.0)
>>> for node in tree_char.iter_descendants():
>>>     if node.name:
>>>         print("%10s - " % node.name, " ".join(["%02i" % v for v in node.chars]))

  Fupsikmu -  15 28 02 17 04 05 06 07 29 09 10 11 30 13 19
     Enafu -  15 16 02 17 04 24 06 25 26 09 27 11 12 13 19
      Zuhu -  00 01 02 31 20 05 06 21 32 09 22 23 33 13 14
      Hesi -  00 01 02 03 20 05 06 21 08 09 22 23 12 13 14
    Poizes -  00 01 02 03 20 05 06 21 08 09 22 23 12 13 14
      Buge -  00 01 02 03 20 05 06 21 08 09 22 23 12 13 14

The trees can be exported to different formats, as described in the modules documentation and perfomed in the tests.

ngesh

ngesh package

Submodules

ngesh.output module

Module with auxiliary function for output generation.

ngesh.output.tree2nexus(tree: ete3.coretype.tree.TreeNode) → str

Returns a string with the representation of a tree in NEXUS format.

Parameters:tree – The ete3 tree whose NEXUS representation will be returned.
Returns:A string with the full representation of the tree in NEXUS format.
ngesh.output.tree2wordlist(tree: ete3.coretype.tree.TreeNode) → str

Returns a string with the representation of a tree in wordlist format.

Parameters:tree – The ete3 tree whose CSV representation will be returned.
Returns:A string with the full representation of the tree in CSV format.

ngesh.random_tree module

Random Phylogenetic Tree Generator.

This script provides function to generate random phylogenetic trees in a Yule (birth only) or Birth-Death model, setting different generation parameters and limiting the tree in terms of number of leaves and/or evolution time.

ngesh.random_tree.add_characters(tree: ete3.coretype.tree.TreeNode, num_characters: int, k: float, th: float, mut_exp: float = 1.0, k_hgt: Optional[float] = None, th_hgt: Optional[float] = None, seed: Optional[Hashable] = None) → ete3.coretype.tree.TreeNode

Add random characters to the nodes of a tree.

Characters are added according to parameters of gamma distributions which are related to the length of each branch. The two possible events are mutation (assumed to be always to a new character, i.e., no parallel evolution) and horizontal gene transfer. No perturbation, such as the simulation of errors in sequencing/data collection, is performed by this function.

Parameters:
  • tree – The ete3 tree to which characters will be added. Any previous characters will be overridden.
  • num_characters – The number of characters for each taxa.
  • k – The k parameter for the gamma distribution related to mutation events.
  • th – The theta parameter for the gamma distribution related to mutation events.
  • k_hgt – The k parameter for the gamma distribution related to horizontal gene transfer events. Defaults to None (in case HGT should be modelled but the user is unsure about an appropriate value for k_hgt, it suggested to set it to 1.5 times k).
  • th_hgt – The theta parameter for the gamma distribution related to horizontal gene transfer events. Defaults to None (in case HGT should be modelled but the user is unsure about an appropriate value for th_hgt, it suggested to set it to the same value as th).
  • mut_exp – The exponent for correction of mutation probability of each character. Defaults to 1.0 (no correction).
  • seed – An optional seed for the random number generator. Defaults to None.
Returns:

The provided tree, with random characters added.

ngesh.random_tree.gen_tree(birth: float = 1.0, death: float = 0.5, method: str = 'standard', min_leaves: Optional[int] = None, max_time: Optional[float] = None, lam: float = 0.0, prune: bool = False, labels: Optional[str] = 'enum', seed: Optional[Hashable] = None) → ete3.coretype.tree.TreeNode

Return a random phylogenetic tree.

At least one stopping criterion must be informed, with the tree being returned when such criterion, or either criteria, is/are met.

This function wraps the internal __gen_tree() function which cannot guarantee that a valid tree will be generated given the user parameters and the random sampling. It will try as many times as necessary to provide a valid (and reproducible, given a seed) tree, within the limits of an internal parameter for maximum number of attempts.

Parameters:
  • birth – The birth rate (lambda) for the generated tree. Defaults to 1.0.
  • death – The death rate (mu) for the generated tree. Must be explicitly set to zero for Yule model (i.e., birth only). Defaults to 0.5.
  • method – The generation method to use. Available methods are “default” and “fast” (contributed by Nicola de Maio).
  • min_leaves – A stopping criterion with the minimum number of extant leaves. The generated tree will have at least the number of requested extant leaves (possibly more, as the last speciation event might produce more leaves than the minimum specified. Defaults to None.
  • max_time – A stopping criterion with the maximum allowed time for evolution. Defaults to None.
  • lam – The expectation of interval for sampling a Poisson distribution during speciation, with a minimum of two descendants. Should be used if more than two descendants are to be allowed. Defaults to zero, meaning that all speciation events will have two and only two descendents.
  • prune – A flag indicating whether any non-extant leaves should be pruned from the tree before it is returned.
  • labels – The model to be used for generating random labels, either “enum” (for enumerated labels), “human” (for random single names), “bio” (for random biological names” or None. Defaults to “enum”.
  • seed – An optional seed for the random number generator. Defaults to None.
Returns:

The tree randomly generated according to the parameters.

ngesh.random_tree.label_tree(tree: ete3.coretype.tree.TreeNode, model: str = 'enum', seed: Optional[Hashable] = None)

Labels the nodes of a tree according to a model.

Linguistic labels are unique names generated in a way intended to be readable.

Please note that the tree object is changed in place (no return).

Parameters:
  • tree – The tree whose nodes will be labeled in place.
  • model – A string indicating which model for label generation should be used. Possible values are “enum” (for enumerated labels), “human” (for random single names), and “bio” (for random biological names).
  • seed – An optional seed for the random number generator, only used in case of linguistic and biological labels. Defaults to None.
ngesh.random_tree.simulate_bad_sampling(tree: ete3.coretype.tree.TreeNode, cutoff: Optional[float] = None, seed: Optional[Hashable] = None)

Modify a tree in place simulating bad sampling.

Bad sampling is currently simulated in an uniform distribution, i.e., all existing leaves have the same probability of being removed. Note that if a full simulation of tree topology and characters is performed, this task must be carried out after the simulation of character evolution, as otherwise they would fit the sampled tree and not the original one.

As the bad sampling simulation is also based on random numbers, while unlikely it is possible that no actual simulation takes place.

Parameters:
  • tree – ETE3 Tree object for bad sampling simulation.
  • cutoff – The approximate percentage of extant leaves to remove from the tree before returning, simulating uniform bad sampling. As this is performed randomly, there is no guarantee that any leaf will actually be removed. Default to None (no bad sampling simulation).
  • seed – An optional seed for the random number generator. Defaults to None.

Module contents

ngesh __init__.py

ngesh.gen_tree(birth: float = 1.0, death: float = 0.5, method: str = 'standard', min_leaves: Optional[int] = None, max_time: Optional[float] = None, lam: float = 0.0, prune: bool = False, labels: Optional[str] = 'enum', seed: Optional[Hashable] = None) → ete3.coretype.tree.TreeNode

Return a random phylogenetic tree.

At least one stopping criterion must be informed, with the tree being returned when such criterion, or either criteria, is/are met.

This function wraps the internal __gen_tree() function which cannot guarantee that a valid tree will be generated given the user parameters and the random sampling. It will try as many times as necessary to provide a valid (and reproducible, given a seed) tree, within the limits of an internal parameter for maximum number of attempts.

Parameters:
  • birth – The birth rate (lambda) for the generated tree. Defaults to 1.0.
  • death – The death rate (mu) for the generated tree. Must be explicitly set to zero for Yule model (i.e., birth only). Defaults to 0.5.
  • method – The generation method to use. Available methods are “default” and “fast” (contributed by Nicola de Maio).
  • min_leaves – A stopping criterion with the minimum number of extant leaves. The generated tree will have at least the number of requested extant leaves (possibly more, as the last speciation event might produce more leaves than the minimum specified. Defaults to None.
  • max_time – A stopping criterion with the maximum allowed time for evolution. Defaults to None.
  • lam – The expectation of interval for sampling a Poisson distribution during speciation, with a minimum of two descendants. Should be used if more than two descendants are to be allowed. Defaults to zero, meaning that all speciation events will have two and only two descendents.
  • prune – A flag indicating whether any non-extant leaves should be pruned from the tree before it is returned.
  • labels – The model to be used for generating random labels, either “enum” (for enumerated labels), “human” (for random single names), “bio” (for random biological names” or None. Defaults to “enum”.
  • seed – An optional seed for the random number generator. Defaults to None.
Returns:

The tree randomly generated according to the parameters.

ngesh.add_characters(tree: ete3.coretype.tree.TreeNode, num_characters: int, k: float, th: float, mut_exp: float = 1.0, k_hgt: Optional[float] = None, th_hgt: Optional[float] = None, seed: Optional[Hashable] = None) → ete3.coretype.tree.TreeNode

Add random characters to the nodes of a tree.

Characters are added according to parameters of gamma distributions which are related to the length of each branch. The two possible events are mutation (assumed to be always to a new character, i.e., no parallel evolution) and horizontal gene transfer. No perturbation, such as the simulation of errors in sequencing/data collection, is performed by this function.

Parameters:
  • tree – The ete3 tree to which characters will be added. Any previous characters will be overridden.
  • num_characters – The number of characters for each taxa.
  • k – The k parameter for the gamma distribution related to mutation events.
  • th – The theta parameter for the gamma distribution related to mutation events.
  • k_hgt – The k parameter for the gamma distribution related to horizontal gene transfer events. Defaults to None (in case HGT should be modelled but the user is unsure about an appropriate value for k_hgt, it suggested to set it to 1.5 times k).
  • th_hgt – The theta parameter for the gamma distribution related to horizontal gene transfer events. Defaults to None (in case HGT should be modelled but the user is unsure about an appropriate value for th_hgt, it suggested to set it to the same value as th).
  • mut_exp – The exponent for correction of mutation probability of each character. Defaults to 1.0 (no correction).
  • seed – An optional seed for the random number generator. Defaults to None.
Returns:

The provided tree, with random characters added.

ngesh.simulate_bad_sampling(tree: ete3.coretype.tree.TreeNode, cutoff: Optional[float] = None, seed: Optional[Hashable] = None)

Modify a tree in place simulating bad sampling.

Bad sampling is currently simulated in an uniform distribution, i.e., all existing leaves have the same probability of being removed. Note that if a full simulation of tree topology and characters is performed, this task must be carried out after the simulation of character evolution, as otherwise they would fit the sampled tree and not the original one.

As the bad sampling simulation is also based on random numbers, while unlikely it is possible that no actual simulation takes place.

Parameters:
  • tree – ETE3 Tree object for bad sampling simulation.
  • cutoff – The approximate percentage of extant leaves to remove from the tree before returning, simulating uniform bad sampling. As this is performed randomly, there is no guarantee that any leaf will actually be removed. Default to None (no bad sampling simulation).
  • seed – An optional seed for the random number generator. Defaults to None.
ngesh.tree2nexus(tree: ete3.coretype.tree.TreeNode) → str

Returns a string with the representation of a tree in NEXUS format.

Parameters:tree – The ete3 tree whose NEXUS representation will be returned.
Returns:A string with the full representation of the tree in NEXUS format.
ngesh.tree2wordlist(tree: ete3.coretype.tree.TreeNode) → str

Returns a string with the representation of a tree in wordlist format.

Parameters:tree – The ete3 tree whose CSV representation will be returned.
Returns:A string with the full representation of the tree in CSV format.
ngesh.show_random_tree()

Shows a random tree using ETE3.

This function is intended for a quick demonstration of the library.

Indices and tables