| jupytext |
|
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| kernelspec |
|
:tags: [remove-cell]
import msprime
def create_notebook_data():
ts = msprime.sim_ancestry(
samples=4,
sequence_length=1_000,
population_size=300,
random_seed=42,
)
ts = msprime.sim_mutations(ts, rate=1e-6, random_seed=321)
assert ts.num_mutations == 2
site_pos = ts.simplify([0, 1, 2]).sites_position
assert len(site_pos) == 1 and site_pos[0] == ts.sites_position[1]
ts.dump("data/simplification_basic.trees")
# create_notebook_data() # uncomment to recreate the tree seqs used in this notebook
(sec_simplification)=
The {meth}~TreeSequence.simplify method provides one of the most powerful ways to modify a
tskit {class}TreeSequence.
At a high level, simplification works as follows: it starts from a chosen set of focal nodes and then traces their ancestry back through the tree sequence. Any nodes, edges, and mutations (as well as individuals, populations, and sites) that are not needed to represent that ancestry are discarded, and the remaining information is compacted into a new, equivalent tree sequence. During this process, IDs of nodes and other objects may change. In particular, non-coalescent nodes are usually removed, unless you ask to keep them.
Simplification is commonly used:
- In forward simulations, to remove lineages that have gone extinct
- To create a smaller tree sequence focussed on a subset of samples
- To remove redundant nodes and other tskit objects (e.g. unreferenced populations)
Other less common uses, such as retaining all ancestral individuals, retaining unary
regions of coalescent nodes, and simplifying without touching the node table,
are described in the {ref}sec_advanced_simplification tutorial.
We start with a very small example for ease of visualisation. This is a tree sequence consisting of a single tree with 8 haploid genomes (4 diploid individuals) and 2 variable sites.
import tskit
ts = tskit.load("data/simplification_basic.trees")
plot_params = {"size": (500, 200), "time_scale": "log_time", "y_axis":True, "y_ticks": [0, 10, 100, 1000]}
ts.draw_svg(**plot_params)
Suppose we only want to retain the ancestry of sample nodes 0, 1, and 2. We can do this
by passing those IDs as the new samples to the {meth}~TreeSequence.simplify method:
focal_node_ids = [0, 1, 2]
ts_simp1 = ts.simplify(samples=focal_node_ids)
ts_simp1.draw_svg(**plot_params)
Restricting the sample nodes to [0, 1, 2] makes the ancestry much simpler. Note that one of the mutations is not relevant to the new samples, so it has been filtered out, causing the ID of the remaining mutation to change from 1 to 0. Similarly, many nodes have been filtered out, resulting in changed node IDs (e.g. the root node at time 1000 is the same in both trees, but its ID has changed from 14 to 4).
To keep node IDs the same, you can specify filter_nodes=False. Although this makes
the result easier to compare with the original, it is not generally recommended, as it
leaves redundant nodes cluttering up the tree sequence.
ts_simp2 = ts.simplify(focal_node_ids, filter_nodes=False, filter_sites=False)
# Node IDs should now remain unchanged
ts_simp2.draw_svg(**plot_params)
Note that the example above also used another filter_ argument, setting
filter_sites=False, so that the first site, which has no mutations after
simplification, is also retained (it is shown as a bare tick mark on the X axis,
around position 250). However, mutations above unused nodes are still deleted,
so mutation IDs are not guaranteed to stay the same.
To further reduce the size of the simplified tree sequence, simplification normally
removes nodes from the ancestry that no longer represent branch points (coalescences).
We can leave those in using keep_unary=True.
ts_simp3 = ts.simplify(focal_node_ids, filter_nodes=False, keep_unary=True)
ts_simp3.draw_svg(**plot_params)
:::{note}
As modifying a tree sequence can change the IDs of nodes, sites, and other objects, it
can be useful to use {ref}metadata <sec_tutorial_metadata>: information that stays
associated with tskit objects even when their IDs change. When simplifying, it is
also possible to keep track of node ID changes by using the map_nodes parameter,
see the {ref}advanced simplification <sec_advanced_simplification_map_nodes> tutorial.
:::
Here we examine the impact of simplification on the efficiency of tree sequence storage and processing. We'll start with a larger backward simulation that has a handful of admixed individuals:
demography = msprime.Demography()
demography.add_population(name="SMALL", initial_size=1000)
demography.add_population(name="BIG", initial_size=4000)
demography.add_population(name="ADMIX", initial_size=500)
demography.add_population(name="ANC", initial_size=1500)
demography.add_admixture(
time=100, derived="ADMIX", ancestral=["SMALL", "BIG"], proportions=[0.5, 0.5]
)
demography.add_population_split(time=1_000, derived=["SMALL", "BIG"], ancestral="ANC")
big_ts = msprime.sim_ancestry(
samples={"SMALL": 400, "BIG": 400, "ADMIX": 6, "ANC": 400},
demography=demography,
sequence_length=5e7,
recombination_rate=2e-8,
random_seed=2432,
)
big_ts = msprime.sim_mutations(big_ts, rate=1e-8, random_seed=6151)
print(
f"`big_ts` represents a simulation with admixture of {big_ts.num_samples} samples",
f"over {big_ts.sequence_length/1e6:g} Mb ({big_ts.num_trees} trees)",
)
Here, about a third of the sample nodes (those from the ANC population) exist
at times prior to the current generation, i.e. they are historical sample nodes.
In fact, in forward simulations most nodes will be of this sort, left over from
previously simulated generations. These are often unwanted, and one of the main
use cases for simplification is to reduce the ancestry to that of just the
contemporary genomes: i.e. removing any edges, nodes, and mutations that track
"extinct" lineages.
modern_sample_ids = big_ts.samples(time=0)
ts_modern = big_ts.simplify(modern_sample_ids)
print(f"Tree sequence simplified to {ts_modern.nbytes/big_ts.nbytes:.1%} of original size")
Simplifying has only produced a modest reduction in size, but you can imagine that
in a forward simulation, where the majority of genomes are historical, repeated
simplification can result in huge savings. In practice, simulators usually do this
regular simplification of the tables used to store the paths of genetic inheritance
automatically, so using
{meth}~TableCollection.simplify in this way is mainly of interest if you are
{ref}building your own forward simulator <sec_tskit_forward_simulations>.
Often you might want to focus on only one group of samples, for example the small
group of ADMIX individuals (population ID 2 in this simulation):
admix_pop_id = 2
assert big_ts.population(admix_pop_id).metadata["name"] == "ADMIX"
admix_sample_ids = big_ts.samples(population=admix_pop_id)
ts_admix, node_map = big_ts.simplify(admix_sample_ids, map_nodes=True)
print(f"Tree sequence simplified to {ts_admix.nbytes/big_ts.nbytes:.2%} of original size")
print()
print(f"Previous admixed sample IDs were {admix_sample_ids}")
print(f"Simplifying has changed these to {node_map[admix_sample_ids]}")
# Check that these are indeed the only sample IDs
assert set(node_map[admix_sample_ids]) == set(ts_admix.samples())
:::{note}
The map_nodes=True argument means that simplify() returns both a new
tree sequence and an array mapping each old node ID to its new ID, or to
tskit.NULL if that node is removed.
Here you can see that (unlike in previous examples) the sample node IDs
have changed: unless filter_nodes=False, the N node IDs provided as the samples
argument will be allocated new IDs from 0 to N - 1 in the returned tree sequence (so simplify can be used to reorder sample IDs, although
{meth}~TreeSequence.subset is a way to do this with fewer side effects).
:::
Edges take up the majority of the space in most tree sequences. In this case you can
see that although simplify has reduced the sample nodes to 12 genomes from
the 6 diploid ADMIX individuals (a reduction of 99.5%), the number of edges
has not been reduced by such a large amount.
That's because many of the ancestors of the SMALL and BIG populations are also shared
by ADMIX. It also shows why tree sequence structures are so effective for encoding
and analysing large datasets: storage and processing efficiency, in particular the
number of edges, is sub-linear in the number of samples.
print(
f"The simplified tree sequence has only {ts_admix.num_samples / big_ts.num_samples:.2%} of the samples,",
f"but retains {ts_admix.num_edges / big_ts.num_edges:.2%} of the edges."
)
If you want to analyse only the admixed individuals, using the simplified tree sequence
is much more efficient than running equivalent operations on the original big_ts:
%%timeit
# Speed test for decoding all the genetic variants of the admixed individuals
for v in ts_admix.variants():
pass
Identical results can be obtained using the full tree sequence and restricting calculations to the admix_sample_ids, but this approach is much slower:
%%timeit
# Equivalent processing of admixed individuals, using the full tree sequence
for v in big_ts.variants(samples=admix_sample_ids):
pass
The same efficiencies apply to calculating statistics on subsets of genomic samples.
As simplification has been highly optimised in tskit, if you perform repeated
processing of the same subset of genomes, it can be worth simplifying before
processing.
If we print out the original and admix-only (simplified) tree sequence, we can see that a number of other tables have also been reduced in size. For instance, simplification has reduced the number of individuals from 1206 to 6, and the number of sites to less than a sixth of the original.
print("Original tree sequence")
big_ts
print("Simplified tree sequence")
ts_admix
Note that the call to {meth}TreeSequence.simplify has been recorded in the
{ref}sec_provenance information. Like most tree sequence methods, you can pass
record_provenance=False if you want this to be omitted (which will save space, but not
lead to other efficiency gains).
On closer inspection, you might be surprised to see that there are still 4 populations in
the simplified tree sequence, although it contains only samples from the ADMIX population:
print(
"Sample nodes in `ts_admix` belong to the following populations",
ts_admix.tables.nodes.population[ts_admix.samples()],
)
ts_admix.tables.populations
The reason that the other populations (BIG, SMALL, and ANC) have been retained is
that the simulation has assigned populations to both sample and nonsample nodes. If we
{ref}edit <sec_tables_editing> the tree sequence tables such that ancestral
(non-sample) genomes are not associated with defined populations, then simplification will
remove all but the admixed population (and reassign the population IDs as
necessary).
An example of this is given in the code below, which performs a further round of simplification,
taking advantage of the fact that if a list of focal nodes is not given, simplify
uses the existing sample nodes.
import numpy as np
import tskit
tables = ts_admix.dump_tables()
samples = ts_admix.samples()
# Make an array of NULL population values for each node
nodes_population = np.full_like(tables.nodes.population, tskit.NULL)
# Set the sample node populations back to their expected population
nodes_population[samples] = ts_admix.nodes_population[samples]
tables.nodes.population = nodes_population
tables.simplify() # This is the tables version of simplify, often used in forward sims
ts_admix_only = tables.tree_sequence()
print(
"Sample nodes in `ts_admix_only` belong to the following populations",
ts_admix_only.tables.nodes.population[ts_admix_only.samples()],
)
ts_admix_only.tables.populations
Although reducing the number of populations saves space, it requires care.
For instance admix_pop_id can no longer be used to refer to the correct ID
in the ts_admix_only tree sequence.
Simplify is somewhat of a "Swiss Army knife" for tree sequences, and can be used in
several other ways. See the {ref}sec_advanced_simplification tutorial for more details.