diff --git a/advanced_simplification.md b/advanced_simplification.md index ac7ce54f..465ef635 100644 --- a/advanced_simplification.md +++ b/advanced_simplification.md @@ -20,7 +20,7 @@ kernelspec: % remove underscores in title when tutorial is complete or near-complete :::{todo} -This tutorial is only partly complete: and there are a number of sections containing TODO items. +This tutorial is only partly complete, and there are a number of sections containing TODO items. ::: This is a companion to the basic {ref}`sec_simplification` tutorial. @@ -41,13 +41,13 @@ import tskit_arg_visualizer as argviz arg = msprime.sim_ancestry( samples=10, - sequence_length=1e4, + sequence_length=1.8e4, recombination_rate=1e-8, population_size=1e4, record_full_arg=True, random_seed=123, ) -arg = msprime.sim_mutations(arg, rate=1e-8, random_seed=124) +arg = msprime.sim_mutations(arg, rate=2e-8, random_seed=123) ``` :::{note} @@ -86,7 +86,7 @@ is always returned. To avoid compacting the node table, and leave node IDs uncha Often you might want a reverse map, mapping the new node IDs to the old ones. Here's a simple way to do this: -```{code-cell} ipython3 +```{code-cell} def invert_map(node_mapping): kept = node_mapping != tskit.NULL indexes = node_mapping[kept] # indexes are guaranteed 0..N-1 @@ -98,12 +98,74 @@ reverse_map = invert_map(node_map) print("New sample ID 0", "maps to old ID", int(reverse_map[0])) ``` -## 2) Keeping input roots +Here's how the IDs in the first tree have changed: -:::{todo} -The `keep_input_roots=True` argument is easy to illustrate, and useful for -forward sims / census approaches. -::: +```{code-cell} +simp.first().draw_svg( + size=(800, 300), + node_labels={nd.id: f"id:{nd.id} (old id:{reverse_map[nd.id]})" for nd in simp.nodes()} +) +``` + +## 2) Information above the local roots + +Usually there are no nodes or mutations in a tree sequence above each local root. +However, as simplification deletes topology, it can create new local roots, leading to this +expectation being broken. + +In some cases, you might want to retain nodes above the local roots, which is possible by +setting `keep_input_roots=True`. The most common reason for this is to allow +{ref}`recapitation ` of forward-time simulations. +See {ref}`this tutorial section ` +for details. + +### Removing mutations above the root + +You can also end up with mutations above the root, for instance when all the chosen samples +are monomorphic, sharing a single derived mutation. In long running forward-time +simulations with mutation, this many mutations like this can gather above each local root. +These can be removed by re-setting the `ancestral_state` of a site to the `derived_state` +of the mutation immediately above the root node. There is currently no method +provided to do this, but the following code should work +(see [this tskit issue](https://github.com/tskit-dev/tskit/issues/260)): + +```{code-cell} +def remove_root_mutations(ts): + tables = ts.dump_tables() + tables.sites.clear() + tables.mutations.clear() + for tree in ts.trees(): + for s in tree.sites(): + anc_state = s.ancestral_state + root_states = {u: anc_state for u in tree.roots if not tree.is_isolated(u)} + for m in s.mutations: + if m.node in root_states: + anc_state = m.derived_state + root_states[m.node] = anc_state + else: + tables.mutations.append(m.replace(parent=tskit.NULL)) + if all([anc == anc_state for anc in root_states.values()]): + if anc_state != s.ancestral_state: + print( + f"Changed ancestral state from {s.ancestral_state} to {anc_state} " + f"for site {s.id} at position {s.position}" + ) + tables.sites.append(s.replace(ancestral_state=anc_state)) + else: + raise ValueError( + f"Multiple roots with different inherited states exist for the site at position {s.position}" + ) + tables.compute_mutation_parents() + return tables.tree_sequence() + +simp_no_root_muts = remove_root_mutations(simp) +# Check this encodes the same genetic variation +for var1, var2 in zip(simp.variants(), simp_no_root_muts.variants()): + assert (var1.states() == var2.states()).all() +print( + f"Original simplified tree sequence had {simp.num_mutations} mutations, " + f"now has {simp_no_root_muts.num_mutations} mutations") +``` ## 3) Keeping ancestral individuals @@ -223,9 +285,9 @@ Identifying the _msprime_ recombination nodes that stay as pairs after simplific a little work: :::{todo} -Currently the code below doesn't quite work, because `keep_unary` forces the nodes above the local -roots to be kept, see https://github.com/tskit-dev/tskit/issues/3450. This means that some RE -(and possibly CA) nodes are kept when they should be discarded. +Currently the code below wrongly includes a few extra RE and CA nodes, because nodes +above local roots are retained when `keep_unary=True`, see +https://github.com/tskit-dev/tskit/issues/3450. ::: ```{code-cell} ipython3 @@ -241,7 +303,7 @@ keep_CA_nodes = reverse_map[arg_num_children(simp) > 1] ``` Now that we have defined which nodes to keep, we can use the same trick as before, -passing these nodes as focal, but simplifying twice, once with `update_sample_flags=False` +passing these nodes as focal, but simplifying twice, once with `update_sample_flags=False`, then again with `keep_unary=True`: ```{code-cell} ipython3 @@ -251,11 +313,20 @@ tmp_arg = arg.simplify(keep, update_sample_flags=False) subset_arg = tmp_arg.simplify(keep_unary=True) # Defaults to focal nodes = existing samples ``` -Here's what it looks like in graph form: +Here's what it looks like in graph form, with recombination nodes in red and common ancestor non-coalescent nodes in blue: ```{code-cell} ipython3 d3arg = argviz.D3ARG.from_ts(ts=subset_arg) -d3arg.draw(title=f"A full ARG, subset to {subset_arg.num_samples} samples"); +d3arg.set_all_node_styles(size=100, stroke_width=2) +d3arg.set_node_styles({ + u: {"symbol": "d3.symbolSquare", "fill": "black"} for u in subset_arg.samples() +}) +d3arg.set_node_styles({i : {"fill": "red"} for i in node_map[keep_RE_nodes]}) +d3arg.set_node_styles({i : {"fill": "blue"} for i in node_map[keep_CA_nodes]}) +d3arg.draw( + edge_type="ortho", + height=800, + title=f"A full ARG, subset to {subset_arg.num_samples} samples"); ``` ## 5) reduce_to_site_topology diff --git a/completing_forward_sims.md b/completing_forward_sims.md index 495e4029..9f8d6fb2 100644 --- a/completing_forward_sims.md +++ b/completing_forward_sims.md @@ -9,20 +9,20 @@ kernelspec: name: python3 --- -(sec_completing_forwards_simulations)= +(sec_completing_forward_simulations)= -# Completing forwards simulations +# Recapitation: completing a forward simulation -The ``msprime`` simulator generates tree sequences using the backwards in -time coalescent model. But it is also possible to output tree sequences -from [forwards-time](https://doi.org/10.1371/journal.pcbi.1006581) +The ``msprime`` simulator generates tree sequences using the +backward-in-time coalescent model. But it is also possible to output tree sequences +from [forward-time](https://doi.org/10.1371/journal.pcbi.1006581) simulators such as [SLiM](https://messerlab.org/slim) and [fwdpy11](https://fwdpy11.readthedocs.io/) (see the {ref}`sec_tskit_forward_simulations` tutorial). There are many advantages to using forward-time simulators, but they are usually quite slow compared to similar coalescent simulations. In this section we show how to combine the best of both approaches by simulating -the recent past using a forwards-time simulator and then complete the +the recent past using a forward-time simulator and then complete the simulation of the ancient past using ``msprime``. (We sometimes refer to this "recapitation", as we can think of it as adding a "head" onto a tree sequence.) @@ -133,9 +133,10 @@ coalesced_ts = msprime.sim_ancestry( coalesced_ts.draw_svg() ``` -The trees have fully coalesced and we've successfully combined a forwards-time +The trees have fully coalesced and we've successfully combined a forward-time Wright-Fisher simulation with a coalescent simulation: hooray! +(sec_completing_forward_simulations_input_roots)= ## Why keep input roots (i.e., the initial generation)? @@ -164,7 +165,7 @@ the method presented here. ## Topology gotchas -The trees that we output from this combined forwards and backwards simulation +The trees that we output from this combined forward and backward simulation process have some slightly odd properties that are important to be aware of. In the example above, we can see that the old roots are still present in both trees, even through they have only one child and are clearly redundant. @@ -179,7 +180,7 @@ they may cause problems: 2. If you are computing the overall tree "height" by taking the time of the root node, you may overestimate the height because there is a unary edge above the "real" root (this would happen if one of the trees had already - coalesced in the forwards-time simulation). + coalesced in the forward-time simulation). For these reasons it may be better to remove this redundancy from your computed tree sequence which is easily done using the diff --git a/forward_sims.md b/forward_sims.md index a28c6deb..421b26a7 100644 --- a/forward_sims.md +++ b/forward_sims.md @@ -25,7 +25,7 @@ along with their genomes, storing inherited genomic regions as well as the full The code in this tutorial is broken into separate functions for clarity and to make it easier to modify for your own purposes; a simpler and substantially condensed forward-simulator is coded as a single function at the top of the -{ref}`sec_completing_forwards_simulations` tutorial. +{ref}`sec_completing_forward_simulations` tutorial. :::{note} If you are simply trying to obtain a tree sequence which is @@ -210,7 +210,7 @@ that the child inherits a mosaic of the two genomes present in each parent. The exact details of the mosaic will depend on the model of recombination you wish to implement. For instance, a simple model such as that in the -{ref}`sec_completing_forwards_simulations` tutorial might assume exactly one +{ref}`sec_completing_forward_simulations` tutorial might assume exactly one crossover per chromosome. A complex model might allow not just multiple crossovers with e.g. recombination "hotspots", but also non-crossover events such as {ref}`msprime:sec_ancestry_gene_conversion`. @@ -475,7 +475,7 @@ in which an alternative technique, such as backward-in-time coalescent simulatio is used to to fill in the "head" of the tree sequence. In other words, we can use a fast backward-time simulator such as `msprime` to simulate the prior genealogy of the oldest nodes in the simplified tree sequence. -Details are described in the {ref}`sec_completing_forwards_simulations` +Details are described in the {ref}`sec_completing_forward_simulations` tutorial. ## More complex forward-simulations diff --git a/more_forward_sims.md b/more_forward_sims.md index 8325e68f..17a52074 100644 --- a/more_forward_sims.md +++ b/more_forward_sims.md @@ -254,5 +254,5 @@ This will primarily deal with sites and mutations (and mutational metadata). We could also include details on selection, if that seems sensible. The section in that workbook on "Starting with a prior history" should be put in -the {ref}`sec_completing_forwards_simulations` tutorial. +the {ref}`sec_completing_forward_simulations` tutorial. ::: \ No newline at end of file diff --git a/simulation_overview.md b/simulation_overview.md index af1963d5..ac4079b3 100644 --- a/simulation_overview.md +++ b/simulation_overview.md @@ -44,29 +44,29 @@ Compare to expectations (e.g. for use as a null model). For instance, comparison to *neutral* simulations can be used to identify regions subject to selection. -There are two major forms of population genetic simulation: **forwards-time** -and **backwards-time**. In general, forwards-time simulation is detailed and more -realistic, while backwards-time simulation is fast and efficient. +There are two major forms of population genetic simulation: **forward-time** +and **backward-time**. In general, forward-time simulation is detailed and more +realistic, while backward-time simulation is fast and efficient. More specifically, apart from a {ref}`few exceptions `, -backwards-time simulations are primarily focused on neutral simulations, while +backward-time simulations are primarily focused on neutral simulations, while forward simulation is better suited to complex simulations, including those involving selection and continuous space. ## Advantages of tree sequences -Some forwards-time ([SLiM](http://messerlab.org/slim/), -[fwdpy](http://molpopgen.github.io/fwdpy/)) and backwards-time +Some forward-time ([SLiM](http://messerlab.org/slim/), +[fwdpy](http://molpopgen.github.io/fwdpy/)) and backward-time ([msprime](https://tskit.dev/msprime)) simulators have a built-in capacity to output tree sequences. This can have several benefits: 1. Neutral mutations, which often account for the majority of genetic variation, do not need to be tracked during the simulation, but can be added afterwards. See "{ref}`sec_tskit_no_mutations`". -2. Tree sequences can be used as an interchange format to combine backwards and - forwards simulations, allowing you to take advantage of the advantages of both - approaches. This is detailed in {ref}`sec_completing_forwards_simulations`. +2. Tree sequences can be used as an interchange format to combine backward and + forward simulations, allowing you to take advantage of the advantages of both + approaches. This is detailed in {ref}`sec_completing_forward_simulations`. ## Some tips on simulation diff --git a/terminology_and_concepts.md b/terminology_and_concepts.md index 360b587d..72c13cac 100644 --- a/terminology_and_concepts.md +++ b/terminology_and_concepts.md @@ -145,7 +145,7 @@ objects. In tree sequences, however, all nodes, both internal and terminal, are represented by an **integer ID**, unique over the entire tree sequence, and which exists at a specific point in time. A branch point in any of the trees is associated with an *internal node*, representing an ancestor in which a single DNA -sequence was duplicated (in forwards-time terminology) or in which multiple sequences +sequence was duplicated (in forward-time terminology) or in which multiple sequences coalesced (in backwards-time terminology).