Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 86 additions & 15 deletions advanced_simplification.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ kernelspec:
% remove underscores in title when tutorial is complete or near-complete

:::{todo}
This tutorial is only partly complete: and there are a number of sections containing TODO items.
This tutorial is only partly complete, and there are a number of sections containing TODO items.
:::

This is a companion to the basic {ref}`sec_simplification` tutorial.
Expand All @@ -41,13 +41,13 @@ import tskit_arg_visualizer as argviz

arg = msprime.sim_ancestry(
samples=10,
sequence_length=1e4,
sequence_length=1.8e4,
recombination_rate=1e-8,
population_size=1e4,
record_full_arg=True,
random_seed=123,
)
arg = msprime.sim_mutations(arg, rate=1e-8, random_seed=124)
arg = msprime.sim_mutations(arg, rate=2e-8, random_seed=123)
```

:::{note}
Expand Down Expand Up @@ -86,7 +86,7 @@ is always returned. To avoid compacting the node table, and leave node IDs uncha
Often you might want a reverse map, mapping the new node IDs to the old ones. Here's
a simple way to do this:

```{code-cell} ipython3
```{code-cell}
def invert_map(node_mapping):
kept = node_mapping != tskit.NULL
indexes = node_mapping[kept] # indexes are guaranteed 0..N-1
Expand All @@ -98,12 +98,74 @@ reverse_map = invert_map(node_map)
print("New sample ID 0", "maps to old ID", int(reverse_map[0]))
```

## 2) Keeping input roots
Here's how the IDs in the first tree have changed:

:::{todo}
The `keep_input_roots=True` argument is easy to illustrate, and useful for
forward sims / census approaches.
:::
```{code-cell}
simp.first().draw_svg(
size=(800, 300),
node_labels={nd.id: f"id:{nd.id} (old id:{reverse_map[nd.id]})" for nd in simp.nodes()}
)
```

## 2) Information above the local roots

Usually there are no nodes or mutations in a tree sequence above each local root.
However, as simplification deletes topology, it can create new local roots, leading to this
expectation being broken.

In some cases, you might want to retain nodes above the local roots, which is possible by
setting `keep_input_roots=True`. The most common reason for this is to allow
{ref}`recapitation <sec_completing_forward_simulations>` of forward-time simulations.
See {ref}`this tutorial section <sec_completing_forward_simulations_input_roots>`
for details.

### Removing mutations above the root

You can also end up with mutations above the root, for instance when all the chosen samples
are monomorphic, sharing a single derived mutation. In long running forward-time
simulations with mutation, this many mutations like this can gather above each local root.
These can be removed by re-setting the `ancestral_state` of a site to the `derived_state`
of the mutation immediately above the root node. There is currently no method
provided to do this, but the following code should work
(see [this tskit issue](https://github.com/tskit-dev/tskit/issues/260)):

```{code-cell}
def remove_root_mutations(ts):
tables = ts.dump_tables()
tables.sites.clear()
tables.mutations.clear()
for tree in ts.trees():
for s in tree.sites():
anc_state = s.ancestral_state
root_states = {u: anc_state for u in tree.roots if not tree.is_isolated(u)}
for m in s.mutations:
if m.node in root_states:
anc_state = m.derived_state
root_states[m.node] = anc_state
else:
tables.mutations.append(m.replace(parent=tskit.NULL))
if all([anc == anc_state for anc in root_states.values()]):
if anc_state != s.ancestral_state:
print(
f"Changed ancestral state from {s.ancestral_state} to {anc_state} "
f"for site {s.id} at position {s.position}"
)
tables.sites.append(s.replace(ancestral_state=anc_state))
else:
raise ValueError(
f"Multiple roots with different inherited states exist for the site at position {s.position}"
)
tables.compute_mutation_parents()
return tables.tree_sequence()

simp_no_root_muts = remove_root_mutations(simp)
# Check this encodes the same genetic variation
for var1, var2 in zip(simp.variants(), simp_no_root_muts.variants()):
assert (var1.states() == var2.states()).all()
print(
f"Original simplified tree sequence had {simp.num_mutations} mutations, "
f"now has {simp_no_root_muts.num_mutations} mutations")
```

## 3) Keeping ancestral individuals

Expand Down Expand Up @@ -223,9 +285,9 @@ Identifying the _msprime_ recombination nodes that stay as pairs after simplific
a little work:

:::{todo}
Currently the code below doesn't quite work, because `keep_unary` forces the nodes above the local
roots to be kept, see https://github.com/tskit-dev/tskit/issues/3450. This means that some RE
(and possibly CA) nodes are kept when they should be discarded.
Currently the code below wrongly includes a few extra RE and CA nodes, because nodes
above local roots are retained when `keep_unary=True`, see
https://github.com/tskit-dev/tskit/issues/3450.
:::

```{code-cell} ipython3
Expand All @@ -241,7 +303,7 @@ keep_CA_nodes = reverse_map[arg_num_children(simp) > 1]
```

Now that we have defined which nodes to keep, we can use the same trick as before,
passing these nodes as focal, but simplifying twice, once with `update_sample_flags=False`
passing these nodes as focal, but simplifying twice, once with `update_sample_flags=False`,
then again with `keep_unary=True`:

```{code-cell} ipython3
Expand All @@ -251,11 +313,20 @@ tmp_arg = arg.simplify(keep, update_sample_flags=False)
subset_arg = tmp_arg.simplify(keep_unary=True) # Defaults to focal nodes = existing samples
```

Here's what it looks like in graph form:
Here's what it looks like in graph form, with recombination nodes in red and common ancestor non-coalescent nodes in blue:

```{code-cell} ipython3
d3arg = argviz.D3ARG.from_ts(ts=subset_arg)
d3arg.draw(title=f"A full ARG, subset to {subset_arg.num_samples} samples");
d3arg.set_all_node_styles(size=100, stroke_width=2)
d3arg.set_node_styles({
u: {"symbol": "d3.symbolSquare", "fill": "black"} for u in subset_arg.samples()
})
d3arg.set_node_styles({i : {"fill": "red"} for i in node_map[keep_RE_nodes]})
d3arg.set_node_styles({i : {"fill": "blue"} for i in node_map[keep_CA_nodes]})
d3arg.draw(
edge_type="ortho",
height=800,
title=f"A full ARG, subset to {subset_arg.num_samples} samples");
```

## 5) reduce_to_site_topology
Expand Down
19 changes: 10 additions & 9 deletions completing_forward_sims.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,20 @@ kernelspec:
name: python3
---

(sec_completing_forwards_simulations)=
(sec_completing_forward_simulations)=

# Completing forwards simulations
# Recapitation: completing a forward simulation

The ``msprime`` simulator generates tree sequences using the backwards in
time coalescent model. But it is also possible to output tree sequences
from [forwards-time](https://doi.org/10.1371/journal.pcbi.1006581)
The ``msprime`` simulator generates tree sequences using the
backward-in-time coalescent model. But it is also possible to output tree sequences
from [forward-time](https://doi.org/10.1371/journal.pcbi.1006581)
simulators such as [SLiM](https://messerlab.org/slim)
and [fwdpy11](https://fwdpy11.readthedocs.io/) (see the
{ref}`sec_tskit_forward_simulations` tutorial).
There are many advantages to using forward-time simulators, but they
are usually quite slow compared to similar coalescent simulations. In this
section we show how to combine the best of both approaches by simulating
the recent past using a forwards-time simulator and then complete the
the recent past using a forward-time simulator and then complete the
simulation of the ancient past using ``msprime``. (We sometimes refer to this
"recapitation", as we can think of it as adding a "head" onto a tree sequence.)

Expand Down Expand Up @@ -133,9 +133,10 @@ coalesced_ts = msprime.sim_ancestry(
coalesced_ts.draw_svg()
```

The trees have fully coalesced and we've successfully combined a forwards-time
The trees have fully coalesced and we've successfully combined a forward-time
Wright-Fisher simulation with a coalescent simulation: hooray!

(sec_completing_forward_simulations_input_roots)=

## Why keep input roots (i.e., the initial generation)?

Expand Down Expand Up @@ -164,7 +165,7 @@ the method presented here.

## Topology gotchas

The trees that we output from this combined forwards and backwards simulation
The trees that we output from this combined forward and backward simulation
process have some slightly odd properties that are important to be aware of.
In the example above, we can see that the old roots are still present in both trees,
even through they have only one child and are clearly redundant.
Expand All @@ -179,7 +180,7 @@ they may cause problems:
2. If you are computing the overall tree "height" by taking the time of the
root node, you may overestimate the height because there is a unary edge
above the "real" root (this would happen if one of the trees had already
coalesced in the forwards-time simulation).
coalesced in the forward-time simulation).

For these reasons it may be better to remove this redundancy from your
computed tree sequence which is easily done using the
Expand Down
6 changes: 3 additions & 3 deletions forward_sims.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ along with their genomes, storing inherited genomic regions as well as the full
The code in this tutorial is broken into separate functions for clarity and
to make it easier to modify for your own purposes; a simpler and substantially
condensed forward-simulator is coded as a single function at the top of the
{ref}`sec_completing_forwards_simulations` tutorial.
{ref}`sec_completing_forward_simulations` tutorial.

:::{note}
If you are simply trying to obtain a tree sequence which is
Expand Down Expand Up @@ -210,7 +210,7 @@ that the child inherits a mosaic of the two genomes present in each parent.

The exact details of the mosaic will depend on the model of recombination you
wish to implement. For instance, a simple model such as that in the
{ref}`sec_completing_forwards_simulations` tutorial might assume exactly one
{ref}`sec_completing_forward_simulations` tutorial might assume exactly one
crossover per chromosome. A complex model might allow not just multiple
crossovers with e.g. recombination "hotspots", but also non-crossover events
such as {ref}`msprime:sec_ancestry_gene_conversion`.
Expand Down Expand Up @@ -475,7 +475,7 @@ in which an alternative technique, such as backward-in-time coalescent simulatio
is used to to fill in the "head" of the tree sequence. In other words,
we can use a fast backward-time simulator such as `msprime` to simulate the
prior genealogy of the oldest nodes in the simplified tree sequence.
Details are described in the {ref}`sec_completing_forwards_simulations`
Details are described in the {ref}`sec_completing_forward_simulations`
tutorial.

## More complex forward-simulations
Expand Down
2 changes: 1 addition & 1 deletion more_forward_sims.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,5 +254,5 @@ This will primarily deal with sites and mutations (and mutational metadata).
We could also include details on selection, if that seems sensible.

The section in that workbook on "Starting with a prior history" should be put in
the {ref}`sec_completing_forwards_simulations` tutorial.
the {ref}`sec_completing_forward_simulations` tutorial.
:::
18 changes: 9 additions & 9 deletions simulation_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,29 +44,29 @@ Compare to expectations
(e.g. for use as a null model). For instance, comparison to *neutral* simulations
can be used to identify regions subject to selection.

There are two major forms of population genetic simulation: **forwards-time**
and **backwards-time**. In general, forwards-time simulation is detailed and more
realistic, while backwards-time simulation is fast and efficient.
There are two major forms of population genetic simulation: **forward-time**
and **backward-time**. In general, forward-time simulation is detailed and more
realistic, while backward-time simulation is fast and efficient.

More specifically, apart from a
{ref}`few exceptions <msprime:sec_ancestry_models_selective_sweeps>`,
backwards-time simulations are primarily focused on neutral simulations, while
backward-time simulations are primarily focused on neutral simulations, while
forward simulation is better suited to complex simulations, including those involving
selection and continuous space.

## Advantages of tree sequences

Some forwards-time ([SLiM](http://messerlab.org/slim/),
[fwdpy](http://molpopgen.github.io/fwdpy/)) and backwards-time
Some forward-time ([SLiM](http://messerlab.org/slim/),
[fwdpy](http://molpopgen.github.io/fwdpy/)) and backward-time
([msprime](https://tskit.dev/msprime)) simulators have a built-in capacity to output
tree sequences. This can have several benefits:

1. Neutral mutations, which often account for the majority of genetic variation, do not
need to be tracked during the simulation, but can be added afterwards. See
"{ref}`sec_tskit_no_mutations`".
2. Tree sequences can be used as an interchange format to combine backwards and
forwards simulations, allowing you to take advantage of the advantages of both
approaches. This is detailed in {ref}`sec_completing_forwards_simulations`.
2. Tree sequences can be used as an interchange format to combine backward and
forward simulations, allowing you to take advantage of the advantages of both
approaches. This is detailed in {ref}`sec_completing_forward_simulations`.

## Some tips on simulation

Expand Down
2 changes: 1 addition & 1 deletion terminology_and_concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ objects. In tree sequences, however, all nodes, both internal and terminal,
are represented by an **integer ID**, unique over the entire tree sequence, and which exists
at a specific point in time. A branch point in any of the trees is associated with
an *internal node*, representing an ancestor in which a single DNA
sequence was duplicated (in forwards-time terminology) or in which multiple sequences
sequence was duplicated (in forward-time terminology) or in which multiple sequences
coalesced (in backwards-time terminology).


Expand Down
Loading