Simulation and Data Conversion¶

Simulating a metagenotype is equivalent to sampling from the StrainFacts model (or one of the multiple models that have been defined). For the default simulation model (a.k.a. simulation0), this means simulating each of:

  • genotypes of each strain
  • strain relative abundances in each sample
  • error rates and over-dispersion parameters
  • the actual, observed counts at each site in each sample

Hyperparameters controlling each of these are specified explicitly. Likewise, we specify all three dimensions of our model, $S$, $N$, and $G$, the number of strains, number of samples, and number of SNP positions, respectively.

We also set a random seed so that running this command again will give us an identical output.

In [1]:
!sfacts simulate \
    --num-strains=10 --num-samples=50 --num-positions=1000 \
    --hyperparameters pi_hyper=0.1 mu_hyper_mean=10.0 epsilon_hyper_mode=0.01 \
    --random-seed=0 \
    sim.world.nc

The simulate command writes the full description of our simulated "world" (including all of the unobserved variables) to a file in the NetCDF format.

We then use the sfacts Python library to load and visualize the the simulated values from this file.

In [2]:
import sfacts as sf

# Read the simulated world.
sim = sf.data.World.load('sim.world.nc')
sim_ss = sim.random_sample(position=min(400, len(sim.position)))

# Plot the metagenotype for each sample.
sf.plot.plot_metagenotype(
    sim_ss,
    col_linkage_func=lambda w: w.metagenotype.linkage("sample"),
    row_linkage_func=lambda w: w.metagenotype.linkage("position"),
)

# Plot relative abundances for each sample (the "community").
sf.plot.plot_community(
    sim_ss,
    row_linkage_func=lambda w: w.genotype.linkage("strain"),
    col_linkage_func=lambda w: w.metagenotype.linkage("sample"),
)

# Plot the (unobserved) genotypes of the 10 simulated strains.
sf.plot.plot_genotype(
    sim_ss,
    row_linkage_func=lambda w: w.genotype.linkage("strain"),
    col_linkage_func=lambda w: w.metagenotype.linkage("position"),
)
Out[2]:
<seaborn.matrix.ClusterGrid at 0x148dc2690>

The {row,col}_linkage_func parameter allows us to customize how the columns/rows of these heatmaps are ordered. Here we match the axes across all three plots.

Next, we use the CLI to select just the metagenotype (the observed data) from this simulated world, and dumping it to a TSV.

In [3]:
!sfacts dump sim.world.nc --metagenotype sim.mgen.tsv
In [4]:
!head sim.mgen.tsv | column -t
sample  position  allele  metagenotype
0       0         alt     0.0
0       0         ref     10.0
0       1         alt     10.0
0       1         ref     0.0
0       2         alt     10.0
0       2         ref     0.0
0       3         alt     10.0
0       3         ref     0.0
0       4         alt     10.0

When working with "real world" metagenotypes, they should be formatted to match this example TSV file.

This is a convenient format for importing data from other tools. We'll convert this TSV back to the native StrainFacts format.

In [5]:
!sfacts load --metagenotype sim.mgen.tsv sim.mgen.nc

This metagenotype file will be filtered and used for inference in the next example.