Simulating a metagenotype is equivalent to sampling from the StrainFacts model
(or one of the multiple models that have been defined).
For the default simulation model (a.k.a. simulation0
),
this means simulating each of:
Hyperparameters controlling each of these are specified explicitly. Likewise, we specify all three dimensions of our model, $S$, $N$, and $G$, the number of strains, number of samples, and number of SNP positions, respectively.
We also set a random seed so that running this command again will give us an identical output.
!sfacts simulate \
--num-strains=10 --num-samples=50 --num-positions=1000 \
--hyperparameters pi_hyper=0.1 mu_hyper_mean=10.0 epsilon_hyper_mode=0.01 \
--random-seed=0 \
sim.world.nc
The simulate
command writes the full description of our simulated "world"
(including all of the unobserved variables) to a file in the NetCDF format.
We then use the sfacts
Python library to load and visualize the
the simulated values from this file.
import sfacts as sf
# Read the simulated world.
sim = sf.data.World.load('sim.world.nc')
sim_ss = sim.random_sample(position=min(400, len(sim.position)))
# Plot the metagenotype for each sample.
sf.plot.plot_metagenotype(
sim_ss,
col_linkage_func=lambda w: w.metagenotype.linkage("sample"),
row_linkage_func=lambda w: w.metagenotype.linkage("position"),
)
# Plot relative abundances for each sample (the "community").
sf.plot.plot_community(
sim_ss,
row_linkage_func=lambda w: w.genotype.linkage("strain"),
col_linkage_func=lambda w: w.metagenotype.linkage("sample"),
)
# Plot the (unobserved) genotypes of the 10 simulated strains.
sf.plot.plot_genotype(
sim_ss,
row_linkage_func=lambda w: w.genotype.linkage("strain"),
col_linkage_func=lambda w: w.metagenotype.linkage("position"),
)
<seaborn.matrix.ClusterGrid at 0x148dc2690>
The {row,col}_linkage_func
parameter allows us to customize how
the columns/rows of these heatmaps are ordered.
Here we match the axes across all three plots.
Next, we use the CLI to select just the metagenotype (the observed data) from this simulated world, and dumping it to a TSV.
!sfacts dump sim.world.nc --metagenotype sim.mgen.tsv
!head sim.mgen.tsv | column -t
sample position allele metagenotype 0 0 alt 0.0 0 0 ref 10.0 0 1 alt 10.0 0 1 ref 0.0 0 2 alt 10.0 0 2 ref 0.0 0 3 alt 10.0 0 3 ref 0.0 0 4 alt 10.0
When working with "real world" metagenotypes, they should be formatted to match this example TSV file.
This is a convenient format for importing data from other tools. We'll convert this TSV back to the native StrainFacts format.
!sfacts load --metagenotype sim.mgen.tsv sim.mgen.nc
This metagenotype file will be filtered and used for inference in the next example.