Now the fun part: fitting the StrainFacts model to these data.
Let's take a look at the details of the default StrainFacts model.
!sfacts model_info default
Defined in: sfacts.model_zoo.model5 Summary: Metagenotype model intended for inference, designed for simplicity and speed. Just like model4 but with uniform sequencing error. Default Hyperparameters: {'gamma_hyper': 1e-10, 'rho_hyper': 0.2, 'rho_hyper2': 1.0, 'pi_hyper': 0.5, 'pi_hyper2': 0.9, 'mu_hyper_mean': 1.0, 'mu_hyper_scale': 1.0, 'm_hyper_concentration': 1.0, 'epsilon_hyper_mode': 0.01, 'epsilon_hyper_spread': 1.5}
We'll leave all of the hyperparameters set to their default values for this model. In addition, we explicitly fit 15 strains (5 more than the simulation actually had), and we set a random seed for reproducibility.
!sfacts fit \
--verbose \
--num-strains 15 \
--random-seed 0 \
sim.filt.ss-0.mgen.nc sim.filt.ss-0.fit.world.nc
2023-05-19 15:25:49,962 START: Fitting 15 strains with data shape Frozen({'sample': 50, 'position': 500, 'allele': 2}). 0%|| 0/1000000 [00:00<?, ?it/s]/Users/byronsmith/Projects/StrainFacts/sfacts/model_zoo/components.py:310: UserWarning: Using LogTriangle as an approximation for random sampling. This is probably a bad idea. warnings.warn( 1%|| 7399/1000000 [00:42<1:35:32, 173.15it/s, NLP=+1.11e+05, delta=+0.00e+00, lag100=+0.00e+00, lr=+7.63e-07] 2023-05-19 15:26:32,715 Converged: NLP=1.11292e+05 /Users/byronsmith/Projects/StrainFacts/sfacts/model_zoo/components.py:310: UserWarning: Using LogTriangle as an approximation for random sampling. This is probably a bad idea. warnings.warn( 2023-05-19 15:26:32,734 END: Fitting 15 strains with data shape Frozen({'sample': 50, 'position': 500, 'allele': 2}). (took 43 seconds) 2023-05-19 15:26:32,742 Average metagenotype error: 0.01823714681497938
A model of this size on this dataset should fit relatively quickly (<2 minutes on my computer).
When run on the command-line, several pieces of information are printed to the screen,
thanks to the --verbose
flag.
The result of this fit is a "world" with a point estimate for all of the parameters.
Let's load this into Python and plot the inferred genotypes and relative abundances.
import sfacts as sf
fit = sf.data.World.load('sim.filt.ss-0.fit.world.nc')
# Plot inferred relative abundances for each sample (the "community").
sf.plot.plot_community(
fit,
col_linkage_func=lambda w: w.metagenotype.linkage('sample'),
row_linkage_func=lambda w: w.genotype.linkage('strain'),
)
# Plot the inferred genotype of the 10 simulated strains.
sf.plot.plot_genotype(
fit,
row_linkage_func=lambda w: w.genotype.linkage("strain"),
col_linkage_func=lambda w: w.metagenotype.linkage("position"),
)
<seaborn.matrix.ClusterGrid at 0x149b81a10>
Finally, we'll dump the relative abundance and genotype inferences out to TSV files, which could then be used by downstream tools.
Note that the genotypes for each strain in each position are encoded as a float, where 0.0 means entirely reference and 1.0 means entirely alternative allele.
!sfacts dump sim.filt.ss-0.fit.world.nc \
--genotype sim.filt.ss-0.fit.geno.tsv \
--community sim.filt.ss-0.fit.comm.tsv
!head sim.filt.ss-0.fit.geno.tsv sim.filt.ss-0.fit.comm.tsv
==> sim.filt.ss-0.fit.geno.tsv <== strain position genotype 0 549 0.9594818 0 715 0.98840594 0 603 0.9872556 0 545 0.02570168 0 424 0.9771314 0 646 0.988557 0 438 0.011858768 0 891 0.02623163 0 963 0.9879804 ==> sim.filt.ss-0.fit.comm.tsv <== sample strain community 0 0 0.00020524139 0 1 0.0002019747 0 2 0.00012614406 0 3 0.99850655 0 4 2.3415639e-07 0 5 0.0002068418 0 6 0.00016549 0 7 0.00019602856 0 8 0.0001604072
In the next example, we'll see some more complicated ways to fit our data.