For real-world metagenotype data, positions and samples need to be filtered in order to select
!sfacts filter_mgen \
--verbose \
--min-minor-allele-freq 0.05 \
--min-horizontal-cvrg 0.15 \
--random-seed 0 \
sim.mgen.nc sim.filt.mgen.nc
usage: sfacts [-h] [--version] COMMAND ... sfacts: error: unrecognized arguments: --random-seed sim.filt.mgen.nc
!sfacts data_info sim.mgen.nc sim.filt.mgen.nc
sim.mgen.nc 50 1000 2 sim.filt.mgen.nc 50 999 2
For this simple simulation this filtering had nearly no effect.
Next we'll subset our data into two non-overlapping blocks. In real data we might want to do this for one of two reasons:
!sfacts sample_mgen --verbose \
--num-positions 500 \
--block-number 0 \
--random-seed 0 \
sim.filt.mgen.nc sim.filt.ss-0.mgen.nc
!sfacts sample_mgen --verbose \
--num-positions 500 \
--block-number 1 \
--random-seed 0 \
sim.filt.mgen.nc sim.filt.ss-1.mgen.nc
2023-05-19 15:19:16,949 Shuffling metagenotype positions using random seed 0. /Users/byronsmith/anaconda3/envs/StrainFacts-dev2/lib/python3.11/site-packages/xarray/core/computation.py:760: RuntimeWarning: divide by zero encountered in log2 result_data = func(*input_data) 2023-05-19 15:19:16,968 Extraction positions [0, 500). 2023-05-19 15:19:20,677 Shuffling metagenotype positions using random seed 0. /Users/byronsmith/anaconda3/envs/StrainFacts-dev2/lib/python3.11/site-packages/xarray/core/computation.py:760: RuntimeWarning: divide by zero encountered in log2 result_data = func(*input_data) 2023-05-19 15:19:20,695 Extraction positions [500, 999).
Next we'll fit our model to these metagenotypes.