Metagenotype Filtering¶

For real-world metagenotype data, positions and samples need to be filtered in order to select

  • sufficiently polymorphic positions (minimum minor allele frequency)
  • samples with sufficient coverage (minimum fraction of sites with counts)
In [1]:
!sfacts filter_mgen \
    --verbose \
    --min-minor-allele-freq 0.05 \
    --min-horizontal-cvrg 0.15 \
    --random-seed 0 \
    sim.mgen.nc sim.filt.mgen.nc
usage: sfacts [-h] [--version] COMMAND ...
sfacts: error: unrecognized arguments: --random-seed sim.filt.mgen.nc
In [2]:
!sfacts data_info sim.mgen.nc sim.filt.mgen.nc
sim.mgen.nc	50	1000		2
sim.filt.mgen.nc	50	999		2

For this simple simulation this filtering had nearly no effect.

Next we'll subset our data into two non-overlapping blocks. In real data we might want to do this for one of two reasons:

  • To reduce the computational burden of fitting the data
  • For cross-validation to evaluate the accuracy of our fit
In [3]:
!sfacts sample_mgen --verbose \
    --num-positions 500 \
    --block-number 0 \
    --random-seed 0 \
    sim.filt.mgen.nc sim.filt.ss-0.mgen.nc

!sfacts sample_mgen --verbose \
    --num-positions 500 \
    --block-number 1 \
    --random-seed 0 \
    sim.filt.mgen.nc sim.filt.ss-1.mgen.nc
2023-05-19 15:19:16,949 Shuffling metagenotype positions using random seed 0.
/Users/byronsmith/anaconda3/envs/StrainFacts-dev2/lib/python3.11/site-packages/xarray/core/computation.py:760: RuntimeWarning: divide by zero encountered in log2
  result_data = func(*input_data)
2023-05-19 15:19:16,968 Extraction positions [0, 500).
2023-05-19 15:19:20,677 Shuffling metagenotype positions using random seed 0.
/Users/byronsmith/anaconda3/envs/StrainFacts-dev2/lib/python3.11/site-packages/xarray/core/computation.py:760: RuntimeWarning: divide by zero encountered in log2
  result_data = func(*input_data)
2023-05-19 15:19:20,695 Extraction positions [500, 999).

Next we'll fit our model to these metagenotypes.