Schumer lab: Important simulation tools

From OpenWetWare
Jump to navigationJump to search

Coalescent simulations with macs

Introduction to macs

macs uses the same command line format at ms but is much faster due to how it tracks recombination.

A useful guide to command line in ms can be found here: File:Msdoc.pdf

For example, to simulate divergence between birchmanni and malinche (assuming the ancestral theta equals birchmanni present-day theta), I would use the following command:

/home/groups/schumer/shared_bin/macs 2 1000000 -I 2 1 1 0 -t 0.001 -r 0.0006 -ej 2 2 1

This simulates a 1 Mb sequence with theta of 0.001 per bp and rho of 0.0006 per bp. The time at which lineage 1 and 2 join is 2 (in units of 4 Ne) generations ago.

This outputs data in macs format, but to convert it to ms format you can pipe the results to the msformatter program:

/home/groups/schumer/shared_bin/macs 2 1000000 -I 2 1 1 0 -t 0.001 -r 0.0006 -ej 2 2 1 | /home/groups/schumer/shared_bin/msformatter

Note: to use msformatter on Sherlock you must first load boost

module load boost

Using macs to simulate sequences

macs output in certain formats can be convert nucleotide sequences using the program seq-gen. To convert the above command to nucleotide sequences using seq-gen, simply add the -T command to decode the output as phylogenetic trees. Then these trees can be used as input for seq-gen:


/home/groups/schumer/shared_bin/macs 2 1000000 -I 2 1 1 0 -T -t 0.001 -r 0.0006 -ej 2 2 1 | /home/groups/schumer/shared_bin/msformatter | grep '\[' > macs_treefile


count the number of trees generated by the simulation:

wc -l macs_treefile

then substitute this value for [PARTITIONS] in the below command line:

/home/groups/schumer/shared_bin/seq-gen -mHKY -l 1000000 -p [PARTITIONS] -s 0.001 -a 0.597 -f 0.305566 0.194762 0.194557 0.305115 <macs_treefile >macs_seqfile.phy

Note: the simulation here uses -s 0.001 because that is the theta simulated above, and also uses particular base composition and transition/transversion ratios. Make sure to adapt this to your simulations

seq-gen also has a useful option of simulating sequences given an "ancestral" sequence, which can be useful in many scenarios. See the seq-gen documentation here

Simulate reads from sequences

You can also simulate illumina reads from these [or any other sequences] once converted to fasta format:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/Phylip2Fasta.pl macs_seqfile.phy macs_seqfile.fa

For example, simulate 1000 paired-end 100 bp reads (see wgsim documentation here for options of SNP, indel, and error rates):

/home/groups/schumer/shared_bin/wgsim -N1000 -1100 -2100 -e0.01 -r0.0006 -R1 macs_seqfile.fa myread1.fq myread2.fq

Admixture simulations with admix'em

Admix'em is a very useful and flexible admixture simulator but is a lot slower than coalescent simulators. Admix'em can simulate arbitrary selection scenarios but is somewhat limited in its ability to simulate complex demographic scenarios. For complete documentation of Admix'em see here

Here I outline some common uses of admix'em in the lab and associated scripts.

Generate input marker and recombination files

There are two methods for generating the recombination and marker input files for admix'em:

1) Run admix'em interactively and follow the prompts to generate the desired number of markers and candidate sites for recombination events:

admixemp

Select Option 1 and follow the prompts


2) Use lab scripts to generate markers and recombination rates defined by your data:

2a) Generate markers based on a list of markers (e.g. from your AncestryHMM run):

perl convert_markerlist_to_admixem_markers.pl markers_list_one_per_line chromosome_lengths_file

The script expects that the markers file will be in the format chromosome:site and that there will be one marker per line. The chromosome lengths file should also have one entry per line with the lengths in order of the chromosomes in the marker input file

2b) Generate recombination file based on LDhelmet rates:

Rscript LDres_to_admixem_rec_map.R LDresmod_file out_file_name chr_length_bp_for_simulation chr_number_for_simulation

Run this script once per chromosome you'd like to simulate and combine the files, e.g:

cat group1_rec.txt group2_rec.txt group3_rec.txt > group1-3_recrates_sim.txt

res files for the X. birchmanni 10x map can be found on Sherlock in this directory and have the extension .post.txt_mod:

/home/groups/schumer/shared_bin/shared_resources/Xbirchmanni_LD_recombination_map_10xgenome_March2019/block_penalty_50_version


Example admix'em marker files, configuration files, and selection files can be found on Sherlock here:

/home/groups/schumer/shared_bin/admixem_example_files

Generate sites under selection

In many cases it will be useful to randomly place sites under selection instead of doing so manually in the text files.

This script generates selection configuration files for admix’em that simulate neutral randomly placed DMIs:

Rscript generate_neutral_DMI_random_unlinked.R number_DMI_pairs selection_coefficient dominance_coefficient

Run admix'em

To run admix'em fill out all sections of the configuration file relevant to your simulation (again see documentation) and then enter:


admixemp myconfig.cfg

Post process simulations

To convert admix'em output into a format similar to ancestry tsv files you can run the following script:

1) Convert the admix'em markers to a header for your ancestry tsv file:

perl convert_admixem_markers_to_header.pl admixem_markers.txt admixed_sim_header.txt

2) Generate ancestry tsv files for your simulation using this header and the admixem results files of interest. These can be find inside your simulation folder and will have the file extension: _markers.txt

perl generate_msg_data_header_locisim.pl GenX_markers.txt outfile_name admixed_sim_header.txt

This will generate a genotypes format file that can be used with other common scripts

3) If needed for downstream applications, you can convert this genotypes file to ancestry tsv files:

perl genotypes_to_ancestrytsv.pl genotypes_file

Writes output files called infile_ancestry_par1.tsv and infile_ancestry_par2.tsv

Complete examples of workflows

Complete examples of these workflows including configuration files, selection files, and wrapper scripts to run the simulations and processing can be found here:


/home/groups/schumer/shared_bin/admixem_example_files

The two complete examples include a neutral simulation of early generation hybrids:

/home/groups/schumer/shared_bin/admixem_example_files/F2_simulations


and a simulation of selection agains hybrid incompatibilities in F2 hybrids:

/home/groups/schumer/shared_bin/admixem_example_files/simulate_selection_against_DMI


Simulations of one-locus selection are fairly straightforward and more details can be found in the admixem wiki here

There is a document in box that has a cheat sheet for simulating two-locus selection: convert_DMIs_to_admixem_code.xlsx


There is also an example here of a collection of scripts and admix'em files for generating and tracking pedigree information:

/home/groups/schumer/shared_bin/admixem_example_files/Pedigree_scripts_config

To run these admix'em simulations, simply run the shell script:

perl simulated_ancestry_pedigree_neutral.pl admixsimul_complex.cfg admix_simulation_complex [NUMBER OF SIMULATIONS] focal_genes_indivs header_markers_group1-24 pedigree-sims


There is also an example of a collection of scripts and admix'em files for generating time series data with selection:

/home/groups/schumer/shared_bin/admixem_example_files/time_series_admixem_scripts

To run these admix'em simulations, simply run the shell script:


perl simulated_ancestry_timeseries.pl cfg_file out_folder_name num_iterations generations_file indivs_file results_file_tag markers_file_header selection_coefficient

example:

perl simulated_ancestry_timeseries.pl admixsimul_test.cfg admix_simulation_complex 2 gens_file indivs_file results_sim_gen80 header_bir10x_group1-3 0.05

Simulations with SELAM and Macs integrated into hybrid genome and read simulator

We've developed a simulator to simulate hybrid genomic data with or without selection. This pipeline has a number of uses:

  • Evaluate the accuracy of local ancestry inference under a range of demographic scenarios (in hybrids, parentals or both)
  • Evaluate the impact of different data collection and analysis choices - number of reads, length of reads, error rates, number of parental haplotypes collected for defining ancestry informative markers, among others
  • Generate hybrid genomes and reads with or with out selection for other applications (proof of principle detecting selection, inferring hybrid recombination maps, etc)

On Sherlock the simulator folder can be found here:

/home/groups/schumer/shared_bin/Simulate_hybrid_genomes


1) Copy the configuration file:

cp /home/groups/schumer/shared_bin/Simulate_hybrid_genomes/hybrid_simulation_configuration.cfg hybrid_simulation_configuration_mix0.75_55gen.cfg


2) Edit the configuration file to input simulation-specific parameters using a text editor. See more information on the GitHub about each configurable parameter [here]


3) To simulate hybrid population demography, include a SELAM parameter file in the configuration file (Note: input a maximum of two parental species). Examples can be found in the Simulate_hybrid_genomes folder or in the SELAM documentation File:SELAM_manual.pdf


4) Set up a slurm script with dependencies and the simulation command:

#SBATCH --ntasks=1

#SBATCH --mem=32000

#SBATCH --cpus-per-task=1

#SBATCH -p schumer

#SBATCH --time=04:00:00

#SBATCH --job-name=sim-hybrid-genomes

#SBATCH --mail-user=youremail@stanford.edu


module load biology py-biopython

module load R

module load perl

module load gsl

module load boost

export PATH="/home/groups/schumer/shared_bin/lib:$PATH"

export PATH="/home/groups/schumer/shared_bin:$PATH"

export LD_LIBRARY_PATH="/home/groups/schumer/shared_bin/lib:$LD_LIBRARY_PATH"


perl simulate_admixed_genomes_v3.pl hybrid_simulation_configuration_mix0.75_55gen.cfg


5) Submit the job, e.g.:

sbatch simulate_job_submit_sherlock_v3.sh


Output files and next steps

The output files of the simulator will be the following:

parent 1 simulated genome:

macs_simulated_parent1.fa

parent 2 simulated genome:

macs_simulated_parent2.fa

read list file, e.g.:

simulated_hybrids_readlist_gen55_prop_par1_0.75

reads folder (including bed files with true ancestry), e.g.:

simulated_hybrids_reads_gen55_prop_par1_0.75

and simulated AIMs and counts:

simulated_AIMs_for_AncestryHMM

simulated_parental_counts_for_AncestryHMM


These files can be used as input to the AncestryHMM pipeline.


After the AncestryHMM pipeline finishes running, you can summarize accuracy relative to the true ancestry by running the following script:

perl /groups/schumer/shared_bin/Simulate_hybrid_genomes/post_hmm_accuracy_shell.pl ancestry-probs1-file ancestry-probs2-file simulation_folder_name path_to_simulator_install

This will generate a file called results_simulation_folder_name which summarizes accuracy by individual, tract, and ancestry type'

example:

perl /home/groups/schumer/shared_bin/Simulate_hybrid_genomes/post_hmm_accuracy_shell.pl ancestry-probs-par1_allchrs.tsv ancestry-probs-par2_allchrs.tsv simulated_hybrids_reads_gen45_prop_par1_0.75 /home/groups/schumer/shared_bin/Simulate_hybrid_genomes

Simulating Admixture and Selection with SLiM

Installation

The SLiM binary is available in shared_bin but it depends on users having the following modules/python packages:

module load gsl

module load python/3.6.1

pip install tskit --user

pip install msprime --user

pip install pyslim --user


SLiM Simulations

SLiM is a very flexible program for simulating population genetics and evolution, including admixture with and without selection on hybrid genotypes. SLiM can be downloaded here, with full documentation found here. Using the TreeSequence tracking feature of SLiM is ideal for admixture simulations, as the output contains the exact ancestry information of all loci.

Using LDHelmet Recombination Maps with SLiM

Detailed recombination maps can be specified in SLiM using the function initializeRecombinationRate(), which takes as input 1) a vector of per-base, per-generation recombination rates and 2) a sorted vector of corresponding endpoints for those rates. For detailed examples of importing an outside recombination map, see Example 6.1.2 in the SLiM Manual. LDHelmet may report suspiciously high rates (>0.5), which can be normalized by taking of the average neighboring intervals. The output of LDHelmet is in population-scaled recombination rates; to transform this to per-base rates, take the empirical crossovers detected in 139 F2s:

/home/groups/schumer/shared_bin/shared_resources/Xbirchmanni_Xmalinche_F2_map_March2019/all_crossovers_true_F2_5Mb_windows

Calculate the total per-generation rate for the chromosome of interest, then scale the LDHelmet values such that their sum (weighted by the segment length) is this value.

Using Genome Annotations with SLiM

SLiM defines different genomic regions that vary in their potential mutation types (e.g. deleterious mutations allowed in exons but not noncoding regions). A file with three tab-separated columns defining 1) Chromosome 2) Coding Region Start and 3) Coding Region End can be found at:

/home/groups/schumer/shared_bin/shared_resources/xiphophorus_birchmanni_10x_12Sep2018_yDAA6_coding_nonoverlapping.tsv

Tracking Ancestry Information with SLiM

To use ancestry tracking in SLiM, include the following in the code block: initialize () { initializeTreeSeq(); }

For the generation to be treated as the founder generation, include the command: <generation #> late() { ... sim.treeSeqRememberIndividuals(sim.subpopulations.individuals); ... }

For the generation whose ancestry is to be calculated, include the command: <generation #> late() { ... sim.treeSeqOutput("/path/to/output.trees"); ... }

Simulations for ABC demographic inference with SLiM

For demographic inference of timing of admixture, proportion of admixture, admixed population size, and migration from each parent population we have a slim script.

Usage:

slim -d SEED=1 -d POPSIZE=2000 -d GEN=120 -d INIT_PROP=0.75 -d PAR1MIG=4.5e-06 -d PAR2MIG=0.002 /home/groups/schumer/shared_bin/Lab_shared_scripts/neutral_admixture_ABC_migration.slim

Parsing SLiM TreeSeq Output

The output of SLiM TreeSeq is in tskit format, described here, and parsing requires the pyslim tools described here. Use this script to convert gene trees to ancestry tract data for a subset of individuals:

/home/groups/schumer/shared_bin/Lab_shared_scripts/slim_genetree_to_indiv_ancestries.py

Usage is: /path/to/slim_genetree_to_indiv_ancestries.py input_file.trees number_of_samples output_file.tsv

The output has columns 1) Interval Start 2) Interval End 3) Individual ID and 4) Ancestry (0 = homozygous pop 1, 1 = heterozygous, 2 = homozygous pop 2).


Simulating Random DMIs in SLiM

Dozhansky-Muller Incompatibilities (DMIs) can be simulated in SLiM by the following general workflow:

  1. Set up random neutral (or adaptive) mutations appearing in the desired parts of the genome
  2. Use a mutation() callback to randomly select a subset of these new mutations to have a negative interaction with a preexisting mutation
  3. Use a fitness() callback to check whether each individual is carrying incompatible alleles, and if so adjust fitness downwards according to a provided fitness matrix

By tailoring the mutation() callback to particular populations and time periods, you can set DMIs to appear between particular branches of a tree. For an implementation of these ideas, see the Box/Schumer_lab_resources/Computational_simulation_resources/Simulation_repo/stochastic_introgressed_BDMI_config_v8.slim


Forcing F1 mating between Populations in SLiM

Creating F1 crosses in SLiM is surprisingly tricky, but Ben arrived at the following solution:

  1. Create a mutation type to be an "ancestry marker", and deterministically give this mutation to every individual in one of the two populations
  2. Set up a new population which is populated by 50% migration from each of the two populations to be crossed
  3. Set up a matechoice() callback in this population that checks how many alleles of the ancestry marker each individual has. Then set the probability of individuals with 0 alleles mating with other individuals with 0 alleles, and of individuals with 2 alleles mating with those with 2 alleles, to 0.0, and otherwise leaving mate choice probabilities unaffected.

For an implementation of these ideas, see the Box/Schumer_lab_resources/Computational_simulation_resources/Simulation_repo/stochastic_introgressed_BDMI_config_v8.slim