Software
Below is a list of software I recommend for different tasks.
Here is a nice short article by Ignacy Misztal on software in animal breeding: http://nce.ads.uga.edu/~ignacy/numpub/oldpapers/wc94.PDF
One of the main tasks for animal breeders is to estimate (co)variances components in the animal (mixed) model used. This is required to plan breeding programs and run evaluations (calculate EBVs/EPDs/PTAs). We are also interested in population parameters such as heritability as it helps predict the accuracy of EBVs.
BLUPF90
BLUPF90 was developed by Ignacy Misztal at UGA many years ago and since has developed with a small team including Ignacio Aguilar, Daniela Lourenco, Andres Legarra, Yutaka Masuda, etc.
Home: Link
Download: Link
Manual: Link
Ignacy Idea: Link
Ignacy PDF 1997: Link
REML Notes: Link
Genomic: Link
Computational Techniques Ignacy: Link
Legarra PDF Metafounders: Link
HELP: Link
Masuda
Masuda HTML Notes: Link
Masuda PDF Notes: Link
Masuda GitHub: Link
Masuda GitHub 2: Link
ASReml
ASReml was developed by Arthur Gilmour starting in 1996
PAID
Note that this software comes with a user license.
Standalone: Link
R Version: Link
YouTube: Link
Manual 4.2: Link
Manual ASReml-4 v4: Link
Manual ASReml-R: Link
ASReml-R Download Guide: Link
ASReml Cookbook: Link
EchidnaMMS
Looks like Arthur started a new project to clone ASReml.
Home: Link
DMU
DMU was developed at Aarhus by Per Madsen and Just Jensen who are both at Aarhus in Denmark.
Home: Link
Download: Link
Paper: Link
User Guide: Link
GCTA
GCTA is a newer software developed by the Yang lab. I’m not very familiar with this software, but it’s been around since 2011 it appears. It can also be used for GWAS. See the GWAS section as it can also support GWAS analyses.
Home: Link
Genetic evaluations are easily one of the most important aspects of a breeding program. They are different from other software due to their ability to handle very large datasets (e.g. the Dairy industry would have millions and millions of records in their national evaluation through CDCB). Often with alternative solving methods that may not be theoretically pleasing to do research projects with, for instance we don’t do direct inverses in most of these softwares and therefore not able to calculate PEV for individual accuracy/reliabilities.
PAID
Note that most all software with the ability to run evaluations
MiXBLUP
MiXBLUP was developed at Wageningen by several researchers, currently Jan tenapel and Jeremie Vendenplas (to my knowledge). It was developed on top of Mix99 at it’s core. MiXBLUP tends to be more affordable with most other options (see below).
The main limitation is that it cannot be used for variance component estimation to my knowledge.
Home: Link
Download: Link
License: Link
Abstract: Link
BLUPF90
Mentioned above, but is a main software used in the USA (among other countries) by breeding companies to run evaluations. It has the ability to run iteration on data with a license fee.
**See all of the links for BLUPF90 in the Variance Components tab.
Home: Link
BOLT
Developed by Daniel and Dorian Garrick with Bruce Golden originally, today Dorian’s son Daniel Garrick runs and maintains this software suite.
Home: Link
Mix99
Developed in Finland and the base for MiXBLUP (above).
Home: Link
Slides: Link
PEST
DEPRECATED
Paper: Link
Maintaining adequate levels of genetic variation in a population is critical to the long term survival of that breed or line.
There are few software packages today that are still around to deal with this optimization problem.
Matesel
Matesel was developed in Australia by Brian Kinghorn and his son.
This software is likely the most utilized by industry today because of it’s capabilities.
Home: Link
AlphaMate
Home: Link
EVA
Per Berg was with Brian Kinghorn at one point. He developed EVA software later.
Home: Link
Paper: Link
optiSel R Package
Home: Link
Website: Link
Paper: Link
One of the main problems with genomic selection when it began was that we needed quality control and other processing of the genotypes before utilizing them in the evaluations. Just some of the QC needed would include:
- Call rates for animals (rows) and SNPs (columns)
- Minor allele frequency minimums (often 0.01 or 0.05)
- Correlations between inbreeding values
- Correlations between off-diagonal elements
- Removing parent-offspring conflicts
- Removing duplicates or twins (high off-diag correlations)
- Many more
calc_grm
calc_grm was developed within the MiXBLUP suite to process SNP chip genotypes. This software will do QC and calculate the A, G, and H matrix.
Home: Link
preGSf90
Very similar to calc_grm but developed with the BLUPF90 suite. Also works with postGSf90 to run GWAS.
Docs: Link
PLINK
Very popular software to process SNP panels.
Home: Link
Genome wide association studies (GWAS) are a way for researchers to determine what SNPs may contribute more than others to the genetic variance of a trait. Often they are looking for SNPs that may explain a large percentage of the genetic variance (e.g. 10%). There are both frequentist and Bayesian methodology and countless ways to summarize the results.
JWAS
Hao Cheng started this software with Rohan Fernando and Dorian Garrick while at Iowa State University as a PhD student. He then moved to UC Davis as an assistant professor and continued developing it.
Home: Link
GitHub: Link
Gensel
DEPRECATED
JWAS was originally based on this software developed at Iowa State University by Dorian Garrick and Rohan Fernando. I do not think it is still maintained, please use JWAS.
Not available See JWAS
postGSf90
Developed within the BLUPF90 suite of programs. The solving method for postGSf90 uses the so-called EMMAX method, however computationally efficient by dividing the backsolved SNP effects (from a GBLUP run) and dividing by the SE of SNP Effects.
Docs: Link
BGLR
BGLR was developed by Gustavo de los Campos and Paulino Perez-Rodriguez at MSU. BGLR can do many of the Bayesian regressions for GWAS and genomic prediction.
GitHub: Github Link
2014 Paper: Paper Link
2022 Paper: Paper Link
GCTA
GCTA is a newer software developed by the Yang lab. I’m not very familiar with this software, but it’s been around since 2011 it appears. It can also be used for GWAS. Also see the variance component section as it can do both.
Home: Link
Imputation is the process of predicting missing genotype calls for SNPs, often from a lower density (e.g. 10k SNP chip) to a higher density (e.g. 60k SNP chip). However, it can be used to simply impute missing values in a genotype matrix. Most of them first estimate the haplotypes in the population then extrapolate between the observed haplotypes in the smaller chip to the larger chip.
Beagle
Home: Link
AlphaImpute
Originally developed under John Hickey.
GitHub: Link
AlphaPeel
Originally developed under John Hickey.
GitHub: Link
FImpute
PAID
NOTE: FImpute is paid for commercial use.
Home: Link
Breed composition is the process of computing what percent of each genetic line makes up each individual. Purebreds ideally would be 100% one breed, however often this is not the case due to pedigree mistakes over the years in breeding programs. These mistakes are unavoidable and most companies will admit 2-5% pedigree errors in swine, which can be much higher in other species.
These estimates are also good to fit for crossbred models in CCPS I believe.
My personal experience showed that the last method, regression on the allele frequencies works very well if the lines are well known and you have good allele freq estimates. Admixture allows you to fix the lines, but I didn’t see any advantage to it.
Structure
Home: Link
Admixture
Home: Link
Allele Freq Method
This method is very simple, you do a normal or constrained regression on the allele frequencies (divided by 2) for each line and see how the regression fits each line. The coefficients will tell you what percent they are.
Normal Regression:
Kuehn Paper:: Link
Constrained Regression:
Funkhouser Paper:: Link
Scott’s Github: Link
Many questions in the past were done with quantitative genetic and animal breeding theory with deterministic equations. This was very useful and powerful, however there are many assumptions and limitations to this work that may or may not represent a real world breeding program. For that we need simulation to mimic real breeding programs in terms of structure, selection, matings, and evaluations.
AlphaSimR
AlphaSimR is from the Alpha Suite of tools developed by John Hickey’s group at Roslin. It had a headstart on MoBPS. Chris Gaynor is still developing this software at Bayer I believe.
GitHub: Link
Paper: Link
MoBPS
Was written in R starting in Henner’s lab, Torsten Pook did a lot of the work.
This is very much in development at Wageningen now and Dr. Pook is doing a lot of improvements.
GitHub: GitHub Link
Paper: Paper Link
QMSim
QMSim was developed by Mehdi Sargolzaei as an affiliate at Guelph.
NOTE: This software is still very good, but it’s fixed, meaning you cannot really change the breeding program that much besides what is programmed. Most people have now turned to AlphaSimR or MoBPS.
Home: Link
ADAM
Paper: Link
Selection index, also known as economic selection index is the process we take to combine the EBVs from many traits into a single index. We need to understand the accuracy of such an index as well as the weights for the selection criteria after calculating the economic values for each trait.
SelAction
SelAction was developed at Wageningen by Rutten, Bijma, Woolliams, and van Arendonk. The goal of the software is to help design breeding programs and calculate things such as the accuracy of your index and other parameters related to selection indexes.
Paper: Link
SelAction 2
Jack Dekkers and some postdocs have worked on this software. It is still in development, but hopefully done before too long.
Conference Abstract: Link
Here are a set of miscellaneous programs.
OpenMendel
Julia implementation of statistical genetics analysis.
Home: Home Link
Here are my thoughts on ABG software in general, not specific to any one software or software suite:
- Most animal breeding software is bush league as almost all of us are not trained in computer science as most of us are self taught (e.g. I have never taken a programming class). The one exception I know if Ignacy Misztal who Dan Gianola convinced to come to Illinois (I believe) to program the threshold models Dan worked on in the 1980’s.
- Many projects are free to use but not maintained as it may have been a small part of their PhD research or something. They have then little or no incentive to maintain it or show others how to use it. I’m guilty of this, I mostly share on GitHub to allow others to look at my code and nothing more.
- You are not paying, therefore they have no responsibility (or feel none) to explain how to use their software, except to commercial entities who are paying for it.
- Some are just lazy af…
- Often we start programming with little to no planning, unlike software companies, and then build on top of it leading to 90% spaghetti code. In contrast to planning for all future features we will need. Genomics was an exception as when most software started, there was no way to predict this technology would be implemented and how.
- Many simply don’t know how to write documentation correctly and haven’t studied it. You can find different types of documentation online if you search (e.g. a function definition vs a ‘cookbook’ style). Many articles on this out there to read up on.
- Out of habit, almost everyone I see write code doesn’t even write comments for themselves later and there is no way to tell what code even does by looking at it (just go to GitHub and start looking…..). Especially bad as you get to lower level languages. This makes it impossible to contribute in an open source way as no one knows what anything does without an intricate knowledge of what you are doing. Higher level languages are much easier to determine but still can be tricky without comments.
- Many projects were started long ago, when Windows dominated vs today when MacOS and Linux are extremely popular (especially in academia), Windows is still common in companies for some reason and IT loves it for some reason (because we all need printers in 2024 I guess).
- Testing and stress testing software is not done. The only example would be CRAN forces users to make sure their software runs on multiple systems, which is a pain, but keeps the R CRAN network very robust.
- Most of them developing software have a very strong conflict of interest (COI) in that they don’t want to actually teach others how the algorithms are implemented because this is their internal competitive edge, I know first hand stories of this happening. So many times it’s difficult or impossible to know how to speed up routines to get them to process at industry speeds.