Repository logo

Statistical methods to improve understanding of the genetic basis of complex diseases

Repository uri, repository doi.

Robust statistical methods, utilising the vast amounts of genetic data that is now available, are required to resolve the genetic aetiology of complex human diseases including immune-mediated diseases. Essential to this process is firstly the use of genome-wide association studies (GWAS) to identify regions of the genome that determine the susceptibility to a given complex disease. Following this, identified regions can be fine-mapped with the aim of deducing the specific sequence variants that are causal for the disease of interest.

Functional genomic data is now routinely generated from high-throughput experiments. This data can reveal clues relating to disease biology, for example elucidating the functional genomic annotations that are enriched for disease-associated variants. In this thesis I describe a novel methodology based on the conditional false discovery rate (cFDR) that leverages functional genomic data with genetic association data to increase statistical power for GWAS discovery whilst controlling the FDR. I demonstrate the practical potential of my method through applications to asthma and type 1 diabetes (T1D) and validate my results using the larger, independent, UK Biobank data resource.

Fine-mapping is used to derive credible sets of putative causal variants in associated regions from GWAS. I show that these sets are generally over-conservative due to the fact that fine-mapping data sets are not randomly sampled, but are instead sampled from a subset of those with the largest effect sizes. I develop a method to derive credible sets that contain fewer variants whilst still containing the true causal variant with high probability. I use my method to improve the resolution of fine-mapping studies for T1D and ankylosing spondylitis. This enables a more efficient allocation of resources in the expensive functional follow-up studies that are used to elucidate the true causal variants from the prioritised sets of variants.

Whilst GWAS investigate genome-wide patterns of association, it is likely that studying a specific biological factor using a variety of data sources will give a more detailed perspective on disease pathogenesis. Taking a more holistic approach, I utilise a variety of genetic and functional genomic data in a range of statistical genetics techniques to try and decipher the role of the Ikaros family of transcription factors in T1D pathogenesis. I find that T1D-associated variants are enriched in Ikaros binding sites in immune-relevant cell types, but that there is no evidence of epistatic effects between causal variants residing in the Ikaros gene region and variants residing in genome-wide binding sites of Ikaros, thus suggesting that these sets of variants are not acting synergistically to influence T1D risk.

Together, in this thesis I develop and examine a range of statistical methods to aid understanding of the genetic basis of complex human diseases, with application specifically to immune-mediated diseases.

Description

Qualification, awarding institution, collections.

CU Denver & CU Anschutz logo

colorado school of public health

Colorado sph.

test

MS in Biostatistics

This program emphasizes the applied and theoretical nature of biostatistics. In addition to courses on theory, statistical computing, consulting, analysis of clinical trials, and longitudinal and survival data, you'll be exposed to a wide variety of research areas including statistical genetics and genomics, causal inference, infectious disease, and cancer research. During the program, you’ll get involved in research with a faculty mentor as part of your thesis or research paper. You'll also have the opportunity to specialize in one of two minor areas within the MS—Statistical Genomics and Data Science Analytics. 

Quick facts, careers, and skills

When you graduate with an MS in Biostatistics, you’ll be ready for a career designing and analyzing clinical trials and public health studies.

Quick facts

Program location: CU Anschutz Est. time to complete: 2 years Credit hours: 36 Option: Minor in Statistical Genomics/Genetics or Data Science Analytics

Sample careers

Biostatistician Data analyst Data scientist & more

Skills you'll gain

Data visualization Collaborative research Statistical programming & more

This program will prepare you for in-depth study and research in statistics as it applies to healthcare and biological settings. You'll get a balance between theory, methods, and hands-on practical and research experience. Our required courses include applied and theoretical statistics, statistical computing, consulting, and advanced statistical modeling. Plus, you can choose elective coursework ranging from analysis of clinical trials to survival analysis to statistical ‘omics. You'll also complete a Master's research paper or thesis.

In addition, we offer two minor areas of specialization within the MS—Statistical Genomics and Data Science Analytics. We recommend planning out the minor in your first year to ensure timely graduation and availability of electives. Check out the Department of Biostatistics & Informatics FAQ page for more information about this program.

Examples of thesis titles and research conducted by past students can be found at the bottom of the department's students page .

Concentration core courses

Required public health courses, thesis/research paper/project​ ​, total credits: 36, sample schedule.

The following sample schedule is designed to help you plan your courses. The number of credits in a given semester, the order in which the required courses are taken, and the courses you take to meet the concentration requirements may vary. ​Please note that you cannot use the same course to fulfill more than one requirement. The MS in Biostatistics degree is designed to be completed in two academic years.

Year 1 sample schedule

Year 2 sample schedule.

View the course book and course schedule >

Competencies

Minor in statistical genomics & genetics.

We offer two minors—small, optional groupings of courses—that are designed to provide specialization for the MS degree on a certain topic. The Statistical Genomics & Genetics minor, offers an official designation in a topic that has become very popular within the field of biostatistics and is a strength of our program that will help you with employment and other opportunities.

Eligibility

Requirements.

  • Take an additional 3 credits of electives; 6 of the elective credits must be from a list of courses related to statistical genetics/genomics; at most 4 credits may be from an outside department.
  • Write a thesis or publishable paper with a focus on Statistical Genomics & Genetics.

The MS in Biostatistics degree is designed to be completed in two academic years. Although the minor requires three additional credits for the specialization, the degree can still be completed within two years assuming students take courses over the summer. See an example timeline below (individual cases and offerings every year will vary).

Accepted electives

1   At most 4 credits can be taken from an outside program 2  Only one programming course may be taken

Please confirm with the program directors about your schedule and minor electives.

Minor in Data Science Analytics

In response to a changing landscape of biomedical research that relies more and more on the generation, analysis, and interpretation of large data sets, we offer a minor in Data Science Analytics. Students pursuing this minor within their MS in Biostatistics will have an official designation that will help with employment and other opportunities upon graduation.

  • Take at least 9 credits of electives from a list of courses related to Data Science Analytics (see below). There are additional elective credits required in the minor compared to the original MS degree, so that there is opportunity to specialize in this area.
  • Write a thesis or publishable paper with a focus on Data Science Analytics.

The MS in Biostatistics degree is designed to be completed in two academic years. Although the minor requires additional credits for the specialization, the degree can still be completed within two years assuming students take courses over the summer.  See an example timeline below (individual cases and offerings every year will vary).

1   At least 3 BIOS credits should be taken

2  Only one lower-level 2000/3000 CSCI course can count towards the electives, these are used to meet prerequisites of the 5000 CSCI series if needed

3  Because of overlap, only MATH 6388 or BIOS 6645 can be counted as an elective

Your minor degree plan is subject to approval by the program director(s), please check with them before enrolling in minor electives

Support from the department

This program is designed to be completed in two years by full time students. Many MS students are supported through research or teaching assistantship positions. These positions are competitive and offer opportunities for training and experience in research, teaching, and practice in biostatistics. During the past several years more than half of MS students were supported in some way upon entry to the program, and all students were supported in research positions by the end of their first summer. Through these positions students gain real-world research and collaborative experience on a very large and active health sciences campus. This practical experience is a strength of our program.

Colorado School of Public Health

CU Anschutz

Fitzsimons Building

13001 East 17th Place

Mail Stop B119

Aurora, CO 80045

  • Accreditation
  • Prospective Students
  • Current Students
  • Faculty/Staff
  • CU Anschutz Medical Campus
  • Colorado State University
  • University of Northern Colorado
  • University Policies
  • Strauss Health Sciences Library
  • Public Lectures
  • Faculty & Staff Site >>

Statistical Genetics (not admitting students)

The Graduate Certificate in Statistical Genetics has paused admissions as of December 2023.

The Graduate Certificate in Statistical Genetics provides opportunities for concentrated education in statistical genetics to graduate students from a variety of disciplines. While primarily focused towards matriculated PhD and MS students at UW, non-matriculated students may also apply. The classes are taught by faculty in Statistics, Biostatistics, and Genome Sciences.

Program Website

Degree(s)/Certificate(s) offered

  • Graduate Certificate in Statistical Genetics

Program director/interdisciplinary group chair

Primary staff contact.

  • Mee-Ling Hon, Academic Counselor, Department of Statistics

Interdisciplinary Faculty Group Membership

The following are the core/voting Graduate Faculty members of the interdisciplinary group. For a complete list of faculty active in the program, see the program website.

  • Brian Browning, Associate Professor, Department of Medicine
  • Sharon Browning, Research Associate Professor, Department of Biostatistics
  • Daniel Eisenberg, Assistant Professor, Department of Anthropology
  • Joeseph Felsenstein, Professor, Departments of Biology, Genome Sciences, and Molecular Biotechnology
  • Philip Green, Professor, Departments of Genome Sciences and Molecular Biotechnology
  • Gail Jarvik, Professor, Departments of Medicine, Genome Sciences, and Molecular Biotechnology
  • Benjamin Kerr, Professor, Department of Biology
  • Kathleen Kerr, Associate Professor, Department of Biostatistics
  • Adam Leaché, Assistant Professor, Department of Biology
  • Harmit Malik, Affiliate Associate Professor, Departments of Genome Sciences and Molecular Biotechnology
  • Frederick Matsen, Affiliate Associate Professor, Department of Statistics
  • Barbara McKnight, Professor, Department of Biostatistics
  • Daniel Promislow, Professor, Departments of Biology and Pathology
  • Kenneth Rice, Associate Professor, Department of Biostatistics
  • Ali Shojaie, Assistant Professor, Department of Biostatistics
  • Noah Simon, Assistant Professor, Department of Biostatistics
  • Elizabeth Thompson, Professor, Department of Statistics
  • Jon Wakefield, Professor, Departments of Statistics and Biostatistics
  • Bruce Weir, Professor, Department of Biostatistics
  • Daniela Witten, Associate Professor, Departments of Biostatistics and Statistics

Three students in a discussion

DPhil in Statistics

  • Entry requirements
  • Funding and Costs

College preference

  • How to Apply

About the course

In the DPhil in Statistics, you will investigate a particular project in depth and write a thesis which makes a significant contribution to the field. You will acquire a wide range of research and transferable skills, as well as in-depth knowledge, understanding and expertise in your chosen field of research. You will become part of a vibrant community of researchers.

The Department of Statistics in the University of Oxford is a world leader in research in probability, bioinformatics, mathematical genetics and statistical methodology, including computational statistics, machine learning and data science. Oxford’s Mathematical Sciences submission came first in the UK on all criteria in the 2021 Research Excellence Framework (REF) and in 2016 the department moved to a newly-refurbished building in the centre of Oxford.  

Much of the department’s research is either explicitly interdisciplinary or draws its motivation from application areas, ranging from genetics, immunoinformatics, bioinformatics and cheminformatics, to finance and the social sciences. 

You will be expected to acquire transferable skills as part of your training, and to undertake broadening training outside your specialist area. Part of that broadening training is obtained through APTS, the Academy for PhD Training in Statistics; this is a joint venture with a group of leading university statistics departments which runs four weeks of appropriate courses a year. You will give a research presentation or prepare a research poster each year in the department. There may also be opportunities to undertake industrial internships as appropriate.

You are expected to teach approximately 12 contact hours per year in undergraduate and graduate courses in the department. This is mentored teaching, beginning with simple marking, to reach a point where individual students are leading whole classes of 10 to 12 undergraduate students. You will be encouraged to participate in social events and to take part in public engagement. The department also offers career development events.

Supervision

The allocation of graduate supervision for this course is the responsibility of the Department of Statistics and it is not always possible to accommodate the preferences of incoming graduate students to work with a particular member of staff. Under exceptional circumstances, a supervisor may be found outside the Department of Statistics.

You will be assigned a named supervisor or supervisors, who will have overall responsibility for the direction of your work on behalf of the department. You will have the opportunity to interact with fellow students and other members of your research groups, and more widely across the department. Typically, as a research student, you should expect to have meetings with your supervisor or a member of the supervisory team with a frequency of at least once every two weeks averaged across the year. The regularity of these meetings may be subject to variations according to the time of the year, and the stage that you are at in your research programme.

Initially, you will be admitted as a Probationer Research Student (PRS).

There are formal assessments of progress on the research project with the Transfer of Status from PRS to DPhil status at around 12 to 15 months and Confirmation of Status at around 30 to 36 months. These assessments involve the submission of written work and oral examination by two assessors (other than your supervisor). Over the course of the DPhil you will be expected to undertake a total of 100 hours of broadening training outside your specialist area.

The final thesis is normally submitted for examination during the fourth year and is followed by the viva examination.

Graduate destinations

After research degrees, the majority of the department’s graduates move into research and academic careers. Others work, for example, in data analytics, in tech and biotech companies and in the financial sector.

Changes to this course and your supervision

The University will seek to deliver this course in accordance with the description set out in this course page. However, there may be situations in which it is desirable or necessary for the University to make changes in course provision, either before or after registration. The safety of students, staff and visitors is paramount and major changes to delivery or services may have to be made in circumstances of a pandemic, epidemic or local health emergency. In addition, in certain circumstances, for example due to visa difficulties or because the health needs of students cannot be met, it may be necessary to make adjustments to course requirements for international study.

Where possible your academic supervisor will not change for the duration of your course. However, it may be necessary to assign a new academic supervisor during the course of study or before registration for reasons which might include illness, sabbatical leave, parental leave or change in employment.

For further information please see our page on changes to courses and the provisions of the student contract regarding changes to courses.

Entry requirements for entry in 2024-25

Proven and potential academic excellence.

The requirements described below are specific to this course and apply only in the year of entry that is shown. You can use our interactive tool to help you  evaluate whether your application is likely to be competitive .

Please be aware that any studentships that are linked to this course may have different or additional requirements and you should read any studentship information carefully before applying. 

Degree-level qualifications

As a minimum, applicants should hold or be predicted to achieve the following UK qualifications or their equivalent:

  • a first-class or strong upper second-class undergraduate degree with honours in an appropriate subject. You will need a strong background in mathematics and/or statistics.

However, entrance is very competitive and most successful applicants have a first-class degree or the equivalent.

A previous master's degree (either an integrated master's degree or standalone) is preferred but is not required.

For applicants with a degree from the USA, the minimum GPA sought is 3.6 out of 4.0.

If your degree is not from the UK or another country specified above, visit our International Qualifications page for guidance on the qualifications and grades that would usually be considered to meet the University’s minimum entry requirements.

GRE General Test scores

No Graduate Record Examination (GRE) or GMAT scores are sought.

Other qualifications, evidence of excellence and relevant experience

Publications are not expected but can be included with the application.

English language proficiency

This course requires proficiency in English at the University's  standard level . If your first language is not English, you may need to provide evidence that you meet this requirement. The minimum scores required to meet the University's standard level are detailed in the table below.

*Previously known as the Cambridge Certificate of Advanced English or Cambridge English: Advanced (CAE) † Previously known as the Cambridge Certificate of Proficiency in English or Cambridge English: Proficiency (CPE)

Your test must have been taken no more than two years before the start date of your course. Our Application Guide provides further information about the English language test requirement .

Declaring extenuating circumstances

If your ability to meet the entry requirements has been affected by the COVID-19 pandemic (eg you were awarded an unclassified/ungraded degree) or any other exceptional personal circumstance (eg other illness or bereavement), please refer to the guidance on extenuating circumstances in the Application Guide for information about how to declare this so that your application can be considered appropriately.

You will need to register three referees who can give an informed view of your academic ability and suitability for the course. The  How to apply  section of this page provides details of the types of reference that are required in support of your application for this course and how these will be assessed.

Supporting documents

You will be required to supply supporting documents with your application. The  How to apply  section of this page provides details of the supporting documents that are required as part of your application for this course and how these will be assessed.

Performance at interview

Interviews are normally held as part of the admissions process for applicants who, on the basis of the written application, best meet the selection criteria. Interviews may be held in person, by telephone, or by video link such as Microsoft Teams or Zoom, normally with at least two interviewers.

The interviews last about 30 minutes and include questions about motivation as well as questions from the proposed research area.

How your application is assessed

Your application will be assessed purely on your proven and potential academic excellence and other entry requirements described under that heading.

References  and  supporting documents  submitted as part of your application, and your performance at interview (if interviews are held) will be considered as part of the assessment process. Whether or not you have secured funding will not be taken into consideration when your application is assessed.

An overview of the shortlisting and selection process is provided below. Our ' After you apply ' pages provide  more information about how applications are assessed . 

Shortlisting and selection

Students are considered for shortlisting and selected for admission without regard to age, disability, gender reassignment, marital or civil partnership status, pregnancy and maternity, race (including colour, nationality and ethnic or national origins), religion or belief (including lack of belief), sex, sexual orientation, as well as other relevant circumstances including parental or caring responsibilities or social background. However, please note the following:

  • socio-economic information may be taken into account in the selection of applicants and award of scholarships for courses that are part of  the University’s pilot selection procedure  and for  scholarships aimed at under-represented groups ;
  • country of ordinary residence may be taken into account in the awarding of certain scholarships; and
  • protected characteristics may be taken into account during shortlisting for interview or the award of scholarships where the University has approved a positive action case under the Equality Act 2010.

Processing your data for shortlisting and selection

Information about  processing special category data for the purposes of positive action  and  using your data to assess your eligibility for funding , can be found in our Postgraduate Applicant Privacy Policy.

Admissions panels and assessors

All recommendations to admit a student involve the judgement of at least two members of the academic staff with relevant experience and expertise, and must also be approved by the Director of Graduate Studies or Admissions Committee (or equivalent within the department).

Admissions panels or committees will always include at least one member of academic staff who has undertaken appropriate training.

Other factors governing whether places can be offered

The following factors will also govern whether candidates can be offered places:

  • the ability of the University to provide the appropriate supervision for your studies, as outlined under the 'Supervision' heading in the  About  section of this page;
  • the ability of the University to provide appropriate support for your studies (eg through the provision of facilities, resources, teaching and/or research opportunities); and
  • minimum and maximum limits to the numbers of students who may be admitted to the University's taught and research programmes.

Offer conditions for successful applications

If you receive an offer of a place at Oxford, your offer will outline any conditions that you need to satisfy and any actions you need to take, together with any associated deadlines. These may include academic conditions, such as achieving a specific final grade in your current degree course. These conditions will usually depend on your individual academic circumstances and may vary between applicants. Our ' After you apply ' pages provide more information about offers and conditions . 

In addition to any academic conditions which are set, you will also be required to meet the following requirements:

Financial Declaration

If you are offered a place, you will be required to complete a  Financial Declaration  in order to meet your financial condition of admission.

Disclosure of criminal convictions

In accordance with the University’s obligations towards students and staff, we will ask you to declare any  relevant, unspent criminal convictions  before you can take up a place at Oxford.

The Department of Statistics is based in St Giles, near the centre of Oxford. The building has spaces for study and collaborative learning, including a large interaction and social area, the Library and an Open Research Zone.

You will normally be provided with a computer and desk space in a shared office in this building.

You will have access to the Department of Statistics’ computing facilities and support, the department’s library (in addition to the nearby Radcliffe Science Library and other university libraries, and the centrally-provided electronic resources) and other facilities appropriate to your research topic. The provision of other resources specific to your research project should be agreed with your supervisor as a part of the planning stages of the agreed project.

The department runs seminar series in statistics and probability. There is also a graduate lecture series, involving snapshots of the research interests of the department. Several journal-clubs run each term, reading and discussing new research papers as they emerge.

Graduate training is an important part of the department's research mission. As well as the graduate lectures previously mentioned, formal lecture courses are also available, for example from the MSc in Statistical Science, from the fourth-year undergraduate courses in mathematics and statistics, and from the Centres for Doctoral Training. The MPLS Graduate School offers an extensive range of courses for graduate research students throughout the academic year, including academic subjects and skills; research skills and techniques; ethics and intellectual property; transferable, professional and personal effectiveness skills; and communication, interpersonal and teaching skills.

Departmental seminars and colloquia bring research students, together with academic and other research staff, to hear about on-going research, and provide an opportunity for networking and socialising. There are various social events held throughout the year, such as board game evenings, choir practice, a Summer party and a Winter party.

The University's Department of Statistics is a world leader in research in probability, bioinformatics, mathematical genetics and statistical methodology, including computational statistics, machine learning and data science. 

You will be actively involved in a vibrant academic community by means of seminars, lectures, journal clubs, and social events. Research students are offered training in modern probability, stochastic processes, statistical methodology, computational methods and transferable skills, in addition to specialised topics relevant to specific application areas.

Much of the research in the Department of Statistics is either explicitly interdisciplinary or draws motivation from application areas, ranging from genetics, immunoinformatics, bioinformatics and cheminformatics, to finance and the social sciences.

The department is located on St Giles, in a building providing excellent teaching facilities and creating a highly visible centre for statistics in Oxford. Oxford’s Mathematical Sciences submission came first in the UK on all criteria in the 2021 Research Excellence Framework (REF).

View all courses   View taught courses View research courses

The University expects to be able to offer over 1,000 full or partial graduate scholarships across the collegiate University in 2024-25. You will be automatically considered for the majority of Oxford scholarships , if you fulfil the eligibility criteria and submit your graduate application by the relevant December or January deadline. Most scholarships are awarded on the basis of academic merit and/or potential. 

For further details about searching for funding as a graduate student visit our dedicated Funding pages, which contain information about how to apply for Oxford scholarships requiring an additional application, details of external funding, loan schemes and other funding sources.

Please ensure that you visit individual college websites for details of any college-specific funding opportunities using the links provided on our college pages or below:

Please note that not all the colleges listed above may accept students on this course. For details of those which do, please refer to the College preference section of this page.

Annual fees for entry in 2024-25

Further details about fee status eligibility can be found on the fee status webpage.

Information about course fees

Course fees are payable each year, for the duration of your fee liability (your fee liability is the length of time for which you are required to pay course fees). For courses lasting longer than one year, please be aware that fees will usually increase annually. For details, please see our guidance on changes to fees and charges .

Course fees cover your teaching as well as other academic services and facilities provided to support your studies. Unless specified in the additional information section below, course fees do not cover your accommodation, residential costs or other living costs. They also don’t cover any additional costs and charges that are outlined in the additional information below.

Continuation charges

Following the period of fee liability , you may also be required to pay a University continuation charge and a college continuation charge. The University and college continuation charges are shown on the Continuation charges page.

Where can I find further information about fees?

The Fees and Funding  section of this website provides further information about course fees , including information about fee status and eligibility  and your length of fee liability .

Additional information

There are no compulsory elements of this course that entail additional costs beyond fees (or, after fee liability ends, continuation charges) and living costs. However, please note that, depending on your choice of research topic and the research required to complete it, you may incur additional expenses, such as travel expenses, research expenses, and field trips. You will need to meet these additional costs, although you may be able to apply for small grants from your department and/or college to help you cover some of these expenses.

Living costs

In addition to your course fees, you will need to ensure that you have adequate funds to support your living costs for the duration of your course.

For the 2024-25 academic year, the range of likely living costs for full-time study is between c. £1,345 and £1,955 for each month spent in Oxford. Full information, including a breakdown of likely living costs in Oxford for items such as food, accommodation and study costs, is available on our living costs page. The current economic climate and high national rate of inflation make it very hard to estimate potential changes to the cost of living over the next few years. When planning your finances for any future years of study in Oxford beyond 2024-25, it is suggested that you allow for potential increases in living expenses of around 5% each year – although this rate may vary depending on the national economic situation. UK inflationary increases will be kept under review and this page updated.

Students enrolled on this course will belong to both a department/faculty and a college. Please note that ‘college’ and ‘colleges’ refers to all 43 of the University’s colleges, including those designated as societies and permanent private halls (PPHs). 

If you apply for a place on this course you will have the option to express a preference for one of the colleges listed below, or you can ask us to find a college for you. Before deciding, we suggest that you read our brief  introduction to the college system at Oxford  and our  advice about expressing a college preference . For some courses, the department may have provided some additional advice below to help you decide.

The following colleges accept students on the DPhil in Statistics:

  • Balliol College
  • Brasenose College
  • Christ Church
  • Corpus Christi College
  • Exeter College
  • Green Templeton College
  • Hertford College
  • Jesus College
  • Keble College
  • Kellogg College
  • Lady Margaret Hall
  • Linacre College
  • Lincoln College
  • Magdalen College
  • Mansfield College
  • Merton College
  • New College
  • Nuffield College
  • Oriel College
  • The Queen's College
  • Reuben College
  • St Anne's College
  • St Catherine's College
  • St Cross College
  • St Edmund Hall
  • St Hilda's College
  • St Hugh's College
  • St Peter's College
  • Somerville College
  • University College
  • Wadham College
  • Wolfson College
  • Worcester College
  • Wycliffe Hall

Before you apply

Our  guide to getting started  provides general advice on how to prepare for and start your application. You can use our interactive tool to help you  evaluate whether your application is likely to be competitive .

If it's important for you to have your application considered under a particular deadline – eg under a December or January deadline in order to be considered for Oxford scholarships – we recommend that you aim to complete and submit your application at least two weeks in advance . Check the deadlines on this page and the  information about deadlines  in our Application Guide.

Application fee waivers

An application fee of £75 is payable per course application. Application fee waivers are available for the following applicants who meet the eligibility criteria:

  • applicants from low-income countries;
  • refugees and displaced persons; 
  • UK applicants from low-income backgrounds; and 
  • applicants who applied for our Graduate Access Programmes in the past two years and met the eligibility criteria.

You are encouraged to  check whether you're eligible for an application fee waiver  before you apply.

Readmission for current Oxford graduate taught students

If you're currently studying for an Oxford graduate taught course and apply to this course with no break in your studies, you may be eligible to apply to this course as a readmission applicant. The application fee will be waived for an eligible application of this type. Check whether you're eligible to apply for readmission .

Application fee waivers for eligible associated courses

If you apply to this course and up to two eligible associated courses from our predefined list during the same cycle, you can request an application fee waiver so that you only need to pay one application fee.

The list of eligible associated courses may be updated as new courses are opened. Please check the list regularly, especially if you are applying to a course that has recently opened to accept applications.

Do I need to contact anyone before I apply?

You are advised to look at the research interests of the  department's academic staff  at an early stage and make contact with a potential supervisor via email to clarify your proposed research area. 

Completing your application

You should refer to the information below when completing the application form, paying attention to the specific requirements for the supporting documents .

For this course, the application form will include questions that collect information that would usually be included in a CV/résumé. You should not upload a separate document. If a separate CV/résumé is uploaded, it will be removed from your application .

If any document does not meet the specification, including the stipulated word count, your application may be considered incomplete and not assessed by the academic department. Expand each section to show further details.

Proposed field and title of research project

Under the 'Field and title of research project' please enter your proposed field or area of research if this is known. If the department has advertised a specific research project that you would like to be considered for, please enter the project title here instead.

You should not use this field to type out a full research proposal. You will be able to upload your research supporting materials separately if they are required (as described below).

Proposed supervisor

If known, under 'Proposed supervisor name' enter the name of the academic(s) who you would like to supervise your research. Otherwise, leave this field blank.

If possible, you should suggest one or two potential supervisors, listing them in order of preference or indicating equal preference.

Referees: Three overall, academic preferred

Whilst you must register three referees, the department may start the assessment of your application if two of the three references are submitted by the course deadline and your application is otherwise complete. Please note that you may still be required to ensure your third referee supplies a reference for consideration.

Academic references are strongly encouraged, though a professional reference is acceptable in the exceptional case that the referee is able to offer comparable information on your background and suitability for the course to an academic referee.

Your references will support intellectual ability, academic achievement, motivation and commitment.

Official transcript(s)

Your transcripts should give detailed information of the individual grades received in your university-level qualifications to date. You should only upload official documents issued by your institution and any transcript not in English should be accompanied by a certified translation.

More information about the transcript requirement is available in the Application Guide.

Research proposal: A maximum of 1,000 words

Your research proposal should be written in English and should specify the area in which your research interests lie and why you have chosen this area. If you have a particular project in mind, you should describe this and why you are keen to work on this.

If you do not have a detailed project in mind at this stage, you should describe your research interests instead. In this case, the description can be very brief but should include your reasons for applying.

The proposal should aim to be helpful to the department in the selection process and can include a suggestion for potential supervisor(s) and/or research group. The overall page count does not need to include any bibliography.

If possible, please ensure that the word count is clearly displayed on the document.

This will be assessed for:

  • your reasons for applying
  • evidence of motivation for and understanding of the proposed area of study.

Your statement should focus on specific research areas rather than personal achievements and aspirations.

Start or continue your application

You can start or return to an application using the relevant link below. As you complete the form, please  refer to the requirements above  and  consult our Application Guide for advice . You'll find the answers to most common queries in our FAQs.

Application Guide   Apply

ADMISSION STATUS

Open - applications are still being accepted

Up to a week's notice of closure will be provided on this page - no other notification will be given

12:00 midday UK time on:

Friday 5 January 2024 Latest deadline for most Oxford scholarships

Friday 1 March 2024 Applications may remain open after this deadline if places are still available - see below

A later deadline shown under 'Admission status' If places are still available,  applications may be accepted after 1 March . The 'Admissions status' (above) will provide notice of any later deadline.

*Three-year average (applications for entry in 2021-22 to 2023-24)

Further information and enquiries

This course is offered by the Department of Statistics

  • Course page on the department's website
  • Academic and research staff
  • Departmental research
  • Mathematical, Physical and Life Sciences
  • Residence requirements for full-time courses
  • Postgraduate applicant privacy policy

Course-related enquiries

Advice about contacting the department can be found in the How to apply section of this page

✉ [email protected] ☎ +44 (0)1865 272870

Application-process enquiries

See the application guide

Mathematics, genetics and evolution

  • Published: 06 February 2013
  • Volume 1 , pages 9–31, ( 2013 )

Cite this article

  • Warren J. Ewens 1  

5760 Accesses

4 Citations

1 Altmetric

Explore all metrics

The importance of mathematics and statistics in genetics is well known. Perhaps less well known is the importance of these subjects in evolution. The main problem that Darwin saw in his theory of evolution by natural selection was solved by some simple mathematics. It is also not a coincidence that the re-writing of the Darwinian theory in Mendelian terms was carried largely by mathematical methods. In this article I discuss these historical matters and then consider more recent work showing how mathematical and statistical methods have been central to current genetical and evolutionary research.

Article PDF

Download to read the full article text

Similar content being viewed by others

statistical genetics thesis

Back to the fundamentals: a reply to Basener and Sanford 2018

Zachary B. Hancock & Daniel Stern Cardinale

statistical genetics thesis

A review on genetic algorithm: past, present, and future

Sourabh Katoch, Sumit Singh Chauhan & Vijay Kumar

statistical genetics thesis

Evolutionary algorithms and their applications to engineering problems

Adam Slowik & Halina Kwasnicka

Avoid common mistakes on your manuscript.

Darwin, C. (1859) On the Origin of Species by Means of Natural Selection or the Preservation of Favoured Races in the Struggle for Life. London: John Murray.

Google Scholar  

Mendel, G. (1866) Versuche über pflanzenhybriden (Experiments relating to plant hybridization). Verh. Naturforsch. Ver. Brunn, 4, 3–17.

Hardy, G. H. (1908) Mendelian proportions in a mixed population. Science, 28, 49–50.

Article   PubMed   CAS   Google Scholar  

Weinberg, W. (1908) Über den Nachweis der Vererbung beim Menschen. (On the detection of heredity in man). Jahreshelfts. Ver. Vaterl. Naturf. Württemb., 64, 368–382.

Fisher, R. A. (1930) The Genetical Theory of Natural Selection. Oxford: Clarendon Press.

Malécot, G. (1948) Les Mathématiques de l’Hérédité. Paris: Masson.

Kingman, J. F. C. (1961) A mathematical problem in population genetics. Proc. Camb. Philol. Soc., 57, 574–582.

Article   Google Scholar  

Wright, S. (1931) Evolution in Mendelian populations. Genetics, 16, 97–159.

PubMed   CAS   Google Scholar  

Ewens, W. J. (2004) Mathematical Population Genetics. New York: Springer.

Book   Google Scholar  

Kimura, M. (1971) Theoretical foundation of population genetics at the molecular level. Theor. Popul. Biol., 2, 174–208.

Ewens, W. J. and Kirby, K. (1975) The eigenvalues of the neutral alleles process. Theor. Popul. Biol., 7, 212–220.

Ewens, W. J. (1972) The sampling theory of selectively neutral alleles. Theor. Popul. Biol., 3, 87–112.

Kimura, M. (1969) The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics, 61, 893–903.

Watterson, G. A. (1975) On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol., 7, 256–276.

Tavaré, S. (1984) Lines of descent and genealogical processes, and their applications in population genetic models. Theoret. Pop. Biol., 26, 119–164.

Watterson, G. A. (1977) Heterosis or neutrality? Genetics, 85, 789–814.

Ewens, W. J. (1974) A note on the sampling theory for infinite alleles and infinite sites models. Theor. Popul. Biol., 6, 143–148.

Tajima, F. (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics, 123, 585–595.

Hein, J., Schierup, M. H. and Wiuf, C. (2005) Gene Genealogies, Variation and Evolution. Oxford: Oxford University Press.

Wakeley, J. (2009) Coalescent Theory. Greenwood Village, Colorado: Roberts and Company.

Marjoram, P. and Joyce, P. (2009) Practical implications of coalescent theory. In Lenwood, L. S. and Ramakrishnan, N. (eds.), Problem Solving Handbook in Computational Biology and Bioinformatics. New York: Springer.

Nordborg, M. (2001) Coalescent theory. In Balding, D. J., Bishop, M. J. and Cannings, C. (eds.), Handbook of Statistical Genetics. Chichester, UK: Wiley.

Kingman, J. F. C. (1982) The coalescent. Stoch. Proc. Appl., 13, 235–248.

Kingman, J. F. C. (1982) On the genealogy of large populations. J. Appl. Probab., 19, 27–43.

Kelly, F. P. (1977) Exact results for the Moran neutral allele model. J. Appl. Probab., 14, 197–201.

Donnelly, P. J. and Tavaré, S. (1986) The ages of alleles and a coalescent. Adv. Appl. Probab., 18, 1–19.

Donnelly, P. J. (1986) Partition structures, Polya urns, the Ewens sampling formula, and the ages of alleles. Theor. Popul. Biol., 30, 271–288.

Watterson, G. A. and Guess, H. A. (1977) Is the most frequent allele the oldest? Theor. Popul. Biol., 11, 141–160.

Article   CAS   Google Scholar  

Kingman, J. F. C. (1975) Random discrete distributions. J. R. Stat. Soc. [Ser. A], 37, 1–22.

Watterson, G. A. (1976) The stationary distribution of the infinitelymany neutral alleles model. J. Appl. Probab., 13, 639–651.

Crow, J. F. (1972) The dilemma of nearly neutral mutations: how important are they for evolution and human welfare? J. Hered., 63, 306–316.

Griffiths, R. C. (1980) Unpublished notes.

Engen, S. (1975) A note on the geometric series as a species frequency model. Biometrika, 62, 697–699.

McCloskey, J. W. (1965) A model for the distribution of individuals by species in an environment. Unpublished PhD. thesis. Michigan State University.

Tavaré, S. (2004) Ancestral inference in population genetics. In Picard J. (ed.), Êcole d’Êté de Probabilités de Saint-Fleur XXX1-2001, 1–188, Berlin: Springer-Verlag.

Durrett, R. (2008) Probability Models for DNA Sequence Evolution. Berlin: Springer-Verlag.

Etheridge, A. (2011) Some Mathematical Models from Population Genetics. Berlin: Springer-Verlag.

Download references

Author information

Authors and affiliations.

Department of Biology and Statistics, The University of Pennsylvania, Philadelphia, PA, 19104, USA

Warren J. Ewens

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Warren J. Ewens .

Rights and permissions

Reprints and permissions

About this article

Ewens, W.J. Mathematics, genetics and evolution. Quant Biol 1 , 9–31 (2013). https://doi.org/10.1007/s40484-013-0003-5

Download citation

Received : 08 October 2012

Revised : 22 October 2012

Accepted : 06 November 2012

Published : 06 February 2013

Issue Date : March 2013

DOI : https://doi.org/10.1007/s40484-013-0003-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Natural Selection
  • Random Mating
  • Additive Genetic Variance
  • Allelic Type
  • Allele Model
  • Find a journal
  • Publish with us
  • Track your research

statistical genetics thesis

  • Our Department
  • Equality, Diversity and Inclusion
  • Graduate Research
  • Msc in Statistical Science
  • Undergraduate Study
  • Research Facilitation
  • Public Engagement with Research
  • Industry Relations
  • Consultancy

""

DPhil in Statistics

The Department of Statistics admits doctoral students each year to a programme of instruction and research leading to the Doctor of Philosophy (DPhil) in Statistics degree. A doctorate normally requires between three and four years of full-time study. 

In the DPhil in Statistics, you will investigate a particular project in depth and write a thesis which makes a significant contribution to the field.   It can be in any of the subject areas for which supervision is available.

The Department of Statistics in the University of Oxford is a world leader in research in probability, bioinformatics, mathematical genetics and statistical methodology, including computational statistics and machine learning. Much of the department’s research is either explicitly interdisciplinary or draws its motivation from application areas, ranging from biology and physics to the social sciences.

You will be assigned a named supervisor or supervisors, who will have overall responsibility for the direction of your work on behalf of the department. You will have the opportunity to interact with fellow students and other members of your research groups, and more widely across the department. Typically, as a research student, you should expect to have meetings with your supervisor or a member of the supervisory team with a frequency of at least once every two weeks averaged across the year. The regularity of these meetings may be subject to variations according to the time of the year, and the stage that you are at in your research programme.

There are formal assessments of progress on the research project at around 12 to 15 months and at around 30 to 36 months. These assessments involve the submission of written work and oral examination.

The final thesis is normally submitted for examination during the fourth year and is followed by the viva examination.

You will be expected to acquire transferable skills as part of your training, and to undertake a total of 100 hours broadening training outside your specialist area. Part of that broadening training is obtained through APTS, the Academy for PhD Training in Statistics; this is a joint venture with a group of leading university statistics departments which runs four weeks of appropriate courses a year. You will give a research presentation or prepare a research poster each year in the department.

Our research students are actively involved in a lively academic community by means of seminars, lectures, journal clubs, working groups and social events. They receive training in modern probability, stochastic processes, statistical methodology, computational methods and transferable skills, in addition to specialised topics relevant to specific application areas. In particular, a broad structured programme of training in modern statistical methodology is available via courses in the Academy for Postgraduate Training in Statistics (APTS), of which the Department is a founding member.

Information about application deadlines, entry requirements and funding are available from the DPhil in Statistics prospectus page on the University of Oxford website.

Research Areas

Supervisors for DPhil projects are listed below, with links to the research group page.

Computational Statistics and Machine Learning

Supervisor: Professor François Caron Bayesian Statistics, Statistical Machine Learning, Statistical Network Analysis, Bayesian Nonparametrics.

Supervisor: Professor Mihai Cucuringu Development and mathematical & statistical analysis of algorithms that extract information from massive noisy data sets. Computationally-hard inverse problems on large graphs with applications in machine learning. Spectral and semidefinite programming algorithms with application to ranking, clustering, group synchronization, phase unwrapping. Network analysis: community and core-periphery structure, network time series. Statistical analysis of financial data, statistical arbitrage, limit order books, risk models.

Supervisor: Professor George Deligiannidis Computational statistics, in particular theory and methodology for Monte Carlo methods, especially MCMC and SMC for high-dimensional targets; limit theorems and convergence rates for Markov chains and stochastic processes in general; random walks.

Supervisor: Professor Arnaud Doucet Possible research areas: Bayesian Computation, Monte Carlo methods, Statistical Machine Learning.

Supervisor: Professor Patrick Rebeschini Investigation of fundamental principles in high-dimensional probability, statistics and optimisation to design computationally efficient and statistically optimal algorithms for machine learning.

Supervisor: Professor Yee Whye Teh Machine learning. Probabilistic modelling, learning and inference.

Econometrics and Population Statistics

Supervisor: Professor Mihai Cucuringu Development and mathematical & statistical analysis of algorithms that extract information from high-dimensional noisy data sets, network time series, and certain computationally-hard inverse problems on large graphs. Particular areas of focus include statistical arbitrage, machine-learning for asset pricing, lead-lag detection, market microstructure, limit order books, synthetic data generation, as well as nonlinear dimensionality reduction techniques for high-dimensional time series data.

Supervisor: Professor Frank Windmeijer Causal Inference, Instrumental Variables Estimation (instrument selection using machine learning, weak instrument robust inference, bootstrap), Mendelian Randomisation.

Probability

Supervisor: Professor Julien Berestycki Branching processes, branching random walks, coalescence, fragmentation, population genetics, reaction-diffusion equations, front propagation, random trees.

Supervisor: Professor Alison Etheridge Stochastic analysis, especially problems related to stochastic modelling in population genetics.

Supervisor: Professor Christina Goldschmidt Research area: random discrete structures (eg trees and graphs) and their scaling limits.

Supervisor: Professor James Martin Probability theory, with strong links to statistical physics and theoretical computer science. Particular interests include: random graphs; interacting particle systems; models of random growth and percolation; models of coagulation and fragmentation; queueing networks.

Supervisor: Professor Gesine Reinert Investigation of networks such as protein-protein interaction networks and social networks in a statistically rigorous fashion. Often this will require some approximation, and approximations in statistics are another of my research interests. There is an excellent method to derive distances between the distributions of random quantities, namely Stein’s method, and I am interested in Stein’s method also from a theoretical viewpoint. The general area of my research falls under the category Applied Probability and many of the problems and examples I study are from the area of Computational Biology.

Supervisor: Professor David Steinsaltz Random dynamical systems, particularly with applications to population ecology. Evolutionary and biodemographic models of ageing.

Supervisor: Professor Matthias Winkel Probability and stochastic processes, in particular problems involving branching processes, Levy processes, fragmentation processes, random tree structures.

Protein Informatics

Supervisor: Professor Charlotte Deane Developing novel methodologies to understand and predict protein evolution, interaction, structure and function.

Supervisor: Professor Garrett Morris Developing novel therapeutics and improving our understanding of living systems at the molecular level, in particular methods development in computer-aided drug discovery. Harnessing the increasing amounts of experimental data, and the development of novel algorithms in chemoinformatics and bioinformatics, machine learning, network pharmacology, and structural biology, to help solve real-world drug discovery problems.

Oxford Protein Informatics Group

Statistical Genetics and Epidemiology

Supervisor: Professor Christl Donnelly Epidemiology of infectious disease; Real-time analysis of outbreaks; Biostatistics; Disease ecology; Applied statistics.

Supervisor: Professor Jotun Hein Algorithms in Bioinformatics, Computational Biology, Stochastic Models of Genealogies and Sequence Evolution, Mathematical Models of the Origin of Life, Stochastic Models of Network Evolution, Genome Analysis

Supervisor: Professor Simon Myers Statistical and population genomics (fine-scale population structure and migrations, recombination, natural selection on complex traits, association testing, demographic history), statistical approaches for single-cell data (RNA-seq, ATAC-seq), genetic determinants of speciation and fertility in mammals.

Supervisor: Professor Pier Palamara Computational methods for population genetics (natural selection, demographic history); statistical genetics (complex trait heritability, association); scalable methods for large genomic data sets.

Statistical Theory and Methodology

Supervisor:  Professor Robin Evans Graphical models; Causal inference; Marginal modelling; Combining causal information from different experimental settings; Confounding and selection bias; High-dimensional model selection, and low dimensional model selection in the presence of high-dimensional confounders.

Supervisor:  Professor Geoff Nicholls Applied Bayesian Statistics and Statistical Methods, focusing on building and fitting models for complex stochastic systems. Computational Statistics, in particular Monte Carlo Algorithms. Current projects: Multiple imputation and model misspecification; Monte Carlo filtering and inference for partial orders from rank data; Spatial Statistics and the location of texts; Phylogenetic inference for cultural traits.

Computational Biology and Bioinformatics

Queries about the DPhil in Statistics or MSc by Research in Statistics should be sent to [email protected]

Discover More

Our research.

Read about the research carried out at the Department, and find out more about our research groups.

Research Degrees FAQ

Find the answers to the most common questions about our research degrees.

  • Master of Science
  • Master of Science Thesis

Master of Science Thesis Course Electives

Elective list one: methodological emphasis, elective list two: biology or public health emphasis.

Email the following to  Biostatistics Graduate Director :

  • Name of the UW course you wish to take
  • Reason for your request, and
  • Course syllabus (You may need to contact the course instructor to obtain the syllabus.)  

Note: Course syllabus is required   for your request to be considered.

Purdue University Graduate School

Computational Methods for Population Genetics

Iii: small: novel statistical data analysis approaches for mining human genetics datasets.

Directorate for Computer & Information Science & Engineering

III: Small: Fast and Efficient Algorithms for Matrix Decompositions and Applications to Human Genetics

Bigdata: f: dka: collaborative research: randomized numerical linear algebra (randnla) for multi-linear and non-linear data, degree type.

  • Doctor of Philosophy
  • Computer Science

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Advisor/supervisor/committee co-chair, additional committee member 2, additional committee member 3, usage metrics.

  • Statistical and quantitative genetics
  • Genetics not elsewhere classified
  • Anthropological genetics
  • Behavioural genetics
  • Bioinformatics and computational biology not elsewhere classified
  • Applications in life sciences
  • Knowledge representation and reasoning
  • Pattern recognition
  • Data mining and knowledge discovery

CC BY 4.0

  • Open access
  • Published: 06 May 2023

Genomics in animal breeding from the perspectives of matrices and molecules

  • Martin Johnsson   ORCID: orcid.org/0000-0003-1262-4585 1  

Hereditas volume  160 , Article number:  20 ( 2023 ) Cite this article

3578 Accesses

6 Citations

2 Altmetric

Metrics details

A Correction to this article was published on 18 May 2023

This article has been updated

This paper describes genomics from two perspectives that are in use in animal breeding and genetics: a statistical perspective concentrating on models for estimating breeding values, and a sequence perspective concentrating on the function of DNA molecules.

This paper reviews the development of genomics in animal breeding and speculates on its future from these two perspectives. From the statistical perspective, genomic data are large sets of markers of ancestry; animal breeding makes use of them while remaining agnostic about their function. From the sequence perspective, genomic data are a source of causative variants; what animal breeding needs is to identify and make use of them.

The statistical perspective, in the form of genomic selection, is the more applicable in contemporary breeding. Animal genomics researchers using from the sequence perspective are still working towards this the isolation of causative variants, equipped with new technologies but continuing a decades-long line of research.

Genomics, in the sense of genetic analyses using markers spaced out along the whole genome, has become a mainstream part of animal breeding. In March 2021, the dairy cattle evaluation in the US run by the Council on Dairy Cattle Breeding had accumulated five million genotyped animals [ 1 ]. These data are gathered for the purpose genomic selection, that is, evaluation of animals based on genome-wide DNA-testing, which was implemented in the US in 2007 (reviewed by [ 2 ]). Genomic selection builds on the practice of genetic evaluation by estimating a breeding value — a prediction of the trait values of the offspring that an animal will have — based on measurements on the animal itself and its relatives. Genomic selection adds molecular information in the form of genome-wide DNA markers to the evaluation.

Animal breeding before genomics was already immensely effective in changing the traits of farm animals. Take for example broiler chicken breeding. Zuidhof et al. [ 3 ] compared commercial broilers from 2005 (Ross 308 from Aviagen) with populations where breeding stopped in 1957 or 1978, kept in the same environment and fed the same feed. At eight weeks of age, the average body mass was 0.9 kg for the population with genetics from 1957, 1.8 for the population with genetics from 1978, and 4.2 kg for the population with genetics from 2005. The first SNP chip for chickens was developed in 2005 [ 4 ], and Aviagen started using genomic selection in 2012 [ 5 ] and thus, this difference is due to breeding that occurred before genomics. Genomics, however, made selection even more effective, either by increasing accuracy of selection or reducing generation interval, depending on the species. Potentially, it can also tell us about the molecular nature of the variants under selection and lead to new biotechnology applications for livestock.

The term “genomics” is derived from “genome”, which was coined by Hans Winkler in 1920 [ 6 ] and refers to one haploid set of chromosomes [ 7 ], or —with some degree of slippage in meaning — the complete DNA of a species. According to Thomas Roderick [ 8 ] the extension to “genomics” was conceived in 1986, as founders of the journal Genomics were trying to find a name for it. From the start, they regarded genomics as the name of a new field — “an activity, a new way to think about biology”.

There are (at least) two ways to think of genomics in animal breeding: two perspectives on genomics that will, throughout this paper, be called the statistical and the sequence perspectives:

We may think of the genome as a big table of numbers, where each row is an individual and each column a genetic variant, and the numbers are ancestry indicators. These matrices lend themselves to statistical calculations such as estimation of genomic breeding values. This is the view from the statistical perspective.

Alternatively, we may think of the genome as a long string of A, C, G and T. They lend themselves to molecular biology operations like predicting the amino acid substitution from a base pair substitution, or identifying patterns of interest. This is the view from the sequence perspective.

The perspectives roughly map to two concepts of a so-called gene [ 9 ]: The statistical perspective relates to the instrumental gene, a calculating device used by classical geneticists to understand inheritance patterns. The instrumental gene is a particle of inheritance, observed indirectly through crosses and comparisons of traits between relatives. For an example, the textbook of classical genetics by Sturtevant and Beadle [ 10 ] is full of crossing schemes of fruit flies that allow modes of inheritance to be investigated. In the introduction, the authors describe their view of genetics as a science. They call it “a mathematically formulated subject that is logically complete and self-contained”, without the necessity of a physical or chemical account of how inheritance works. On the other hand, the molecular perspective aligns closer with the nominal gene concept, where a gene is a DNA sequence that has a name and (potentially) a function. As an example, we can look at a genome browser such as Ensembl [ 11 ], which shows a genome as a series of track, with colourful boxes denoting genes, regulatory DNA sequences, and other associated information.

To be clear, I am not suggesting that individual geneticists are so limited in their thinking as to use only one of these perspectives. Any one researcher probably has these and several other mental models of the genome for different tasks. In practice, geneticists seem to routinely switch between different perspectives and conceptions of central terms like “genome”, “gene” and “locus”, without much friction. Certainly, ambiguity may lead to “complexity and confusion” [ 12 ], but I would argue that the imprecision is also sometimes productive, as it avoids unnecessary debates about which of these concepts are “right”, when the real answer is that all of them are working models and all are useful in different contexts.

The two perspectives lead to different views about the importance of identifying sequence variants that cause trait differences between individuals (“causative variants”, for short). From the statistical perspective, genomic data are large sets of markers of ancestry; we can make use of them while remaining agnostic about their function. From the sequence perspective, genomic data are a source of causative variants; we need to identify and make use of them. To realise the future potential of the sequence perspective, geneticists need to identify causative variants, while the statistical perspective has been successful, precisely by ignoring causative variants. The power of markers [ 13 ] is what Sturtevant & Beadle described: The point is to make use of statistical regularities without getting bogged down in mechanistic detail. Conversely, the potential of the molecular perspective is in understanding mechanisms and learning to manipulate them in ways that would not be possible by traditional selection and crossing. Mostly, this potential of the sequence perspective has not been realised, but the search for molecular knowledge has made possible tools that underpin applications of the statistical perspective, especially genomic selection.

Tools of the statistical perspective

Genomic selection is the crowning achievement of the statistical perspective on genomics in animal breeding, building on a long line of research of mapping phenotypes to genotypes. Genetic mapping — the family of methods used for localising variants that affect traits, roughly at first — goes back to the early history of classical genetics. Once geneticists had discovered that genes were arranged linearly on chromosomes, they could build maps of where causative variants underlying visible phenotypes were located relative to each other, the first map being published by Sturtevant [ 14 ]. This map building activity, based on crossing and detecting recombinant individuals, is called linkage mapping. The extension to complex traits with many causative variants of small effects is traditionally called “quantitative trait locus mapping” [ 15 ]. The extension to large population samples of more distantly related individuals is called “genome-wide association” [ 16 ], and has become the dominant form of genetic mapping. Arguably, genetic mapping can be viewed both from the statistical and sequence perspectives. On one hand, these methods involve statistical genetical methods that are very similar to those used in genomic prediction, and involve representing genomic data statistically. On the other hand, the end goal is usually to identify causative variants.

Out of genetic mapping of traits relevant to breeding comes marker-assisted selection, an earlier paradigm for incorporating molecular information in breeding. In a way, marker-assisted selection is the most intuitive way to imagine molecular breeding: Imagine that we have identified some genetic variants that either cause a trait of interest, or are strongly associated with it. Then, we can genotype our selection candidates for the variant of interest, and incorporate those genotypes into selection decisions. For example, if we know about a strongly deleterious variant, we can exclude candidates that carry it. The proposition of a genetic test is especially attractive when the trait is otherwise hard to phenotype. This was precisely the situation with several large-effect deleterious alleles in pigs and cattle, where marker-assisted selection was successfully implemented against the problematic alleles: malignant hyperthermia and the RN gene in pigs (reviewed by [ 13 , 17 ]) and BLAD in cattle [ 18 ]. DNA tests for such large-effect damaging variants are now routinely included in many genomic breeding programs (e.g., [ 19 , 20 ]).

At some point during the late 1990 to early 2000s, animal breeding researchers shifted their thinking from marker-assisted selection to genomic selection, from thinking about mapping causative variants to treating the whole genome together. Arguably, the key paper, and the most cited, is the one by Meuwissen, Hayes and Goddard [ 21 ]. It presents the full case for genomic selection, including simulations and a few alternative estimation methods (leading to the so-called Bayesian alphabet family of methods). However, genomic selection did not appear fully formed at once. Other genomic selection precursor papers from the era include:

The 1990 paper by Lande & Thompson [ 22 ] that contains the key idea of covering the genome with markers and selecting on a total score based on all the markers.

The 1997 paper by Nejati-Javaremi, Smith & Gibson [ 23 ], the key idea of which is to create a relationship matrix based on variants that affect a trait, creating estimated breeding values based on what they call “total allelic relationship”.

The 1998 paper by Haley & Visscher [ 24 ] which uses the term “genomic selection” and clearly expresses the concept, including the interpretation of genetic markers as realised relatedness.

Exactly when and by whom (in conversation or in parallel) the shift happened is a topic of its own. It seems to have been a gradual process. Still, Meuwissen, Hayes and Goddard (2001) is a landmark in that it provided a full recipe for genomic selection, and ran the proof of concept in silico . Genomic selection worked well enough in theory that is provided the inspiration for creating the tools and the practical initiatives to make it reality.

We can think of genomic prediction it as refining the estimate of how closely related animals are to each other by observing how much DNA the animals share, as opposed to the average relatedness that can be predicted from a pedigree. Alternatively, we can think of it as simultaneously estimating the contribution of every part of the genome (that is, every marker we genotype), and adding them up to a genomic estimate for that animal (see [ 25 ] for a review of the statistical approaches used in animal breeding). Either way, the key insight in genomic selection is that one can accurately predict breeding values in the absence of information about the function of particular variants by combining all markers in one statistical model. As Lowe & Bruce point out [ 13 ], this black-boxing of genetic mechanisms is characteristic of the quantitative genetics tradition, here expressed by one of the pioneering applied quantitative geneticists, Lush [ 26 ]:

It is rarely possible to identify the pertinent genes in a Mendelian way or to map the chromosomal position of any of them. Fortunately this inability to identify and describe the genes individually is almost no handicap to the breeder of economic plants or animals. What he would actually do if he knew the details about all the genes which affect a quantitative character in that population differs little from what he will do if he merely knows how heritable it is and whether much of the hereditary variance comes from dominance or overdominance, and from epistatic interactions between the genes.

Lowe & Bruce argue that this attitude is key to the success of genomic selection: this strategy is the outcome of an alignment, but not a full integration of quantitative and molecular genetics, which allowed quantitative genetics to make use of molecular methods to generate ever denser marker maps, while sticking with the tradition of abstraction [ 13 ].

The effects of genomics have been dramatic. Genomic prediction allows selection to proceed more quickly, or more accurately, depending on the biology of the species and the design of the breeding program. In cattle, increased selection accuracy for young bulls without daughter records allow shorter generation times [ 2 , 27 , 28 ], and genotyping of heifers much improves selection accuracy of cows relative to pedigree-based evaluation [ 29 ]. In pigs, genomics have increased accuracy of selection in several traits by 50% [ 17 ]. In poultry, accuracy has also increased; a review of genomic selection in poultry gives accuracy increases ranging from 20% to over 50% in layers and broilers [ 5 ].

There are further statistical genetics tools, agnostic of marker function, that can be enriched by genomics. Optimal contributions selection (reviewed by [ 30 ]) is a family of methods to balance the genetic improvement and inbreeding or loss of diversity of a population. These methods work by finding less related individuals to pair, that still give a high expected genetic gain in the offspring. Like in genomic selection, pedigree relatedness can be substituted with genomic relatedness. Since genomic selection in practice tends to accelerate inbreeding, there may be greater need for optimal contributions selection in genomic breeding. Specifically, genomic selection can in principle differentiate between individuals that are identically related in terms of pedigree, and thus lead to less correlation between families, and a lower inbreeding rate, all else equal [ 31 ]. In practice, all else is not equal, because genomics leads to redesigns of breeding programs, which may in itself increase or decrease the inbreeding rate. In breeding programs where genomic selection helped reduce generation time, a low inbreeding rate per generation may translate to accelerating inbreeding per year. There are examples of both accelerated [ 32 ] and reduced inbreeding rates after genomic selection [ 33 ].

Furthermore, population genetic methods can find the similarity between populations and individuals, and classify individuals based on breed composition, geographic origin or assign offspring to parents. For example, DNA testing to confirm pedigree in cattle started with blood groups, moved on to genetic markers, and now use the genome-wide SNP chips that are used for genomic selection [ 34 ]. Genomics allows plentiful markers distributed throughout the genome, and so, methods can be more precise in pinpointing ancestry [ 35 ], and reconstruct pedigree information that is missing [ 36 ].

Tools of the sequence perspective

From the sequence perspective, the development of genomics in animal breeding can be seen as ongoing effort to build the tools for causative variant identification. In the process, it also gave rise to the enabling technology for genomic selection. This development includes reference genomes for farm animals, dense marker panels and affordable methods to type them (SNP chips, reduced representation sequencing), genome annotation and maps that localise causative variants in the genome (linkage mapping and genome-wide association).

The chicken genome sequence was published in 2004 [ 37 ], cattle in 2009 [ 38 ], and pig in 2012 [ 39 ]. The choice of any one publication and year as a milestone in a genome sequencing project is somewhat arbitrary, because the sequences reported in these papers were neither the first nor the last drafts. Genome assembly is an iterative process that combines different kinds of data, computational models, and human judgement to represent a genome. For a historical account of the diverse data and ways of reasoning used in the pig genome project, see Lowe [ 40 ]. Lowe points out that a genome project was not just about sequencing in the narrow sense of putting DNA base pairs in order, but “thick” sequencing, which also includes the creation of tools, annotation with additional data, and dissemination to a research community that makes reference genomes useful. Consequently, the development of farm animal reference sequences is still ongoing, with the pig, cattle and chicken genomes being updated [ 41 , 42 ] and followed by sheep, goat, ducks, turkeys and many other. There are now multiple high-quality genome assemblies, e.g. in cattle [ 43 , 44 ]. Inevitably, more are coming, as genome assembly becomes more affordable and streamlined.

The next layer atop the reference genome is annotation, here understood as any information that has a genomic coordinate, localising it in the genome. As Szymanski et al. [ 45 ] point out in a study of the yeast genome, one of the functions of a reference genome as a digital model of the genome is to allow researchers to organise and connect different sources of data. Researchers can put their data on the same coordinate system and create a coherent picture. In the yeast community, that coherence-building used to be achieved by sharing strains and standard protocols, before the reference genome. For logistical reasons, germplasm sharing is harder in farm animal genetics. But now, genome annotation is available in genome browsers such as the NCBI Genome Data Viewer and Ensembl, which contain comparative information [ 46 ], the location of genes, and non-genic elements of importance such as open chromatin (as it is becoming available). Projects like Functional Annotation of Animal Genomes [ 47 ] are producing detailed maps of gene-regulatory regions in farm animal genomes, with the express purpose that researchers are going to be able integrate their openly available data into their projects. Such functional genomic data might be useful both for annotating genetic variants as a part of fine-mapping and nominating potential causative variants, in genomic prediction with sequence data, and in molecular biology studies of gene-regulatory networks.

The key technology, however, enabling genomics in farm animals is affordable high throughput genotyping, in the form of SNP chip technology that allows the testing of thousands of single nucleotide variants (SNPs) at the same time. SNP chips are, generally, surfaces with known pieces of DNA them. The array captures fragments of DNA close to the markers we want to type, and a DNA polymerase enzyme that incorporates labelled nucleotides gives a fluorescence signal, where the relative signal intensity of the alleles will tell us the genotype [ 48 ]. A clustering algorithm will help turn the intensity values into genotypes — the numeric coding needed for all the statistical genomic methods.

Looking at the original three farm animal genome papers, they all mentioned genetic improvement of livestock, but in oblique terms. It is as if they either did not know precisely how a reference genome would improve breeding in these animals, or that the way forward now that the reference genome was in place was too obvious to even to mention:

The chicken genome sequence promotes both the development of more refined polymorphic maps (see the accompanying paper [ 49 ] ) and the framework for discovering the functional polymorphisms underlying interesting quantitative traits, thus fully exploiting the genetic potential of the chicken. [ 37 ]

The cattle genome and associated resources will facilitate the identification of novel functions and regulatory systems of general importance in mammals and may provide an enabling tool for genetic improvement within the beef and dairy industries. [ 38 ]

The pig genome sequence provides an important resource for further improvements of this important livestock species, and our identification of many putative disease-causing variants extends the potential of the pig as a biomedical model. [ 39 ]

However, when the first SNP chips were being published, the design of the SNP chips were explicitly motivated with the ability to perform genomic selection, in addition to the ability to improve genetic mapping:

The aim of this study was to develop and characterize a high-density, genome-wide SNP assay for cattle with the power to detect genomic segments harboring inter-individual DNA sequence variation affecting phenotypic traits and for application to GWS, in which an animal’s genetic merit is estimated solely from its multilocus genotype. [ 50 ]

The most efficient way to genotype large numbers of SNPs is to design a high-density assay that includes tens of thousands of SNPs distributed throughout the genome. These SNP “chips” are a valuable resource for genetic studies in livestock species, such as genomic selection, detection of [quantitative trait loci] or diversity studies. [ 51 ]

In livestock species like the chicken, high throughput single nucleotide polymorphism (SNP) genotyping assays are increasingly being used for whole genome association studies and as a tool in breeding (referred to as genomic selection). [ 52 ]

These genomic tools — reference genomes, genome annotation, large-scale genotyping — build towards detecting causative variants that affect traits by allowing bigger and more marker-dense genome-wide association studies for localising causative variants, and the ability to look under the loci detected to find the underlying genes and important sequence elements, such as gene-regulatory sequences. It is striking to read the attitudes in commentaries on genomics in animal breeding from the early days of genomics. Here is Bulfield [ 53 ] in 2000 describing the isolation of causative variants:

Farm animal genomics is developing in four phases. (1) Constructing maps of highly informative markers and genes. (2) Using these maps to scan broadly across genomes of resource populations, segregating for commercially important traits, to locate quantitative trait loci (QTL) into 20–40 cM chromosomal segments. (3) Identifying the trait gene(s) themselves, within these regions. (4) Bridging the ‘phenotype gap’ between the gene(s) and the ultimate trait.

What implications would this have for animal breeding? Bulfield continues:

In animal breeding, a combination of genome analysis and cell culture-based transgenesis would permit a more controlled approach to animal breeding, especially for currently intractable traits such as fertility and disease resistance. In addition, cloning from adult cells (as with Dolly) would permit the replication of (for example) a proven high-yielding and productive dairy cow.

On the same theme, Goddard [ 54 ] wrote in 2003:

I believe animal breeding in the post-genomic era will be dramatically different to what it is today. There will be a massive research effort to discover the function of genes including the effect of DNA polymorphisms on phenotype. Breeding programmes will utilize a large number of DNA-based tests for specific genes combined with new reproductive techniques and transgenes to increase the rate of genetic improvement and to produce for, or allocate animals to, the product line to which they are best suited. However, this stage will not be reached for some years by which time many of the early investors will have given up, disappointed with the early benefits.

In retrospect, Bulfield was clearly too optimistic; Goddard’s more tempered optimism might still be right depending on how long time counts as “some years”. Also, the technologies listed by Bulfield [ 53 ] — linkage maps of 20 to 40 cM resolution, microsatellite and amplified fragment length markers, back-crosses and expressed sequence tag libraries — sound antique to students of animal breeding educated today. The low number of markers (e.g., 40 cM resolution would mean about 150 markers to cover the cattle genome), made sense for genetic mapping based on linkage within families, which was the state of the art at the time. The tools of the sequence perspective have moved far during 20 years, but the underlying problems of causative variant identification remain the same.

That is, despite the increasing development of molecular tools, statistical methods, and increasing dataset sizes, there are few known causative variants for economically important traits (see tables in [ 55 ]). None of them have yet led to transgenic animals that are used in farming. Why have we not found the causative variants? There are at least three problems:

It turns out that most traits of interest are massively polygenic. That is, they are affected by thousands of genetic variants, most of individually small effects. This has been a staple assumption of quantitative genetics since the early 20th century, and was further cemented by the failure of linkage mapping to explain large chunks of inheritance, and now there are methods (based on genomic selection models) to estimate polygenicity from data. The estimated number of variants for complex traits in humans are in the range of tens of thousands of causative variants [ 56 , 57 ].

Quantitative traits may have complex genetic architectures in other ways than polygenicity; they may be affected by rare variants whose effects are hard to estimate, and variants that act in non-additive ways (dominance or epistasis). This is less important for selection, as the response to selection depends on the additive genetic variance, and even non-additive effects at the variant level can result in substantial additive genetic variance [ 58 , 59 ]. However, when we go on to identify causative variants, it may matter, for example, if the apparently additive outcome depends on pairwise interactions between variants that are located close together.

Even when an association has been isolated (and there are thousands of them [ 60 ]), fine-mapping an association signal down to the causative variant or even gene is hard, because there are many variants, and they correlate (geneticists call this correlation, abstrusely, “linkage disequilibrium”), and interpreting them and testing their effects are hard work.

The Goddard [ 54 ] quote is particularly apt, because while the post-genomic future he envisaged, based on the sequence perspective, has not happened, at about the same time as that paper was published, he was involved in developing genomic selection, the statistical genomics future that happened instead.

Statistical futures

What is the future of genomic breeding? From the statistical perspective, the immediate future seems to hold even more genomic selection — on more data, with new traits, spread to new species and breeding programs, and possibly enhanced with functional genomic data.

As data accumulate on more and more animals, larger datasets cause computational difficulties. Methods such as APY (the “algorithm of proven and young”), which splits a genomic selection dataset into a “core” group of animals and a “peripheral” group of animals and performs the most intense computations only on the core subset, allow one to use large numbers of genotyped animals and still be able to compute estimated breeding values in reasonable times [ 61 ]. There is a whole strand of genomics research in animal breeding that works on improving the way genomic selection models are used in practice, how to fit the models efficiently, how to re-fit them when new data arrives, and how to estimate their accuracy (see review by [ 62 ]).

Another ongoing strand of research is extending genomic selection to more complicated genetic scenarios like crossbred animals or generalisation between different populations. Standard genomic selection models work best for prediction within a single population. Thus, if crossbred animals are used for breeding, as is common for example in beef cattle, one would like to have genomic estimated breeding values for them. Even when the crossbred animals might not be used in breeding themselves, such as in pig or poultry breeding, there are traits that can only be measured on crossbred individuals and that information needs to be propagated back to the purebred nucleus animals. Similarly, small breeds might struggle to gather enough data, and the ability to borrow information from larger breeds is attractive.

However, genetic distance between animals quickly reduces the accuracy of genomic selection, complicating across-breed and multi-breed genomic prediction (see review by [ 63 ]). First, comparing distantly related breeds, the marker—trait associations in each breed could be very different, both because the breeds might carry different causative alleles and because the correlations (linkage disequilibrium) between causal variants and markers might be different. Second, non-additive genetic effects, which to a first approximation can be discounted as a nuisance factor within a population, can make a substantial difference as genetic differences accumulate. To accurately predict the outcome, a full model would have to consider both dominance and the genotypes at multiple interacting loci. However, without identifying the interactions and non-linearities, the correlation between marker effect estimates can be shown to decline with genetic differentiation [ 64 ].

Another avenue of development is to find a place for machine learning methods in genomics of animal breeding. Machine learning methods have been used in functional genomics to predict variant effects (reviewed by [ 65 ]), and in animal breeding applications for developing new phenotypes [ 66 , 67 ], but so far have not been widely used in genomic selection. This is not for lack of trying; early work included attempts at using kernel methods [ 68 , 69 ], tree regression [ 70 ] and neural networks [ 71 ], and later efforts have been made with deep learning [ 72 , 73 ]. However, unless we count linear mixed models as a machine learning application, these have not made much impact on applied genomic selection. Probably, this is because non-additive effects have hitherto not played a big role in selection, and these methods only outperform linear mixed models when predicting non-additive effects. This may change if genomic selection is extended to systems where non-additive effects are more important, and one has to design matings to produce offspring that deviate from the parent average in the right direction [ 74 ], or for applications where predicting individual phenotype rather than breeding value is the goal.

Finally, there is a strand of research that aims to improve genomic selection by adding more genomic information. For biological reasons, some variants are expected to contribute more — variants close to known associations from genome-wide association studies, variants predicted by bioinformatic means to be functional, variants associated with gene expression variation, variants located in open chromatin in a relevant tissue, and so on. Various statistical extensions to the genomic selection models allow groups of variants to be treated separately [ 75 , 76 ] and given different emphasis depending on their predicted function. Such methods would be important for performing genomic selection with whole-genome sequence data, that include millions rather than tens of thousands of variants. It seems clear that there is potential. A series of studies using gene expression quantitative trait locus data in combination with chromatin and evolutionary conservation suggest that one might be able to prioritise variants that are more likely to explain quantitative trait variation [ 77 , 78 ]. However, empirical results on whole-genome sequence data in genomic prediction [ 79 , 80 , 81 , 82 ] are inconsistent between methods, populations and traits about whether adding genomic information brings any benefit, or even degrades accuracy. Even in simulations where the causative variants are known [ 83 ], the increase in accuracy from including true causative variants is not great, unless the true effect sizes of the variants are known. Therefore, the potential gain from enhancing genomic selection is probably much less than from the improvement that came from starting genomic selection over traditional evaluation.

The statistical perspective also holds the opposite possibility for a turn away from the genome. Instead of pursuing more genomic data to possibly improve genomic prediction, one could invest in improving measurement technology or modelling to improve the measurement of traits. Because the task, from the statistical perspective, is not to understand the genome but to get a good enough estimate of ancestry, it might be that the best choice is to settle for a relatively crude genotyping strategy (like a medium density SNP chip) and instead focus on gathering more records on high-value but hard-to-measure traits [ 84 ].

Sequence futures

As we saw above, around the turn of the century there was optimism about identifying causative variants and exploiting them in animal breeding, which turned out to be mostly premature. Marker-assisted selection was successfully used on large-effect variants such as genetic defects, but less successful for quantitative traits. There are thousands of quantitative trait loci and genome-wide association hits published for economically relevant quantitative traits in farm animals, but only a handful that have been fine-mapped down to a causative variant [ 85 ]. However, molecular genetic techniques have moved rapidly over the last 20 years, not just adding new assays for gene-regulatory activity, but scaling them to the whole genome. With these new tools at hand, researchers are again optimistic that causative variants can be identified and exploited.

Several papers outline a vision of a future for the sequence perspective in animal [ 86 , 87 ] and plant breeding [ 88 ], using genome editing methods such as CRISPR/Cas9 to supplement classical breeding with causative variants of known function. They call future, causative-variant enabled breeding “Livestock 2.0” and “Breeding 4.0”. Beside the version number conflict the visions have a similar overall shape: the future of breeding lies in identifying genetic causative variants through large genomic datasets, and then introducing them into breeding individuals through gene editing. Clark et al. [ 86 ] also describe identifying functional variants and editing them as “a route to application” for functional genomic data in farm animals.

The first application along this route of gene editing would be the ongoing attempts at editing of monogenic high-value traits, such as hornlessness caused by polled alleles in cattle [ 89 ], or porcine reproductive and respiratory syndrome virus resistance in pigs conveyed by edits to the CD163 gene [ 90 ]. In the case of pigs, the causative variant does not occur naturally, and was designed based on molecular knowledge about the virus’ mode of infection. The hornless variant (“polled”) was identified by genome-wide association [ 91 ]. Conceptually, these proposed applications are somewhat different than the applications that have been proposed for transgenic animals before. Transgenic farm animals, such as the defunct “Enviropig” project [ 92 ] or the AquaAdvantage salmon [ 93 ], would have DNA introduced from different species, and can be thought of as examples of a genetic engineering approach. These modern proposals typically use less dramatic changes, alleles that exist in nature, or could relatively easily happen by natural mutation (e.g., partial deletion of a gene in the CD163 example, or producing a duplication similar to a naturally occurring duplication in the polled case).

Gene editing is like marker-assisted selection in the sense that the variants to be edited need to have large enough effects to be worthwhile, and editing must be more effective than conventional alternatives. Both resistance to porcine reproductive and respiratory syndrome and polledness are potentially traits of great value and connected to animal welfare. Outbreaks of porcine reproductive and respiratory syndrome has devastating consequences for pig health and farm profitability, and simulations suggest that gene editing in combination with partially protective vaccines could eliminate the disease [ 94 ]. Hornless cows are highly desirable by farmers and dehorning is a welfare issue. As for conventional alternative strategies, natural knockouts of the CD163 gene in pigs appear to be exceedingly rare [ 95 ]. Polled alleles, however, occur in many breeds, including dairy breeds conceived as targets of editing, and marker-assisted selection is already in use in breeding programs to promote it, as polled status can be predicted from SNP chips used for genomic selection. Simulation studies suggest that an editing-based strategy for promoting polled can have better consequences in terms of genetic gain and inbreeding than marker-assisted selection [ 96 , 97 , 98 ], but it remains to be seen whether the technological hurdles, regulations, acceptability and ethical issues will be resolved in time for polled gene editing to be successful.

However, going beyond monogenic traits to complex traits, the lack of other routes to application other than gene editing becomes a problem. If editing or marker-assisted selection are the only applications for knowledge of causative variants, and neither is likely to work well for complex traits, this limits the applied potential of the sequence perspective. Molecular insights about traits in farm animals are scientifically interesting, but currently have little other applied value. This is often not very clear from reading genomic studies, that often promise improvements to animal breeding without spelling out how they will come about. Allow me a personal and somewhat embarrassing example: In the introduction to my PhD thesis, which was defended in 2015, I wrote about the quantitative trait loci that I had identified, and speculated about what would be needed for them to be used in actual breeding. This discussion was completely misguided. It raised true concerns, such as whether the association would replicate in a different population, whether the underlying variant between shared associations in different populations are the same, and so on, but it missed the mark, because I was not aware that marker-assisted selection for quantitative traits was essentially dead at this point. The quantitative trait locus paradigm that I was operating within was dead and buried in animal breeding, and the first commercial genomic selection of poultry was already happening [ 5 ].

Most traits of economic relevance to animal breeding are affected by many variants of small effects. This polygenicity means that in order to know what sequences to edit and what to put instead one needs to solve the fine-mapping problem, to find ways to reliably identify causative variants, even if they are of moderate effect size. The situation is more challenging than with marker-assisted selection, where it may be enough to detect a variant in close linkage disequilibrium with the genuine causative variant. It is still an open question when and how we will get detailed enough knowledge of the genomic basis of complex traits to do this. It would require a workflow to identify causative variants reliably enough to edit them, in a very short time compared to current methods where thorough characterization of a causative variant takes years.

Furthermore, pleiotropy and non-additive effects might affect predictability of the outcomes of editing. Because the size of the genome and its repertoire of genes is limited, genes and pathways are recycled in a context-dependent manner for many biological functions. This suggests that many genetic variants will affect multiple traits, likely mediated by gene-regulatory relationships. This postulate of “universal pleiotropy” goes back to early quantitative genetics [ 99 ] and forms part of the more recent “omnigenic model” of complex traits [ 100 ]. This suggests that any use of gene editing needs to be vigilant against side-effects and consider the whole breeding goal in a balanced way, as argued by [ 101 ]. In the presence of non-additive effects, the statistical effect of an allele substitution depends on the frequency of the interaction partners. This means that the net effect of a gene edit might change as the population changes, as argued by [ 101 , 102 ]. However, one might argue that we already take genomic selection decisions, and thus shift the allele frequency of regions associated with large marker effects, on the basis of estimates that average over potential interactions and are liable to change over time.

The next problem to overcome is how to introduce many edits into a breeding program. The challenge has two parts: First, multiplex gene editing technically challenging on its own, given that the success rate of a biallelic homology-directed repair editing event with CRISPR/Cas9 is low. Even if it could be increase to double digits, the success rate for multilocus edits would scale poorly. Second, integrating gene editing into animal breeding programs would involve performing gene editing at the scale of many animals. Jenko et al. [ 103 ] suggested a strategy of promotion of alleles by gene editing, where the chosen sires of a breeding program would be edited to be homozygous for causative variants that they did not already carry. They assumed that causative variants were known and that sires could be selected before they were edited. This would require new reproductive technology integrated with genomic selection. Such in vitro breeding strategies have been proposed several times [ 24 , 104 , 105 ] as extensions of the already advanced reproductive technologies used in particular in cattle breeding. For example, if an embryo transfer is already in use to breed sires for a cattle breeding program, it might be possible in the future to use to introduce gene editing machinery into the embryo, then biopsy a small amount of DNA to both verify the integrity of the edits and perform genomic selection. It remains to be seen, if this strategy becomes technologically feasible, what numbers of edited embryos and what levels of failure of editing would be acceptable. The failure rate of gene editing technologies are currently high, and that may lead to high costs and loss of selection response [ 96 ].

Johnsson et al. proposed removal of deleterious alleles [ 106 ], reasoning that damaging variants might be easier to identify from sequence data than causative variants for quantitative traits, and that recessive deleterious alleles may be common in farm animal populations due to ineffective natural selection and the large impact of genetic drift. While that assumption may be true, there is currently no workflow for large-scale identification of deleterious variants in place, and when such variants are detected, marker-assisted selection is more attractive than gene editing.

In summary, the sequence perspective faces challenges, not just within genomics (the fine mapping problem) but also within reproductive technology and breeding program design (the problem of multiplex editing). Gene editing of very large-effect variants is somewhat akin to marker-assisted selection, where there are reliable workflows for causative variant identification, and individual effects may be dramatic enough to justify editing. However, gene editing of causative variants for complex traits appears to fraught with problems to be possible within the foreseeable future. Perhaps finding a promising route to application for the sequence perspective will require a shift in the thinking of the field that we are not yet seeing, similar to the shift from marker-assisted to genomic selection.

Conclusions

In conclusion, there are (at least) two ways to think of genomics in animal breeding, that are helpful in understanding how genomic technologies have changed and may continue to change animal breeding. Currently, tools derived from the statistical perspective are doing the heavy lifting in breeding practice, in the form of genomic selection. With the advent of new technologies, the sequence perspective could make an impact in the future, if it can overcome the twin problems of how to identify causative variants for complex traits and how to introduce them into animals, both at scale.

Data Availability

Not applicable.

Change history

18 may 2023.

A Correction to this paper has been published: https://doi.org/10.1186/s41065-023-00287-8

Carillo J, Tokuhisa K. The U.S. has recorded 5 million genotypes. Hoard’s Dairyman [Internet]. 2021 [cited 2023 Jan 4]; Available from: https://hoards.com/article-29836-the-us-has-recorded-5-million-genotypes.html .

Wiggans GR, Cole JB, Hubbard SM, Sonstegard TS. Genomic selection in dairy cattle: the USDA experience. Annu Rev Anim Biosci. 2017;5:309–27.

Article   PubMed   Google Scholar  

Zuidhof M, Schneider B, Carney V, Korver D, Robinson F. Growth, efficiency, and yield of commercial broilers from 1957, 1978, and 2005. Poult Sci. 2014;93:2970–82.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Muir W, Wong G, Zhang Y, Wang J, Groenen M, Crooijmans R, et al. Review of the initial validation and characterization of a 3K chicken SNP array. World’s Poult Sci J. 2008;64:219–26.

Article   Google Scholar  

Wolc A, Kranis A, Arango J, Settar P, Fulton J, O’Sullivan N, et al. Implementation of genomic selection in the poultry industry. Anim Front. 2016;6:23–31.

Yadav SP. The wholeness in suffix-omics,-omes, and the word om. J Biomol techniques: JBT. 2007;18:277.

PubMed Central   Google Scholar  

Winkler H. Verbreitung und ursache der parthenogenesis im pflanzen-und tierreiche. 1920;165.

Kuska B, Beer. Bethesda, and biology: how “genomics” came into being. 1998.

Griffiths PE, Stotz K. Genes in the postgenomic era. Theor Med Bioeth. 2006;27:499.

Sturtevant AH, Beadle GW. An introduction to genetics. An introduction to genetics. 1939.

Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, et al. Ensembl 2018. Nucleic Acids Res. 2017;46:D754–61.

Article   PubMed Central   Google Scholar  

Portin P, Wilkins A. The evolving definition of the term “Gene. Genetics. 2017;205:1353–64.

Lowe JW, Bruce A. Genetics without genes? The centrality of genetic markers in livestock genetics and genomics. Hist Philos Life Sci. 2019;41:50.

Sturtevant AH. The linear arrangement of six sex-linked factors in Drosophila, as shown by their mode of association. J Exp Zool. 1913;14:43–59.

Soller M, Brody T, Genizi A. On the power of experimental designs for the detection of linkage between marker loci and quantitative loci in crosses between inbred lines. Theor Appl Genet. 1976;47:35–9.

Article   CAS   PubMed   Google Scholar  

Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–7.

Knol EF, Nielsen B, Knap PW. Genomic selection in commercial pig breeding. Anim Front. 2016;6:15–22.

Dekkers JCM. Commercial application of marker- and gene-assisted selection in livestock: strategies and lessons1,2. J Anim Sci. 2004;82:E313–28.

PubMed   Google Scholar  

CDCB -. Haplotypes & Genetic Conditions [Internet]. CDCB. [cited 2023 Jan 4]. Available from: https://uscdcb.com/haplotypes/ .

Genetic traits | NAV - Nordic Cattle Genetic Evaluation [Internet]. 2019 [cited 2023 Jan 4]. Available from: https://nordicebv.info/ntm-and-breeding-values/genetic-traits/ .

Meuwissen THE, Hayes B, Goddard M. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–29.

Lande R, Thompson R. Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics. 1990;124:743–56.

Nejati-Javaremi A, Smith C, Gibson J. Effect of total allelic relationship on accuracy of evaluation and response to selection. J Anim Sci. 1997;75:1738–45.

Haley C, Visscher P. Strategies to utilize marker-quantitative trait loci associations. J Dairy Sci. 1998;81:85–97.

Gianola D, Rosa GJ. One hundred years of statistical developments in animal breeding. Annu Rev Anim Biosci Annual Reviews. 2015;3:19–56.

Lush JL. Heritability of quantitative characters in farm animals. Hereditas. 1949;35:356–75.

García-Ruiz A, Cole JB, VanRaden PM, Wiggans GR, Ruiz-López FJ, Van Tassell CP. Changes in genetic selection differentials and generation intervals in US Holstein dairy cattle as a result of genomic selection. Proceedings of the National Academy of Sciences. National Acad Sciences; 2016;113:E3995–4004.

Hayes BJ, Bowman PJ, Chamberlain A, Goddard M. Invited review: genomic selection in dairy cattle: Progress and challenges. J Dairy Sci. 2009;92:433–43.

Bengtsson C, Stålhammar H, Strandberg E, Eriksson S, Fikse WF. Association of genomically enhanced and parent average breeding values with cow performance in nordic dairy cattle. J Dairy Sci. 2020;103:6383–91.

Woolliams J, Berg P, Dagnachew B, Meuwissen T. Genetic contributions and their optimization. J Anim Breed Genet. 2015;132:89–99.

Daetwyler H, Villanueva B, Bijma P, Woolliams J. a. inbreeding in genome-wide selection. J Anim Breed Genet. 2007;124:369–76.

Makanjuola BO, Miglior F, Abdalla EA, Maltecca C, Schenkel FS, Baes CF. Effect of genomic selection on rate of inbreeding and coancestry and effective population size of Holstein and Jersey cattle populations. J Dairy Sci. 2020;103:5183–99.

Lozada-Soto EA, Maltecca C, Lu D, Miller S, Cole JB, Tiezzi F. Trends in genetic diversity and the effect of inbreeding in american Angus cattle under genomic selection. Genet Selection Evol. 2021;53:50.

VanRaden P, Cooper T, Wiggans G, O’Connell J, Bacheller L. Confirmation and discovery of maternal grandsires and great-grandsires in dairy cattle. J Dairy Sci. 2013;96:1874–9.

McFarlane SE, Hunter DC, Senn HV, Smith SL, Holland R, Huisman J et al. Increased genetic marker density reveals high levels of admixture between red deer and introduced Japanese sika in Kintyre, Scotland. Evolutionary Applications [Internet]. 2019 [cited 2019 Dec 29];n/a. Available from: https://doi.org/10.1111/eva.12880 .

Huisman J. Pedigree reconstruction from SNP data: parentage assignment, sibship clustering and beyond. Mol Ecol Resour. 2017;17:1009–24.

Hillier LW, Miller W, Birney E, Warren W, Hardison RC, Ponting CP, et al. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004;432:695–716.

Article   CAS   Google Scholar  

Elsik CG, Tellam RL, Worley KC. The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science. 2009;324:522–8.

Article   PubMed   PubMed Central   Google Scholar  

Groenen MAM, Archibald AL, Uenishi H, Tuggle CK, Takeuchi Y, Rothschild MF, et al. Analyses of pig genomes provide insight into porcine demography and evolution. Nature. 2012;491:393–8.

Lowe JW. Sequencing through thick and thin: Historiographical and philosophical implications. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences. 2018;72:10–27.

Warr A, Affara N, Aken B, Beiki H, Bickhart DM, Billis K et al. An improved pig reference genome sequence to enable pig genetics and genomics research. bioRxiv. 2019;668921.

Warren WC, Hillier LW, Tomlinson C, Minx P, Kremitzki M, Graves T et al. A new chicken genome assembly provides insight into avian genome structure. G3: Genes, Genomes, Genetics. 2017;7:109–17.

Low WY, Tearle R, Liu C, Koren S, Rhie A, Bickhart DM et al. Haplotype-Resolved Cattle Genomes Provide Insights Into Structural Variation and Adaptation BioRxiv. 2019;720797.

Rice ES, Koren S, Rhie A, Heaton MP, Kalbfleisch TS, Hardy T et al. Chromosome-length haplotigs for yak and cattle from trio binning assembly of an F1 hybrid. BioRxiv. 2019;737171.

Szymanski E, Vermeulen N, Wong M. Yeast: one cell, one reference sequence, many genomes? New Genetics and Society. Volume 38. Routledge; 2019. pp. 430–50.

Herrero J, Muffato M, Beal K, Fitzgerald S, Gordon L, Pignatelli M et al. Ensembl comparative genomics resources. Database. 2016;2016.

Giuffra E, Tuggle CK, FAANG Consortium. Functional annotation of animal genomes (FAANG): current achievements and roadmap. Annu Rev Anim Biosci. 2019;7:65–88.

Gunderson KL, Steemers FJ, Lee G, Mendoza LG, Chee MS. A genome-wide scalable SNP genotyping assay using microarray technology. Nat Genet. 2005;37:549.

International Chicken Polymorphism Map Consortium. A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. Nature. 2004;432:717.

Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, et al. Development and characterization of a high density SNP genotyping assay for cattle. PLoS ONE. 2009;4:e5350.

Ramos AM, Crooijmans RP, Affara NA, Amaral AJ, Archibald AL, Beever JE, et al. Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS ONE. 2009;4:e6524.

Groenen MA, Megens H-J, Zare Y, Warren WC, Hillier LW, Crooijmans RP, et al. The development and characterization of a 60K SNP chip for chicken. BMC Genomics. 2011;12:274.

Bulfield G. Farm animal biotechnology. Trends Biotechnol. 2000;18:10–3.

Goddard ME. Animal breeding in the (post-) genomic era. Animal Science. Volume 76. Cambridge University Press; 2003. pp. 353–65.

Georges M, Charlier C, Hayes B. Harnessing genomic information for livestock improvement. Nat Rev Genet. 2019;20:135–56.

O’Connor LJ, Schoech AP, Hormozdiari F, Gazal S, Patterson N, Price AL. Extreme polygenicity of complex traits is explained by negative selection. Am J Hum Genet. 2019;105:456–76.

Zeng J, De Vlaming R, Wu Y, Robinson MR, Lloyd-Jones LR, Yengo L, et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat Genet. 2018;50:746.

Hill WG, Goddard ME, Visscher PM. Data and Theory Point to mainly additive genetic variance for Complex Traits. PLOS Genet Public Libr Sci. 2008;4:e1000008.

Mäki-Tanila A, Hill WG. Influence of Gene Interaction on Complex Trait Variation with Multilocus Models. Genetics. 2014;198:355–67.

Hu Z-L, Park CA, Reecy JM. Building a livestock genetic and genomic information knowledgebase through integrative developments of animal QTLdb and CorrDB. Nucleic Acids Res. 2018;47:D701–10.

Misztal I. Inexpensive computation of the inverse of the genomic relationship matrix in populations with small effective population size. Genetics. 2016;202:401–9.

Misztal I, Lourenco D, Legarra A. Current status of genomic evaluation. J Anim Sci. 2020;98:kaa101.

Misztal I, Steyn Y, Lourenco D. a. L. genomic evaluation with multibreed and crossbred data *. JDS Commun Elsevier. 2022;3:156–9.

Legarra A, Garcia-Baccino CA, Wientjes YCJ, Vitezica ZG. The correlation of substitution effects across populations and generations in the presence of nonadditive functional gene action. Genetics. 2021;219:iyab138.

Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet Nature Publishing Group. 2019;51:12–8.

Brand W, Wells AT, Smith SL, Denholm SJ, Wall E, Coffey MP. Predicting pregnancy status from mid-infrared spectroscopy in dairy cow milk using deep learning. J Dairy Sci. 2021;104:4980–90.

Robson JF, Denholm SJ, Coffey M. Automated Processing and phenotype extraction of Ovine Medical images using a combined generative Adversarial Network and Computer Vision Pipeline. Sensors. Volume 21. Multidisciplinary Digital Publishing Institute; 2021. p. 7268.

Gianola D, Fernando RL, Stella A. Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics. 2006;173:1761–76.

Gianola D, van Kaam JBCHM. Reproducing Kernel Hilbert Spaces regression methods for genomic assisted prediction of quantitative traits. Genetics. 2008;178:2289–303.

Ogutu JO, Piepho H-P, Schulz-Streeck T. A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proceedings. 2011;5:S11.

Okut H, Wu X-L, Rosa GJ, Bauck S, Woodward BW, Schnabel RD, et al. Predicting expected progeny difference for marbling score in Angus cattle using artificial neural networks and bayesian regression models. Genet Selection Evol. 2013;45:34.

Abdollahi-Arpanahi R, Gianola D, Peñagaricano F. Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genet Sel Evol. 2020;52:12.

Pook T, Freudenthal J, Korte A, Simianer H. Using Local Convolutional Neural Networks for Genomic Prediction. Frontiers in Genetics [Internet]. 2020 [cited 2023 Jan 4];11. Available from: https://www.frontiersin.org/articles/ https://doi.org/10.3389/fgene.2020.561497 .

Sun C, VanRaden P, O’Connell J, Weigel K, Gianola D. Mating programs including genomic relationships and dominance effects. J Dairy Sci. 2013;96:8014–23.

MacLeod IM, Bowman PJ, Vander Jagt CJ, Haile-Mariam M, Kemper KE, Chamberlain AJ, et al. Exploiting biological priors and sequence variants enhances QTL discovery and genomic prediction of complex traits. BMC Genomics. 2016;17:144.

Mouresan EF, Selle M, Rönnegård L. Genomic prediction including SNP-specific variance predictors. G3: genes. Genomes Genet. 2019;9:3333–43.

Google Scholar  

Xiang R, Fang L, Liu S, Liu GE, Tenesa A, Gao Y et al. Genetic score omics regression and multi-trait meta-analysis detect widespread cis-regulatory effects shaping bovine complex traits [Internet]. bioRxiv; 2022 [cited 2023 Jan 9]. p. 2022.07.13.499886. Available from: https://www.biorxiv.org/content/ https://doi.org/10.1101/2022.07.13.499886v1 .

Xiang R, van den Berg I, MacLeod IM, Hayes BJ, Prowse-Wilkins CP, Wang M, et al. Quantifying the contribution of sequence variants with regulatory and evolutionary significance to 34 bovine complex traits. Proc Natl Acad Sci USA. 2019;116:19398–408.

Ros-Freixedes R, Johnsson M, Whalen A, Chen C-Y, Valente BD, Herring WO, et al. Genomic prediction with whole-genome sequence data in intensely selected pig lines. Genet Selection Evol. 2022;54:65.

van Binsbergen R, Calus MPL, Bink MCAM, van Eeuwijk FA, Schrooten C, Veerkamp RF. Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle. Genet Selection Evol. 2015;47:71.

van den Berg I, Bowman PJ, MacLeod IM, Hayes BJ, Wang T, Bolormaa S, et al. Multi-breed genomic prediction using Bayes R with sequence data and dropping variants with a small effect. Genet Sel Evol. 2017;49:70.

VanRaden PM, Tooker ME, O’connell JR, Cole JB, Bickhart DM. Selecting sequence variants to improve genomic predictions for dairy cattle. Genet Selection Evol Springer. 2017;49:1–12.

Fragomeni BO, Lourenco DA, Masuda Y, Legarra A, Misztal I. Incorporation of causative quantitative trait nucleotides in single-step GBLUP. Genet Selection Evol. 2017;49:59.

Coffey M. Dairy cows: in the age of the genotype, #phenotypeisking. Anim Front. 2020;10:19–22.

Johnsson M, Jungnickel MK. Evidence for and localization of proposed causative variants in cattle and pig genomes. Genet Sel Evol. 2021;53:67.

Clark EL, Archibald AL, Daetwyler HD, Groenen MA, Harrison PW, Houston RD, et al. From FAANG to fork: application of highly annotated genomes to improve farmed animal production. Genome Biol. 2020;21:285.

Tait-Burkard C, Doeschl-Wilson A, McGrew MJ, Archibald AL, Sang HM, Houston RD, et al. Livestock 2.0–genome editing for fitter, healthier, and more productive farmed animals. Genome Biol. 2018;19:204.

Wallace JG, Rodgers-Melnick E, Buckler ES. On the road to breeding 4.0: unraveling the good, the bad, and the boring of crop quantitative genomics. Annu Rev Genet. 2018;52:421–44.

Young AE, Mansour TA, McNabb BR, Owen JR, Trott JF, Brown CT et al. Genomic and phenotypic analyses of six offspring of a genome-edited hornless bull. Nat Biotechnol. 2019;1–8.

Burkard C, Opriessnig T, Mileham AJ, Stadejek T, Ait-Ali T, Lillico SG et al. Pigs Lacking the Scavenger Receptor Cysteine-Rich Domain 5 of CD163 Are Resistant to Porcine Reproductive and Respiratory Syndrome Virus 1 Infection. Gallagher T, editor. J Virol. 2018;92:e00415-18.

Medugorac I, Seichter D, Graf A, Russ I, Blum H, Göpel KH, et al. Bovine polledness–an autosomal dominant trait with allelic heterogeneity. PLoS ONE. 2012;7:e39477.

Forsberg CW, Phillips JP, Golovan SP, Fan MZ, Meidinger RG, Ajakaiye A, et al. The Enviropig physiology, performance, and contribution to nutrient management advances in a regulated environment: the leading edge of change in the pork industry12. J Anim Sci. 2003;81:E68–77.

Hew CL, Fletcher GL. Transgenic salmonid fish expressing exogenous salmonid growth hormone [Internet]. 1996 [cited 2023 Jan 5]. Available from: https://patents.google.com/patent/US5545808A/en .

Petersen GEL, Buntjer JB, Hely FS, Byrne TJ, Doeschl-Wilson A. Modeling suggests gene editing combined with vaccination could eliminate a persistent disease in livestock. Proceedings of the National Academy of Sciences. Proceedings of the National Academy of Sciences; 2022;119:e2107224119.

Johnsson M, Ros-Freixedes R, Gorjanc G, Campbell MA, Naswa S, Kelly K, et al. Sequence variation, evolutionary constraint, and selection at the CD163 gene in pigs. Genet Selection Evol. 2018;50:69.

Bastiaansen JWM, Bovenhuis H, Groenen MAM, Megens H-J, Mulder HA. The impact of genome editing on the introduction of monogenic traits in livestock. Genet Selection Evol. 2018;50:18.

Cole JB. Management of Mendelian Traits in Breeding Programs by Gene Editing: A Simulation Study. bioRxiv. 2017;116459.

Mueller M, Cole J, Sonstegard T, Van Eenennaam A. Comparison of gene editing versus conventional breeding to introgress the POLLED allele into the US dairy cattle population. J Dairy Sci. 2019;102:4215–26.

Stearns FW. One hundred Years of Pleiotropy: a retrospective. Genetics. 2010;186:767–73.

Boyle EA, Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017;169:1177–86.

Eriksson S, Jonas E, Rydhmer L, Röcklinsberg H. Invited review: breeding and ethical perspectives on genetically modified and genome edited cattle. J Dairy Sci. 2018;101:1–17.

Simianer H. Of cows and cars. J Anim Breed Genet. 2018;135:249–50.

Jenko J, Gorjanc G, Cleveland MA, Varshney RK, Whitelaw CBA, Woolliams JA, et al. Potential of promotion of alleles by genome editing to improve quantitative traits in livestock breeding programs. Genet Selection Evol. 2015;47:55.

Georges M, Massey JM. Velogenetics, or the synergistic use of marker assisted selection and germ-line manipulation. Theriogenology. 1991;35:151–9.

Goszczynski DE, Cheng H, Demyda-Peyrás S, Medrano JF, Wu J, Ross PJ. In vitro breeding: application of embryonic stem cells to animal production†. Biol Reprod. 2019;100:885–95.

Johnsson M, Gaynor RC, Jenko J, Gorjanc G, de Koning D-J, Hickey JM. Removal of alleles by genome editing (RAGE) against deleterious load. Genet Selection Evol. 2019;51:14.

Download references

The author acknowledges the financial support from Formas—a Swedish Research Council for Sustainable Development Dnr 2020 − 01637.

Open access funding provided by Swedish University of Agricultural Sciences.

Author information

Authors and affiliations.

Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Box 7023, Uppsala, 75007, Sweden

Martin Johnsson

You can also search for this author in PubMed   Google Scholar

Contributions

MJ wrote the paper.

Corresponding author

Correspondence to Martin Johnsson .

Ethics declarations

Competing interests.

The author declares that he has no competing interests.

Ethics approval and consent to participate

Consent for publication, additional information, publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is based on a presentation at “Approaches to genetics for livestock research” at IASH, University of Edinburgh, May 2019.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Johnsson, M. Genomics in animal breeding from the perspectives of matrices and molecules. Hereditas 160 , 20 (2023). https://doi.org/10.1186/s41065-023-00285-w

Download citation

Received : 12 January 2023

Accepted : 03 May 2023

Published : 06 May 2023

DOI : https://doi.org/10.1186/s41065-023-00285-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Animal breeding
  • Genomic selection
  • Gene editing

ISSN: 1601-5223

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

statistical genetics thesis

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 17 April 2024

Refining the impact of genetic evidence on clinical success

  • Eric Vallabh Minikel   ORCID: orcid.org/0000-0003-2206-1608 1 ,
  • Jeffery L. Painter   ORCID: orcid.org/0000-0001-9651-9904 2   nAff5 ,
  • Coco Chengliang Dong 3 &
  • Matthew R. Nelson   ORCID: orcid.org/0000-0001-5089-5867 3 , 4  

Nature ( 2024 ) Cite this article

Metrics details

  • Drug development
  • Genetic predisposition to disease
  • Genome-wide association studies
  • Target validation

The cost of drug discovery and development is driven primarily by failure 1 , with only about 10% of clinical programmes eventually receiving approval 2 , 3 , 4 . We previously estimated that human genetic evidence doubles the success rate from clinical development to approval 5 . In this study we leverage the growth in genetic evidence over the past decade to better understand the characteristics that distinguish clinical success and failure. We estimate the probability of success for drug mechanisms with genetic support is 2.6 times greater than those without. This relative success varies among therapy areas and development phases, and improves with increasing confidence in the causal gene, but is largely unaffected by genetic effect size, minor allele frequency or year of discovery. These results indicate we are far from reaching peak genetic insights to aid the discovery of targets for more effective drugs.

Human genetics is one of the only forms of scientific evidence that can demonstrate the causal role of genes in human disease. It provides a crucial tool for identifying and prioritizing potential drug targets, providing insights into the expected effect (or lack thereof 6 ) of pharmacological engagement, dose–response relationships 7 , 8 , 9 , 10 and safety risks 6 , 11 , 12 , 13 . Nonetheless, many questions remain about the application of human genetics in drug discovery. Genome-wide association studies (GWASs) of common, complex traits, including many diseases, generally identify variants of small effect. This contributed to early scepticism of the value of GWASs 14 . Anecdotally, such variants can point to highly successful drug targets 7 , 8 , 9 , and yet, genetic support from GWASs is somewhat less predictive of drug target advancement than support from Mendelian diseases 5 , 15 .

In this paper we investigate several open questions regarding the use of genetic evidence for prioritizing drug discovery. We explore the characteristics of genetic associations that are more likely to differentiate successful from unsuccessful drug mechanisms, exploring how they differ across therapy areas and among discovery and development phases. We also investigate how close we may be to saturating the insights we can gain from genetic studies for drug discovery and how much of the genetically supported drug discovery space remains clinically unexplored.

To characterize the drug development pipeline, we filtered Citeline Pharmaprojects for monotherapy programmes added since 2000 annotated with a highest phase reached and assigned both a human gene target (usually the gene encoding the drug target protein) and an indication defined in Medical Subject Headings (MeSH) ontology. This resulted in 29,476 target–indication (T–I) pairs for analysis (Extended Data Fig. 1a ). Multiple sources of human genetic associations totalled 81,939 unique gene–trait (G–T) pairs, with traits also mapped to MeSH terms. Intersection of these datasets yielded an overlap of 2,166 T–I and G–T pairs (7.3%) for which the indication and the trait MeSH terms had a similarity ≥0.8; we defined these T–I pairs as possessing genetic support (Extended Data Figs. 1b and 2a and  Methods ). The probability of having genetic support, or P(G), was higher for launched T–I pairs than those in historical or active clinical development (Fig. 1a ). In each phase, P(G) was higher than previously reported 5 , 15 , owing, as expected 15 , 16 , more to new G–T discoveries than to changes in drug pipeline composition (Extended Data Fig. 3a–f ). For ensuing analyses, we considered both historical and active programmes. We defined success at each phase as a T–I pair transitioning to the next development phase (for example, from phase I to II), and we also considered overall success—advancing from phase I to a launched drug. We defined relative success (RS) as the ratio of the probability of success, P(S), with genetic support to the probability of success without genetic support ( Methods ). We tested the sensitivity of RS to various characteristics of genetic evidence. RS was sensitive to the indication–trait similarity threshold (Extended Data Fig. 2a ), which we set to 0.8 for all analyses herein. RS was >2 for all sources of human genetic evidence examined (Fig. 1b ). RS was highest for Online Mendelian Inheritance in Man (OMIM) (RS = 3.7), in agreement with previous reports 5 , 15 ; this was not the result of a higher success rate for orphan drug programmes (Extended Data Fig. 2b ), a designation commonly acquired for rare diseases. Rather, it may owe partly to the difference in confidence in causal gene assignment between Mendelian conditions and GWASs, supported by the observation that the RS for Open Targets Genetics (OTG) associations was sensitive to the confidence in variant-to-gene mapping as reflected in the minimum share of locus-to-gene (L2G) score (Fig. 1c ). The differences common and rare disease programmes face in regulatory and reimbursement environments 4 and differing proportions of drug modalities 9 probably contribute as well. OMIM and GWAS support were synergistic with one another (Supplementary Fig. 2b ). Somatic evidence from IntOGen had an RS of 2.3 in oncology (Extended Data Fig. 2c ), similar to GWASs, but analyses below are limited to germline genetic evidence unless otherwise noted.

figure 1

a , Proportion of T–I pairs with genetic support, P(G), as a function of highest phase reached. n at right: denominator, number of T–I pairs per phase; numerator, number that are genetically supported. b , Sensitivity of phase I–launch RS to source of human genetic association. GWAS Catalog, Neale UKBB and FinnGen are subsets of OTG. n at right: denominator, number of T–I pairs with genetic support from each source; numerator, number of those launched. Note that RS is calculated from a 2 × 2 contingency table ( Methods ). Total n  = 13,022 T–I pairs. c , Sensitivity of RS to L2G share threshold among OTG associations. Minimum L2G share threshold is varied from 0.1 to 1.0 in increments of 0.05 (labels); RS ( y axis) is plotted against the number of clinical (phase I+) programmes with genetic support from OTG ( x axis). d , Sensitivity of RS for OTG GWAS-supported T–I pairs to binned variables: (1) year that T–I pair first acquired human genetic support from GWASs, excluding replications and excluding T–I pairs otherwise supported by OMIM; (2) number of genes exhibiting genetic association to the same trait; (3) quartile of effect size (beta) for quantitative traits; (4) quartile of effect size (odds ratio, OR) for case/control traits standardized to be >1 (that is, 1/OR if <1); (5) order of magnitude of minor allele frequency bins. n at right as in b . Total n  = 13,022 T–I pairs. e , Count of indications ever developed in Pharmaprojects ( y axis) by the number of genes associated with traits similar to those indications ( x axis). Throughout, error bars or shaded areas represent 95% CIs (Wilson for P(G) and Katz for RS) whereas centres represent point estimates. See Supplementary Fig. 1 for the same analyses restricted to drugs with a single known target.

Source Data

As sample sizes grow ever larger with a corresponding increase in the number of unique G–T associations, some expect 17 the value of GWAS genetic findings to become less useful for the purpose of drug target selection. We explored this in several ways. We investigated the year that genetic support for a T–I pair was first discovered, under the expectation that more common and larger effects are discovered earlier. Although there was a slightly higher RS for discoveries from 2007–2010 that was largely driven by early lipid and cardiovascular-related associations, the effect of year was overall non-significant ( P  = 0.46; Fig. 1d ). Results were similar when replicate associations or OMIM discoveries were included (Extended Data Fig. 2d–f ). We next divided up GWAS-supported drug programmes by the number of unique traits associated to each gene. RS nominally increased with the number of associated genes, by 0.048 per gene ( P  = 0.024; Fig. 1d ). The reason is probably not that successful genetically supported programmes inspire other programmes, because most genetic support was discovered retrospectively (Extended Data Fig. 2g ); the few examples of drug programmes prospectively motivated by genetic evidence were primarily for Mendelian diseases 9 . There were no statistically significant associations with estimated effect sizes ( P  = 0.90 and 0.57, for quantitative and binary traits, respectively; Fig. 1d and Extended Data Fig. 2h ) or minor allele frequency ( P  = 0.26; Fig. 1d ). That ever larger GWASs can continue to uncover support for successful targets is also illustrated by two recent large GWASs in type 2 diabetes (T2D) 18 , 19 (Extended Data Fig. 4 ).

Previously 5 , we observed significant heterogeneity among therapy areas in the fraction of approved drug mechanisms with genetic support, but did not investigate the impact on probability of success 5 . Here, our estimates of RS from phase I to launch showed significant heterogeneity ( P  < 1.0 × 10 −15 ), with nearly all therapy areas having estimates greater than 1; 11 of 17 were >2, and haematology, metabolic, respiratory and endocrine >3 (Fig. 2a–e ). In most therapy areas, the impact of genetic evidence was most pronounced in phases II and III and least impactful in phase I, corresponding to capacity to demonstrate clinical efficacy in later development phases. Accordingly, therapy areas differed in P(G) and in whether P(G) increased throughout clinical development or only at launch (Extended Data Fig. 5 ); data source and other properties of genetic evidence including year of discovery and effect size also differed (Extended Data Fig. 6 ). We also found that genetic evidence differentiated likelihood to progress from preclinical to clinical development for metabolic diseases (RS = 1.38; 95% confidence interval (95% CI), 1.25 to 1.54), which may reflect preclinical models that are more predictive of clinical outcomes. P(G) by therapy area was correlated with P(S) ( ρ  = 0.59, P  = 0.013) and with RS ( ρ  = 0.72, P  = 0.0011; Extended Data Fig. 7 ), which led us to explore how the sheer quantity of genetic evidence available within therapy areas (Fig. 2f and Extended Data Fig. 8a ) may influence this. We found that therapy areas with more possible gene–indication (G–I) pairs supported by genetic evidence had significantly higher RS ( ρ  = 0.71, P  = 0.0010; Fig. 2g ), although respiratory and endocrine were notable outliers with high RS despite fewer associations.

figure 2

a – e , RS by therapy area and phase transitions: preclinical to phase I ( a ), phase I to II ( b ), phase II to III ( c ), phase III to launch ( d ) and phase I to launch ( e ). n at right: denominator, T–I pairs with genetic support; numerator, number of those that succeeded in the phase transition indicated at the top of the panel. For ‘all’, total n  = 22,638 preclinical, 13,022 reaching at least phase I, 7,223 reaching at least phase II and 2,184 reaching at least phase III. Total n for each therapy area is provided in Supplementary Table 27 . f , Cumulative number of possible genetically supported G–I pairs in each therapy ( y axis) as genetic discoveries have accrued over time ( x axis). g , RS ( y axis) by number of possible supported G–I pairs ( x axis) across therapy areas, with dots coloured as in panels a – e and sized according to number of genetically supported T–I pairs in at least phase I. h , Number of launched indications versus similarity of those indications, by approved drug target. i , Proportion of launched T–I pairs with genetic support, P(G), binned by quintile of the number of launched indications per target (top panel) or by mean similarity among launched indications (bottom panel). Targets with exactly 1 launched indication (6.2% of launched T–I pairs) are considered to have mean similarity of 1.0. n at right: denominator, total number of launched T–I pairs in each bin; numerator, number of those with genetic support. j , RS ( y axis) versus mean similarity among launched indications per target ( x axis) by therapy area. k , RS ( y axis) versus mean count of launched indications per target ( x axis). Throughout, error bars or shaded areas represent 95% CIs (Wilson for P(G) and Katz for RS) whereas centres represent point estimates. See Supplementary Fig. 2 for the same analyses restricted to drugs with a single known target.

We hypothesized that genetic support might be most pronounced for drug mechanisms with disease-modifying effects, as opposed to those that manage symptoms, and that the proportions of such drugs differ by therapy area 20 , 21 . We were unable to find data with these descriptions available for a sufficient number of drug mechanisms to analyse, but we reasoned that targets of disease-modifying drugs are more likely to be specific to a disease, whereas targets of symptom-managing drugs are more likely to be applied across many indications. We therefore examined the number and diversity of all-time launched indications per target. Launched T–I pairs are heavily skewed towards a few targets (Fig. 2h ). Of 450 launched targets, the 42 with ≥10 launched indications comprise 713 (39%) of 1,806 launched T–I pairs (Fig. 2h ). Many of these are used across diverse indications for management of symptoms such as inflammatory and immune responses ( NR3C1 , IFNAR2 ), pain ( PTGS2 , OPRM1 ), mood ( SLC6A4 ) or parasympathetic response ( CHRM3 ). The count of launched indications was inversely correlated with the mean similarity of those indications ( ρ  = −0.72, P  = 4.4 × 10 −84 ; Fig. 2h ). Among T–I pairs, the probability of having genetic support increased as the number of launched indications decreased ( P  = 6.3 × 10 −7 ) and as the similarity of a target’s launched indications increased ( P  = 1.8 × 10 −5 ; Fig. 2i ). We observed a corresponding impact on RS, increasing in therapy areas for which the similarity among launched indications increased, and decreasing with increasing indications per target ( ρ  = 0.74, P  = 0.0010, and ρ  = −0.62, P  = 0.0080, respectively; Fig. 2j,k ).

Only 4.8% (284 of 5,968) of T–I pairs active in phases I–III possess human germline genetic support (Fig. 1a ), similar to T–I pairs no longer in development (4.2%, 560 of 13,355), a difference that was not statistically significant ( P  = 0.080). We estimated ( Methods ) that only 1.1% of all genetically supported G–I relationships have been explored clinically (Fig. 3a ), or 2.1% when restricting to the most similar indication. Given that the vast majority of proteins are classically ‘undruggable’, we explored the proportion of genetically supported G–I pairs that had been developed to at least phase I, as a function of therapy area across several classes of tractability and relevant protein families 22 (Fig. 3a ). Within therapy areas, oncology kinases with germline evidence were the most saturated: 109 of 250 (44%) of all genetically supported G–I pairs had reached at least phase I; GPCRs for psychiatric indications were also notable (14 of 53, 26%). Grouping by target rather than G–I pair, 3.6% of genetically supported targets have been pursued for any genetically supported indication (Extended Data Fig. 8 ). Of possible genetically supported G–I pairs, most (68%) arose from OTG associations, mostly in the past 5 years (Fig. 2f ). Such low use is partly due to recent emergence of most genetic evidence (Extended Data Figs. 2f,g and 7a ), as drug programmes prospectively supported by human genetics have had a mean lag time from genetic association of 13 years to first trial 21 and 21 years to approval 9 . Because some types of targets may be more readily tractable by antagonists than agonists, we also grouped by target and examined human genetic evidence by direction of effect for tumour suppressors versus oncogenes (Fig. 3b ), identifying a few substrata for which a majority of genetically supported targets had been pursued to at least phase I for at least one genetically supported indication. Oncogene kinases received the most attention, with 19 of 25 (76%) reaching phase I.

figure 3

a , Heatmap of proportion of genetically supported T–I pairs that have been developed to at least phase I, by therapy area ( y axis) and gene list ( x axis). b , As panel a , but for genetic support from IntOGen rather than germline sources and grouped by the direction of effect of the gene according to IntOGen ( y axis), and also grouped by target rather than T–I pair. Thus, the denominator for each cell is the number of targets with at least one genetically supported indication, and each target counts towards the numerator if at least one genetically supported indication has reached phase I. c , Of targets that have reached phase I for any indication, and have at least one genetically supported indication, the mean count ( x axis) of genetically supported (left) and unsupported (right) indications pursued, binned by the number of possible genetically supported indications ( y axis). The centre is the mean and bars are Wilson 95% CIs. n  = 1,147 targets. d , Proportion of D–I pairs with genetic support, P(G) ( x axis), as a function of each D–I pair’s phase reached (inner y -axis grouping) and the drug’s highest phase reached for any indication (outer y -axis grouping). The centre is the exact proportion and bars are Wilson 95% CIs. The n is indicated at the right, for which the denominator is the total number of D–I pairs in each bin, and the numerator is the number of those that are genetically supported. See Supplementary Fig. 3 for the same analyses restricted to drugs with a single known target. Ab, antibody; SM, small molecule.

To focus on demonstrably druggable proteins, we further restricted the analysis to targets with both (1) any programme reaching phase I, and (2) ≥1 genetically supported indications. Of 1,147 qualifying targets, only 373 (33%) had been pursued for one or more supported indications (Fig. 3c ), and most (307, 27%) of these targets were pursued for indications both with and without genetic support. Overall, an overwhelming majority of development effort has been for unsupported indications, at a 17:1 ratio. Within this subset of targets, we asked whether genetic support was predictive of which indications would advance the furthest. Grouping active and historical programmes by drug–indication (D–I) pair, we found that the odds of advancing to a later stage in the pipeline are 82% higher for indications with genetic support ( P  = 8.6 × 10 −73 ; Fig. 3d ).

Although there has been anecdotal support—such as the HMGCR example—to argue that genetic effect size may not matter in prioritizing drug targets, here we provide systematic evidence that small effect size, recent year of discovery, increasing number of genes identified or higher associated allele frequency do not diminish the value of GWAS evidence to differentiate clinical success rates. One reason for this is probably because genetic effect size on a phenotype rarely accounts for the magnitude of genetic effect on gene expression, protein function or some other molecular intermediate. In some circumstances, genetic effect sizes can yield insights into anticipated drug effects. This is best illustrated for cardiovascular disease therapies, for which genetic effects on cholesterol and disease risk and treatment outcomes are correlated 23 . A limitation is that, other than Genebass, we did not include whole exome or whole genome sequencing association studies, which may be more likely to pinpoint causal variants. Moreover, all of our analyses are naive to direction of genetic effect (gain versus loss of gene function) as this is unknown or unannotated in most datasets used here.

Our results argue for continuing investment to expand GWAS-like evidence, particularly for many complex diseases with treatment options that fail to modify disease. Although genetic evidence has value across most therapy areas, its benefit is more pronounced in some areas than others. Furthermore, it is possible that the therapy areas for which genetic evidence had a lower impact have seen more focus on symptom management. If so, we would predict that for drugs aimed at disease modification, human genetics should ultimately prove highly valuable across therapy areas.

The focus of this work has been on the RS of drug programmes with and without genetic evidence, limited to drug mechanisms that have entered clinical development. This metric does not address the probability that a gene associated with a disease, if targeted, will yield a successful drug. At the early stage of target selection, is evidence of a large loss-of-function effect in one gene usually a better choice than a small non-coding single nucleotide polymorphism (SNP) effect on the same phenotype in another? We explored this question for T2D studies referenced above. When these GWASs quadrupled the number of T2D-associated genes from 217 to 862, new genetic support was identified for 7 of 95 mechanisms in clinical development whereas the number supported increased from 5 to 7 of 12 launched drug mechanisms. Thus, RS has remained high in light of new GWAS data. One can also, however, consider the proportion of genetic associations that are successful drug targets. Of the 7 targets of launched drugs with genetic evidence, 4 had Mendelian evidence (in addition to pre-2020 GWAS evidence), out of a total of 19 Mendelian genes related to T2D (21%). One launched T2D target had only GWAS (and no Mendelian) evidence among 217 GWAS-associated genes before 2020 (0.46%), whereas 2 launched targets were among 645 new GWAS associations since 2020 (0.31%). At least in this example, the ‘yield’ of genetic evidence for successful drug mechanisms was greatest for genes with Mendelian effects, but similar between earlier and later GWASs. Clearly, just because genetic associations differentiate clinical stage drug targets from launched ones, does not mean that a large fraction of associations will be fruitful. Moreover, genetically supported targets may be more likely to require upregulation, to be druggable only by more challenging modalities 4 , 9 or to enjoy narrower use across indications. More work is required to better understand the challenges of target identification and prioritization given the genetic evidence precondition.

The utility of human genetic evidence in drug discovery has had firm theoretical and empirical footing for several years 5 , 7 , 15 . If the benefit of this evidence were cancelled out by competitive crowding 24 , then currently active clinical phases should have higher rates of genetic support than their corresponding historical phases, and might look similar to, or even higher than, launched pairs. Instead, we find that active programmes possess genetic support only slightly more often than historical programmes and remain less enriched for genetic support than launched drugs. Meanwhile, only a tiny fraction of classically druggable genetically supported G–I pairs have been pursued even among targets with clinical development reported. Human genetics thus represents a growing opportunity for novel target selection and improving indication selection for existing drugs and drug candidates. Increasing emphasis on drug mechanisms with supporting genetic evidence is expected to increase success rates and lower the cost of drug discovery and development.

Definition of metrics

Except where otherwise noted, we define genetic support of a drug mechanism (that is, a T–I pair) as a genetic association mapped to the corresponding target gene for a trait that is ≥0.8 similar to the indication (see MeSH term similarity below). We defined P(G) as the proportion of drug mechanisms satisfying the above definition of genetic support. P(S) is the proportion of programmes in one phase that advance to a subsequent phase (for instance, phase I to phase II). Overall P(S) from phase I to launched is the product of P(S) at each individual phase. RS is the ratio of P(S) for programmes with genetic support to P(S) for programmes lacking genetic support, which is equivalent to a relative risk or risk ratio. Thus, if N denotes the total number of programmes that have reached the reference phase, and X denotes the number of those that advance to a later phase of interest, and the subscripts G and!G indicate the presence or absence of genetic support, then P(G) =  N G /( N G  +  N !G ); P(S) = ( X G  +  X !G )/( N G  +  N !G ); RS = ( X G / N G )/( X !G / N !G ). RS from phase I to launched is the product of RS at each individual phase. The count of ‘programs’ for X and N is T–I pairs throughout, except for Fig. 3d , which uses D–I pairs to specifically interrogate P(G) for which the same drug has been developed for different indications. For clarity, we note that whereas other recent studies 22 , 25 have examined the fold enrichment and overlap between genes with a human genetic support and genes encoding a drug target, without regard to similarity, herein all of our analyses are conditioned on the similarity between the drug’s indication and the genetically associated trait.

Drug development pipeline

Citeline Pharmaprojects 26 is a curated database of drug development programmes including preclinical, all clinical phases and launched (approved and marketed) drugs. It was queried via API (22 December 2022) to obtain information on drugs, targets, indications, phases reached and current development status. T–I pair was the unit of analysis throughout, except where otherwise indicated in the text (D–I pairs were examined in Fig. 3d ). Current development status was defined as ‘active’ if the T–I pair had at least one drug still in active development, and ‘historical’ if development of all drugs for the T–I pair had ceased. Targets were defined as genes; as most drugs do not directly target DNA, this usually refers to the gene encoding the protein target that is bound or modulated by the drug. We removed combination therapies, diagnostic indication and programmes with no human target or no indication assigned. For most analyses, only programmes added to the database since 2000 were included, whereas for the count and similarity of launched indications per target, we used all launches for all time. Indications were considered to possess ‘genetic insight’—meaning the human genetics of this trait or similar traits have been successfully studied—if they had ≥0.8 similarity to (1) an OMIM or IntOGen disease, or (2) a GWAS trait with at least 3 independently associated loci, on the basis of lead SNP positions rounded to the nearest 1 megabase. For calculating RS, we used the number of T–I pairs with genetic insight as the denominator. The rationale for this choice is to focus on indications for which there exists the opportunity for human genetic evidence, consistent with the filter applied previously 5 . However, we observe that our findings are not especially sensitive to the presence of this filter, with RS decreasing by just 0.17 when the filter is removed (Extended Data Fig. 3g,h ). Note that the criteria for determining genetic insight are distinct from, and much looser than, the criteria for mapping GWAS hits to genes (see L2G scores under OTG below). Many drugs had more than one target assigned, in which case all targets were retained for T–I pair analyses. As a sensitivity test, running our analyses restricted to only drugs with exactly one target assigned yielded very similar results ( Supplementary Figures ).

OMIM is a curated database of Mendelian gene–disease associations. The OMIM Gene Map (downloaded 21 September 2023) contained 8,671 unique gene–phenotype links. We restricted to entries with phenotype mapping code 3 (‘the molecular basis for the disorder is known; a mutation has been found in the gene’), removed phenotypes with no MIM number or no gene symbol assigned, and removed duplicate combinations of gene MIM and phenotype MIM. We used regular expression matching to further filter out phenotypes containing the terms ‘somatic’, ‘susceptibility’ or ‘response’ (drug response associations) and those flagged as questionable (‘?’), or representing non-disease phenotypes (‘[’). A set of OMIM phenotypes are flagged as denoting susceptibility rather than causation (‘{’); this category includes low-penetrance or high allele frequency association assertions that we wished to exclude, but also germline heterozygous loss-of-function mutations in tumour suppressor genes, for which the underlying mechanism of disease initiation is loss of heterozygosity, which we wished to include. We therefore also filtered out phenotypes containing ‘{’ except for those that did contain the terms ‘cancer’, ‘neoplasm’, ‘tumor’ or ‘malignant’ and did not contain the term ‘somatic’. Remaining entries present in OMIM as of 2021 were further evaluated for validity by two curators, and gene–disease combinations for which a disease association was deemed not to have been established were excluded from all analyses. All of the above filters left 5,670 unique G–T links. MeSH terms for OMIM phenotypes were then mapped using the EFO OWL database using an approach previously described 27 , with further mappings from Orphanet, full text matches to the full MeSH vocabulary and, finally, manual curation, for a cumulative mapping rate of 93% (5,297 of 5,670). Because sometimes distinct phenotype MIM numbers mapped to the same MeSH term, this yielded 4,510 unique gene–MeSH links.

OTG is a database of GWAS hits from published studies and biobanks. OTG version 8 (12 October 2022) variant-to-disease, L2G, variant index and study index data were downloaded from EBI. Traits with multiple EFO IDs were excluded as these generally represent conditional, epistasis or other complex phenotypes that would lack mappings in the MeSH vocabulary. Of the top 100 traits with the greatest number of genes mapped, we excluded 76 as having no clear disease relevance (for example, ‘red cell distribution width’) or no obvious marginal value (for example, excluded ‘trunk predicted mass’ because ‘body mass index’ was already included). Remaining traits were mapped to MeSH using the EFO OWL database, full text queries to the MeSH API, mappings already manually curated in PICCOLO (see below) or new manual curation. In total, 25,124 of 49,599 unique traits (51%) were successfully mapped to a MeSH ID. We included associations with P  < 5 × 10 −8 . OTG L2G scores used for gene mapping are based on a machine learning model trained on gold standard causal genes 28 ; inputs to that model include distance, functional annotations, expression quantitative trait loci (eQTLs) and chromatin interactions. Note that we do not use Mendelian randomization 29 to map causal genes, and even gene mappings with high L2G scores are necessarily imperfect. OTG provides an L2G score for the triplet of each study or trait with each hit and each possible causal gene. We defined L2G share as the proportion of the total L2G score assigned each gene among all potentially causal genes for that trait–hit combination. In sensitivity analyses we considered L2G share thresholds from 10% to 100% (Fig. 1b and Extended Data Fig. 3a ), but main analyses used only genes with ≥50% L2G share (which are also the top-ranked genes for their respective associations). OTG links were parsed to determine the source of each OTG data point: the EBI GWAS catalog 30 ( n  = 136,503 hits with L2G share ≥0.5), Neale UK Biobank ( http://www.nealelab.is/uk-biobank ; n  = 19,139), FinnGen R6 (ref.  31 ) ( n  = 2,338) or SAIGE ( n  = 1,229).

PICCOLO 32 is a database of GWAS hits with gene mapping based on tests for colocalization without full summary statistics by using Probabilistic Identification of Causal SNPs (PICS) and a reference dataset of SNP linkage disequilibrium values. As described 32 , gene mapping uses quantitative trait locus (QTL) data from GTEx ( n  = 7,162) and a variety of other published sources ( n  = 6,552). We included hits with GWAS P  < 5 × 10 −8 , and with eQTL P  < 1 × 10 −5 , and posterior probability H4 ≥ 0.9, as these thresholds were determined empirically 32 to strongly predict colocalization results.

Genebass 33 is a database of genetic associations based on exome sequencing. Genebass data from 394,841 UK Biobank participants (the ‘500K’ release) were queried using Hail (19 October 2023). We used hits from four models: pLoF (predicted loss-of-function) or missense|LC (missense and low confidence LoF), each with sequencing kernel association test (SKAT) or burden tests, filtering for P  < 1 × 10 −5 . Because the traits in Genebass are from UK Biobank, which is included in OTG, we used the OTG MeSH mappings established above.

IntOGen is a database of enrichments of somatic genetic mutations within cancer types. We used the driver genes and cohort information tables (31 May 2023). IntOGen assigns each gene a mechanism in each tumour type; occasionally, a gene will be classified as a tumour suppressor in one type and an oncogene in another. We grouped by gene and assigned each gene its modal classification across cancers. MeSH mappings were curated manually.

MeSH term similarity

MeSH terms in either Pharmaprojects or the genetic associations datasets that were Supplementary Concept Records (IDs beginning in ‘C’) were mapped to their respective preferred main headings (IDs beginning in ‘D’). A matrix of all possible combinations of drug indication MeSH IDs and genetic association MeSH IDs was constructed. MeSH term Lin and Resnik similarities were computed for each pair as described 34 , 35 . Similarities of −1, indicating infinite distance between two concepts, were assigned as 0. The two scores were regressed against each other across all term pairs, and the Resnik scores were adjusted by a multiplier such that both scores had a range from 0 to 1 and their regression had a slope of 1. The two scores were then averaged to obtain a combined similarity score. Similarity scores were successfully calculated for 1,006 of 1,013 (99.3%) unique MeSH terms for Pharmaprojects indications, corresponding to 99.67% of Pharmaprojects T–I pairs, and for 2,260 of 2,262 (99.9%) unique MeSH terms for genetic associations, corresponding to >99.9% of associations.

Therapeutic areas

MeSH terms for Pharmaprojects indications were mapped onto 16 top-level headings under the Diseases [C] and Psychiatry and Psychology [F] branches of the MeSH tree ( https://meshb.nlm.nih.gov/treeView ), plus an ‘other’. The signs/symptoms area corresponds to C23 Pathological Conditions, Signs and Symptoms and contains entries such as inflammation and pain. Many MeSH terms map to >1 tree positions; these multiples were retained and counted towards each therapy area, except for the following conditions: for terms mapped to oncology, we deleted their mappings to all other areas; and ‘other’ was used only for terms that mapped to no other areas.

Analysis of T2D GWASs

We included 19 genes from OMIM linked to Mendelian forms of diabetes or syndromes with diabetic features. For Vujkovic et al. 18 , we considered as novel any genes with a novel nearest gene, novel coding variant or a novel lead SNP colocalized with an eQTL with H4 ≥ 0.9. Non-novel nearest genes, coding variants and colocalized lead SNPs were considered established variants. For Suzuki et al. 19 , we used the available L2G scores that OTG had assigned for the same lead SNPs in previously reported GWASs for other phenotypes, yielding mapped genes with L2G share >0.5 for 27% of loci. Genes were considered novel if absent from the Vujkovic analysis. Together, these approaches identified 217 established GWAS genes and 645 novel ones (469 from Vujkovic and 176 from Suzuki). We identified 347 unique drug targets in Pharmaprojects reported with a T2D or diabetes mellitus indication, including 25 approved. We reviewed the list of approved drugs and eliminated those for which there were questions around the relevance of the drug or target to T2D ( AKR1B1 , AR , DRD1 , HMGCR , IGF1R , LPL , SLC5A1 ). Because Pharmaprojects ordinarily specifies the receptor as target for protein or peptide replacement therapies, we also remapped the minority of programmes for which the ligand, rather than receptor, had been listed as target (changing INS to INSR , GCG to GCGR ). To assess the proportion of programmes with genetic support, we first grouped by drug and selected just one target, preferring the target with the earliest genetic support (OMIM, then established GWASs, then novel GWASs, then none). Next we grouped by target and selected its highest phase reached. Finally, we grouped by highest phase reached and counted the number of unique targets.

Universe of possible genetically supported G–I pairs

In all of our analyses, targets are defined as human gene symbols, but we use the term G–I pair to refer to possible genes that one might attempt to target with a drug, and T–I pair to refer to genes that are the targets of actual drug candidates in development. To enumerate the space of possible G–I pairs, we multiplied the n  = 769 Pharmaprojects indications considered here by the ‘universe’ of n  = 19,338 protein-coding genes, yielding a space of n  = 14,870,922 possible G–I pairs. Of these, n  = 101,954 (0.69%) qualify as having genetic support per our criteria. A total of 16,808 T–I pairs have reached at least phase I in an active or historical programme, of which 1,155 (6.9%) are genetically supported. This represents an enrichment compared with random chance (OR = 11.0, P  < 1.0 × 10 −15 , Fisher’s exact test), but in absolute terms, only 1.1% of genetically supported G–I pairs have been pursued. A genetically supported G–I pair may be less likely to attract drug development interest if the indication already has many other potential targets, and/or if the indication is but the second-most similar to the gene’s associated trait. Removing associations with many GWAS hits and restricting to the single most similar indication left a space of 34,190 possible genetically supported G–I pairs, 719 (2.1%) of which had been pursued. This small percentage might yet be perceived to reflect competitive saturation, if the vast majority of indications are undevelopable and/or the vast majority of targets are undruggable. We therefore asked what proportion of genetically supported G–I pairs had been developed to at least phase I, as a function of therapy area cross-tabulated against Open Targets predicted tractability status or membership in canonically ‘druggable’ protein families, using families from ref. 22 as well as UniProt pkinfam for kinases 36 . We also grouped at the level of gene, rather than G–I pair (Extended Data Fig. 8 ).

Druggability and protein families

Antibody and small molecule druggability status was taken from Open Targets 37 . For antibody tractability, Clinical Precedence, Predicted Tractable–High Confidence and Predicted Tractable–Medium to Low Confidence were included. For small molecules, Clinical Precedence, Discovery Precedence and Predicted Tractable were included. Protein families were from sources described previously 22 , plus the pkinfam kinase list from UniProt 36 . To make these lists non-overlapping, genes that were both kinases and also enzymes, ion channels or nuclear receptors were considered to be kinases only.

Analyses were conducted in R 4.2.0. For binomial proportions P(G) and P(S), error bars are Wilson 95% CIs, except for P(S) for phase I–launch for which the Wald method is used to compute the confidence intervals on the product of the individual probabilities of success at each phase. RS uses Katz 95% CIs, with the phase I launch RS based on the number of programs entering phase I and succeeding in phase III. Effects of continuous variables on probability of launch were assessed using logistic regression. Differences in RS between therapy areas were tested using the Cochran–Mantel–Haenszel chi-squared test (cmh.test from the R lawstat package, v.3.4). Pipeline progression of D–I pairs conditioned on the highest phase reached by a drug was modelled using an ordinal logit model (polr with Hess = TRUE from the R MASS package, v.7.3-56). Correlations across therapy areas were tested by weighted Pearson’s correlation (wtd.cor from the R weights package, v.1.0.4); to control for the amount of data available in each therapy area, the number of genetically supported T–I pairs having reached at least phase I was used as the weight. Enrichments of T–I pairs in the utilization analysis were tested using Fisher’s exact test. All statistical tests were two-sided.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

An analytical dataset is provided at GitHub at https://github.com/ericminikel/genetic_support/ (ref. 38 ) and is sufficient to reproduce all figures and statistics herein. This repository is permanently archived at Zenodo at https://doi.org/10.5281/zenodo.10783210 (ref. 39 ). Source data are provided with this paper.

Code availability

Source code is provided at GitHub at https://github.com/ericminikel/genetic_support/ (ref. 38 ) and is sufficient to reproduce all figures and statistics herein. This code is permanently archived at the Zenodo repository at https://doi.org/10.5281/zenodo.10783210 (ref. 39 ).

DiMasi, J. A., Grabowski, H. G. & Hansen, R. W. Innovation in the pharmaceutical industry: new estimates of R&D costs. J. Health Econ. 47 , 20–33 (2016).

Article   PubMed   Google Scholar  

Hay, M., Thomas, D. W., Craighead, J. L., Economides, C. & Rosenthal, J. Clinical development success rates for investigational drugs. Nat. Biotechnol. 32 , 40–51 (2014).

Article   CAS   PubMed   Google Scholar  

Wong, C. H., Siah, K. W. & Lo, A. W. Estimation of clinical trial success rates and related parameters. Biostatistics 20 , 273–286 (2019).

Article   MathSciNet   PubMed   Google Scholar  

Thomas D. et al. Clinical Development Success Rates and Contributing Factors 2011–2020 (Biotechnology Innovation Organization, 2021); https://go.bio.org/rs/490-EHZ-999/images/ClinicalDevelopmentSuccessRates2011_2020.pdf

Nelson, M. R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47 , 856–860 (2015).

Diogo, D. et al. Phenome-wide association studies across large population cohorts support drug target validation. Nat. Commun. 9 , 4285 (2018).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Plenge, R. M., Scolnick, E. M. & Altshuler, D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 12 , 581–594 (2013).

Musunuru, K. & Kathiresan, S. Genetics of common, complex coronary artery disease. Cell 177 , 132–145 (2019).

Trajanoska, K. et al. From target discovery to clinical drug development with human genetics. Nature 620 , 737–745 (2023).

Article   ADS   CAS   PubMed   Google Scholar  

Burgess, S. et al. Using genetic association data to guide drug discovery and development: review of methods and applications. Am. J. Hum. Genet. 110 , 195–214 (2023).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Carss, K. J. et al. Using human genetics to improve safety assessment of therapeutics. Nat. Rev. Drug Discov. 22 , 145–162 (2023).

Nguyen, P. A., Born, D. A., Deaton, A. M., Nioi, P. & Ward, L. D. Phenotypes associated with genes encoding drug targets are predictive of clinical trial side effects. Nat. Commun. 10 , 1579 (2019).

Minikel, E. V., Nelson, M. R. Human genetic evidence enriched for side effects of approved drugs. Preprint at medRxiv https://doi.org/10.1101/2023.12.12.23299869 (2023).

Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five years of GWAS discovery. Am. J. Hum. Genet. 90 , 7–24 (2012).

King, E. A., Davis, J. W. & Degner, J. F. Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLoS Genet. 15 , e1008489 (2019).

Article   PubMed   PubMed Central   Google Scholar  

Hingorani, A. D. et al. Improving the odds of drug development success through human genomics: modelling study. Sci. Rep. 9 , 18911 (2019).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Reay, W. R. & Cairns, M. J. Advancing the use of genome-wide association studies for drug repurposing. Nat. Rev. Genet. 22 , 658–671 (2021).

Vujkovic M. et al. Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis. Nat. Genet. 52 , 680–691 (2020).

Suzuki K. et al. Genetic drivers of heterogeneity in type 2 diabetes pathophysiology. Nature 627 , 347–357 (2024).

Lommatzsch, M. et al. Disease-modifying anti-asthmatic drugs. Lancet 399 , 1664–1668 (2022).

Mortberg, M. A., Vallabh, S. M. & Minikel, E. V. Disease stages and therapeutic hypotheses in two decades of neurodegenerative disease clinical trials. Sci. Rep. 12 , 17708 (2022).

Minikel, E. V. et al. Evaluating drug targets through human loss-of-function genetic variation. Nature 581 , 459–464 (2020).

Ference, B. A. et al. Low-density lipoproteins cause atherosclerotic cardiovascular disease. 1. Evidence from genetic, epidemiologic, and clinical studies. A consensus statement from the European Atherosclerosis Society Consensus Panel. Eur. Heart J. 38 , 2459–2472 (2017).

Scannell, J. W. et al. Predictive validity in drug discovery: what it is, why it matters and how to improve it. Nat. Rev. Drug Discov. 21 , 915–931 (2022).

Sun, B. B. et al. Genetic associations of protein-coding variants in human disease. Nature 603 , 95–102 (2022).

Pharmaprojects (Citeline, accessed 30 August 2023); https://web.archive.org/web/20230830135309/https://www.citeline.com/en/products-services/clinical/pharmaprojects

Painter, J. L. Toward automating an inference model on unstructured terminologies: OXMIS case study. Adv. Exp. Med. Biol. 680 , 645–651 (2010).

Mountjoy, E. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat. Genet. 53 , 1527–1533 (2021).

Zheng, J. et al. Phenome-wide Mendelian randomization mapping the influence of the plasma proteome on complex diseases. Nat. Genet. 52 , 1122–1131 (2020).

Sollis, E. et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res. 51 , D977–D985 (2023).

Kurki, M. I. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 613 , 508–518 (2023).

Guo C. et al. Identification of putative effector genes across the GWAS Catalog using molecular quantitative trait loci from 68 tissues and cell types. Preprint at bioRxiv https://doi.org/10.1101/808444 (2019).

Karczewski, K. J. et al. Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. Cell Genomics. 2 , 100168 (2022).

Lin D. An information-theoretic definition of similarity. In Proc. 15th International Conference on Machine Learning (ICML) (ed. Shavlik, J. W.) 296–304 (Morgan Kaufmann Publishers Inc., 1998).

Resnik P. Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 11 , 95–130 (1999).

The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45 , D158–D169 (2017).

Article   Google Scholar  

Ochoa, D. et al. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Res. 51 , D1353–D1359 (2023).

Minikel, E. et al. GitHub https://github.com/ericminikel/genetic_support/ (2024).

Minikel, E. et al. Refining the impact of genetic evidence on clinical success. Zenodo https://doi.org/10.5281/zenodo.10783210 (2024).

Download references

Acknowledgements

This study was funded by Deerfield.

Author information

Jeffery L. Painter

Present address: GlaxoSmithKline, Research Triangle Park, NC, USA

Authors and Affiliations

Stanley Center for Psychiatric Research, Broad Institute, Cambridge, MA, USA

Eric Vallabh Minikel

JiveCast, Raleigh, NC, USA

Deerfield Management Company LP, New York, NY, USA

Coco Chengliang Dong & Matthew R. Nelson

Genscience LLC, New York, NY, USA

Matthew R. Nelson

You can also search for this author in PubMed   Google Scholar

Contributions

M.R.N. and E.V.M. conceived and designed the study. E.V.M., J.L.P., C.C.D. and M.R.N. performed analyses. M.R.N. supervised the research. M.R.N. and E.V.M. drafted the manuscript. E.V.M., J.L.P., C.C.D. and M.R.N. reviewed and approved the final manuscript.

Corresponding author

Correspondence to Matthew R. Nelson .

Ethics declarations

Competing interests.

M.R.N. is an employee of Deerfield and Genscience. C.C.D. is an employee of Deerfield. E.V.M. and J.L.P. are consultants to Deerfield. Unrelated to the current work, E.V.M. acknowledges speaking fees from Eli Lilly, consulting fees from Alnylam and research support from Ionis, Gate, Sangamo and Eli Lilly.

Peer review

Peer review information.

Nature thanks Joanna Howson, Heiko Runz and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 data processing schematic..

A ) Dataset size, filters, and join process for Pharmaprojects and human genetic evidence. Note that a drug can be assigned multiple targets, and can be approved for multiple indications. The entire analysis described herein has also been run restricted to only those drugs with exactly one target annotated (Figs. S1 – S11 ). B ) Illustration of the definition of genetic support. A table of drug development programs with one row per target-indication pair (left) is joined to a table of human genetic associations based on the identity of the gene encoding the drug target and the similarity between the drug indication MeSH term and the genetically associated trait MeSH term being ≥ 0.8. Drug program rows with a joined row in the genetic associations table are considered to have genetic support.

Extended Data Fig. 2 Further analysis of influence of characteristics of genetic associations on relative success.

A ) Sensitivity of RS to the similarity threshold between the MeSH ID for the genetically associated trait and the MeSH ID for the clinically developed indication. The threshold is varied by units of 0.05 (labels) and the results are plotted as RS (y axis) versus number of genetically supported T-I pairs (x axis). B ) Breakdown of OTG and OMIM RS values by whether any drug for each T-I pair has had orphan status assigned. The N of genetically supported T-I pairs (denominator) and, of those, launched T-I pairs (numerator) is shown at right. Values for the full 2×2 contingency table including the non-supported pairs, used to calculate RS, are provided in Table S12 . Total N = 13,022 T-I pairs, of which 3,149 are orphan. The center is the RS point estimate and error bars are Katz 95% confidence intervals. C ) RS for somatic genetic evidence from IntOGen versus germline genetic evidence, for oncology and non-oncology indications. Note that the approved/supported proportions displayed for the top two rows are identical because all IntOGen genetic support is for oncology indications, yet the RS is different because the number of non-supported approved and non-supported clinical stage programs is different. In other words, in the “All indications” row, there is a Simpson’s paradox that diminishes the apparent RS of IntOGen — IntOGen support improves success rate (see 2 nd row) but also selects for oncology, an area with low baseline success rate (as shown in Extended Data Fig. 6a ). N is displayed at right as in (B), with full contingency tables in Table S13 . Total N = 13,022 T-I pairs, of which 6,842 non-oncology, 6,180 oncology, 1,287 targeting IntOGen oncogenes, 284 targeting tumor suppressors, and 176 targeting IntOGen genes of unknown mechanism. The center is the RS point estimate and error bars are Katz 95% confidence intervals. D ) As for top panel of Fig. 1d , but without removing replications or OMIM-supported T-I pairs. N is displayed as in (B), with full contingency tables in Table S14 . Total N = 13,022 T-I pairs. The center is the RS point estimate and error bars are Katz 95% confidence intervals. E ) As for top panel of Fig. 1d , removing replications but not removing OMIM-supported T-I pairs. N is displayed as in (B), with full contingency tables in Table S15 . Total N = 13,022 T-I pairs. The center is the RS point estimate and error bars are Katz 95% confidence intervals. F ) Proportion of T-I pairs supported by a GWAS Catalog association that are launched (versus phase I-III) as a function of the year of first genetic association. G ) Launched T-I pairs genetically supported by OTG GWAS, shown by year of launch (y axis) and year of first genetic association (x axis). Gene symbols are labeled for first approvals of targets with at least 5 years between association and launch. Of 104 OTG-supported launched T-I pairs (Fig. 1d ), year of drug launch was available for N = 38 shown here, of which 18 (47%) acquired genetic support only in or after the year of launch. The true proportion of launched T-I whose GWAS support is retrospective may be larger if the T-I with a missing launch year are more often older drug approvals less well annotated in Pharmaprojects. H ) Lack of impact of GWAS Catalog lead SNP odds ratio (OR) on RS when using the same OR breaks as used by King et al. 15 . N is displayed as in (B), with full contingency tables in Table S18 . Total N = 13,022 T-I pairs. The center is the RS point estimate and error bars are Katz 95% confidence intervals. See Fig. S4 for the same analyses restricted to drugs with a single known target.

Extended Data Fig. 3 Sensitivity to changes in genetic data and drug pipeline over the past decade and to the ‘genetic insight’ filter.

“2013” here indicates the data freezes from Nelson et al. 5 (that study’s supplementary dataset 2 for genetics and supplementary dataset 3 for drug pipeline); “2023” indicates the data freezes in the present study. All datasets were processed using the current MeSH similarity matrix, and because “genetic insight” changes over time (more traits have been studied genetically now than in 2013), all panels are unfiltered for genetic insight (hence numbers in panel D differ from those in Fig. 1a ). Every panel shows the proportion of combined (both historical and active) target-indication pairs with genetic support, or P(G), by development phase. A ) 2013 drug pipeline and 2013 genetics. B ) 2013 drug pipeline and 2023 genetics. C ) 2023 drug pipeline and 2013 genetics. D ) 2023 drug pipeline and 2023 genetics. E ) 2023 drug pipeline with only OTG GWAS hits through 2013 and no other sources of genetic evidence. F ) 2023 drug pipeline with only OTG GWAS hits for all years, no other sources of genetic evidence. We note that the increase in P(G) over the past decade 5 is almost entirely attributable to new genetic evidence (e.g. contrast B vs. A, D vs. C, F vs. E) rather than changes in the drug pipeline (e.g. compare A vs. C, B vs. D). In contrast, the increase in RS is due mostly to changes in the drug pipeline (compare C, D, E, F vs. A, B), in line with theoretical expectations outlined by Hingorani et al. 16 and consistent with the findings of King et al. 15 We note that both the contrasts in this figure, and the fact that genetic support is so often retrospective (Extended Data Fig. 2g ) suggest that P(G) will continue to rise in coming years. For 2013 drug pipeline, N = 8,624 T-I pairs (1,605 preclinical, 1,772 phase I, 2,779 phase II, 636 phase III, and 1,832 launched); for 2023 drug pipeline, N = 29,464 T-I pairs (N = 12,653 preclinical, 4,946 phase I, 8,268 phase II, 1,781 phase III, and 1,816 launched). Details including numerator and denominator for P(G) and full continency tables for RS are provided in Tables S19 - S20 . In A-F, the center is exact proportion and error bars are Wilson binomial 95% confidence intervals. Because all panels here are unfiltered for genetic insight, we also show the difference in RS across G ) sources of genetic evidence and H ) therapy areas when this filter is removed. In general, removing this filter decreases RS by 0.17; this varies only slightly between sources and areas. The largest impact is seen in Infection, where removing the filter drops the RS from 2.73 to 2.03. The relatively minor impact of removing the genetic insight filter is consistent with the findings of King et al. 15 , who varied the minimum number of genetic associations required for an indication to be included, and found that risk ratio for progression (i.e. RS) was slightly diminished when the threshold was reduced. See Fig. S5 for the same analyses restricted to drugs with a single known target.

Extended Data Fig. 4 Proportion of type 2 diabetes drug targets with human genetic support by highest phase reached.

A) OMIM, B) established (2019 and earlier) GWAS genes, C) novel (new in Vujkovic 2020 or Suzuki 2023) GWAS genes, or D) any of the above. See  Methods for details on GWAS dataset processing. N is indicated at right of each panel, with denominator being the number of T2D targets at each stage and the numerator being the number of those that are genetically supported. Total N = 284 targets. The center is the exact proportion and error bars are Wilson binomial 95% confidence intervals.

Extended Data Fig. 5 P(G) by phase versus therapy area.

Each panel represents one therapy area, and shows the proportion of target-indication pairs in that area with genetic support, or P(G), by development phase. The genetically supported and total number of T-I pairs at each phase in each therapy area are provided in Table S33 . Total number of T-I pairs in any area: N = 10,839 preclinical, N = 4,421 phase I, N = 7,383 phase II, N = 1,551 phase III, N = 1,519 launched. The center is the exact proportion and error bars are Wilson binomial 95% confidence intervals. See Fig. S6 for the same analyses restricted to drugs with a single known target.

Extended Data Fig. 6 Confounding between therapy areas and properties of supporting genetic evidence.

In panels A-E, each point represents one GWAS Catalog-supported T-I pair in phase I through launched, and boxes represent medians and interquartile ranges (25 th , 50 th , and 75 th percentile). Each panel A-E represents the cross-tabulation of therapy areas versus the properties examined in Fig. 1d . Kruskal-Wallis tests treat each variable as continuous, while chi-squared tests are applied to the discrete bins used in Fig. 1d . A ) Year of discovery, Kruskal-Wallis P = 1.1e-11, chi-squared P = 2.9e-16, N = 686 target-indication-area (T-I-A) triplets; B ) gene count, Kruskal-Wallis P = 6.2e-35, chi-squared P = 7.1e-47, N = 770 T-I-A triplets; C ) absolute beta, Kruskal-Wallis P = 1.2e-5, chi-squared P = 1.7e-7, N = 461 T-I-A triplets; D ) absolute odds ratio, Kruskal-Wallis P = 2.5e-5, chi-squared P = 4.3e-6, N = 305 T-I-A triplets; E ) minor allele frequency, Kruskal-Wallis P = 5.7e-4, chi-squared P = 4.3e-3, N = 584 T-I-A triplets; F ) Barplot of therapy areas of genetically supported T-I by source of GWAS data within OTG, chi-squared P = 2.4e-7. See Fig. S7 for the same analyses restricted to drugs with a single known target.

Extended Data Fig. 7 Further analyses of differences in relative success among therapy areas.

A ) Probability of success, P(S), by therapy area, with Wilson 95% confidence intervals. The N shown at right indicates the number of launched T-I pairs (numerator) and number of T-I pairs reaching at least phase I (denominator). The center is the exact proportion and error bars are Wilson binomial 95% confidence intervals. B ) Probability of genetic support, P(G), by therapy area, with Wilson 95% confidence intervals. The N shown at right indicates the number of genetically supported T-I pairs reaching at least phase I (numerator) and total number of T-I pairs reaching at least phase I (denominator). The center is the exact proportion and error bars are Wilson binomial 95% confidence intervals. C ) P(S) vs. P(G), D ) RS s. P(S), and E ) RS vs. P(G) across therapy areas, with centers indicating point estimates and crosshairs representing 95% confidence intervals on both dimensions — Katz for RS and Wilson for P(G) and P(S). For A-E, total N = 13,022 unique T-I pairs, but because some indications belong to > 1 therapy area, N = 16,900 target-indication-area (T-I-A) triples. For exact N and full contingency tables, see Table S28 . F ) Re-analysis of RS (x axis) broken down by therapy area using data from supplementary table  6 of Nelson et al. 5 . G ) Confusion matrix showing the categorization of unique drug indications into therapy areas in Nelson et al. 5 versus current. Note that the current categorization is based on each indication’s position in the MeSH ontological tree and one indication can appear in > 1 area, see  Methods for details. Marginals along the top edge are the number of drug indications in each current therapy area that were absent from the 2015 dataset. Marginals along the right edge are the number of drug indications in each 2015 therapy area that are absent from the current dataset. See Fig. S8 for the same analyses restricted to drugs with a single known target.

Extended Data Fig. 8 Level of utilization of genetic support among targets.

As for Fig. 3 , but grouped by target instead of T-I pair. Thus, the denominator for each cell is the number of targets with at least one genetically supported indication, and each target counts towards the numerator if at least one genetically supported indication has reached phase I. See Fig. S9 for the same analyses restricted to drugs with a single known target.

Supplementary information

Supplementary figures.

Supplementary Figs. 1–9, corresponding to the three main and six extended data figures restricted to drugs with one target only.

Reporting Summary

Peer review file, supplementary data.

Supplementary Tables 1–50, including information on all target-indication pairs, source data for all graphs and additional analyses.

Source data

Source data fig. 1, source data fig. 2, source data fig. 3, source data extended data fig. 2, source data extended data fig. 3, source data extended data fig. 4, source data extended data fig. 5, source data extended data fig. 6, source data extended data fig. 7, source data extended data fig. 8, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Minikel, E.V., Painter, J.L., Dong, C.C. et al. Refining the impact of genetic evidence on clinical success. Nature (2024). https://doi.org/10.1038/s41586-024-07316-0

Download citation

Received : 05 July 2023

Accepted : 14 March 2024

Published : 17 April 2024

DOI : https://doi.org/10.1038/s41586-024-07316-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

statistical genetics thesis

Division of Biology & Biomedical Sciences

  • Molecular Genetics & Genomics

Molecular Genetics and Genomics

Geneticists seek to understand how genes are inherited, modified, expressed and regulated, as well as the genetic basis of human disease. The field of genetics and genomics has been astonishingly successful in deciphering the genetic code and providing us with a clear picture of the nature of the gene, but much remains to be learned about fundamental genetic mechanisms and how specific gene mutations lead to disease. How is it that only the appropriate genes are turned on in a particular cell type? How do cells replicate their genes with remarkable speed and fidelity? By what processes do genes become altered to drive evolution or cause disease? Uncovering answers to such fundamental questions makes genetics and genomics an exciting and important field of biology.

Laboratories in the Molecular Genetics and Genomics (MGG) program leverage human genetics, model organism genetics, and genomic and computational approaches to address key outstanding questions in all areas of biomedical research with a focus on human disease. Integrating wet and dry bench approaches, students in the MGG program advance our understanding of the genetic, cellular, and molecular basis of how cells, tissues, and organs develop and function and how alterations in these processes lead to disease. MGG laboratories at WashU have been at the forefront of human molecular genetics and the Human Genome Project. Students interested in studying fundamental genetic mechanisms, as well as those who desire to apply this knowledge to human biology, will find scores of laboratories within the program in which to pursue their doctoral research.

Students in the Molecular Genetics & Genomics (MGG) program will typically take five (5) courses during their first year, although students with masters degrees have the option of taking fewer classes. Students will participate in three lab rotations during the fall and spring semesters of Year 1 to identify a thesis lab.  Students are expected to complete the following coursework during their graduate education:

DBBS required courses

Graduate Research Fundamentals Ethics and Research Science – typically taken in Year 2

Program required courses

Advanced Genetics Genomics (with programming lab) Two (2) semesters of Genetics Journal Club

Two (2) advanced electives

Fundamentals of Biostatistics (recommended for students with little or no programming experience) Human Genetic Analysis Computational Statistical Genetics Fundamentals in Molecular Cell Biology Nucleic Acids and Protein Biosynthesis Molecular, Cell and Organ Systems Computational Molecular Biology

In addition to the listings above, students may also take the following additional courses as appropriate: Developmental Biology Macromolecular Interactions Molecular Microbiology and Pathogenesis Immunobiology I Immunobiology II

Qualifying exam

In Year 2, students must pass a Qualifying Exam (QE). Following a successful QE defense, students will identify and finalize their committee and complete their thesis proposal by December 31 of Year 3.

Thesis committee, proposal, and defense

Toward the end of Year 1, students will select a thesis advisor and begin their research.  Students next assemble a research advisory committee (thesis committee) of faculty and complete a written and oral thesis proposal to their committee by December 31 of Year 3. After their thesis proposal, students will focus on their thesis research. Students may participate in advanced courses, workshops, journal clubs, and seminars relevant to their research and to their career development. Students typically defend in Years 4, 5, or 6 of graduate school.

statistical genetics thesis

MGG graduates pursue a variety of careers. Most program graduates go into academia, but many find paths in industry, government, and other fields, like science communication, law, and business and entrepreneurship.

Graduate Program Administrator:  Shonda Dukes

Faculty Co-Directors:  Jim Skeath, PhD John Edwards, PhD

statistical genetics thesis

  • Biochemistry, Biophysics, & Structural Biology
  • Biomedical Informatics & Data Science
  • Cancer Biology
  • Computational & Systems Biology
  • Developmental, Regenerative, & Stem Cell Biology
  • Ecology & Evolutionary Biology
  • Molecular Cell Biology
  • Molecular Microbiology & Microbial Pathogenesis
  • Neurosciences
  • Plant & Microbial Biosciences

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.160; 2023
  • PMC10163706

Logo of hereditas

Genomics in animal breeding from the perspectives of matrices and molecules

Martin johnsson.

Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Box 7023, Uppsala, 75007 Sweden

Associated Data

Not applicable.

This paper describes genomics from two perspectives that are in use in animal breeding and genetics: a statistical perspective concentrating on models for estimating breeding values, and a sequence perspective concentrating on the function of DNA molecules.

This paper reviews the development of genomics in animal breeding and speculates on its future from these two perspectives. From the statistical perspective, genomic data are large sets of markers of ancestry; animal breeding makes use of them while remaining agnostic about their function. From the sequence perspective, genomic data are a source of causative variants; what animal breeding needs is to identify and make use of them.

The statistical perspective, in the form of genomic selection, is the more applicable in contemporary breeding. Animal genomics researchers using from the sequence perspective are still working towards this the isolation of causative variants, equipped with new technologies but continuing a decades-long line of research.

Genomics, in the sense of genetic analyses using markers spaced out along the whole genome, has become a mainstream part of animal breeding. In March 2021, the dairy cattle evaluation in the US run by the Council on Dairy Cattle Breeding had accumulated five million genotyped animals [ 1 ]. These data are gathered for the purpose genomic selection, that is, evaluation of animals based on genome-wide DNA-testing, which was implemented in the US in 2007 (reviewed by [ 2 ]). Genomic selection builds on the practice of genetic evaluation by estimating a breeding value — a prediction of the trait values of the offspring that an animal will have — based on measurements on the animal itself and its relatives. Genomic selection adds molecular information in the form of genome-wide DNA markers to the evaluation.

Animal breeding before genomics was already immensely effective in changing the traits of farm animals. Take for example broiler chicken breeding. Zuidhof et al. [ 3 ] compared commercial broilers from 2005 (Ross 308 from Aviagen) with populations where breeding stopped in 1957 or 1978, kept in the same environment and fed the same feed. At eight weeks of age, the average body mass was 0.9 kg for the population with genetics from 1957, 1.8 for the population with genetics from 1978, and 4.2 kg for the population with genetics from 2005. The first SNP chip for chickens was developed in 2005 [ 4 ], and Aviagen started using genomic selection in 2012 [ 5 ] and thus, this difference is due to breeding that occurred before genomics. Genomics, however, made selection even more effective, either by increasing accuracy of selection or reducing generation interval, depending on the species. Potentially, it can also tell us about the molecular nature of the variants under selection and lead to new biotechnology applications for livestock.

The term “genomics” is derived from “genome”, which was coined by Hans Winkler in 1920 [ 6 ] and refers to one haploid set of chromosomes [ 7 ], or —with some degree of slippage in meaning — the complete DNA of a species. According to Thomas Roderick [ 8 ] the extension to “genomics” was conceived in 1986, as founders of the journal Genomics were trying to find a name for it. From the start, they regarded genomics as the name of a new field — “an activity, a new way to think about biology”.

There are (at least) two ways to think of genomics in animal breeding: two perspectives on genomics that will, throughout this paper, be called the statistical and the sequence perspectives:

  • We may think of the genome as a big table of numbers, where each row is an individual and each column a genetic variant, and the numbers are ancestry indicators. These matrices lend themselves to statistical calculations such as estimation of genomic breeding values. This is the view from the statistical perspective.
  • Alternatively, we may think of the genome as a long string of A, C, G and T. They lend themselves to molecular biology operations like predicting the amino acid substitution from a base pair substitution, or identifying patterns of interest. This is the view from the sequence perspective.

The perspectives roughly map to two concepts of a so-called gene [ 9 ]: The statistical perspective relates to the instrumental gene, a calculating device used by classical geneticists to understand inheritance patterns. The instrumental gene is a particle of inheritance, observed indirectly through crosses and comparisons of traits between relatives. For an example, the textbook of classical genetics by Sturtevant and Beadle [ 10 ] is full of crossing schemes of fruit flies that allow modes of inheritance to be investigated. In the introduction, the authors describe their view of genetics as a science. They call it “a mathematically formulated subject that is logically complete and self-contained”, without the necessity of a physical or chemical account of how inheritance works. On the other hand, the molecular perspective aligns closer with the nominal gene concept, where a gene is a DNA sequence that has a name and (potentially) a function. As an example, we can look at a genome browser such as Ensembl [ 11 ], which shows a genome as a series of track, with colourful boxes denoting genes, regulatory DNA sequences, and other associated information.

To be clear, I am not suggesting that individual geneticists are so limited in their thinking as to use only one of these perspectives. Any one researcher probably has these and several other mental models of the genome for different tasks. In practice, geneticists seem to routinely switch between different perspectives and conceptions of central terms like “genome”, “gene” and “locus”, without much friction. Certainly, ambiguity may lead to “complexity and confusion” [ 12 ], but I would argue that the imprecision is also sometimes productive, as it avoids unnecessary debates about which of these concepts are “right”, when the real answer is that all of them are working models and all are useful in different contexts.

The two perspectives lead to different views about the importance of identifying sequence variants that cause trait differences between individuals (“causative variants”, for short). From the statistical perspective, genomic data are large sets of markers of ancestry; we can make use of them while remaining agnostic about their function. From the sequence perspective, genomic data are a source of causative variants; we need to identify and make use of them. To realise the future potential of the sequence perspective, geneticists need to identify causative variants, while the statistical perspective has been successful, precisely by ignoring causative variants. The power of markers [ 13 ] is what Sturtevant & Beadle described: The point is to make use of statistical regularities without getting bogged down in mechanistic detail. Conversely, the potential of the molecular perspective is in understanding mechanisms and learning to manipulate them in ways that would not be possible by traditional selection and crossing. Mostly, this potential of the sequence perspective has not been realised, but the search for molecular knowledge has made possible tools that underpin applications of the statistical perspective, especially genomic selection.

Tools of the statistical perspective

Genomic selection is the crowning achievement of the statistical perspective on genomics in animal breeding, building on a long line of research of mapping phenotypes to genotypes. Genetic mapping — the family of methods used for localising variants that affect traits, roughly at first — goes back to the early history of classical genetics. Once geneticists had discovered that genes were arranged linearly on chromosomes, they could build maps of where causative variants underlying visible phenotypes were located relative to each other, the first map being published by Sturtevant [ 14 ]. This map building activity, based on crossing and detecting recombinant individuals, is called linkage mapping. The extension to complex traits with many causative variants of small effects is traditionally called “quantitative trait locus mapping” [ 15 ]. The extension to large population samples of more distantly related individuals is called “genome-wide association” [ 16 ], and has become the dominant form of genetic mapping. Arguably, genetic mapping can be viewed both from the statistical and sequence perspectives. On one hand, these methods involve statistical genetical methods that are very similar to those used in genomic prediction, and involve representing genomic data statistically. On the other hand, the end goal is usually to identify causative variants.

Out of genetic mapping of traits relevant to breeding comes marker-assisted selection, an earlier paradigm for incorporating molecular information in breeding. In a way, marker-assisted selection is the most intuitive way to imagine molecular breeding: Imagine that we have identified some genetic variants that either cause a trait of interest, or are strongly associated with it. Then, we can genotype our selection candidates for the variant of interest, and incorporate those genotypes into selection decisions. For example, if we know about a strongly deleterious variant, we can exclude candidates that carry it. The proposition of a genetic test is especially attractive when the trait is otherwise hard to phenotype. This was precisely the situation with several large-effect deleterious alleles in pigs and cattle, where marker-assisted selection was successfully implemented against the problematic alleles: malignant hyperthermia and the RN gene in pigs (reviewed by [ 13 , 17 ]) and BLAD in cattle [ 18 ]. DNA tests for such large-effect damaging variants are now routinely included in many genomic breeding programs (e.g., [ 19 , 20 ]).

At some point during the late 1990 to early 2000s, animal breeding researchers shifted their thinking from marker-assisted selection to genomic selection, from thinking about mapping causative variants to treating the whole genome together. Arguably, the key paper, and the most cited, is the one by Meuwissen, Hayes and Goddard [ 21 ]. It presents the full case for genomic selection, including simulations and a few alternative estimation methods (leading to the so-called Bayesian alphabet family of methods). However, genomic selection did not appear fully formed at once. Other genomic selection precursor papers from the era include:

  • The 1990 paper by Lande & Thompson [ 22 ] that contains the key idea of covering the genome with markers and selecting on a total score based on all the markers.
  • The 1997 paper by Nejati-Javaremi, Smith & Gibson [ 23 ], the key idea of which is to create a relationship matrix based on variants that affect a trait, creating estimated breeding values based on what they call “total allelic relationship”.
  • The 1998 paper by Haley & Visscher [ 24 ] which uses the term “genomic selection” and clearly expresses the concept, including the interpretation of genetic markers as realised relatedness.

Exactly when and by whom (in conversation or in parallel) the shift happened is a topic of its own. It seems to have been a gradual process. Still, Meuwissen, Hayes and Goddard (2001) is a landmark in that it provided a full recipe for genomic selection, and ran the proof of concept in silico . Genomic selection worked well enough in theory that is provided the inspiration for creating the tools and the practical initiatives to make it reality.

We can think of genomic prediction it as refining the estimate of how closely related animals are to each other by observing how much DNA the animals share, as opposed to the average relatedness that can be predicted from a pedigree. Alternatively, we can think of it as simultaneously estimating the contribution of every part of the genome (that is, every marker we genotype), and adding them up to a genomic estimate for that animal (see [ 25 ] for a review of the statistical approaches used in animal breeding). Either way, the key insight in genomic selection is that one can accurately predict breeding values in the absence of information about the function of particular variants by combining all markers in one statistical model. As Lowe & Bruce point out [ 13 ], this black-boxing of genetic mechanisms is characteristic of the quantitative genetics tradition, here expressed by one of the pioneering applied quantitative geneticists, Lush [ 26 ]:

It is rarely possible to identify the pertinent genes in a Mendelian way or to map the chromosomal position of any of them. Fortunately this inability to identify and describe the genes individually is almost no handicap to the breeder of economic plants or animals. What he would actually do if he knew the details about all the genes which affect a quantitative character in that population differs little from what he will do if he merely knows how heritable it is and whether much of the hereditary variance comes from dominance or overdominance, and from epistatic interactions between the genes.

Lowe & Bruce argue that this attitude is key to the success of genomic selection: this strategy is the outcome of an alignment, but not a full integration of quantitative and molecular genetics, which allowed quantitative genetics to make use of molecular methods to generate ever denser marker maps, while sticking with the tradition of abstraction [ 13 ].

The effects of genomics have been dramatic. Genomic prediction allows selection to proceed more quickly, or more accurately, depending on the biology of the species and the design of the breeding program. In cattle, increased selection accuracy for young bulls without daughter records allow shorter generation times [ 2 , 27 , 28 ], and genotyping of heifers much improves selection accuracy of cows relative to pedigree-based evaluation [ 29 ]. In pigs, genomics have increased accuracy of selection in several traits by 50% [ 17 ]. In poultry, accuracy has also increased; a review of genomic selection in poultry gives accuracy increases ranging from 20% to over 50% in layers and broilers [ 5 ].

There are further statistical genetics tools, agnostic of marker function, that can be enriched by genomics. Optimal contributions selection (reviewed by [ 30 ]) is a family of methods to balance the genetic improvement and inbreeding or loss of diversity of a population. These methods work by finding less related individuals to pair, that still give a high expected genetic gain in the offspring. Like in genomic selection, pedigree relatedness can be substituted with genomic relatedness. Since genomic selection in practice tends to accelerate inbreeding, there may be greater need for optimal contributions selection in genomic breeding. Specifically, genomic selection can in principle differentiate between individuals that are identically related in terms of pedigree, and thus lead to less correlation between families, and a lower inbreeding rate, all else equal [ 31 ]. In practice, all else is not equal, because genomics leads to redesigns of breeding programs, which may in itself increase or decrease the inbreeding rate. In breeding programs where genomic selection helped reduce generation time, a low inbreeding rate per generation may translate to accelerating inbreeding per year. There are examples of both accelerated [ 32 ] and reduced inbreeding rates after genomic selection [ 33 ].

Furthermore, population genetic methods can find the similarity between populations and individuals, and classify individuals based on breed composition, geographic origin or assign offspring to parents. For example, DNA testing to confirm pedigree in cattle started with blood groups, moved on to genetic markers, and now use the genome-wide SNP chips that are used for genomic selection [ 34 ]. Genomics allows plentiful markers distributed throughout the genome, and so, methods can be more precise in pinpointing ancestry [ 35 ], and reconstruct pedigree information that is missing [ 36 ].

Tools of the sequence perspective

From the sequence perspective, the development of genomics in animal breeding can be seen as ongoing effort to build the tools for causative variant identification. In the process, it also gave rise to the enabling technology for genomic selection. This development includes reference genomes for farm animals, dense marker panels and affordable methods to type them (SNP chips, reduced representation sequencing), genome annotation and maps that localise causative variants in the genome (linkage mapping and genome-wide association).

The chicken genome sequence was published in 2004 [ 37 ], cattle in 2009 [ 38 ], and pig in 2012 [ 39 ]. The choice of any one publication and year as a milestone in a genome sequencing project is somewhat arbitrary, because the sequences reported in these papers were neither the first nor the last drafts. Genome assembly is an iterative process that combines different kinds of data, computational models, and human judgement to represent a genome. For a historical account of the diverse data and ways of reasoning used in the pig genome project, see Lowe [ 40 ]. Lowe points out that a genome project was not just about sequencing in the narrow sense of putting DNA base pairs in order, but “thick” sequencing, which also includes the creation of tools, annotation with additional data, and dissemination to a research community that makes reference genomes useful. Consequently, the development of farm animal reference sequences is still ongoing, with the pig, cattle and chicken genomes being updated [ 41 , 42 ] and followed by sheep, goat, ducks, turkeys and many other. There are now multiple high-quality genome assemblies, e.g. in cattle [ 43 , 44 ]. Inevitably, more are coming, as genome assembly becomes more affordable and streamlined.

The next layer atop the reference genome is annotation, here understood as any information that has a genomic coordinate, localising it in the genome. As Szymanski et al. [ 45 ] point out in a study of the yeast genome, one of the functions of a reference genome as a digital model of the genome is to allow researchers to organise and connect different sources of data. Researchers can put their data on the same coordinate system and create a coherent picture. In the yeast community, that coherence-building used to be achieved by sharing strains and standard protocols, before the reference genome. For logistical reasons, germplasm sharing is harder in farm animal genetics. But now, genome annotation is available in genome browsers such as the NCBI Genome Data Viewer and Ensembl, which contain comparative information [ 46 ], the location of genes, and non-genic elements of importance such as open chromatin (as it is becoming available). Projects like Functional Annotation of Animal Genomes [ 47 ] are producing detailed maps of gene-regulatory regions in farm animal genomes, with the express purpose that researchers are going to be able integrate their openly available data into their projects. Such functional genomic data might be useful both for annotating genetic variants as a part of fine-mapping and nominating potential causative variants, in genomic prediction with sequence data, and in molecular biology studies of gene-regulatory networks.

The key technology, however, enabling genomics in farm animals is affordable high throughput genotyping, in the form of SNP chip technology that allows the testing of thousands of single nucleotide variants (SNPs) at the same time. SNP chips are, generally, surfaces with known pieces of DNA them. The array captures fragments of DNA close to the markers we want to type, and a DNA polymerase enzyme that incorporates labelled nucleotides gives a fluorescence signal, where the relative signal intensity of the alleles will tell us the genotype [ 48 ]. A clustering algorithm will help turn the intensity values into genotypes — the numeric coding needed for all the statistical genomic methods.

Looking at the original three farm animal genome papers, they all mentioned genetic improvement of livestock, but in oblique terms. It is as if they either did not know precisely how a reference genome would improve breeding in these animals, or that the way forward now that the reference genome was in place was too obvious to even to mention:

  • The chicken genome sequence promotes both the development of more refined polymorphic maps (see the accompanying paper [ 49 ] ) and the framework for discovering the functional polymorphisms underlying interesting quantitative traits, thus fully exploiting the genetic potential of the chicken. [ 37 ]
  • The cattle genome and associated resources will facilitate the identification of novel functions and regulatory systems of general importance in mammals and may provide an enabling tool for genetic improvement within the beef and dairy industries. [ 38 ]
  • The pig genome sequence provides an important resource for further improvements of this important livestock species, and our identification of many putative disease-causing variants extends the potential of the pig as a biomedical model. [ 39 ]

However, when the first SNP chips were being published, the design of the SNP chips were explicitly motivated with the ability to perform genomic selection, in addition to the ability to improve genetic mapping:

  • The aim of this study was to develop and characterize a high-density, genome-wide SNP assay for cattle with the power to detect genomic segments harboring inter-individual DNA sequence variation affecting phenotypic traits and for application to GWS, in which an animal’s genetic merit is estimated solely from its multilocus genotype. [ 50 ]
  • The most efficient way to genotype large numbers of SNPs is to design a high-density assay that includes tens of thousands of SNPs distributed throughout the genome. These SNP “chips” are a valuable resource for genetic studies in livestock species, such as genomic selection, detection of [quantitative trait loci] or diversity studies. [ 51 ]
  • In livestock species like the chicken, high throughput single nucleotide polymorphism (SNP) genotyping assays are increasingly being used for whole genome association studies and as a tool in breeding (referred to as genomic selection). [ 52 ]

These genomic tools — reference genomes, genome annotation, large-scale genotyping — build towards detecting causative variants that affect traits by allowing bigger and more marker-dense genome-wide association studies for localising causative variants, and the ability to look under the loci detected to find the underlying genes and important sequence elements, such as gene-regulatory sequences. It is striking to read the attitudes in commentaries on genomics in animal breeding from the early days of genomics. Here is Bulfield [ 53 ] in 2000 describing the isolation of causative variants:

Farm animal genomics is developing in four phases. (1) Constructing maps of highly informative markers and genes. (2) Using these maps to scan broadly across genomes of resource populations, segregating for commercially important traits, to locate quantitative trait loci (QTL) into 20–40 cM chromosomal segments. (3) Identifying the trait gene(s) themselves, within these regions. (4) Bridging the ‘phenotype gap’ between the gene(s) and the ultimate trait.

What implications would this have for animal breeding? Bulfield continues:

In animal breeding, a combination of genome analysis and cell culture-based transgenesis would permit a more controlled approach to animal breeding, especially for currently intractable traits such as fertility and disease resistance. In addition, cloning from adult cells (as with Dolly) would permit the replication of (for example) a proven high-yielding and productive dairy cow.

On the same theme, Goddard [ 54 ] wrote in 2003:

I believe animal breeding in the post-genomic era will be dramatically different to what it is today. There will be a massive research effort to discover the function of genes including the effect of DNA polymorphisms on phenotype. Breeding programmes will utilize a large number of DNA-based tests for specific genes combined with new reproductive techniques and transgenes to increase the rate of genetic improvement and to produce for, or allocate animals to, the product line to which they are best suited. However, this stage will not be reached for some years by which time many of the early investors will have given up, disappointed with the early benefits.

In retrospect, Bulfield was clearly too optimistic; Goddard’s more tempered optimism might still be right depending on how long time counts as “some years”. Also, the technologies listed by Bulfield [ 53 ] — linkage maps of 20 to 40 cM resolution, microsatellite and amplified fragment length markers, back-crosses and expressed sequence tag libraries — sound antique to students of animal breeding educated today. The low number of markers (e.g., 40 cM resolution would mean about 150 markers to cover the cattle genome), made sense for genetic mapping based on linkage within families, which was the state of the art at the time. The tools of the sequence perspective have moved far during 20 years, but the underlying problems of causative variant identification remain the same.

That is, despite the increasing development of molecular tools, statistical methods, and increasing dataset sizes, there are few known causative variants for economically important traits (see tables in [ 55 ]). None of them have yet led to transgenic animals that are used in farming. Why have we not found the causative variants? There are at least three problems:

  • It turns out that most traits of interest are massively polygenic. That is, they are affected by thousands of genetic variants, most of individually small effects. This has been a staple assumption of quantitative genetics since the early 20th century, and was further cemented by the failure of linkage mapping to explain large chunks of inheritance, and now there are methods (based on genomic selection models) to estimate polygenicity from data. The estimated number of variants for complex traits in humans are in the range of tens of thousands of causative variants [ 56 , 57 ].
  • Quantitative traits may have complex genetic architectures in other ways than polygenicity; they may be affected by rare variants whose effects are hard to estimate, and variants that act in non-additive ways (dominance or epistasis). This is less important for selection, as the response to selection depends on the additive genetic variance, and even non-additive effects at the variant level can result in substantial additive genetic variance [ 58 , 59 ]. However, when we go on to identify causative variants, it may matter, for example, if the apparently additive outcome depends on pairwise interactions between variants that are located close together.
  • Even when an association has been isolated (and there are thousands of them [ 60 ]), fine-mapping an association signal down to the causative variant or even gene is hard, because there are many variants, and they correlate (geneticists call this correlation, abstrusely, “linkage disequilibrium”), and interpreting them and testing their effects are hard work.

The Goddard [ 54 ] quote is particularly apt, because while the post-genomic future he envisaged, based on the sequence perspective, has not happened, at about the same time as that paper was published, he was involved in developing genomic selection, the statistical genomics future that happened instead.

Statistical futures

What is the future of genomic breeding? From the statistical perspective, the immediate future seems to hold even more genomic selection — on more data, with new traits, spread to new species and breeding programs, and possibly enhanced with functional genomic data.

As data accumulate on more and more animals, larger datasets cause computational difficulties. Methods such as APY (the “algorithm of proven and young”), which splits a genomic selection dataset into a “core” group of animals and a “peripheral” group of animals and performs the most intense computations only on the core subset, allow one to use large numbers of genotyped animals and still be able to compute estimated breeding values in reasonable times [ 61 ]. There is a whole strand of genomics research in animal breeding that works on improving the way genomic selection models are used in practice, how to fit the models efficiently, how to re-fit them when new data arrives, and how to estimate their accuracy (see review by [ 62 ]).

Another ongoing strand of research is extending genomic selection to more complicated genetic scenarios like crossbred animals or generalisation between different populations. Standard genomic selection models work best for prediction within a single population. Thus, if crossbred animals are used for breeding, as is common for example in beef cattle, one would like to have genomic estimated breeding values for them. Even when the crossbred animals might not be used in breeding themselves, such as in pig or poultry breeding, there are traits that can only be measured on crossbred individuals and that information needs to be propagated back to the purebred nucleus animals. Similarly, small breeds might struggle to gather enough data, and the ability to borrow information from larger breeds is attractive.

However, genetic distance between animals quickly reduces the accuracy of genomic selection, complicating across-breed and multi-breed genomic prediction (see review by [ 63 ]). First, comparing distantly related breeds, the marker—trait associations in each breed could be very different, both because the breeds might carry different causative alleles and because the correlations (linkage disequilibrium) between causal variants and markers might be different. Second, non-additive genetic effects, which to a first approximation can be discounted as a nuisance factor within a population, can make a substantial difference as genetic differences accumulate. To accurately predict the outcome, a full model would have to consider both dominance and the genotypes at multiple interacting loci. However, without identifying the interactions and non-linearities, the correlation between marker effect estimates can be shown to decline with genetic differentiation [ 64 ].

Another avenue of development is to find a place for machine learning methods in genomics of animal breeding. Machine learning methods have been used in functional genomics to predict variant effects (reviewed by [ 65 ]), and in animal breeding applications for developing new phenotypes [ 66 , 67 ], but so far have not been widely used in genomic selection. This is not for lack of trying; early work included attempts at using kernel methods [ 68 , 69 ], tree regression [ 70 ] and neural networks [ 71 ], and later efforts have been made with deep learning [ 72 , 73 ]. However, unless we count linear mixed models as a machine learning application, these have not made much impact on applied genomic selection. Probably, this is because non-additive effects have hitherto not played a big role in selection, and these methods only outperform linear mixed models when predicting non-additive effects. This may change if genomic selection is extended to systems where non-additive effects are more important, and one has to design matings to produce offspring that deviate from the parent average in the right direction [ 74 ], or for applications where predicting individual phenotype rather than breeding value is the goal.

Finally, there is a strand of research that aims to improve genomic selection by adding more genomic information. For biological reasons, some variants are expected to contribute more — variants close to known associations from genome-wide association studies, variants predicted by bioinformatic means to be functional, variants associated with gene expression variation, variants located in open chromatin in a relevant tissue, and so on. Various statistical extensions to the genomic selection models allow groups of variants to be treated separately [ 75 , 76 ] and given different emphasis depending on their predicted function. Such methods would be important for performing genomic selection with whole-genome sequence data, that include millions rather than tens of thousands of variants. It seems clear that there is potential. A series of studies using gene expression quantitative trait locus data in combination with chromatin and evolutionary conservation suggest that one might be able to prioritise variants that are more likely to explain quantitative trait variation [ 77 , 78 ]. However, empirical results on whole-genome sequence data in genomic prediction [ 79 – 82 ] are inconsistent between methods, populations and traits about whether adding genomic information brings any benefit, or even degrades accuracy. Even in simulations where the causative variants are known [ 83 ], the increase in accuracy from including true causative variants is not great, unless the true effect sizes of the variants are known. Therefore, the potential gain from enhancing genomic selection is probably much less than from the improvement that came from starting genomic selection over traditional evaluation.

The statistical perspective also holds the opposite possibility for a turn away from the genome. Instead of pursuing more genomic data to possibly improve genomic prediction, one could invest in improving measurement technology or modelling to improve the measurement of traits. Because the task, from the statistical perspective, is not to understand the genome but to get a good enough estimate of ancestry, it might be that the best choice is to settle for a relatively crude genotyping strategy (like a medium density SNP chip) and instead focus on gathering more records on high-value but hard-to-measure traits [ 84 ].

Sequence futures

As we saw above, around the turn of the century there was optimism about identifying causative variants and exploiting them in animal breeding, which turned out to be mostly premature. Marker-assisted selection was successfully used on large-effect variants such as genetic defects, but less successful for quantitative traits. There are thousands of quantitative trait loci and genome-wide association hits published for economically relevant quantitative traits in farm animals, but only a handful that have been fine-mapped down to a causative variant [ 85 ]. However, molecular genetic techniques have moved rapidly over the last 20 years, not just adding new assays for gene-regulatory activity, but scaling them to the whole genome. With these new tools at hand, researchers are again optimistic that causative variants can be identified and exploited.

Several papers outline a vision of a future for the sequence perspective in animal [ 86 , 87 ] and plant breeding [ 88 ], using genome editing methods such as CRISPR/Cas9 to supplement classical breeding with causative variants of known function. They call future, causative-variant enabled breeding “Livestock 2.0” and “Breeding 4.0”. Beside the version number conflict the visions have a similar overall shape: the future of breeding lies in identifying genetic causative variants through large genomic datasets, and then introducing them into breeding individuals through gene editing. Clark et al. [ 86 ] also describe identifying functional variants and editing them as “a route to application” for functional genomic data in farm animals.

The first application along this route of gene editing would be the ongoing attempts at editing of monogenic high-value traits, such as hornlessness caused by polled alleles in cattle [ 89 ], or porcine reproductive and respiratory syndrome virus resistance in pigs conveyed by edits to the CD163 gene [ 90 ]. In the case of pigs, the causative variant does not occur naturally, and was designed based on molecular knowledge about the virus’ mode of infection. The hornless variant (“polled”) was identified by genome-wide association [ 91 ]. Conceptually, these proposed applications are somewhat different than the applications that have been proposed for transgenic animals before. Transgenic farm animals, such as the defunct “Enviropig” project [ 92 ] or the AquaAdvantage salmon [ 93 ], would have DNA introduced from different species, and can be thought of as examples of a genetic engineering approach. These modern proposals typically use less dramatic changes, alleles that exist in nature, or could relatively easily happen by natural mutation (e.g., partial deletion of a gene in the CD163 example, or producing a duplication similar to a naturally occurring duplication in the polled case).

Gene editing is like marker-assisted selection in the sense that the variants to be edited need to have large enough effects to be worthwhile, and editing must be more effective than conventional alternatives. Both resistance to porcine reproductive and respiratory syndrome and polledness are potentially traits of great value and connected to animal welfare. Outbreaks of porcine reproductive and respiratory syndrome has devastating consequences for pig health and farm profitability, and simulations suggest that gene editing in combination with partially protective vaccines could eliminate the disease [ 94 ]. Hornless cows are highly desirable by farmers and dehorning is a welfare issue. As for conventional alternative strategies, natural knockouts of the CD163 gene in pigs appear to be exceedingly rare [ 95 ]. Polled alleles, however, occur in many breeds, including dairy breeds conceived as targets of editing, and marker-assisted selection is already in use in breeding programs to promote it, as polled status can be predicted from SNP chips used for genomic selection. Simulation studies suggest that an editing-based strategy for promoting polled can have better consequences in terms of genetic gain and inbreeding than marker-assisted selection [ 96 – 98 ], but it remains to be seen whether the technological hurdles, regulations, acceptability and ethical issues will be resolved in time for polled gene editing to be successful.

However, going beyond monogenic traits to complex traits, the lack of other routes to application other than gene editing becomes a problem. If editing or marker-assisted selection are the only applications for knowledge of causative variants, and neither is likely to work well for complex traits, this limits the applied potential of the sequence perspective. Molecular insights about traits in farm animals are scientifically interesting, but currently have little other applied value. This is often not very clear from reading genomic studies, that often promise improvements to animal breeding without spelling out how they will come about. Allow me a personal and somewhat embarrassing example: In the introduction to my PhD thesis, which was defended in 2015, I wrote about the quantitative trait loci that I had identified, and speculated about what would be needed for them to be used in actual breeding. This discussion was completely misguided. It raised true concerns, such as whether the association would replicate in a different population, whether the underlying variant between shared associations in different populations are the same, and so on, but it missed the mark, because I was not aware that marker-assisted selection for quantitative traits was essentially dead at this point. The quantitative trait locus paradigm that I was operating within was dead and buried in animal breeding, and the first commercial genomic selection of poultry was already happening [ 5 ].

Most traits of economic relevance to animal breeding are affected by many variants of small effects. This polygenicity means that in order to know what sequences to edit and what to put instead one needs to solve the fine-mapping problem, to find ways to reliably identify causative variants, even if they are of moderate effect size. The situation is more challenging than with marker-assisted selection, where it may be enough to detect a variant in close linkage disequilibrium with the genuine causative variant. It is still an open question when and how we will get detailed enough knowledge of the genomic basis of complex traits to do this. It would require a workflow to identify causative variants reliably enough to edit them, in a very short time compared to current methods where thorough characterization of a causative variant takes years.

Furthermore, pleiotropy and non-additive effects might affect predictability of the outcomes of editing. Because the size of the genome and its repertoire of genes is limited, genes and pathways are recycled in a context-dependent manner for many biological functions. This suggests that many genetic variants will affect multiple traits, likely mediated by gene-regulatory relationships. This postulate of “universal pleiotropy” goes back to early quantitative genetics [ 99 ] and forms part of the more recent “omnigenic model” of complex traits [ 100 ]. This suggests that any use of gene editing needs to be vigilant against side-effects and consider the whole breeding goal in a balanced way, as argued by [ 101 ]. In the presence of non-additive effects, the statistical effect of an allele substitution depends on the frequency of the interaction partners. This means that the net effect of a gene edit might change as the population changes, as argued by [ 101 , 102 ]. However, one might argue that we already take genomic selection decisions, and thus shift the allele frequency of regions associated with large marker effects, on the basis of estimates that average over potential interactions and are liable to change over time.

The next problem to overcome is how to introduce many edits into a breeding program. The challenge has two parts: First, multiplex gene editing technically challenging on its own, given that the success rate of a biallelic homology-directed repair editing event with CRISPR/Cas9 is low. Even if it could be increase to double digits, the success rate for multilocus edits would scale poorly. Second, integrating gene editing into animal breeding programs would involve performing gene editing at the scale of many animals. Jenko et al. [ 103 ] suggested a strategy of promotion of alleles by gene editing, where the chosen sires of a breeding program would be edited to be homozygous for causative variants that they did not already carry. They assumed that causative variants were known and that sires could be selected before they were edited. This would require new reproductive technology integrated with genomic selection. Such in vitro breeding strategies have been proposed several times [ 24 , 104 , 105 ] as extensions of the already advanced reproductive technologies used in particular in cattle breeding. For example, if an embryo transfer is already in use to breed sires for a cattle breeding program, it might be possible in the future to use to introduce gene editing machinery into the embryo, then biopsy a small amount of DNA to both verify the integrity of the edits and perform genomic selection. It remains to be seen, if this strategy becomes technologically feasible, what numbers of edited embryos and what levels of failure of editing would be acceptable. The failure rate of gene editing technologies are currently high, and that may lead to high costs and loss of selection response [ 96 ].

Johnsson et al. proposed removal of deleterious alleles [ 106 ], reasoning that damaging variants might be easier to identify from sequence data than causative variants for quantitative traits, and that recessive deleterious alleles may be common in farm animal populations due to ineffective natural selection and the large impact of genetic drift. While that assumption may be true, there is currently no workflow for large-scale identification of deleterious variants in place, and when such variants are detected, marker-assisted selection is more attractive than gene editing.

In summary, the sequence perspective faces challenges, not just within genomics (the fine mapping problem) but also within reproductive technology and breeding program design (the problem of multiplex editing). Gene editing of very large-effect variants is somewhat akin to marker-assisted selection, where there are reliable workflows for causative variant identification, and individual effects may be dramatic enough to justify editing. However, gene editing of causative variants for complex traits appears to fraught with problems to be possible within the foreseeable future. Perhaps finding a promising route to application for the sequence perspective will require a shift in the thinking of the field that we are not yet seeing, similar to the shift from marker-assisted to genomic selection.

Conclusions

In conclusion, there are (at least) two ways to think of genomics in animal breeding, that are helpful in understanding how genomic technologies have changed and may continue to change animal breeding. Currently, tools derived from the statistical perspective are doing the heavy lifting in breeding practice, in the form of genomic selection. With the advent of new technologies, the sequence perspective could make an impact in the future, if it can overcome the twin problems of how to identify causative variants for complex traits and how to introduce them into animals, both at scale.

Authors’ contributions

MJ wrote the paper.

The author acknowledges the financial support from Formas—a Swedish Research Council for Sustainable Development Dnr 2020 − 01637.

Open access funding provided by Swedish University of Agricultural Sciences.

Data Availability

Declarations.

The author declares that he has no competing interests.

This paper is based on a presentation at “Approaches to genetics for livestock research” at IASH, University of Edinburgh, May 2019.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Change history

A Correction to this paper has been published: 10.1186/s41065-023-00287-8

COMMENTS

  1. Biostatistics Dissertations

    Thesis, Applications of Markov Chain Monte Carlo to Longitudinal Repeated Measures: Missing Data and Semi-Parametric Effects Models. Advisor: ... Novel Methodologies in Statistical Genetics for the Discovery of Causal Variants Advisor: C. Lange. Miguel Marino. Statistical Issues in Large-Scale Community-Based Studies

  2. PDF Statistical Methods for Genetics and Genomics Studies

    Statistical challenges will continue to rise in an unprecedented way with the explosive torrent of data from large-scale studies at the molecular, cellular, and whole-organism levels. Gene mapping has become one of the most active research areas in genetics. The goal is to identify genetic variants which underlie the trait of interest. Most

  3. PDF Statistical Genetics

    One current study is the NHLBI Trans-Omics for Precision Medicine (TOPMed) project. www.nhlbiwgs.org For data freeze 5 of this study: Sequence analysis identified 410,323,831 genetic variants (381,343,078 SNVs and 28,980,753 indels), corresponding to an average of one variant per 7 bp throughout the reference genome.

  4. Multivariate linear mixed models for statistical genetics

    To do so, we build on the classical linear mixed model (LMM), a widely adopted framework for genetic studies. The first contribution of this thesis is mtSet, an efficient mixed-model approach that ...

  5. Handbook of Statistical Genomics

    A timely update of a highly popular handbook on statistical genomics This new, two-volume edition of a classic text provides a thorough introduction to statistical genomics, a vital resource for advanced graduate students, early-career researchers and new entrants to the field. It introduces new and updated information on developments that have occurred since the 3rd edition. Widely regarded ...

  6. Academics

    We offer an accelerated 18-month MS capstone program, and a traditional MS thesis program. Doctor of Philosophy. Earn a PhD in Biostatistics to prepare for careers such as an independent investigator, educator, or highly-qualified practitioner of biostatistics. We offer a standard pathway and a statistical genetics pathway.

  7. Statistical Genetics

    PhD with the Statistical Genetics Pathway. Students in our Doctor of Philosophy (PhD) in Biostatistics program may choose a pathway in statistical genetics which provides rigorous training in the areas of statistical genetics, population genetics, and computational molecular biology. Learn more.

  8. Statistical Learning of Large-Scale Genetic Data: How to Run ...

    Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study ...

  9. Integrating statistical genetic and geospatial methods brings new power

    1. Introduction. Phylogeography continues to grow as a discipline, making rapid advances that have been fueled by new methodologies in statistical and population genetics (e.g., Buckley, 2009, Carstens and Richards, 2007, Hickerson et al., 2010, Knowles, 2009, Kozak et al., 2008, Riddle et al., 2008).Originally conceived as a means for bridging the gap between phylogenetics and population ...

  10. Statistical methods to improve understanding of the genetic basis of

    Robust statistical methods, utilising the vast amounts of genetic data that is now available, are required to resolve the genetic aetiology of complex human diseases including immune-mediated diseases. Essential to this process is firstly the use of genome-wide association studies (GWAS) to identify regions of the genome that determine the susceptibility to a given complex disease.

  11. Tutorial: a statistical genetics guide to identifying HLA alleles

    Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, Japan. Yukinori Okada.

  12. Statistical analysis for genome-wide association study

    These findings offer new genetic insights into understanding the pathogenesis of diseases and disorders, and are expected to promote preventive strategies, diagnostic tools and treatments. However, the massive amount of GWAS data poses many statistical and computational problems as well as data storage and management issues [18-22]. GWAS can be ...

  13. MS in Biostatistics

    Take an additional 3 credits of electives; 6 of the elective credits must be from a list of courses related to statistical genetics/genomics; at most 4 credits may be from an outside department. Write a thesis or publishable paper with a focus on Statistical Genomics & Genetics.

  14. Doctor of Philosophy

    Learn statistical theory, skills and techniques, and develop theory and applications of biostatistics. You will learn from internationally recognized faculty in UW's Department of Biostatistics, and complete course work in biostatistics, statistics, and one or more public health or biomedical fields. As a PhD student, you will undertake research that advances the field of biostatistics and ...

  15. 8 Powerful Statistical Methods in Genetics Research

    In this blog post, we will take a look at 8 statistical methods that have been used in genetics research, particularly for big data. Lasso penalized regression and association mapping. Imagine searching for a needle in a haystack. Well, that's precisely what researchers often face when trying to identify genetic variants associated with ...

  16. Statistical Genetics (not admitting students)

    Three Minute Thesis (3MT) ... The Graduate Certificate in Statistical Genetics has paused admissions as of December 2023. The Graduate Certificate in Statistical Genetics provides opportunities for concentrated education in statistical genetics to graduate students from a variety of disciplines. While primarily focused towards matriculated PhD ...

  17. DPhil in Statistics

    The Department of Statistics in the University of Oxford is a world leader in research in probability, bioinformatics, mathematical genetics and statistical methodology, including computational statistics, machine learning and data science. Oxford's Mathematical Sciences submission came first in the UK on all criteria in the 2021 Research ...

  18. Mathematics, genetics and evolution

    The importance of mathematics and statistics in genetics is well known. Perhaps less well known is the importance of these subjects in evolution. The main problem that Darwin saw in his theory of evolution by natural selection was solved by some simple mathematics. It is also not a coincidence that the re-writing of the Darwinian theory in Mendelian terms was carried largely by mathematical ...

  19. DPhil in Statistics

    The final thesis is normally submitted for examination during the fourth year and is followed by the viva examination. ... Computational methods for population genetics (natural selection, demographic history); statistical genetics (complex trait heritability, association); scalable methods for large genomic data sets.

  20. Master of Science Thesis Course Electives

    Statistical Genetics I: Mendelian Traits (3) Offered: jointly with STAT 550; Spring: BIOST 551: Statistical Genetics II: Quantitative ... *This course does not count as an elective for PhD or Thesis students entering in 2024 or later. BIOST 504 or PHI 500 is a requirement for this cohort. Students should register for either BIOST 504 or PHI 500 ...

  21. Computational Methods for Population Genetics

    Computational Methods for Population Genetics. Download (23.14 MB) thesis. posted on 2019-10-16, 12:00 authored by Aritra Bose. The field of population genetics has seen an unprecedented growth driven by the advancement of sequencing technologies, resulting in volumes of massive datasets. As a result, efficient computational methods backed by ...

  22. Genomics in animal breeding from the perspectives of matrices and

    This paper describes genomics from two perspectives that are in use in animal breeding and genetics: a statistical perspective concentrating on models for estimating breeding values, and a sequence perspective concentrating on the function of DNA molecules. This paper reviews the development of genomics in animal breeding and speculates on its future from these two perspectives.

  23. Refining the impact of genetic evidence on clinical success

    Such low use is partly due to recent emergence of most genetic evidence (Extended Data Figs. 2f,g and 7a), as drug programmes prospectively supported by human genetics have had a mean lag time ...

  24. Molecular Genetics & Genomics

    Computational Statistical Genetics Fundamentals in Molecular Cell Biology Nucleic Acids and Protein Biosynthesis Molecular, Cell and Organ Systems ... After their thesis proposal, students will focus on their thesis research. Students may participate in advanced courses, workshops, journal clubs, and seminars relevant to their research and to ...

  25. Genomics in animal breeding from the perspectives of matrices and

    This paper describes genomics from two perspectives that are in use in animal breeding and genetics: a statistical perspective concentrating on models for estimating breeding values, and a sequence perspective concentrating on the function of DNA molecules. ... In the introduction to my PhD thesis, which was defended in 2015, I wrote about the ...

  26. PhD Theses

    PhD Theses. 2023. Title. Author. Supervisor. Statistical Methods for the Analysis and Prediction of Hierarchical Time Series Data with Applications to Demography. Daphne Liu. Adrian E Raftery. Statistical methods for genomic sequencing data.

  27. PDF Researchers develop statistical method for genetic mapping of

    embeddings from RNA-seq, colored by cell types (ncells > 500) and annotated by cell numbers. Credit: Nature Genetics (2024). DOI: 10.1038/s41588-024-01682-1 Genetic studies of diseases map ...