• Search Search
  • CN (Chinese)
  • DE (German)
  • ES (Spanish)
  • FR (Français)
  • JP (Japanese)
  • Open Research
  • Booksellers
  • Peer Reviewers
  • Springer Nature Group ↗
  • Publish an article
  • Roles and responsibilities
  • Signing your contract
  • Writing your manuscript
  • Submitting your manuscript
  • Producing your book
  • Promoting your book
  • Submit your book idea
  • Manuscript guidelines
  • Book author services
  • Publish a book
  • Publish conference proceedings
  • Research data policy

Data availability statements

  • Data repository guidance
  • Sensitive data
  • Data policy FAQs
  • Research data helpdesk

Guidance for authors and editors

An article’s data availability statement lets a reader know where and how to access data that support the results and analysis. It may include links to publicly accessible datasets that were analysed or generated during the study, descriptions of what data are available and/or information on how to access data that is not publicly available.

The data availability statement is a valuable link between a paper’s results and the supporting evidence. Springer Nature’s data policy is based on transparency, requiring these statements in original research articles across our journals.

The guidance below offers advice on how to create a data availability statement, along with examples from different research areas.

Data availability statements support research data

What should a data availability statement include.

Your data availability statement should describe how the data supporting the results reported in your paper can be accessed. 

  • If your data are in a repository, include hyperlinks and persistent identifiers (e.g. DOI or accession number) for the data where available.  
  • If your data cannot be shared openly, for example to protect study participant privacy, then this should be explained. 
  • Include both original data generated in your research and any secondary data reuse that supports your results and analyses.

Read our detailed guidance on how to write an excellent data availability statement .

Citing data sources

You should cite any publicly available data on which the conclusions of the paper rely. This includes novel data shared alongside the publication and any secondary data sources. 

Data citations should include a persistent identifier (such as a DOI), should be included in the reference list using the minimum information recommended by DataCite (Dataset Creator, Dataset Title, Publisher [repository], Publication Year, Identifier [e.g. DOI, Handle or ARK]) and follow journal style.

See our further guidance on citing datasets .

Statement examples by research area 

Life sciences and clinical medicine, data publicly available in a repository:.

  • PRO-Seq data were deposited into the Gene Expression Omnibus database under accession number GSE85337 and are available at the following URL: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85337 . Example from: https://www.nature.com/articles/s41559-017-0447-5
  • The experimental data and the simulation results that support the findings of this study are available in Figshare with the identifier https://doi.org/10.6084/m9.figshare.13322975 . Example from: https://www.nature.com/articles/s41557-020-00629-3
  • The anonymised data collected are available as open data via the University of Bristol online data repository: https://doi.org/10.5523/bris.1dnhjcw6w4m1n29u2yuxjywde1 . Example from: https://doi.org/10.1186/s12889-020-08633-5  

Data available with the paper or supplementary information:

  • All data supporting the findings of this study are available within the paper and its Supplementary Information. Microsatellite primer sequences are provided in Supplementary Table 2, along with original reference describing the microsatellites used in this study. Example from: https://doi.org/10.1038/s41559-017-0148
  • All data on the measured ecosystem variables indicating ecosystem functions that support the findings of this study are included within this paper and its Supplementary Information files. Example from: https://doi.org/10.1038/s41559-017-0391-4

Data cannot be shared openly but are available on request from authors:

  • The data that support the findings of this study are not openly available due to reasons of sensitivity and are available from the corresponding author upon reasonable request. Data are located in controlled access data storage at Karolinska Institutet. Example from: https://doi.org/10.1186/s12910-022-00758-z
  • The data that support the findings of this study are available from the authors but restrictions apply to the availability of these data, which were used under license from the Natural History Museum (London) for the current study, and so are not publicly available. Data are, however, available from the authors upon reasonable request and with permission from the Centre for Human Evolution Studies at the Natural History Museum. Example from: https://www.nature.com/articles/s41559-018-0528-0

Chemistry and chemical biology

  • Crystallographic data for the structures reported in this article have been deposited at the Cambridge Crystallographic Data Centre, under deposition numbers CCDC 2125007  (1), 2125008 (3), 2125009 (4), 2125010 (5), 2166315 (6) and 2125011 (7). Copies of the data can be obtained free of charge via https://www.ccdc.cam.ac.uk/structures/ . All other relevant data generated and analysed during this study, which include experimental, spectroscopic, crystallographic and computational data, are included in this article and its supplementary information. Source data  are provided with this paper. Example from: https://doi.org/10.1038/s41557-022-01081-1
  • The raw transient absorption data (including the anisotropy measurements), Raman and ultraviolet–visible spectra, and computational data that support the findings of this study are available in the Edinburgh DataShare repository with the identifier https://doi.org/10.7488/ds/2751 . Example from: https://www.nature.com/articles/s41557-020-0431-6
  • The authors declare that the data supporting the findings of this study are available within the paper and its Supplementary Information files . Should any raw data files be needed in another format they are available from the corresponding author upon reasonable request. Source data are provided with this paper. Example from: https://www.nature.com/articles/s41557-022-01049-1

Physical sciences

  • The dataset on global land precipitation source and evapotranspiration sink is available at https://doi.org/10.1594/PANGAEA.908705 . The MODIS LAI C6 product is available at https://doi.org/10.5067/MODIS/MOD15A2H.006 . GPCP v.2.3 precipitation data are available at https://psl.noaa.gov/data/gridded/data.gpcp.html . GLEAM v.3.3a evapotranspiration data are available at https://www.gleam.eu/ . Air temperature and wind speed from ERA5 are available at https://cds.climate.copernicus.eu . Surface radiation (CERES_SYN1deg_Ed4.1) data are available at https://ceres.larc.nasa.gov/ . SST from NOAA Optimum Interpolation v.2 is available at https://psl.noaa.gov/data/gridded/data.noaa.oisst.v2.html . Snow-cover product is available at https://nsidc.org/data/NSIDC-0046/versions/4 . Elevation data are available at  https://www.ngdc.noaa.gov/mgg/global/global.html .
  • The authors declare that the data supporting the findings of this study are available within the paper, its supplementary information files, and the National Tibetan Plateau Data Center ( https://doi.org/10.11888/Cryos.tpdc.272747 ).
  • Data sets generated during the current study are available from the corresponding author on reasonable request. The natural gas production data are available from Drilling Info but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Example from: https://www.nature.com/articles/s41559-017-0329-x

Humanities and social science

  • The datasets generated by the survey research during and/or analyzed during the current study are available in the Dataverse repository, https://doi.org/10.7910/DVN/205YXZ .” Example from: https://doi.org/10.1057/s41599-020-00552-5
  • The Greek Hippocratic texts used in this study are available to the public under a Creative Commons license at A Digital Corpus for Graeco-Arabic Studies: https://www.graeco-arabic-studies.org/texts.html . Example from: https://doi.org/10.1057/s41599-020-0511-7

Data available in a repository with restricted access:

  • The Heinz et al. data are available from the Inter-University Consortium of Political and Social Research at https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/6040 . Our custom data are available in the Open Science Framework repository at https://osf.io/5xayz/ . Example from: https://www.nature.com/articles/s41562-019-0761-9
  • The raw CoVIDA and HBS data are protected and are not available due to data privacy laws. The processed data sets are available at OPENICPSR under accession code 14212129 ( https://www.openicpsr.org/openicpsr/project/142121 ). Example from: https://www.nature.com/articles/s41467-021-25038-z
  • The data that support the findings of this study are available from Norwegian Social Research (NOVA), but restrictions apply to the availability of these data, which were used under licence for the current study and so are not publicly available. The data are, however, available from the authors upon reasonable request and with the permission of Norwegian Social Research (NOVA). Example from: https://www.nature.com/articles/s41562-021-01255-w

Data shared with manuscript or Supplementary Information:

  • The author confirms that all data generated or analysed during this study are included in this published article. Example from: https://doi.org/10.1057/s41599-020-0527-z

Data sharing is not applicable:

  • We do not analyse or generate any datasets, because our work proceeds within a theoretical and mathematical approach. Example from: https://doi.org/10.1057/s41599-020-0517-1
  • Tools & Services
  • Account Development
  • Sales and account contacts
  • Professional
  • Press office
  • Locations & Contact

We are a world leading research, educational and professional publisher. Visit our main website for more information.

  • © 2023 Springer Nature
  • General terms and conditions
  • Your US State Privacy Rights
  • Your Privacy Choices / Manage Cookies
  • Accessibility
  • Legal notice
  • Help us to improve this site, send feedback.

Cranfield University logo

Research data management

  • Introduction
  • Using the CORD data repository

How to write a data availability statement (DAS)

  • Restricting access to research data
  • Digital Object Identifiers (DOIs)
  • Data management plans
  • RDM in bids and proposals
  • What is FAIR Data?
  • Syncing ORCID and DataCite for CORD
  • Selecting a data licence
  • Contact the Research Data Manager

What is a DAS?

Data availability statements, also known as data access statements, are included in publications to describe where the data associated with the paper is available, and under what conditions the data can be accessed. They are required by many funders and scientific journals as well as the UKRI Common Principles on Data Policy . 

Compliance with funder policy

UKRI requires all research articles to include a ‘Data Access Statement’, even where there are no data associated with the article or the data are inaccessible.  In situations where no new data have been created, such as a review, the statement “No new data were created or analysed in this study. Data sharing is not applicable to this article.” should be included in the article as a DAS.

Read the full  UKRI policy to ensure you know what your responsibilities are.

Cranfield University’s  Open Access Policy also holds researchers responsible for ‘Ensuring published results always include a statement on how and on what terms supporting data may be accessed.’

What should I include in a DAS?

Examples of data access statements are provided below, but your statement should typically include:

  • where the data can be accessed (preferably a data repository, like CORD )
  • a persistent identifier, such as a Digital Object Identifier (DOI) or accession number, or a link to a permanent record for the dataset
  • details of any restrictions on accessing the data and a justifiable explanation (e.g., for ethical, legal, or commercial reasons)

*A simple direction to contact the author may not be considered acceptable by some funders and publishers. The EPSRC have made this explicit. Consider setting up a shared email address for your research group or use an existing departmental address. ** Under some circumstances (e.g., participants did not agree for their data to be shared) it may be appropriate to explain that the data are not available at all. In this case, you must give clear and justified reasons.

*** Use in situations such as a review, where no new data have been created. Additional examples of data access/availability statements can be found on these publisher web pages:   Springer Nature , Wiley and Taylor Francis .  

Examples of poor and insufficient DAS

Some funders and journals do not accept a direction to contact the author/authors. Data availability statement: “Data available on request / reasonable request” Statement too vague without the provision of a persistent identifier of accession number Data availability statement: “Data are available in a public, open access repository.” Data access statement as part of the methods section Methods: “All data and codes are available under the following OSF repository: https://osf.io/9999/ ” An unclear statement whether there is data. Availability of data and materials: “Not applicable”

Where should I put the DAS in my paper?

Some journals provide a “data availability” or “data access” section. If no such section exists, you can place your statement in the acknowledgements section.

Link Your Datasets to Your Article

Once your article is published, you should update your repository project with the DOI for your article, which will be emailed to you upon article publication. Linking your supporting data to your publication will enable your data and paper to be reciprocally connected, ensuring you receive credit for your work. This can be done in CORD by using the ‘Resource Title’ and ‘Resource DOI’ fields respectively .

  • << Previous: Using the CORD data repository
  • Next: Restricting access to research data >>
  • Last Updated: Apr 3, 2024 1:15 PM
  • URL: https://library.cranfield.ac.uk/research-data-management

how to write data availability statement in research paper

  • Chinese (Traditional)
  • Springer Support
  • Solution home
  • Author and Peer Reviewer Support
  • Research Data Sharing

Write a data availability statement for a paper

Your data availability statement should describe how the data supporting the results reported in your paper can be accessed. This includes both primary data (data you generated as part of your study) and secondary data (data from other sources that you reused in your study). If the data are in a repository, include hyperlinks and persistent identifiers for the data where available. If your data cannot be shared openly for any reason, for example to protect study participant privacy, then this should be explained in the data availability statement. If accessing or reusing your data is subject to any conditions of use or restrictions, these should also be described in the data availability statement.

Go to our data availability statement resource page for more information on writing data availability statements, and examples for different research areas. For more information, please email [email protected] .

Related Articles

  • Find a data repository for my data
  • Data my journal expects me to share
  • Publishing data in a repository
  • Select the right licence for your data
  • My research didn’t generate data or I reused existing data
  • Correctly citing and referencing your dataset for maximum impact
  • Publishing articles that describe datasets
  • Release of research data ahead of manuscript publication
  • Sensitive data or data about human participants: sharing expectations
  • Software and code sharing

Article views count

Elsevier QRcode Wechat

  • Manuscript Preparation

The Data Availability Statement

  • 5 minute read
  • 56.4K views

Table of Contents

Surely, you have already come across many metaphors for knowledge: a building permanently under construction, a very big tree with its roots deep down in the earth, finding nourishment in the legacy of the ancient masters, or even a never-ending road, with countless crossroads and a horizon that sometimes looks close enough to reach – but never actually quite achievable. With our research, we challenge ourselves with goals that always are set further and further away.

One thing is absolutely sure… a step forward actually only means progress when the last one has been given thoughtfully and confidently, putting us in a solid position to continue moving forward. In other words, research is an activity that naturally thrives from previous findings and research. Allowing your readers to access research data is considered a good practice in science, and a mechanism to encourage transparency in scientific progress. It’s for this that most journals today demand a Data Availability Statement, where authors openly provide the necessary information for others to reproduce works stated or reported in an article. In the case of research reviews, however, that are not based on original data, such a statement is not needed.

What is a Data Availability Statement (DAS)?

A Data Availability Statement (also called Data Access Statement) tells the reader if the data behind a research project can be accessed and, if so, where and how. Ideally, authors should include hyperlinks to public databases to make it easier for the readers to find them. If you are currently in the process of submitting an article to a journal, you should check its guidelines and policy, but normally a DAS is included in a very visible place in the manuscript right before the reference section. For further guidance, most journals offer templates for different kinds of DAS formats, regarding different means of accessing data (or not).

When specific journal instructions regarding how to formulate a DAS are absent or unavailable, there are some examples below you might find handy. However, you should consider tailoring or combining them to fit your needs (replace the information inside brackets):

Data available to be shared

  • The raw data required to reproduce the above findings are available to download from [INSERT PERMANENT WEB LINK(s)]. The processed data required to reproduce the above findings are available to download from [INSERT PERMANENT WEB LINK(s)].

Data not available to be shared

  • The raw/processed data required to reproduce the above findings cannot be shared at this time due to legal/ ethical reasons.
  • The raw/processed data required to reproduce the above findings cannot be shared at this time due to technical/ time limitations.
  • The raw/processed data required to reproduce the above findings cannot be shared at this time as the data also forms part of an ongoing study.

As previously mentioned, the most common practice to share datasets is to provide a permanent web link. This way, almost universal access is guaranteed. There are no restrictions concerning the choice of a database;. Authors can use whatever database they want, but depending on which journal they are submitting their papers to, they can be invited to upload their dataset to a specific one. This is a way to encourage authors to include a Data Availability Statement during submission for publication. Generally, though, you can also link your dataset directly to your article.

Why should I write a Data Availability Statement?

Actually, Data Availability Statements are normally only one or two sentences, so there is really no reason why it shouldn’t be included in the manuscript. Furthermore, most journals and funders require them for submission purposes.

DAS’ are important because they support

Data validation.

Impactful findings emerge from solid data. Validating data is an often overlooked step, but it is key to achieve accurate results. By making data access available for your readers, you are giving them the opportunity to check for themselves the quality and legitimacy of your paper’s underlying data. Additionally, it also makes it easier for researchers to find and have access to an even larger quantity of scientific material that could be important to develop their own body of work.

Reuse of information

Science progress ultimately rests on used and reused chunks of information. They are like bricks for that infinitely growing building mentioned in the introduction. Data Availability Statements allow research data to be easily found and properly examined and validated by scientists who might need it for further research. Together with database indexation, they are a very effective tool for general access to scientific information networks, making it easy for researchers to develop their scientific projects, and further knowledge.

Probably the best news for authors concerning Data Access Statements is that they definitely help improve the chances of being (properly) cited by other researchers. As you might already be aware by now, the number of citations constitutes an important value taken into account when calculating a researcher’s relevance metric (H-index). The higher one’s H-index, the more visibility and recognition he or she has.

It can’t be overemphasized how important Data Availability Statements are for transparency in science. They assure that researchers are properly recognized for their work and help science grow as a discipline, in which sustainable progress is a key pillar.

Data Availability Statement Importance

Data Availability Statement with Elsevier

In summary, with a Data Availability Statement an author can provide information about the data presented in an article and provide a reason if data is not available to access. View this example article where the DAS will appear under the “research data” section from the article outline.

Benefits for authors and readers include

  • Increases transparency
  • Allows compliance with data policies
  • Encourages good scientific practice and encourages trust

How does it work

Check the Guide for Authors of the journal of your choice to see your options for sharing research data. The majority of Elsevier journals will have integrated the option to create a Data Availability Statement directly in the submission flow. You will be guided through the steps to complete the DAS during the manuscript submission process.

The difference between abstract and conclusion

The difference between abstract and conclusion

Elsevier News Icon

17 March 2021 – Elsevier’s Mini Program Launched on WeChat Brings Quality Editing Straight to your Smartphone

You may also like.

impactful introduction section

Make Hook, Line, and Sinker: The Art of Crafting Engaging Introductions

Limitations of a Research

Can Describing Study Limitations Improve the Quality of Your Paper?

Guide to Crafting Impactful Sentences

A Guide to Crafting Shorter, Impactful Sentences in Academic Writing

Write an Excellent Discussion in Your Manuscript

6 Steps to Write an Excellent Discussion in Your Manuscript

How to Write Clear Civil Engineering Papers

How to Write Clear and Crisp Civil Engineering Papers? Here are 5 Key Tips to Consider

how to write data availability statement in research paper

The Clear Path to An Impactful Paper: ②

Essentials of Writing to Communicate Research in Medicine

The Essentials of Writing to Communicate Research in Medicine

There are some recognizable elements and patterns often used for framing engaging sentences in English. Find here the sentence patterns in Academic Writing

Changing Lines: Sentence Patterns in Academic Writing

Input your search keywords and press Enter.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Data Availability

Introduction.

PLOS journals require authors to make all data necessary to replicate their study’s findings publicly available without restriction at the time of publication. When specific legal or ethical restrictions prohibit public sharing of a data set, authors must indicate how others may obtain access to the data.

When submitting a manuscript, authors must provide a Data Availability Statement describing compliance with PLOS' data policy. If the article is accepted for publication, the Data Availability Statement will be published as part of the article.

Acceptable data sharing methods are listed below , accompanied by guidance for authors as to what must be included in their Data Availability Statement and how to follow best practices in research reporting .

PLOS believes that sharing data fosters scientific progress. Data availability allows and facilitates:

  • Validation, replication, reanalysis, new analysis, reinterpretation or inclusion into meta-analyses;
  • Reproducibility of research;
  • Efforts to ensure data are archived, increasing the value of the investment made in funding scientific research;
  • Reduction of the burden on authors in preserving and finding old data, and managing data access requests;
  • Citation and linking of research data and their associated articles, enhancing visibility and ensuring recognition for authors, data producers and curators.

Publication is conditional on compliance with this policy. If restrictions on access to data come to light after publication, we reserve the right to post a Correction, an Editorial Expression of Concern, contact the authors' institutions and funders, or, in extreme cases, retract the publication.

Minimal Data Set Definition

Authors must share the “minimal data set” for their submission. PLOS defines the minimal data set to consist of the data required to replicate all study findings reported in the article, as well as related metadata and methods. Additionally, PLOS requires that authors comply with field-specific standards for preparation, recording, and deposition of data when applicable.

For example, authors should submit the following data:

  • The values behind the means, standard deviations and other measures reported;
  • The values used to build graphs;
  • The points extracted from images for analysis.

Authors do not need to submit their entire data set if only a portion of the data was used in the reported study. Also, authors do not need to submit the raw data collected during an investigation if the standard in the field is to share data that have been processed.

PLOS does not permit references to “data not shown.” Authors should deposit relevant data in a public data repository or provide the data in the manuscript.

We require authors to provide sample image data in support of all reported results (e.g. for immunohistochemistry images, fMRI images, etc.), either with the submission files or in a public repository.

For manuscripts submitted to  PLOS Biology,   PLOS ONE, PLOS Climate, PLOS Water, PLOS Global Public Health or PLOS Mental Health  on or after July 1, 2019, authors must provide original, uncropped and minimally adjusted images supporting all blot and gel results reported in the article’s figures and Supporting Information files. Whilst it is not necessary to provide original images at time of initial submission, we will require these files during the peer review process or before a submission can be accepted for publication. 

When reviewing concerns arising after publication in relation to images shown, we may request available underlying data for any image files depicted in the article, as needed to resolve the concern(s).

Acceptable Data Sharing Methods

Deposition within data repository (strongly recommended).

All data and related metadata underlying reported findings should be deposited in appropriate public data repositories, unless already provided as part of a submitted article. Repositories may be either subject-specific repositories that accept specific types of structured data, or cross-disciplinary generalist repositories that accept multiple data types.

If field-specific standards for data deposition exist, PLOS requires authors to comply with these standards. Authors should select repositories appropriate to their field of study (for example, ArrayExpress or GEO for microarray data; GenBank, EMBL, or DDBJ for gene sequences).

The Data Availability Statement must list the name of the repository or repositories as well as digital object identifiers (DOIs), accession numbers or codes, or other persistent identifiers for all relevant data.

Data citation

PLOS encourages authors to cite any publicly available research data in their reference list. References to data sets (data citations) must include a persistent identifier (such as a DOI). Citations of data sets, when they appear in the reference list, should include the minimum information recommended by DataCite and follow journal style.

Example : Andrikou C, Thiel D, Ruiz-Santiesteban JA, Hejnol A. Active mode of excretion across digestive tissues predates the origin of excretory organs. 2019. Dryad Digital Repository. https://doi.org/10.5061/dryad.bq068jr .

PLOS supports the data citation roadmap for scientific publishers developed by the Publishers Early Adopters Expert Group as part of the Data Citation Implementation Pilot (DCIP) project, an initiative of FORCE11 and the NIH BioCADDIE program.

Data in Supporting Information files

Although authors are encouraged to directly deposit data in  appropriate repositories , data can be included in  Supporting Information  files. When including data in Supporting Information files, authors should submit data in file formats that are standard in their field and allow wide dissemination. If there are currently no standards in the field, authors should maximize the accessibility and reusability of the data by selecting a file format from which data can be efficiently extracted (for example, spreadsheets are preferable to PDFs or images when providing tabulated data).

Upon publication, PLOS uploads all Supporting Information files associated with an article to the figshare repository to increase compliance with the  FAIR principles  (Findable, Accessible, Interoperable, Reusable).

Supporting Information files are published exactly as provided and are not copyedited. Each file should be less than 20 MB.

Data Management Plans

Some funding agencies have policies on the preparation and sharing of Data Management Plans (DMPs), and authors who receive funding from some agencies may be required to prepare DMPs as a condition of grants.

PLOS encourages authors to prepare DMPs before conducting their research and encourages authors to make those plans available to editors, reviewers and readers who wish to assess them.

The following resources may also be consulted for guidance on DMPs:

  • Funders and institutions
  • Digital Curation Centre
  • Data Stewardship Wizard

Acceptable Data Access Restrictions

PLOS recognizes that, in some instances, authors may not be able to make their underlying data set publicly available for legal or ethical reasons. This data policy does not overrule local regulations, legislation or ethical frameworks. Where these frameworks prevent or limit data release, authors must make these limitations clear in the Data Availability Statement at the time of submission. Acceptable restrictions on public data sharing are detailed below.

Please note it is not acceptable for an author to be the sole named individual responsible for ensuring data access.

Third-party data

For studies involving third-party data, we encourage authors to share any data specific to their analyses that they can legally distribute. PLOS recognizes, however, that authors may be using third-party data they do not have the rights to share. When third-party data cannot be publicly shared, authors must provide all information necessary for interested researchers to apply to gain access to the data.

  • A description of the data set and the third-party source
  • If applicable, verification of permission to use the data set
  • All necessary contact information others would need to apply to gain access to the data

Authors should properly cite and acknowledge the data source in the manuscript. Please note, if data have been obtained from a third-party source, we require that other researchers would be able to access the data set in the same manner as the authors.

Human research participant data and other sensitive data

For studies involving human research participant data or other sensitive data, we encourage authors to share de-identified or anonymized data. However, when data cannot be publicly shared, we allow authors to make their data sets available upon request.

  • Explain the restrictions in detail (e.g., data contain potentially identifying or sensitive patient information)
  • Provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent

General guidelines for human research participant data

Prior to sharing human research participant data, authors should consult with an ethics committee to ensure data are shared in accordance with participant consent and all applicable local laws.

Data sharing should never compromise participant privacy. It is therefore not appropriate to publicly share personally identifiable data on human research participants. The following are examples of data that should not be shared: 

  • Name, initials, physical address
  • Internet protocol (IP) address
  • Specific dates (birth dates, death dates, examination dates, etc.)
  • Contact information such as phone number or email address
  • Location data

Data that are not directly identifying may also be inappropriate to share, as in combination they can become identifying. For example, data collected from a small group of participants, vulnerable populations, or private groups should not be shared if they involve indirect identifiers (such as sex, ethnicity, location, etc.) that may risk the identification of study participants. 

Steps necessary to protect privacy may include de-identifying data, adding noise, or blocking portions of the database. Where this is not possible, data sharing could be restricted by license agreements directed specifically at privacy concerns. Additional guidance on preparing human research participant data for publication, including information on how to properly de-identify these data, can be found here:

  • Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers

 The following resources may also be consulted for guidance on sharing human research participant data:

  • Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk
  • European Medicines Agency: Publication and access to clinical-trial data
  • US National Institutes of Health: Protecting the Rights and Privacy of Human Subjects
  • Canadian Institutes of Health Research Best Practices for Protecting Privacy in Health Research
  • UK Data Archive: Anonymisation Overview
  • Australian National Data Service: Ethics, Consent and Data Sharing

Guidelines for qualitative data

For studies analyzing data collected as part of qualitative research, authors should make excerpts of the transcripts relevant to the study available in an appropriate data repository, within the paper, or upon request if they cannot be shared publicly. If even sharing excerpts would violate the agreement to which the participants consented, authors should explain this restriction and what data they are able to share in their Data Availability Statement.

See the Qualitative Data Repository for more information about managing and depositing qualitative data.

Other sensitive data

Some data that do not describe human research participants may also be sensitive and inappropriate to share. For studies analyzing other types of sensitive data, authors should share data as appropriate after consulting established field guidelines and all applicable local laws. Examples of sensitive data that may be subject to restrictions include, but are not limited to, data from field studies in protected areas, locations of sensitive archaeological sites, and locations of endangered or threatened species.

Additional help

Please contact the journal office ( [email protected] ) if:

  • You have concerns about the ethics or legality of sharing your data
  • Your institution does not have an established point of contact to field external requests for access to sensitive data
  • You feel unable to share data for reasons not specified above

Unacceptable Data Access Restrictions

PLOS journals will not consider manuscripts for which the following factors influence authors’ ability to share data:

  • Authors will not share data because of personal interests, such as patents or potential future publications.
  • The conclusions depend solely on the analysis of proprietary data. We consider proprietary data to be data owned by individuals, organizations, funders, institutions, commercial interests, or other parties that the data owners will not share. If proprietary data are used and cannot be accessed by others in the same manner by which the authors obtained them, the manuscript must include an analysis of publicly available data that validates the study’s conclusions so that others can reproduce the analysis and build on the study’s findings.

General questions

Why do we not allow an author to be the only point of contact for fielding requests for access to restricted data?

When possible, we recommend authors deposit restricted data to a repository that allows for controlled data access. If this is not possible, directing data requests to a non-author institutional point of contact, such as a data access or ethics committee, helps guarantee long term stability and availability of data. Providing interested researchers with a durable point of contact ensures data will be accessible even if an author changes email addresses, institutions, or becomes unavailable to answer requests.

When was the current data policy implemented?

The data policy was implemented on March 3, 2014. Any paper submitted before that date will not have a Data Availability Statement. For all manuscripts submitted or published before this date, data must be made available upon reasonable request.

how to write data availability statement in research paper

What if my article does not contain any data?

All articles must include a Data Availability Statement but some submissions, such as Registered Report Protocols and Lab or Study Protocol articles, may not contain data. For manuscripts that do not report data, authors must state in their Data Availability Statement that their article does not report data and the data availability policy is not applicable to their article.

Depositing data

What if I cannot provide accession numbers or DOIs for my data set at submission?

Authors may submit their manuscript and include placeholder language in their Data Availability Statement indicating that accession numbers and/or DOIs will be made available after acceptance. The journal office will contact authors prior to publication to ask for this information and will hold the paper until it is received.

Providing private data access to reviewers and editors during the peer review process is acceptable. Many repositories permit private access for review purposes, and have policies for public release at publication.

Is PLOS integrated with any repositories?

PLOS partners with repositories to support data sharing and compliance with the PLOS data policy. Our submission system is integrated with partner repositories to ensure that the article and its underlying data are paired, published together and linked. Current partners include Dryad and FlowRepository.

Partner repositories may have a data submission fee. PLOS is not able to cover this fee and authors are under no obligation to use any specific repository. PLOS does not gain financially from our association with any integrated partners.

Additionally, PLOS uploads all Supporting Information files associated with an article to the figshare repository to increase compliance with the FAIR principles (Findable, Accessible, Interoperable, Reusable).

How do I deposit data with a data repository integration partner?

When authors deposit data in the integrated repository, they receive a provisional data set DOI along with a private reviewer URL link. Upon submission to PLOS, authors should include the data set DOI in the Data Availability Statement. They should also provide the reviewer URL, which will permit restricted access to the data during peer review. If a manuscript is editorially accepted by a PLOS journal, the publication of the article and public release of the data set will be automatically coordinated.

I cannot afford the cost of depositing a large amount of data. What should I do?

PLOS encourages authors to investigate all options and to contact their institutions if they have difficulty providing access to the data underlying the research. There are several repositories recommended by PLOS that specialize in handling large data sets.

What are acceptable licenses for my data deposition?

If authors use repositories with stated licensing policies, the policies should not be more restrictive than the Creative Commons Attribution (CC BY) license .

PLOS Data Advisory Board

PLOS has formed an external board of advisors across many fields of research published in PLOS journals. This board will work with us to develop community standards for data sharing across various fields, provide input and advice on especially complex data-sharing situations submitted to the journals, define data-sharing compliance, and proactively work to refine our policy. If you have any questions or feedback, we welcome you to write to us at [email protected] .

Imperial College London Imperial College London

Latest news.

how to write data availability statement in research paper

Superfast physics and a trio of Fellows: News from Imperial

how to write data availability statement in research paper

'Living paint’ startup wins Imperial’s top entrepreneurship prize

how to write data availability statement in research paper

Imperial wins University Challenge for historic fifth time

  • Scholarly Communication
  • Research and Innovation
  • Support for staff
  • Research data management
  • Sharing data

How to write a data access statement

What is a data access statement.

Data access statements, also known as data availability statements, are included in publications to describe where the data associated with the paper is available, and under what conditions the data can be accessed. They are required by many funders and scientific journals as well as the UKRI Common Principles on Data Policy .

What should I include in a data access statement?

Examples of data access statements are provided below, but your statement should typically include:

  • where the data can be accessed (preferably a data repository)
  • a persistent identifier, such as a Digital Object Identifier (DOI) or accession number, or a link to a permanent record for the dataset
  • details of any restrictions on accessing the data and a justifiable explanation (e.g. for ethical, legal or commercial reasons)

*A simple direction to contact the author may not be considered acceptable by some funders and publishers. The EPSRC have made this explicit. Consider setting up a shared email address for your research group or use an existing departmental address.

** Under some circumstances (e.g. participants did not agree for their data to be shared) it may be appropriate to explain that the data are not available at all. In this case, you must give clear and justified reasons.

Additional examples of data access/availability statements can be found on these publisher web pages:   Springer Nature , Wiley  and  Taylor Francis .

Where should I put the data availability statement in my paper?

Some journals provide a “data access” or “data availability” section. If no such section exists, you can place your statement in the acknowledgements section.

how to write data availability statement in research paper

  • Alumni & Careers
  • News & Events

how to write data availability statement in research paper

Writing “Data Availability Statement” in Your Publications

More and more journals require you to write a Data Availability Statement (DAS) when you submit a manuscript. What is DAS? How do you write one?

Where do you usually find data that relates to a published paper? Sometimes, authors include data as supplementary materials in images or pdf; or as data files for download. In recent years, more authors put relevant datasets in data repositories, and give you the links to the datasets. However, when the data is not explicitly made available with the published paper, readers have no clue about its accessibility.

Data Availability Statement , or DAS , serves to add transparency, so that data can be validated, reused, and properly cited. DAS is simple to write . It is one or two sentences that tells the readers where to find the data associated with your paper. You can place it near the end of your manuscript, such as putting it before the “References”. 

What DAS Describes

A DAS simply describes the availability of the data underlying your paper. Here, “data” means the dataset that supports the results reported, that would be needed to interpret, replicate and build upon the findings in the published paper. DAS can be very short; and it should tell your readers:

  • whether the data is available
  • where to find it
  • under what conditions it can be accessed

Here are three examples of how DAS appears in an article:

Source: Calzolari et al. Vestibular agnosia in traumatic brain injury and its link to imbalance. Brain , 144 , 1 (January 2021), Pages 128–143, https://doi.org/10.1093/brain/awaa386

Source: Gomez-Gonzalez, C., Nesseler, C. & Dietl, H.M. Mapping discrimination in Europe through a field experiment in amateur sport. Humanit Soc Sci Commun 8, 95 (2021). https://doi.org/10.1057/s41599-021-00773-2

Source: Danylo et al. A map of the extent and year of detection of oil palm plantations in Indonesia, Malaysia and Thailand. Sci Data 8, 96 (2021). https://doi.org/10.1038/s41597-021-00867-1

DAS Does Not Require Data Archiving or Sharing

When a journal requires you to include a DAS in your manuscript, it does not impose data sharing in open repositories. You may choose to share your data upon requests; or, you may not be able to share data due to various reasons. DAS only requires you to explicitly describe the situation with a simple and clear statement.

Scope of “Data” Involved

It is the authors’ judgement how much data and which data is qualified to be underlying data for a particular publication. It may be helpful to start with the “minimum” dataset that can support the findings. Sharing your whole set of raw data is often not necessary nor useful to your readers. You can think about which dataset is necessary and sufficient for your readers to interpret or validate your findings.

Journal Policies and Templates

Different journals have different policies regarding DAS; always check the policies of the journals you are submitting to. Some journals only encourage authors to include DAS, while some make it a requirement. They may have different formatting such as where the DAS should appear. Some publishers provide guidance and templates to help authors write DAS, for examples:

  • SpringerNature
  • American Institute of Physics (AIP)
  • Taylor & Francis

Even when the journal you use does not have such requirement, it is a good research practice to write a DAS for each published paper. Use these templates to get started.

— By Gabi Wong , Library

Hits: 12352

  • Academic Publishing

Tags: Data Availability Statement , publishing , research data management

published April 28, 2021 last modified March 11, 2022

guest

Receive new posts by email

Your email:

Search in Research Bridge

  • Research Bridge Home
  • SPD Profiles
  • Create and Connect ORCID iD
  • Researchers’ Series Archive
  • Author Tips
  • Data Management Guide
  • Research Impact Metrics
  • About Research Bridge
  • Calling for Guest Posts
  • Academic Publishing 46
  • AI in Research & Learning 7
  • Digital Humanities 8
  • Evaluation and Ranking 10
  • Guest Posts 4
  • HKUST Research 22
  • Research Data Management Tips 17
  • Research Tools 54
  • Researchers' Series 18

company logo

  • Finding a Repository
  • Budgeting for DMS Plans - NIH
  • Writing a Data Management and Sharing Plan - NIH Grant
  • DMS Plan Examples
  • What is the On-line DMPTool?
  • Persistent Identifiers (PIDs)
  • Managing Research Data
  • Data Discovery
  • Data Documentation and Metadata
  • Qualitative Data
  • Miscellaneous
  • FASEB DataWorks! Help Desk Knowledge Base

Creating a Data Availability Statement

A guide to crafting a statement on how to find and access the data used in your paper., what is a data availability statement.

A data availability statement (DAS) details where the data used in a published paper can be found and how it can be accessed. These statements are required by many publishers, including FASEB, our member societies, Cambridge, Springer Nature, Wiley, and Taylor & Francis. Some grant funders also require DASs.

Where Can I Find an Article’s DAS?

In a published article, the DAS is usually printed alongside the author affiliation, disclosures, or funding information. That is, it is either at the front or very end of an article. For example, PubMed Central displays the data availability statement in a drop-down at the top of their articles:

how to write data availability statement in research paper

What Should I Write in my DAS?

The table below outlines some templates. Each journal has different requirements and templates, so you should check journal author guidelines prior to submission.

This table is condensed from a webpage created by the publisher Taylor & Francis . Other useful options are available on their author services. Other publishers have their own guidance on data availability statements – for example, Wiley and Springer Nature . Publisher templates are not suitable for all situations; researchers should modify the templates to suit their paper.

Tip: Simply writing “data available upon reasonable request” is generally considered insufficient for a data availability statement. Studies have demonstrated a lack of author compliance when data is actually requested.

Back to top of page

More Resources

  • CHORUS: Publisher Data Availability Policies Index
  • Memorial Sloan Kettering Library Data Policy Finder

Taylor & Francis. “Writing a Data Availability Statement.” Author Services. Accessed November 12, 2023. https://authorservices.taylorandfrancis.com/data-sharing/share-your-data/data-availability-statements/ .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of plosone

A descriptive analysis of the data availability statements accompanying medRxiv preprints and a comparison with their published counterparts

Luke A. McGuinness

1 Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, United Kingdom

2 MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, United Kingdom

Athena L. Sheppard

3 Department of Health Sciences, University of Leicester, Leicester, United Kingdom

Associated Data

All materials (data, code and supporting information ) are available from GitHub ( https://github.com/mcguinlu/data-availability-impact ), archived at time of submission on Zenodo (DOI: 10.5281/zenodo.3968301 ).

To determine whether medRxiv data availability statements describe open or closed data—that is, whether the data used in the study is openly available without restriction—and to examine if this changes on publication based on journal data-sharing policy. Additionally, to examine whether data availability statements are sufficient to capture code availability declarations.

Observational study, following a pre-registered protocol, of preprints posted on the medRxiv repository between 25th June 2019 and 1st May 2020 and their published counterparts.

Main outcome measures

Distribution of preprinted data availability statements across nine categories, determined by a prespecified classification system. Change in the percentage of data availability statements describing open data between the preprinted and published versions of the same record, stratified by journal sharing policy. Number of code availability declarations reported in the full-text preprint which were not captured in the corresponding data availability statement.

3938 medRxiv preprints with an applicable data availability statement were included in our sample, of which 911 (23.1%) were categorized as describing open data. 379 (9.6%) preprints were subsequently published, and of these published articles, only 155 contained an applicable data availability statement. Similar to the preprint stage, a minority (59 (38.1%)) of these published data availability statements described open data. Of the 151 records eligible for the comparison between preprinted and published stages, 57 (37.7%) were published in journals which mandated open data sharing. Data availability statements more frequently described open data on publication when the journal mandated data sharing (open at preprint: 33.3%, open at publication: 61.4%) compared to when the journal did not mandate data sharing (open at preprint: 20.2%, open at publication: 22.3%).

Requiring that authors submit a data availability statement is a good first step, but is insufficient to ensure data availability. Strict editorial policies that mandate data sharing (where appropriate) as a condition of publication appear to be effective in making research data available. We would strongly encourage all journal editors to examine whether their data availability policies are sufficiently stringent and consistently enforced.

1 Introduction

The sharing of data generated by a study is becoming an increasingly important aspect of scientific research [ 1 , 2 ]. Without access to the data, it is harder for other researchers to examine, verify and build on the results of that study [ 3 ]. As a result, many journals now mandate data availability statements. These are dedicated sections of research articles, which are intended to provide readers with important information about whether the data described by the study are available and if so, where they can be obtained [ 4 ].

While requiring data availability statements is an admirable first step for journals to take, and as such is viewed favorably by journal evaluation rubrics such as the Transparency and Openness Promotion [TOP] Guidelines [ 5 ], a lack of review of the contents of these statements often leads to issues. Many authors claim that their data can be made “available on request”, despite previous work establishing that these statements are demonstrably untrue in the majority of cases—that when data is requested, it is not actually made available [ 6 – 8 ]. Additionally, previous work found that the availability of data “available on request” declines with article age, indicating that this approach is not a valid long term option for data sharing [ 9 ]. This suggests that requiring data availability statements without a corresponding editorial or peer review of their contents, in line with a strictly enforced data-sharing policy, does not achieve the intended aim of making research data more openly available. However, few journals actually mandate data sharing as a condition of publication. Of a sample of 318 biomedical journals, only ~20% had a data-sharing policy that mandated data sharing [ 10 ].

Several previous studies have examined the data availability statements of published articles [ 4 , 11 – 13 ], but to date, none have examined the statements accompanying preprinted manuscripts, including those hosted on medRxiv, the preprint repository for manuscripts in the medical, clinical, and related health sciences [ 14 ]. Given that preprints, particularly those on medRxiv, have impacted the academic discourse around the recent (and ongoing) COVID-19 pandemic to a similar, if not greater, extent than published manuscripts [ 15 ], assessing whether these studies make their underlying data available without restriction (i.e. “open”), and adequately describe how to access it in their data availability statements, is worthwhile. In addition, by comparing the preprint and published versions of the data availability statements for the same paper, the potential impact of different journal data-sharing policies on data availability can be examined. This study aimed to explore the distribution of data availability statements’ description of the underlying data across a number of categories of “openness” and to assess the change between preprint and journal-published data availability statements, stratified by journal data-sharing policy. We also intended to examine whether authors planning to make the data available upon publication actually do so, and whether data availability statements are sufficient to capture code availability declarations.

2.1 Protocol and ethics

A protocol for this analysis was registered in advance and followed at all stages of the study [ 16 ]. Any deviations from the protocol are described. Ethical approval was not required for this study.

2.2 Data extraction

The data availability statements of preprints posted on the medRxiv preprint repository between 25th June 2019 (the date of first publication of a preprint on medRxiv) and 1st May 2020 were extracted using the medrxivr and rvest R packages [ 17 , 18 ]. Completing a data availability statement is required as part of the medRxiv submission process, and so a statement was available for all eligible preprints. Information on the journal in which preprints were subsequently published was extracted using the published DOI provided by medRxiv and rcrossref [ 19 ]. Several other R packages were used for data cleaning and analysis [ 20 – 33 ].

To extract the data availability statements for published articles and the journals data-sharing policies, we browsed to the article or publication website and manually copied the relevant material (where available) into an Excel file. The extracted data are available for inspection (see Material availability section).

2.3 Categorization

A pre-specified classification system was developed to categorize each data availability statement as describing either open or closed data, with additional ordered sub-categories indicating the degree of openness (see Table 1 ). The system was based on the “Findability” and “Accessibility” elements of the FAIR framework [ 34 ], the categories used by previous effort to categorize published data availability statements [ 4 , 11 ], our own experience of medRxiv data availability statements, and discussion with colleagues. Illustrative examples of each category were taken from preprints included in our sample [ 35 – 43 ].

Illustrative examples of each category were taken from preprints included in our sample (see "Data extraction").

The data availability statement for each preprinted record were categorized by two independent researchers, using the groups presented in Table 1 , while the statements for published articles were categorized using all groups barring Category 3 and 4 (“Available in the future”). Records for which the data availability statement was categorized as “Not applicable” (Category 1 from Table 1 ) at either the preprint or published stage were excluded from further analyses. Researchers were provided only with the data availability statement, and as a result, were blind to the associated preprint metadata (e.g. title, authors, corresponding author institution) in case this could affect their assessments. Any disagreements were resolved through discussion.

Due to our large sample, if authors claimed that all data were available in the manuscript or as a S1 File , or that their study did not make use of any data, we took them at their word. Where a data availability statement met multiple categories or contained multiple data sources with varying levels of openness, we took a conservative approach and categorized it on the basis of the most restrictive aspect (see S1 File for some illustrative examples). We plotted the distribution of preprint and published data availability statements across the nine categories presented in Table 1 .

Similarly, the extracted data-sharing policies were classified by two independent reviewers according to whether the journal mandated data sharing (1) or not (0). Where the journal had no obvious data sharing policy, these were classified as not mandating data sharing.

2.4 Changes between preprinted and published statements

To assess if data availability statements change between preprint and published articles, we examined whether a discrepancy existed between the categories assigned to the preprinted and published statements, and the direction of the discrepancy (“more closed” or “more open”). Records were deemed to become “more open” if their data availability statement was categorized as “closed” at the preprint stage and “open” at the published stage. Conversely, records described as “more closed” were those moving from “open” at preprint to “closed” on publication.

We declare a minor deviation from our protocol for this analysis [ 16 ]. Rather than investigating the data-sharing policy only for journals with the largest change in openness as intended, which involved setting an arbitrary cut-off when defining “largest change”, we systematically extracted and categorized the data-sharing policies for all journals in which preprints had subsequently been published using two categories (1: “requiring/mandating data sharing” and, 2: “not requiring/mandating data sharing”), and compared the change in openness between these two categories. Note that Category 2 includes journals that encourage data sharing, but do not make it a condition of publication.

To assess claims that data will be provided on publication, the data availability statements accompanying the published articles for all records in Category 3 (“Data available on publication (link provided)”) or Category 4 (“Data available on publication (no link provided)”) from Table 1 were assessed, and any difference between the two categories examined.

2.5 Code availability

Finally, to assess whether data availability statements also capture the availability of programming code, such as STATA do files or R scripts, the data availability statement and full text PDF for a random sample of 400 preprinted records were assessed for code availability (1: “code availability described” and 2: “code availability not described”).

The data availability statements accompanying 4101 preprints registered between 25th June 2019 and 1st May 2020 were extracted from the medRxiv preprint repository on the 26th May 2020 and were coded by two independent researchers according to the categories in Table 1 . During this process, agreement between the raters was high (Cohen’s Kappa = 0.98; “almost perfect agreement”) [ 44 ].

Of the 4101 preprints, 163 (4.0%) in Category 0 (“Not applicable”) were excluded following coding, leaving 3938 remaining records. Of these, 911 (23.1%) had made their data open as per the criteria in Table 1 . The distribution of data availability statements across the categories can be seen in Fig 1 . A total of 379 (9.6%) preprints had been subsequently published, and of these, only 159 (42.0%) had data availability statements that we could categorize. 4 (2.5%) records in Category 0 (“Not applicable”) were excluded, and of the 155 remaining, 59 (38.1%) had made their data open as per our criteria.

An external file that holds a picture, illustration, etc.
Object name is pone.0250887.g001.jpg

For the comparison of preprinted data availability statements with their published counterparts, we excluded records that were not published, that did not have a published data availability statement or that were labeled as “Not applicable” at either the preprint or published stage, leaving 151 records (3.7% of the total sample of 4101 records) records.

Data availability statements more frequently described open data on publication compared to the preprinted record when the journal mandated data sharing ( Table 2 ). Moreover, the data availability statements for 8 articles published in journals that did not mandate open data sharing became less open on publication. The change in openness for preprints grouped by category and stratified by journal policy is shown in S1 Table in S1 File , while the change for each individual journal included in our analysis is shown in S2 Table in S1 File .

Interestingly, 22 records published in a journal mandating open data sharing did not have an open data availability statement. The majority of these records described data that was available from a central access-controlled repository (Category 5 or 6), while in others, legal restrictions were cited as the reason for lack of data sharing. However, in some cases, data was either insufficiently described or was only available on request (S3 Table in S1 File ), indicating that journal policies which mandate data sharing may not always be consistently applied allowing some records may slip through the gaps.

161 (4.1%) preprints stated that data would be available on publication, but only 10 of these had subsequently been published ( Table 3 ) and the number describing open data on publication did not seem to vary based on whether the preprinted data availability statements include a link to an embargoed repository or not, though the sample size is small.

Of the 400 records for which code availability was assessed, 75 mentioned code availability in the preprinted full-text manuscript. However, only 22 (29.3%) of these also described code availability in the corresponding data availability statement (S4 Table in S1 File ).

4 Discussion

4.1 principal findings and comparison with other studies.

We have reviewed 4101 preprinted and 159 published data availability statements, coding them as “open” or “closed” according to a predefined classification system. During this labor-intensive process, we appreciated statements that reflected the authors’ enthusiasm for data sharing (“YES”) [ 45 ], their bluntness (“Data is not available on request.”) [ 46 ], and their efforts to endear themselves to the reader (“I promise all data referred to in the manuscript are available.”) [ 47 ]. Of the preprinted statements, almost three-quarters were categorized as “closed”, with the largest individual category being “available on request”. In light of the substantial impact that studies published as preprints on medRxiv have had on real-time decision making during the current COVID-19 pandemic [ 15 ], it is concerning that data for these preprints is so infrequently readily available for inspection.

A minority of published records we examined contained a data availability statement (n = 159 (42.0%)). This lack of availability statement at publication results in a loss of useful information. For at least one published article, we identified relevant information in the preprinted statement that did not appear anywhere in the published article, due to it not containing a data availability statement [ 48 , 49 ].

We provide initial descriptive evidence that strict data-sharing policies, which mandate that data be made openly available (where appropriate) as a condition of publication, appear to succeed in making research data more open than those that do not. Our findings, though based on a relatively small number of observations, agree with other studies on the effect of journal policies on author behavior. Recent work has shown that “requiring” a data availability statement was effective in ensuring that this element was completed [ 4 ], while “encouraging” authors to follow a reporting checklist (the ARRIVE checklist) had no effect on compliance [ 50 , 51 ].

Finally, we also provide evidence that data availability statements alone are insufficient to capture code availability declarations. Even when researchers wish to share their code, as evidenced by a description of code availability in the main paper, they frequently do not include this information in the data availability statement. Code sharing has been advocated strongly elsewhere [ 52 – 54 ], as it provides an insight into the analytic decisions made by the research team, and there are few, if any, circumstances in which it is not possible to share the analytic code underpinning an analysis. Similar to data availability statements, a dedicated code availability statement which is critically assessed against a clear code-sharing policy as part of the editorial and peer review processes will help researchers to appraise published results.

4.2 Strengths and limitations

A particular strength of this analysis is that the design allows us to compare what is essentially the same paper (same design, findings and authorship team) under two different data-sharing polices, and assess the change in the openness of the statement between them. To our knowledge this is the first study to use this approach to examine the potential impact of journal editorial policies. This approach also allows us to address the issue of self-selection. When looking at published articles alone, it is not possible to tell whether authors always intended to make their data available and chose a given journal due to its reputation for data sharing. In addition, we have examined all available preprints within our study period and all corresponding published articles, rather than taking a sub-sample. Finally, categorization of the statements was carried out by two independent researchers using predefined categories, reducing the risk of misclassification.

However, our analysis is subject to a number of potential limitations. The primary one is that manuscripts (at both the preprint and published stages) may have included links to the data, or more information that uniquely identifies the dataset from a data portal, within the text (for example, in the Methods section). While this might be the case, if readers are expected to piece together the relevant information from different locations in the manuscript, it throws into question what having a dedicated data availability statement adds. A second limitation is that we do not assess the veracity of any data availability statements, which may introduce some misclassification bias into our categorization. For example, we do not check whether all relevant data can actually be found in the manuscript/ S1 File (Category 7) or the linked repository (Category 8), meaning our results provide a conservative estimate of the scale of the issue, asprevious work has suggested that this is unlikely to be the case [ 12 ]. A further consideration is that for Categories 1 (“No data available”) and 2 (“Available on request”), there will be situations where making research data available is not feasible, for example, due to cost or concerns about patient re-identifiability [ 55 , 56 ]. This situation is perfectly reasonable, as long as statements are explicit in justifying the lack of open data.

4.3 Implications for policy

Data availability statements are an important tool in the fight to make studies more reproducible. However, without critical review of these statements in line with strict data-sharing policies, authors default to not sharing their data or making it “available on request”. Based on our analysis, there is a greater change towards describing open data between preprinted and published data availability statements in journals that mandate data sharing as a condition of publication. This would suggest that data sharing could be immediately improved by journals becoming more stringent in their data availability policies. Similarly, introduction of a related code availability section (or composite “material” availability section) will aid in reproducibility by capturing whether analytic code is available in a standardized manuscript section.

It would be unfair to expect all editors and reviewers to be able to effectively review the code and data provided with a submission. As proposed elsewhere [ 57 ], a possible solution is to assign an editor or reviewer whose sole responsibility in the review process is to examine the data and code provided. They would also be responsible for judging, when data and code are absent, whether the argument presented by the authors for not sharing these materials is valid.

However, while this study focuses primarily on the role of journals, some responsibility for enacting change rests with the research community at large. If researchers regularly shared our data, strict journal data-sharing policies would not be needed. As such, we would encourage authors to consider sharing the data underlying future publications, regardless of whether the journal actually mandates it.

5 Conclusion

Requiring that authors submit a data availability statement is a good first step, but is insufficient to ensure data availability, as our work shows that authors most commonly use them to state that data is only available on request. However, strict editorial policies that mandate data sharing (where appropriate) as a condition of publication appear to be effective in making research data available. In addition to the introduction of a dedicated code availability statement, a move towards mandated data sharing will help to ensure that future research is readily reproducible. We would strongly encourage all journal editors to examine whether their data availability policies are sufficiently stringent and consistently enforced.

Supporting information

Acknowledgments.

We must acknowledge the input of several people, without whom the quality of this work would have been diminished: Matthew Grainger and Neal Haddaway for their insightful comments on the subject of data availability statements; Phil Gooch and Sarah Nevitt for their skill in identifying missing published papers based on the vaguest of descriptions; Antica Culina, Julian Higgins and Alfredo Sánchez-Tójar for their comments on the preprinted version of this article; and Ciara Gardiner, for proof-reading this manuscript.

Funding Statement

LAM is supported by an National Institute for Health Research (NIHR; https://www.nihr.ac.uk/ ) Doctoral Research Fellowship (DRF-2018-11-ST2-048). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The views expressed in this article are those of the authors and do not necessarily represent those of the NHS, the NIHR, MRC, or the Department of Health and Social Care.

Data Availability

  • PLoS One. 2021; 16(5): e0250887.

Decision Letter 0

PONE-D-20-29718

Dear Dr. McGuinness,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The reviewers suggested minor revisions. Please, reviewed that carefully.

Please submit your revised manuscript by Mar 20 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at gro.solp@enosolp . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see:  http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Rafael Sarkis-Onofre

Academic Editor

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information .

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

2. Has the statistical analysis been performed appropriately and rigorously?

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. The study presents the results of original research.

2. Results reported have not been published elsewhere.

Yes, the authors state that the results were not published elsewhere.

3. Experiments, statistics, and other analyses are performed to a high technical standard and are described in sufficient detail.

Only a descriptive analysis was performed, but detailed described.

4. Conclusions are presented in an appropriate fashion and are supported by the data.

5. The article is presented in an intelligible fashion and is written in standard English.

6. The research meets all applicable standards for the ethics of experimentation and research integrity.

7. The article adheres to appropriate reporting guidelines and community standards for data availability.

General comments

The study idea is original and reinforce the science path in direction to the transparency in research.

The methodological part description should be improved to facilitate the understanding.

The presentation and description of the results could be improved for a better understanding, including the tables.

I would like to suggest the replacement of the terms “more open” and “more closed” for more suitable terms.

The results on the abstract must be revised. Some values are different from those showed on the results section. Also, there is a sentence that is not correct. Please, revise.

Reviewer #2: This is a very interesting and well done study, and the rare paper for which I have almost no additional suggestions to make! The study is sound and contributes to the literature on data sharing and the effect of data availability requirements. My only suggestion would be that it might be interesting to discuss the number of papers that are NOT open despite being published in a journal that requires open data (22 out of 55 according to Table 2). Were these journals that just encouraged rather than required open data, or were papers published despite not following the policy? I would be very interested to know how almost half the papers in this category ended up not having open data.

6. PLOS authors have the option to publish the peer review history of their article ( what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2:  Yes:  Lisa Federer

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool,  https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at  gro.solp@serugif . Please note that Supporting Information files do not need this step.

Author response to Decision Letter 0

26 Feb 2021

We thank the reviewers for their useful feedback. which has definitely improved the quality of our manuscript. Please see the "Response to Reviewers" document for a detailed response to each point raised.

Submitted filename: Response to Reviewers.docx

Decision Letter 1

22 Mar 2021

PONE-D-20-29718R1

==============================

ACADEMIC EDITOR:

Thank you for revising the manuscript. I have only two minor comments:

It is not clear how the data extraction related to the journals' data-sharing policies was performed. Please, clarify that.

The conclusion should be aligned with the objectives and results. Please, revise that.

Please submit your revised manuscript by May 06 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at gro.solp@enosolp . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

Academic editor:

Author response to Decision Letter 1

13 Apr 2021

A detailed response to the editorial comments raised are contained in the "Response to Reviewers" document.

Submitted filename: Response_Round2.docx

Decision Letter 2

16 Apr 2021

PONE-D-20-29718R2

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/ , click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at gro.solp@gnillibrohtua .

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact gro.solp@sserpeno .

Additional Editor Comments (optional):

All of my concerns were addressed.

Acceptance letter

Dear Dr. McGuinness:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact gro.solp@sserpeno .

If we can help with anything else, please email us at gro.solp@enosolp .

Thank you for submitting your work to PLOS ONE and supporting open access.

PLOS ONE Editorial Office Staff

on behalf of

Dr. Rafael Sarkis-Onofre

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 26 March 2024

Predicting and improving complex beer flavor through machine learning

  • Michiel Schreurs   ORCID: orcid.org/0000-0002-9449-5619 1 , 2 , 3   na1 ,
  • Supinya Piampongsant 1 , 2 , 3   na1 ,
  • Miguel Roncoroni   ORCID: orcid.org/0000-0001-7461-1427 1 , 2 , 3   na1 ,
  • Lloyd Cool   ORCID: orcid.org/0000-0001-9936-3124 1 , 2 , 3 , 4 ,
  • Beatriz Herrera-Malaver   ORCID: orcid.org/0000-0002-5096-9974 1 , 2 , 3 ,
  • Christophe Vanderaa   ORCID: orcid.org/0000-0001-7443-5427 4 ,
  • Florian A. Theßeling 1 , 2 , 3 ,
  • Łukasz Kreft   ORCID: orcid.org/0000-0001-7620-4657 5 ,
  • Alexander Botzki   ORCID: orcid.org/0000-0001-6691-4233 5 ,
  • Philippe Malcorps 6 ,
  • Luk Daenen 6 ,
  • Tom Wenseleers   ORCID: orcid.org/0000-0002-1434-861X 4 &
  • Kevin J. Verstrepen   ORCID: orcid.org/0000-0002-3077-6219 1 , 2 , 3  

Nature Communications volume  15 , Article number:  2368 ( 2024 ) Cite this article

54k Accesses

861 Altmetric

Metrics details

  • Chemical engineering
  • Gas chromatography
  • Machine learning
  • Metabolomics
  • Taste receptors

The perception and appreciation of food flavor depends on many interacting chemical compounds and external factors, and therefore proves challenging to understand and predict. Here, we combine extensive chemical and sensory analyses of 250 different beers to train machine learning models that allow predicting flavor and consumer appreciation. For each beer, we measure over 200 chemical properties, perform quantitative descriptive sensory analysis with a trained tasting panel and map data from over 180,000 consumer reviews to train 10 different machine learning models. The best-performing algorithm, Gradient Boosting, yields models that significantly outperform predictions based on conventional statistics and accurately predict complex food features and consumer appreciation from chemical profiles. Model dissection allows identifying specific and unexpected compounds as drivers of beer flavor and appreciation. Adding these compounds results in variants of commercial alcoholic and non-alcoholic beers with improved consumer appreciation. Together, our study reveals how big data and machine learning uncover complex links between food chemistry, flavor and consumer perception, and lays the foundation to develop novel, tailored foods with superior flavors.

Similar content being viewed by others

how to write data availability statement in research paper

BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules

Rudraksh Tuwani, Somin Wadhwa & Ganesh Bagler

how to write data availability statement in research paper

Predicting odor from molecular structure: a multi-label classification approach

Kushagra Saini & Venkatnarayan Ramanathan

how to write data availability statement in research paper

Toward a general and interpretable umami taste predictor using a multi-objective machine learning approach

Lorenzo Pallante, Aigli Korfiati, … Marco A. Deriu

Introduction

Predicting and understanding food perception and appreciation is one of the major challenges in food science. Accurate modeling of food flavor and appreciation could yield important opportunities for both producers and consumers, including quality control, product fingerprinting, counterfeit detection, spoilage detection, and the development of new products and product combinations (food pairing) 1 , 2 , 3 , 4 , 5 , 6 . Accurate models for flavor and consumer appreciation would contribute greatly to our scientific understanding of how humans perceive and appreciate flavor. Moreover, accurate predictive models would also facilitate and standardize existing food assessment methods and could supplement or replace assessments by trained and consumer tasting panels, which are variable, expensive and time-consuming 7 , 8 , 9 . Lastly, apart from providing objective, quantitative, accurate and contextual information that can help producers, models can also guide consumers in understanding their personal preferences 10 .

Despite the myriad of applications, predicting food flavor and appreciation from its chemical properties remains a largely elusive goal in sensory science, especially for complex food and beverages 11 , 12 . A key obstacle is the immense number of flavor-active chemicals underlying food flavor. Flavor compounds can vary widely in chemical structure and concentration, making them technically challenging and labor-intensive to quantify, even in the face of innovations in metabolomics, such as non-targeted metabolic fingerprinting 13 , 14 . Moreover, sensory analysis is perhaps even more complicated. Flavor perception is highly complex, resulting from hundreds of different molecules interacting at the physiochemical and sensorial level. Sensory perception is often non-linear, characterized by complex and concentration-dependent synergistic and antagonistic effects 15 , 16 , 17 , 18 , 19 , 20 , 21 that are further convoluted by the genetics, environment, culture and psychology of consumers 22 , 23 , 24 . Perceived flavor is therefore difficult to measure, with problems of sensitivity, accuracy, and reproducibility that can only be resolved by gathering sufficiently large datasets 25 . Trained tasting panels are considered the prime source of quality sensory data, but require meticulous training, are low throughput and high cost. Public databases containing consumer reviews of food products could provide a valuable alternative, especially for studying appreciation scores, which do not require formal training 25 . Public databases offer the advantage of amassing large amounts of data, increasing the statistical power to identify potential drivers of appreciation. However, public datasets suffer from biases, including a bias in the volunteers that contribute to the database, as well as confounding factors such as price, cult status and psychological conformity towards previous ratings of the product.

Classical multivariate statistics and machine learning methods have been used to predict flavor of specific compounds by, for example, linking structural properties of a compound to its potential biological activities or linking concentrations of specific compounds to sensory profiles 1 , 26 . Importantly, most previous studies focused on predicting organoleptic properties of single compounds (often based on their chemical structure) 27 , 28 , 29 , 30 , 31 , 32 , 33 , thus ignoring the fact that these compounds are present in a complex matrix in food or beverages and excluding complex interactions between compounds. Moreover, the classical statistics commonly used in sensory science 34 , 35 , 36 , 37 , 38 , 39 require a large sample size and sufficient variance amongst predictors to create accurate models. They are not fit for studying an extensive set of hundreds of interacting flavor compounds, since they are sensitive to outliers, have a high tendency to overfit and are less suited for non-linear and discontinuous relationships 40 .

In this study, we combine extensive chemical analyses and sensory data of a set of different commercial beers with machine learning approaches to develop models that predict taste, smell, mouthfeel and appreciation from compound concentrations. Beer is particularly suited to model the relationship between chemistry, flavor and appreciation. First, beer is a complex product, consisting of thousands of flavor compounds that partake in complex sensory interactions 41 , 42 , 43 . This chemical diversity arises from the raw materials (malt, yeast, hops, water and spices) and biochemical conversions during the brewing process (kilning, mashing, boiling, fermentation, maturation and aging) 44 , 45 . Second, the advent of the internet saw beer consumers embrace online review platforms, such as RateBeer (ZX Ventures, Anheuser-Busch InBev SA/NV) and BeerAdvocate (Next Glass, inc.). In this way, the beer community provides massive data sets of beer flavor and appreciation scores, creating extraordinarily large sensory databases to complement the analyses of our professional sensory panel. Specifically, we characterize over 200 chemical properties of 250 commercial beers, spread across 22 beer styles, and link these to the descriptive sensory profiling data of a 16-person in-house trained tasting panel and data acquired from over 180,000 public consumer reviews. These unique and extensive datasets enable us to train a suite of machine learning models to predict flavor and appreciation from a beer’s chemical profile. Dissection of the best-performing models allows us to pinpoint specific compounds as potential drivers of beer flavor and appreciation. Follow-up experiments confirm the importance of these compounds and ultimately allow us to significantly improve the flavor and appreciation of selected commercial beers. Together, our study represents a significant step towards understanding complex flavors and reinforces the value of machine learning to develop and refine complex foods. In this way, it represents a stepping stone for further computer-aided food engineering applications 46 .

To generate a comprehensive dataset on beer flavor, we selected 250 commercial Belgian beers across 22 different beer styles (Supplementary Fig.  S1 ). Beers with ≤ 4.2% alcohol by volume (ABV) were classified as non-alcoholic and low-alcoholic. Blonds and Tripels constitute a significant portion of the dataset (12.4% and 11.2%, respectively) reflecting their presence on the Belgian beer market and the heterogeneity of beers within these styles. By contrast, lager beers are less diverse and dominated by a handful of brands. Rare styles such as Brut or Faro make up only a small fraction of the dataset (2% and 1%, respectively) because fewer of these beers are produced and because they are dominated by distinct characteristics in terms of flavor and chemical composition.

Extensive analysis identifies relationships between chemical compounds in beer

For each beer, we measured 226 different chemical properties, including common brewing parameters such as alcohol content, iso-alpha acids, pH, sugar concentration 47 , and over 200 flavor compounds (Methods, Supplementary Table  S1 ). A large portion (37.2%) are terpenoids arising from hopping, responsible for herbal and fruity flavors 16 , 48 . A second major category are yeast metabolites, such as esters and alcohols, that result in fruity and solvent notes 48 , 49 , 50 . Other measured compounds are primarily derived from malt, or other microbes such as non- Saccharomyces yeasts and bacteria (‘wild flora’). Compounds that arise from spices or staling are labeled under ‘Others’. Five attributes (caloric value, total acids and total ester, hop aroma and sulfur compounds) are calculated from multiple individually measured compounds.

As a first step in identifying relationships between chemical properties, we determined correlations between the concentrations of the compounds (Fig.  1 , upper panel, Supplementary Data  1 and 2 , and Supplementary Fig.  S2 . For the sake of clarity, only a subset of the measured compounds is shown in Fig.  1 ). Compounds of the same origin typically show a positive correlation, while absence of correlation hints at parameters varying independently. For example, the hop aroma compounds citronellol, and alpha-terpineol show moderate correlations with each other (Spearman’s rho=0.39 and 0.57), but not with the bittering hop component iso-alpha acids (Spearman’s rho=0.16 and −0.07). This illustrates how brewers can independently modify hop aroma and bitterness by selecting hop varieties and dosage time. If hops are added early in the boiling phase, chemical conversions increase bitterness while aromas evaporate, conversely, late addition of hops preserves aroma but limits bitterness 51 . Similarly, hop-derived iso-alpha acids show a strong anti-correlation with lactic acid and acetic acid, likely reflecting growth inhibition of lactic acid and acetic acid bacteria, or the consequent use of fewer hops in sour beer styles, such as West Flanders ales and Fruit beers, that rely on these bacteria for their distinct flavors 52 . Finally, yeast-derived esters (ethyl acetate, ethyl decanoate, ethyl hexanoate, ethyl octanoate) and alcohols (ethanol, isoamyl alcohol, isobutanol, and glycerol), correlate with Spearman coefficients above 0.5, suggesting that these secondary metabolites are correlated with the yeast genetic background and/or fermentation parameters and may be difficult to influence individually, although the choice of yeast strain may offer some control 53 .

figure 1

Spearman rank correlations are shown. Descriptors are grouped according to their origin (malt (blue), hops (green), yeast (red), wild flora (yellow), Others (black)), and sensory aspect (aroma, taste, palate, and overall appreciation). Please note that for the chemical compounds, for the sake of clarity, only a subset of the total number of measured compounds is shown, with an emphasis on the key compounds for each source. For more details, see the main text and Methods section. Chemical data can be found in Supplementary Data  1 , correlations between all chemical compounds are depicted in Supplementary Fig.  S2 and correlation values can be found in Supplementary Data  2 . See Supplementary Data  4 for sensory panel assessments and Supplementary Data  5 for correlation values between all sensory descriptors.

Interestingly, different beer styles show distinct patterns for some flavor compounds (Supplementary Fig.  S3 ). These observations agree with expectations for key beer styles, and serve as a control for our measurements. For instance, Stouts generally show high values for color (darker), while hoppy beers contain elevated levels of iso-alpha acids, compounds associated with bitter hop taste. Acetic and lactic acid are not prevalent in most beers, with notable exceptions such as Kriek, Lambic, Faro, West Flanders ales and Flanders Old Brown, which use acid-producing bacteria ( Lactobacillus and Pediococcus ) or unconventional yeast ( Brettanomyces ) 54 , 55 . Glycerol, ethanol and esters show similar distributions across all beer styles, reflecting their common origin as products of yeast metabolism during fermentation 45 , 53 . Finally, low/no-alcohol beers contain low concentrations of glycerol and esters. This is in line with the production process for most of the low/no-alcohol beers in our dataset, which are produced through limiting fermentation or by stripping away alcohol via evaporation or dialysis, with both methods having the unintended side-effect of reducing the amount of flavor compounds in the final beer 56 , 57 .

Besides expected associations, our data also reveals less trivial associations between beer styles and specific parameters. For example, geraniol and citronellol, two monoterpenoids responsible for citrus, floral and rose flavors and characteristic of Citra hops, are found in relatively high amounts in Christmas, Saison, and Brett/co-fermented beers, where they may originate from terpenoid-rich spices such as coriander seeds instead of hops 58 .

Tasting panel assessments reveal sensorial relationships in beer

To assess the sensory profile of each beer, a trained tasting panel evaluated each of the 250 beers for 50 sensory attributes, including different hop, malt and yeast flavors, off-flavors and spices. Panelists used a tasting sheet (Supplementary Data  3 ) to score the different attributes. Panel consistency was evaluated by repeating 12 samples across different sessions and performing ANOVA. In 95% of cases no significant difference was found across sessions ( p  > 0.05), indicating good panel consistency (Supplementary Table  S2 ).

Aroma and taste perception reported by the trained panel are often linked (Fig.  1 , bottom left panel and Supplementary Data  4 and 5 ), with high correlations between hops aroma and taste (Spearman’s rho=0.83). Bitter taste was found to correlate with hop aroma and taste in general (Spearman’s rho=0.80 and 0.69), and particularly with “grassy” noble hops (Spearman’s rho=0.75). Barnyard flavor, most often associated with sour beers, is identified together with stale hops (Spearman’s rho=0.97) that are used in these beers. Lactic and acetic acid, which often co-occur, are correlated (Spearman’s rho=0.66). Interestingly, sweetness and bitterness are anti-correlated (Spearman’s rho = −0.48), confirming the hypothesis that they mask each other 59 , 60 . Beer body is highly correlated with alcohol (Spearman’s rho = 0.79), and overall appreciation is found to correlate with multiple aspects that describe beer mouthfeel (alcohol, carbonation; Spearman’s rho= 0.32, 0.39), as well as with hop and ester aroma intensity (Spearman’s rho=0.39 and 0.35).

Similar to the chemical analyses, sensorial analyses confirmed typical features of specific beer styles (Supplementary Fig.  S4 ). For example, sour beers (Faro, Flanders Old Brown, Fruit beer, Kriek, Lambic, West Flanders ale) were rated acidic, with flavors of both acetic and lactic acid. Hoppy beers were found to be bitter and showed hop-associated aromas like citrus and tropical fruit. Malt taste is most detected among scotch, stout/porters, and strong ales, while low/no-alcohol beers, which often have a reputation for being ‘worty’ (reminiscent of unfermented, sweet malt extract) appear in the middle. Unsurprisingly, hop aromas are most strongly detected among hoppy beers. Like its chemical counterpart (Supplementary Fig.  S3 ), acidity shows a right-skewed distribution, with the most acidic beers being Krieks, Lambics, and West Flanders ales.

Tasting panel assessments of specific flavors correlate with chemical composition

We find that the concentrations of several chemical compounds strongly correlate with specific aroma or taste, as evaluated by the tasting panel (Fig.  2 , Supplementary Fig.  S5 , Supplementary Data  6 ). In some cases, these correlations confirm expectations and serve as a useful control for data quality. For example, iso-alpha acids, the bittering compounds in hops, strongly correlate with bitterness (Spearman’s rho=0.68), while ethanol and glycerol correlate with tasters’ perceptions of alcohol and body, the mouthfeel sensation of fullness (Spearman’s rho=0.82/0.62 and 0.72/0.57 respectively) and darker color from roasted malts is a good indication of malt perception (Spearman’s rho=0.54).

figure 2

Heatmap colors indicate Spearman’s Rho. Axes are organized according to sensory categories (aroma, taste, mouthfeel, overall), chemical categories and chemical sources in beer (malt (blue), hops (green), yeast (red), wild flora (yellow), Others (black)). See Supplementary Data  6 for all correlation values.

Interestingly, for some relationships between chemical compounds and perceived flavor, correlations are weaker than expected. For example, the rose-smelling phenethyl acetate only weakly correlates with floral aroma. This hints at more complex relationships and interactions between compounds and suggests a need for a more complex model than simple correlations. Lastly, we uncovered unexpected correlations. For instance, the esters ethyl decanoate and ethyl octanoate appear to correlate slightly with hop perception and bitterness, possibly due to their fruity flavor. Iron is anti-correlated with hop aromas and bitterness, most likely because it is also anti-correlated with iso-alpha acids. This could be a sign of metal chelation of hop acids 61 , given that our analyses measure unbound hop acids and total iron content, or could result from the higher iron content in dark and Fruit beers, which typically have less hoppy and bitter flavors 62 .

Public consumer reviews complement expert panel data

To complement and expand the sensory data of our trained tasting panel, we collected 180,000 reviews of our 250 beers from the online consumer review platform RateBeer. This provided numerical scores for beer appearance, aroma, taste, palate, overall quality as well as the average overall score.

Public datasets are known to suffer from biases, such as price, cult status and psychological conformity towards previous ratings of a product. For example, prices correlate with appreciation scores for these online consumer reviews (rho=0.49, Supplementary Fig.  S6 ), but not for our trained tasting panel (rho=0.19). This suggests that prices affect consumer appreciation, which has been reported in wine 63 , while blind tastings are unaffected. Moreover, we observe that some beer styles, like lagers and non-alcoholic beers, generally receive lower scores, reflecting that online reviewers are mostly beer aficionados with a preference for specialty beers over lager beers. In general, we find a modest correlation between our trained panel’s overall appreciation score and the online consumer appreciation scores (Fig.  3 , rho=0.29). Apart from the aforementioned biases in the online datasets, serving temperature, sample freshness and surroundings, which are all tightly controlled during the tasting panel sessions, can vary tremendously across online consumers and can further contribute to (among others, appreciation) differences between the two categories of tasters. Importantly, in contrast to the overall appreciation scores, for many sensory aspects the results from the professional panel correlated well with results obtained from RateBeer reviews. Correlations were highest for features that are relatively easy to recognize even for untrained tasters, like bitterness, sweetness, alcohol and malt aroma (Fig.  3 and below).

figure 3

RateBeer text mining results can be found in Supplementary Data  7 . Rho values shown are Spearman correlation values, with asterisks indicating significant correlations ( p  < 0.05, two-sided). All p values were smaller than 0.001, except for Esters aroma (0.0553), Esters taste (0.3275), Esters aroma—banana (0.0019), Coriander (0.0508) and Diacetyl (0.0134).

Besides collecting consumer appreciation from these online reviews, we developed automated text analysis tools to gather additional data from review texts (Supplementary Data  7 ). Processing review texts on the RateBeer database yielded comparable results to the scores given by the trained panel for many common sensory aspects, including acidity, bitterness, sweetness, alcohol, malt, and hop tastes (Fig.  3 ). This is in line with what would be expected, since these attributes require less training for accurate assessment and are less influenced by environmental factors such as temperature, serving glass and odors in the environment. Consumer reviews also correlate well with our trained panel for 4-vinyl guaiacol, a compound associated with a very characteristic aroma. By contrast, correlations for more specific aromas like ester, coriander or diacetyl are underrepresented in the online reviews, underscoring the importance of using a trained tasting panel and standardized tasting sheets with explicit factors to be scored for evaluating specific aspects of a beer. Taken together, our results suggest that public reviews are trustworthy for some, but not all, flavor features and can complement or substitute taste panel data for these sensory aspects.

Models can predict beer sensory profiles from chemical data

The rich datasets of chemical analyses, tasting panel assessments and public reviews gathered in the first part of this study provided us with a unique opportunity to develop predictive models that link chemical data to sensorial features. Given the complexity of beer flavor, basic statistical tools such as correlations or linear regression may not always be the most suitable for making accurate predictions. Instead, we applied different machine learning models that can model both simple linear and complex interactive relationships. Specifically, we constructed a set of regression models to predict (a) trained panel scores for beer flavor and quality and (b) public reviews’ appreciation scores from beer chemical profiles. We trained and tested 10 different models (Methods), 3 linear regression-based models (simple linear regression with first-order interactions (LR), lasso regression with first-order interactions (Lasso), partial least squares regressor (PLSR)), 5 decision tree models (AdaBoost regressor (ABR), extra trees (ET), gradient boosting regressor (GBR), random forest (RF) and XGBoost regressor (XGBR)), 1 support vector regression (SVR), and 1 artificial neural network (ANN) model.

To compare the performance of our machine learning models, the dataset was randomly split into a training and test set, stratified by beer style. After a model was trained on data in the training set, its performance was evaluated on its ability to predict the test dataset obtained from multi-output models (based on the coefficient of determination, see Methods). Additionally, individual-attribute models were ranked per descriptor and the average rank was calculated, as proposed by Korneva et al. 64 . Importantly, both ways of evaluating the models’ performance agreed in general. Performance of the different models varied (Table  1 ). It should be noted that all models perform better at predicting RateBeer results than results from our trained tasting panel. One reason could be that sensory data is inherently variable, and this variability is averaged out with the large number of public reviews from RateBeer. Additionally, all tree-based models perform better at predicting taste than aroma. Linear models (LR) performed particularly poorly, with negative R 2 values, due to severe overfitting (training set R 2  = 1). Overfitting is a common issue in linear models with many parameters and limited samples, especially with interaction terms further amplifying the number of parameters. L1 regularization (Lasso) successfully overcomes this overfitting, out-competing multiple tree-based models on the RateBeer dataset. Similarly, the dimensionality reduction of PLSR avoids overfitting and improves performance, to some extent. Still, tree-based models (ABR, ET, GBR, RF and XGBR) show the best performance, out-competing the linear models (LR, Lasso, PLSR) commonly used in sensory science 65 .

GBR models showed the best overall performance in predicting sensory responses from chemical information, with R 2 values up to 0.75 depending on the predicted sensory feature (Supplementary Table  S4 ). The GBR models predict consumer appreciation (RateBeer) better than our trained panel’s appreciation (R 2 value of 0.67 compared to R 2 value of 0.09) (Supplementary Table  S3 and Supplementary Table  S4 ). ANN models showed intermediate performance, likely because neural networks typically perform best with larger datasets 66 . The SVR shows intermediate performance, mostly due to the weak predictions of specific attributes that lower the overall performance (Supplementary Table  S4 ).

Model dissection identifies specific, unexpected compounds as drivers of consumer appreciation

Next, we leveraged our models to infer important contributors to sensory perception and consumer appreciation. Consumer preference is a crucial sensory aspects, because a product that shows low consumer appreciation scores often does not succeed commercially 25 . Additionally, the requirement for a large number of representative evaluators makes consumer trials one of the more costly and time-consuming aspects of product development. Hence, a model for predicting chemical drivers of overall appreciation would be a welcome addition to the available toolbox for food development and optimization.

Since GBR models on our RateBeer dataset showed the best overall performance, we focused on these models. Specifically, we used two approaches to identify important contributors. First, rankings of the most important predictors for each sensorial trait in the GBR models were obtained based on impurity-based feature importance (mean decrease in impurity). High-ranked parameters were hypothesized to be either the true causal chemical properties underlying the trait, to correlate with the actual causal properties, or to take part in sensory interactions affecting the trait 67 (Fig.  4A ). In a second approach, we used SHAP 68 to determine which parameters contributed most to the model for making predictions of consumer appreciation (Fig.  4B ). SHAP calculates parameter contributions to model predictions on a per-sample basis, which can be aggregated into an importance score.

figure 4

A The impurity-based feature importance (mean deviance in impurity, MDI) calculated from the Gradient Boosting Regression (GBR) model predicting RateBeer appreciation scores. The top 15 highest ranked chemical properties are shown. B SHAP summary plot for the top 15 parameters contributing to our GBR model. Each point on the graph represents a sample from our dataset. The color represents the concentration of that parameter, with bluer colors representing low values and redder colors representing higher values. Greater absolute values on the horizontal axis indicate a higher impact of the parameter on the prediction of the model. C Spearman correlations between the 15 most important chemical properties and consumer overall appreciation. Numbers indicate the Spearman Rho correlation coefficient, and the rank of this correlation compared to all other correlations. The top 15 important compounds were determined using SHAP (panel B).

Both approaches identified ethyl acetate as the most predictive parameter for beer appreciation (Fig.  4 ). Ethyl acetate is the most abundant ester in beer with a typical ‘fruity’, ‘solvent’ and ‘alcoholic’ flavor, but is often considered less important than other esters like isoamyl acetate. The second most important parameter identified by SHAP is ethanol, the most abundant beer compound after water. Apart from directly contributing to beer flavor and mouthfeel, ethanol drastically influences the physical properties of beer, dictating how easily volatile compounds escape the beer matrix to contribute to beer aroma 69 . Importantly, it should also be noted that the importance of ethanol for appreciation is likely inflated by the very low appreciation scores of non-alcoholic beers (Supplementary Fig.  S4 ). Despite not often being considered a driver of beer appreciation, protein level also ranks highly in both approaches, possibly due to its effect on mouthfeel and body 70 . Lactic acid, which contributes to the tart taste of sour beers, is the fourth most important parameter identified by SHAP, possibly due to the generally high appreciation of sour beers in our dataset.

Interestingly, some of the most important predictive parameters for our model are not well-established as beer flavors or are even commonly regarded as being negative for beer quality. For example, our models identify methanethiol and ethyl phenyl acetate, an ester commonly linked to beer staling 71 , as a key factor contributing to beer appreciation. Although there is no doubt that high concentrations of these compounds are considered unpleasant, the positive effects of modest concentrations are not yet known 72 , 73 .

To compare our approach to conventional statistics, we evaluated how well the 15 most important SHAP-derived parameters correlate with consumer appreciation (Fig.  4C ). Interestingly, only 6 of the properties derived by SHAP rank amongst the top 15 most correlated parameters. For some chemical compounds, the correlations are so low that they would have likely been considered unimportant. For example, lactic acid, the fourth most important parameter, shows a bimodal distribution for appreciation, with sour beers forming a separate cluster, that is missed entirely by the Spearman correlation. Additionally, the correlation plots reveal outliers, emphasizing the need for robust analysis tools. Together, this highlights the need for alternative models, like the Gradient Boosting model, that better grasp the complexity of (beer) flavor.

Finally, to observe the relationships between these chemical properties and their predicted targets, partial dependence plots were constructed for the six most important predictors of consumer appreciation 74 , 75 , 76 (Supplementary Fig.  S7 ). One-way partial dependence plots show how a change in concentration affects the predicted appreciation. These plots reveal an important limitation of our models: appreciation predictions remain constant at ever-increasing concentrations. This implies that once a threshold concentration is reached, further increasing the concentration does not affect appreciation. This is false, as it is well-documented that certain compounds become unpleasant at high concentrations, including ethyl acetate (‘nail polish’) 77 and methanethiol (‘sulfury’ and ‘rotten cabbage’) 78 . The inability of our models to grasp that flavor compounds have optimal levels, above which they become negative, is a consequence of working with commercial beer brands where (off-)flavors are rarely too high to negatively impact the product. The two-way partial dependence plots show how changing the concentration of two compounds influences predicted appreciation, visualizing their interactions (Supplementary Fig.  S7 ). In our case, the top 5 parameters are dominated by additive or synergistic interactions, with high concentrations for both compounds resulting in the highest predicted appreciation.

To assess the robustness of our best-performing models and model predictions, we performed 100 iterations of the GBR, RF and ET models. In general, all iterations of the models yielded similar performance (Supplementary Fig.  S8 ). Moreover, the main predictors (including the top predictors ethanol and ethyl acetate) remained virtually the same, especially for GBR and RF. For the iterations of the ET model, we did observe more variation in the top predictors, which is likely a consequence of the model’s inherent random architecture in combination with co-correlations between certain predictors. However, even in this case, several of the top predictors (ethanol and ethyl acetate) remain unchanged, although their rank in importance changes (Supplementary Fig.  S8 ).

Next, we investigated if a combination of RateBeer and trained panel data into one consolidated dataset would lead to stronger models, under the hypothesis that such a model would suffer less from bias in the datasets. A GBR model was trained to predict appreciation on the combined dataset. This model underperformed compared to the RateBeer model, both in the native case and when including a dataset identifier (R 2  = 0.67, 0.26 and 0.42 respectively). For the latter, the dataset identifier is the most important feature (Supplementary Fig.  S9 ), while most of the feature importance remains unchanged, with ethyl acetate and ethanol ranking highest, like in the original model trained only on RateBeer data. It seems that the large variation in the panel dataset introduces noise, weakening the models’ performances and reliability. In addition, it seems reasonable to assume that both datasets are fundamentally different, with the panel dataset obtained by blind tastings by a trained professional panel.

Lastly, we evaluated whether beer style identifiers would further enhance the model’s performance. A GBR model was trained with parameters that explicitly encoded the styles of the samples. This did not improve model performance (R2 = 0.66 with style information vs R2 = 0.67). The most important chemical features are consistent with the model trained without style information (eg. ethanol and ethyl acetate), and with the exception of the most preferred (strong ale) and least preferred (low/no-alcohol) styles, none of the styles were among the most important features (Supplementary Fig.  S9 , Supplementary Table  S5 and S6 ). This is likely due to a combination of style-specific chemical signatures, such as iso-alpha acids and lactic acid, that implicitly convey style information to the original models, as well as the low number of samples belonging to some styles, making it difficult for the model to learn style-specific patterns. Moreover, beer styles are not rigorously defined, with some styles overlapping in features and some beers being misattributed to a specific style, all of which leads to more noise in models that use style parameters.

Model validation

To test if our predictive models give insight into beer appreciation, we set up experiments aimed at improving existing commercial beers. We specifically selected overall appreciation as the trait to be examined because of its complexity and commercial relevance. Beer flavor comprises a complex bouquet rather than single aromas and tastes 53 . Hence, adding a single compound to the extent that a difference is noticeable may lead to an unbalanced, artificial flavor. Therefore, we evaluated the effect of combinations of compounds. Because Blond beers represent the most extensive style in our dataset, we selected a beer from this style as the starting material for these experiments (Beer 64 in Supplementary Data  1 ).

In the first set of experiments, we adjusted the concentrations of compounds that made up the most important predictors of overall appreciation (ethyl acetate, ethanol, lactic acid, ethyl phenyl acetate) together with correlated compounds (ethyl hexanoate, isoamyl acetate, glycerol), bringing them up to 95 th percentile ethanol-normalized concentrations (Methods) within the Blond group (‘Spiked’ concentration in Fig.  5A ). Compared to controls, the spiked beers were found to have significantly improved overall appreciation among trained panelists, with panelist noting increased intensity of ester flavors, sweetness, alcohol, and body fullness (Fig.  5B ). To disentangle the contribution of ethanol to these results, a second experiment was performed without the addition of ethanol. This resulted in a similar outcome, including increased perception of alcohol and overall appreciation.

figure 5

Adding the top chemical compounds, identified as best predictors of appreciation by our model, into poorly appreciated beers results in increased appreciation from our trained panel. Results of sensory tests between base beers and those spiked with compounds identified as the best predictors by the model. A Blond and Non/Low-alcohol (0.0% ABV) base beers were brought up to 95th-percentile ethanol-normalized concentrations within each style. B For each sensory attribute, tasters indicated the more intense sample and selected the sample they preferred. The numbers above the bars correspond to the p values that indicate significant changes in perceived flavor (two-sided binomial test: alpha 0.05, n  = 20 or 13).

In a last experiment, we tested whether using the model’s predictions can boost the appreciation of a non-alcoholic beer (beer 223 in Supplementary Data  1 ). Again, the addition of a mixture of predicted compounds (omitting ethanol, in this case) resulted in a significant increase in appreciation, body, ester flavor and sweetness.

Predicting flavor and consumer appreciation from chemical composition is one of the ultimate goals of sensory science. A reliable, systematic and unbiased way to link chemical profiles to flavor and food appreciation would be a significant asset to the food and beverage industry. Such tools would substantially aid in quality control and recipe development, offer an efficient and cost-effective alternative to pilot studies and consumer trials and would ultimately allow food manufacturers to produce superior, tailor-made products that better meet the demands of specific consumer groups more efficiently.

A limited set of studies have previously tried, to varying degrees of success, to predict beer flavor and beer popularity based on (a limited set of) chemical compounds and flavors 79 , 80 . Current sensitive, high-throughput technologies allow measuring an unprecedented number of chemical compounds and properties in a large set of samples, yielding a dataset that can train models that help close the gaps between chemistry and flavor, even for a complex natural product like beer. To our knowledge, no previous research gathered data at this scale (250 samples, 226 chemical parameters, 50 sensory attributes and 5 consumer scores) to disentangle and validate the chemical aspects driving beer preference using various machine-learning techniques. We find that modern machine learning models outperform conventional statistical tools, such as correlations and linear models, and can successfully predict flavor appreciation from chemical composition. This could be attributed to the natural incorporation of interactions and non-linear or discontinuous effects in machine learning models, which are not easily grasped by the linear model architecture. While linear models and partial least squares regression represent the most widespread statistical approaches in sensory science, in part because they allow interpretation 65 , 81 , 82 , modern machine learning methods allow for building better predictive models while preserving the possibility to dissect and exploit the underlying patterns. Of the 10 different models we trained, tree-based models, such as our best performing GBR, showed the best overall performance in predicting sensory responses from chemical information, outcompeting artificial neural networks. This agrees with previous reports for models trained on tabular data 83 . Our results are in line with the findings of Colantonio et al. who also identified the gradient boosting architecture as performing best at predicting appreciation and flavor (of tomatoes and blueberries, in their specific study) 26 . Importantly, besides our larger experimental scale, we were able to directly confirm our models’ predictions in vivo.

Our study confirms that flavor compound concentration does not always correlate with perception, suggesting complex interactions that are often missed by more conventional statistics and simple models. Specifically, we find that tree-based algorithms may perform best in developing models that link complex food chemistry with aroma. Furthermore, we show that massive datasets of untrained consumer reviews provide a valuable source of data, that can complement or even replace trained tasting panels, especially for appreciation and basic flavors, such as sweetness and bitterness. This holds despite biases that are known to occur in such datasets, such as price or conformity bias. Moreover, GBR models predict taste better than aroma. This is likely because taste (e.g. bitterness) often directly relates to the corresponding chemical measurements (e.g., iso-alpha acids), whereas such a link is less clear for aromas, which often result from the interplay between multiple volatile compounds. We also find that our models are best at predicting acidity and alcohol, likely because there is a direct relation between the measured chemical compounds (acids and ethanol) and the corresponding perceived sensorial attribute (acidity and alcohol), and because even untrained consumers are generally able to recognize these flavors and aromas.

The predictions of our final models, trained on review data, hold even for blind tastings with small groups of trained tasters, as demonstrated by our ability to validate specific compounds as drivers of beer flavor and appreciation. Since adding a single compound to the extent of a noticeable difference may result in an unbalanced flavor profile, we specifically tested our identified key drivers as a combination of compounds. While this approach does not allow us to validate if a particular single compound would affect flavor and/or appreciation, our experiments do show that this combination of compounds increases consumer appreciation.

It is important to stress that, while it represents an important step forward, our approach still has several major limitations. A key weakness of the GBR model architecture is that amongst co-correlating variables, the largest main effect is consistently preferred for model building. As a result, co-correlating variables often have artificially low importance scores, both for impurity and SHAP-based methods, like we observed in the comparison to the more randomized Extra Trees models. This implies that chemicals identified as key drivers of a specific sensory feature by GBR might not be the true causative compounds, but rather co-correlate with the actual causative chemical. For example, the high importance of ethyl acetate could be (partially) attributed to the total ester content, ethanol or ethyl hexanoate (rho=0.77, rho=0.72 and rho=0.68), while ethyl phenylacetate could hide the importance of prenyl isobutyrate and ethyl benzoate (rho=0.77 and rho=0.76). Expanding our GBR model to include beer style as a parameter did not yield additional power or insight. This is likely due to style-specific chemical signatures, such as iso-alpha acids and lactic acid, that implicitly convey style information to the original model, as well as the smaller sample size per style, limiting the power to uncover style-specific patterns. This can be partly attributed to the curse of dimensionality, where the high number of parameters results in the models mainly incorporating single parameter effects, rather than complex interactions such as style-dependent effects 67 . A larger number of samples may overcome some of these limitations and offer more insight into style-specific effects. On the other hand, beer style is not a rigid scientific classification, and beers within one style often differ a lot, which further complicates the analysis of style as a model factor.

Our study is limited to beers from Belgian breweries. Although these beers cover a large portion of the beer styles available globally, some beer styles and consumer patterns may be missing, while other features might be overrepresented. For example, many Belgian ales exhibit yeast-driven flavor profiles, which is reflected in the chemical drivers of appreciation discovered by this study. In future work, expanding the scope to include diverse markets and beer styles could lead to the identification of even more drivers of appreciation and better models for special niche products that were not present in our beer set.

In addition to inherent limitations of GBR models, there are also some limitations associated with studying food aroma. Even if our chemical analyses measured most of the known aroma compounds, the total number of flavor compounds in complex foods like beer is still larger than the subset we were able to measure in this study. For example, hop-derived thiols, that influence flavor at very low concentrations, are notoriously difficult to measure in a high-throughput experiment. Moreover, consumer perception remains subjective and prone to biases that are difficult to avoid. It is also important to stress that the models are still immature and that more extensive datasets will be crucial for developing more complete models in the future. Besides more samples and parameters, our dataset does not include any demographic information about the tasters. Including such data could lead to better models that grasp external factors like age and culture. Another limitation is that our set of beers consists of high-quality end-products and lacks beers that are unfit for sale, which limits the current model in accurately predicting products that are appreciated very badly. Finally, while models could be readily applied in quality control, their use in sensory science and product development is restrained by their inability to discern causal relationships. Given that the models cannot distinguish compounds that genuinely drive consumer perception from those that merely correlate, validation experiments are essential to identify true causative compounds.

Despite the inherent limitations, dissection of our models enabled us to pinpoint specific molecules as potential drivers of beer aroma and consumer appreciation, including compounds that were unexpected and would not have been identified using standard approaches. Important drivers of beer appreciation uncovered by our models include protein levels, ethyl acetate, ethyl phenyl acetate and lactic acid. Currently, many brewers already use lactic acid to acidify their brewing water and ensure optimal pH for enzymatic activity during the mashing process. Our results suggest that adding lactic acid can also improve beer appreciation, although its individual effect remains to be tested. Interestingly, ethanol appears to be unnecessary to improve beer appreciation, both for blond beer and alcohol-free beer. Given the growing consumer interest in alcohol-free beer, with a predicted annual market growth of >7% 84 , it is relevant for brewers to know what compounds can further increase consumer appreciation of these beers. Hence, our model may readily provide avenues to further improve the flavor and consumer appreciation of both alcoholic and non-alcoholic beers, which is generally considered one of the key challenges for future beer production.

Whereas we see a direct implementation of our results for the development of superior alcohol-free beverages and other food products, our study can also serve as a stepping stone for the development of novel alcohol-containing beverages. We want to echo the growing body of scientific evidence for the negative effects of alcohol consumption, both on the individual level by the mutagenic, teratogenic and carcinogenic effects of ethanol 85 , 86 , as well as the burden on society caused by alcohol abuse and addiction. We encourage the use of our results for the production of healthier, tastier products, including novel and improved beverages with lower alcohol contents. Furthermore, we strongly discourage the use of these technologies to improve the appreciation or addictive properties of harmful substances.

The present work demonstrates that despite some important remaining hurdles, combining the latest developments in chemical analyses, sensory analysis and modern machine learning methods offers exciting avenues for food chemistry and engineering. Soon, these tools may provide solutions in quality control and recipe development, as well as new approaches to sensory science and flavor research.

Beer selection

250 commercial Belgian beers were selected to cover the broad diversity of beer styles and corresponding diversity in chemical composition and aroma. See Supplementary Fig.  S1 .

Chemical dataset

Sample preparation.

Beers within their expiration date were purchased from commercial retailers. Samples were prepared in biological duplicates at room temperature, unless explicitly stated otherwise. Bottle pressure was measured with a manual pressure device (Steinfurth Mess-Systeme GmbH) and used to calculate CO 2 concentration. The beer was poured through two filter papers (Macherey-Nagel, 500713032 MN 713 ¼) to remove carbon dioxide and prevent spontaneous foaming. Samples were then prepared for measurements by targeted Headspace-Gas Chromatography-Flame Ionization Detector/Flame Photometric Detector (HS-GC-FID/FPD), Headspace-Solid Phase Microextraction-Gas Chromatography-Mass Spectrometry (HS-SPME-GC-MS), colorimetric analysis, enzymatic analysis, Near-Infrared (NIR) analysis, as described in the sections below. The mean values of biological duplicates are reported for each compound.

HS-GC-FID/FPD

HS-GC-FID/FPD (Shimadzu GC 2010 Plus) was used to measure higher alcohols, acetaldehyde, esters, 4-vinyl guaicol, and sulfur compounds. Each measurement comprised 5 ml of sample pipetted into a 20 ml glass vial containing 1.75 g NaCl (VWR, 27810.295). 100 µl of 2-heptanol (Sigma-Aldrich, H3003) (internal standard) solution in ethanol (Fisher Chemical, E/0650DF/C17) was added for a final concentration of 2.44 mg/L. Samples were flushed with nitrogen for 10 s, sealed with a silicone septum, stored at −80 °C and analyzed in batches of 20.

The GC was equipped with a DB-WAXetr column (length, 30 m; internal diameter, 0.32 mm; layer thickness, 0.50 µm; Agilent Technologies, Santa Clara, CA, USA) to the FID and an HP-5 column (length, 30 m; internal diameter, 0.25 mm; layer thickness, 0.25 µm; Agilent Technologies, Santa Clara, CA, USA) to the FPD. N 2 was used as the carrier gas. Samples were incubated for 20 min at 70 °C in the headspace autosampler (Flow rate, 35 cm/s; Injection volume, 1000 µL; Injection mode, split; Combi PAL autosampler, CTC analytics, Switzerland). The injector, FID and FPD temperatures were kept at 250 °C. The GC oven temperature was first held at 50 °C for 5 min and then allowed to rise to 80 °C at a rate of 5 °C/min, followed by a second ramp of 4 °C/min until 200 °C kept for 3 min and a final ramp of (4 °C/min) until 230 °C for 1 min. Results were analyzed with the GCSolution software version 2.4 (Shimadzu, Kyoto, Japan). The GC was calibrated with a 5% EtOH solution (VWR International) containing the volatiles under study (Supplementary Table  S7 ).

HS-SPME-GC-MS

HS-SPME-GC-MS (Shimadzu GCMS-QP-2010 Ultra) was used to measure additional volatile compounds, mainly comprising terpenoids and esters. Samples were analyzed by HS-SPME using a triphase DVB/Carboxen/PDMS 50/30 μm SPME fiber (Supelco Co., Bellefonte, PA, USA) followed by gas chromatography (Thermo Fisher Scientific Trace 1300 series, USA) coupled to a mass spectrometer (Thermo Fisher Scientific ISQ series MS) equipped with a TriPlus RSH autosampler. 5 ml of degassed beer sample was placed in 20 ml vials containing 1.75 g NaCl (VWR, 27810.295). 5 µl internal standard mix was added, containing 2-heptanol (1 g/L) (Sigma-Aldrich, H3003), 4-fluorobenzaldehyde (1 g/L) (Sigma-Aldrich, 128376), 2,3-hexanedione (1 g/L) (Sigma-Aldrich, 144169) and guaiacol (1 g/L) (Sigma-Aldrich, W253200) in ethanol (Fisher Chemical, E/0650DF/C17). Each sample was incubated at 60 °C in the autosampler oven with constant agitation. After 5 min equilibration, the SPME fiber was exposed to the sample headspace for 30 min. The compounds trapped on the fiber were thermally desorbed in the injection port of the chromatograph by heating the fiber for 15 min at 270 °C.

The GC-MS was equipped with a low polarity RXi-5Sil MS column (length, 20 m; internal diameter, 0.18 mm; layer thickness, 0.18 µm; Restek, Bellefonte, PA, USA). Injection was performed in splitless mode at 320 °C, a split flow of 9 ml/min, a purge flow of 5 ml/min and an open valve time of 3 min. To obtain a pulsed injection, a programmed gas flow was used whereby the helium gas flow was set at 2.7 mL/min for 0.1 min, followed by a decrease in flow of 20 ml/min to the normal 0.9 mL/min. The temperature was first held at 30 °C for 3 min and then allowed to rise to 80 °C at a rate of 7 °C/min, followed by a second ramp of 2 °C/min till 125 °C and a final ramp of 8 °C/min with a final temperature of 270 °C.

Mass acquisition range was 33 to 550 amu at a scan rate of 5 scans/s. Electron impact ionization energy was 70 eV. The interface and ion source were kept at 275 °C and 250 °C, respectively. A mix of linear n-alkanes (from C7 to C40, Supelco Co.) was injected into the GC-MS under identical conditions to serve as external retention index markers. Identification and quantification of the compounds were performed using an in-house developed R script as described in Goelen et al. and Reher et al. 87 , 88 (for package information, see Supplementary Table  S8 ). Briefly, chromatograms were analyzed using AMDIS (v2.71) 89 to separate overlapping peaks and obtain pure compound spectra. The NIST MS Search software (v2.0 g) in combination with the NIST2017, FFNSC3 and Adams4 libraries were used to manually identify the empirical spectra, taking into account the expected retention time. After background subtraction and correcting for retention time shifts between samples run on different days based on alkane ladders, compound elution profiles were extracted and integrated using a file with 284 target compounds of interest, which were either recovered in our identified AMDIS list of spectra or were known to occur in beer. Compound elution profiles were estimated for every peak in every chromatogram over a time-restricted window using weighted non-negative least square analysis after which peak areas were integrated 87 , 88 . Batch effect correction was performed by normalizing against the most stable internal standard compound, 4-fluorobenzaldehyde. Out of all 284 target compounds that were analyzed, 167 were visually judged to have reliable elution profiles and were used for final analysis.

Discrete photometric and enzymatic analysis

Discrete photometric and enzymatic analysis (Thermo Scientific TM Gallery TM Plus Beermaster Discrete Analyzer) was used to measure acetic acid, ammonia, beta-glucan, iso-alpha acids, color, sugars, glycerol, iron, pH, protein, and sulfite. 2 ml of sample volume was used for the analyses. Information regarding the reagents and standard solutions used for analyses and calibrations is included in Supplementary Table  S7 and Supplementary Table  S9 .

NIR analyses

NIR analysis (Anton Paar Alcolyzer Beer ME System) was used to measure ethanol. Measurements comprised 50 ml of sample, and a 10% EtOH solution was used for calibration.

Correlation calculations

Pairwise Spearman Rank correlations were calculated between all chemical properties.

Sensory dataset

Trained panel.

Our trained tasting panel consisted of volunteers who gave prior verbal informed consent. All compounds used for the validation experiment were of food-grade quality. The tasting sessions were approved by the Social and Societal Ethics Committee of the KU Leuven (G-2022-5677-R2(MAR)). All online reviewers agreed to the Terms and Conditions of the RateBeer website.

Sensory analysis was performed according to the American Society of Brewing Chemists (ASBC) Sensory Analysis Methods 90 . 30 volunteers were screened through a series of triangle tests. The sixteen most sensitive and consistent tasters were retained as taste panel members. The resulting panel was diverse in age [22–42, mean: 29], sex [56% male] and nationality [7 different countries]. The panel developed a consensus vocabulary to describe beer aroma, taste and mouthfeel. Panelists were trained to identify and score 50 different attributes, using a 7-point scale to rate attributes’ intensity. The scoring sheet is included as Supplementary Data  3 . Sensory assessments took place between 10–12 a.m. The beers were served in black-colored glasses. Per session, between 5 and 12 beers of the same style were tasted at 12 °C to 16 °C. Two reference beers were added to each set and indicated as ‘Reference 1 & 2’, allowing panel members to calibrate their ratings. Not all panelists were present at every tasting. Scores were scaled by standard deviation and mean-centered per taster. Values are represented as z-scores and clustered by Euclidean distance. Pairwise Spearman correlations were calculated between taste and aroma sensory attributes. Panel consistency was evaluated by repeating samples on different sessions and performing ANOVA to identify differences, using the ‘stats’ package (v4.2.2) in R (for package information, see Supplementary Table  S8 ).

Online reviews from a public database

The ‘scrapy’ package in Python (v3.6) (for package information, see Supplementary Table  S8 ). was used to collect 232,288 online reviews (mean=922, min=6, max=5343) from RateBeer, an online beer review database. Each review entry comprised 5 numerical scores (appearance, aroma, taste, palate and overall quality) and an optional review text. The total number of reviews per reviewer was collected separately. Numerical scores were scaled and centered per rater, and mean scores were calculated per beer.

For the review texts, the language was estimated using the packages ‘langdetect’ and ‘langid’ in Python. Reviews that were classified as English by both packages were kept. Reviewers with fewer than 100 entries overall were discarded. 181,025 reviews from >6000 reviewers from >40 countries remained. Text processing was done using the ‘nltk’ package in Python. Texts were corrected for slang and misspellings; proper nouns and rare words that are relevant to the beer context were specified and kept as-is (‘Chimay’,’Lambic’, etc.). A dictionary of semantically similar sensorial terms, for example ‘floral’ and ‘flower’, was created and collapsed together into one term. Words were stemmed and lemmatized to avoid identifying words such as ‘acid’ and ‘acidity’ as separate terms. Numbers and punctuation were removed.

Sentences from up to 50 randomly chosen reviews per beer were manually categorized according to the aspect of beer they describe (appearance, aroma, taste, palate, overall quality—not to be confused with the 5 numerical scores described above) or flagged as irrelevant if they contained no useful information. If a beer contained fewer than 50 reviews, all reviews were manually classified. This labeled data set was used to train a model that classified the rest of the sentences for all beers 91 . Sentences describing taste and aroma were extracted, and term frequency–inverse document frequency (TFIDF) was implemented to calculate enrichment scores for sensorial words per beer.

The sex of the tasting subject was not considered when building our sensory database. Instead, results from different panelists were averaged, both for our trained panel (56% male, 44% female) and the RateBeer reviews (70% male, 30% female for RateBeer as a whole).

Beer price collection and processing

Beer prices were collected from the following stores: Colruyt, Delhaize, Total Wine, BeerHawk, The Belgian Beer Shop, The Belgian Shop, and Beer of Belgium. Where applicable, prices were converted to Euros and normalized per liter. Spearman correlations were calculated between these prices and mean overall appreciation scores from RateBeer and the taste panel, respectively.

Pairwise Spearman Rank correlations were calculated between all sensory properties.

Machine learning models

Predictive modeling of sensory profiles from chemical data.

Regression models were constructed to predict (a) trained panel scores for beer flavors and quality from beer chemical profiles and (b) public reviews’ appreciation scores from beer chemical profiles. Z-scores were used to represent sensory attributes in both data sets. Chemical properties with log-normal distributions (Shapiro-Wilk test, p  <  0.05 ) were log-transformed. Missing chemical measurements (0.1% of all data) were replaced with mean values per attribute. Observations from 250 beers were randomly separated into a training set (70%, 175 beers) and a test set (30%, 75 beers), stratified per beer style. Chemical measurements (p = 231) were normalized based on the training set average and standard deviation. In total, three linear regression-based models: linear regression with first-order interaction terms (LR), lasso regression with first-order interaction terms (Lasso) and partial least squares regression (PLSR); five decision tree models, Adaboost regressor (ABR), Extra Trees (ET), Gradient Boosting regressor (GBR), Random Forest (RF) and XGBoost regressor (XGBR); one support vector machine model (SVR) and one artificial neural network model (ANN) were trained. The models were implemented using the ‘scikit-learn’ package (v1.2.2) and ‘xgboost’ package (v1.7.3) in Python (v3.9.16). Models were trained, and hyperparameters optimized, using five-fold cross-validated grid search with the coefficient of determination (R 2 ) as the evaluation metric. The ANN (scikit-learn’s MLPRegressor) was optimized using Bayesian Tree-Structured Parzen Estimator optimization with the ‘Optuna’ Python package (v3.2.0). Individual models were trained per attribute, and a multi-output model was trained on all attributes simultaneously.

Model dissection

GBR was found to outperform other methods, resulting in models with the highest average R 2 values in both trained panel and public review data sets. Impurity-based rankings of the most important predictors for each predicted sensorial trait were obtained using the ‘scikit-learn’ package. To observe the relationships between these chemical properties and their predicted targets, partial dependence plots (PDP) were constructed for the six most important predictors of consumer appreciation 74 , 75 .

The ‘SHAP’ package in Python (v0.41.0) was implemented to provide an alternative ranking of predictor importance and to visualize the predictors’ effects as a function of their concentration 68 .

Validation of causal chemical properties

To validate the effects of the most important model features on predicted sensory attributes, beers were spiked with the chemical compounds identified by the models and descriptive sensory analyses were carried out according to the American Society of Brewing Chemists (ASBC) protocol 90 .

Compound spiking was done 30 min before tasting. Compounds were spiked into fresh beer bottles, that were immediately resealed and inverted three times. Fresh bottles of beer were opened for the same duration, resealed, and inverted thrice, to serve as controls. Pairs of spiked samples and controls were served simultaneously, chilled and in dark glasses as outlined in the Trained panel section above. Tasters were instructed to select the glass with the higher flavor intensity for each attribute (directional difference test 92 ) and to select the glass they prefer.

The final concentration after spiking was equal to the within-style average, after normalizing by ethanol concentration. This was done to ensure balanced flavor profiles in the final spiked beer. The same methods were applied to improve a non-alcoholic beer. Compounds were the following: ethyl acetate (Merck KGaA, W241415), ethyl hexanoate (Merck KGaA, W243906), isoamyl acetate (Merck KGaA, W205508), phenethyl acetate (Merck KGaA, W285706), ethanol (96%, Colruyt), glycerol (Merck KGaA, W252506), lactic acid (Merck KGaA, 261106).

Significant differences in preference or perceived intensity were determined by performing the two-sided binomial test on each attribute.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

The data that support the findings of this work are available in the Supplementary Data files and have been deposited to Zenodo under accession code 10653704 93 . The RateBeer scores data are under restricted access, they are not publicly available as they are property of RateBeer (ZX Ventures, USA). Access can be obtained from the authors upon reasonable request and with permission of RateBeer (ZX Ventures, USA).  Source data are provided with this paper.

Code availability

The code for training the machine learning models, analyzing the models, and generating the figures has been deposited to Zenodo under accession code 10653704 93 .

Tieman, D. et al. A chemical genetic roadmap to improved tomato flavor. Science 355 , 391–394 (2017).

Article   ADS   CAS   PubMed   Google Scholar  

Plutowska, B. & Wardencki, W. Application of gas chromatography–olfactometry (GC–O) in analysis and quality assessment of alcoholic beverages – A review. Food Chem. 107 , 449–463 (2008).

Article   CAS   Google Scholar  

Legin, A., Rudnitskaya, A., Seleznev, B. & Vlasov, Y. Electronic tongue for quality assessment of ethanol, vodka and eau-de-vie. Anal. Chim. Acta 534 , 129–135 (2005).

Loutfi, A., Coradeschi, S., Mani, G. K., Shankar, P. & Rayappan, J. B. B. Electronic noses for food quality: A review. J. Food Eng. 144 , 103–111 (2015).

Ahn, Y.-Y., Ahnert, S. E., Bagrow, J. P. & Barabási, A.-L. Flavor network and the principles of food pairing. Sci. Rep. 1 , 196 (2011).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bartoshuk, L. M. & Klee, H. J. Better fruits and vegetables through sensory analysis. Curr. Biol. 23 , R374–R378 (2013).

Article   CAS   PubMed   Google Scholar  

Piggott, J. R. Design questions in sensory and consumer science. Food Qual. Prefer. 3293 , 217–220 (1995).

Article   Google Scholar  

Kermit, M. & Lengard, V. Assessing the performance of a sensory panel-panellist monitoring and tracking. J. Chemom. 19 , 154–161 (2005).

Cook, D. J., Hollowood, T. A., Linforth, R. S. T. & Taylor, A. J. Correlating instrumental measurements of texture and flavour release with human perception. Int. J. Food Sci. Technol. 40 , 631–641 (2005).

Chinchanachokchai, S., Thontirawong, P. & Chinchanachokchai, P. A tale of two recommender systems: The moderating role of consumer expertise on artificial intelligence based product recommendations. J. Retail. Consum. Serv. 61 , 1–12 (2021).

Ross, C. F. Sensory science at the human-machine interface. Trends Food Sci. Technol. 20 , 63–72 (2009).

Chambers, E. IV & Koppel, K. Associations of volatile compounds with sensory aroma and flavor: The complex nature of flavor. Molecules 18 , 4887–4905 (2013).

Pinu, F. R. Metabolomics—The new frontier in food safety and quality research. Food Res. Int. 72 , 80–81 (2015).

Danezis, G. P., Tsagkaris, A. S., Brusic, V. & Georgiou, C. A. Food authentication: state of the art and prospects. Curr. Opin. Food Sci. 10 , 22–31 (2016).

Shepherd, G. M. Smell images and the flavour system in the human brain. Nature 444 , 316–321 (2006).

Meilgaard, M. C. Prediction of flavor differences between beers from their chemical composition. J. Agric. Food Chem. 30 , 1009–1017 (1982).

Xu, L. et al. Widespread receptor-driven modulation in peripheral olfactory coding. Science 368 , eaaz5390 (2020).

Kupferschmidt, K. Following the flavor. Science 340 , 808–809 (2013).

Billesbølle, C. B. et al. Structural basis of odorant recognition by a human odorant receptor. Nature 615 , 742–749 (2023).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Smith, B. Perspective: Complexities of flavour. Nature 486 , S6–S6 (2012).

Pfister, P. et al. Odorant receptor inhibition is fundamental to odor encoding. Curr. Biol. 30 , 2574–2587 (2020).

Moskowitz, H. W., Kumaraiah, V., Sharma, K. N., Jacobs, H. L. & Sharma, S. D. Cross-cultural differences in simple taste preferences. Science 190 , 1217–1218 (1975).

Eriksson, N. et al. A genetic variant near olfactory receptor genes influences cilantro preference. Flavour 1 , 22 (2012).

Ferdenzi, C. et al. Variability of affective responses to odors: Culture, gender, and olfactory knowledge. Chem. Senses 38 , 175–186 (2013).

Article   PubMed   Google Scholar  

Lawless, H. T. & Heymann, H. Sensory evaluation of food: Principles and practices. (Springer, New York, NY). https://doi.org/10.1007/978-1-4419-6488-5 (2010).

Colantonio, V. et al. Metabolomic selection for enhanced fruit flavor. Proc. Natl. Acad. Sci. 119 , e2115865119 (2022).

Fritz, F., Preissner, R. & Banerjee, P. VirtualTaste: a web server for the prediction of organoleptic properties of chemical compounds. Nucleic Acids Res 49 , W679–W684 (2021).

Tuwani, R., Wadhwa, S. & Bagler, G. BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules. Sci. Rep. 9 , 1–13 (2019).

Dagan-Wiener, A. et al. Bitter or not? BitterPredict, a tool for predicting taste from chemical structure. Sci. Rep. 7 , 1–13 (2017).

Pallante, L. et al. Toward a general and interpretable umami taste predictor using a multi-objective machine learning approach. Sci. Rep. 12 , 1–11 (2022).

Malavolta, M. et al. A survey on computational taste predictors. Eur. Food Res. Technol. 248 , 2215–2235 (2022).

Lee, B. K. et al. A principal odor map unifies diverse tasks in olfactory perception. Science 381 , 999–1006 (2023).

Mayhew, E. J. et al. Transport features predict if a molecule is odorous. Proc. Natl. Acad. Sci. 119 , e2116576119 (2022).

Niu, Y. et al. Sensory evaluation of the synergism among ester odorants in light aroma-type liquor by odor threshold, aroma intensity and flash GC electronic nose. Food Res. Int. 113 , 102–114 (2018).

Yu, P., Low, M. Y. & Zhou, W. Design of experiments and regression modelling in food flavour and sensory analysis: A review. Trends Food Sci. Technol. 71 , 202–215 (2018).

Oladokun, O. et al. The impact of hop bitter acid and polyphenol profiles on the perceived bitterness of beer. Food Chem. 205 , 212–220 (2016).

Linforth, R., Cabannes, M., Hewson, L., Yang, N. & Taylor, A. Effect of fat content on flavor delivery during consumption: An in vivo model. J. Agric. Food Chem. 58 , 6905–6911 (2010).

Guo, S., Na Jom, K. & Ge, Y. Influence of roasting condition on flavor profile of sunflower seeds: A flavoromics approach. Sci. Rep. 9 , 11295 (2019).

Ren, Q. et al. The changes of microbial community and flavor compound in the fermentation process of Chinese rice wine using Fagopyrum tataricum grain as feedstock. Sci. Rep. 9 , 3365 (2019).

Hastie, T., Friedman, J. & Tibshirani, R. The Elements of Statistical Learning. (Springer, New York, NY). https://doi.org/10.1007/978-0-387-21606-5 (2001).

Dietz, C., Cook, D., Huismann, M., Wilson, C. & Ford, R. The multisensory perception of hop essential oil: a review. J. Inst. Brew. 126 , 320–342 (2020).

CAS   Google Scholar  

Roncoroni, Miguel & Verstrepen, Kevin Joan. Belgian Beer: Tested and Tasted. (Lannoo, 2018).

Meilgaard, M. Flavor chemistry of beer: Part II: Flavor and threshold of 239 aroma volatiles. in (1975).

Bokulich, N. A. & Bamforth, C. W. The microbiology of malting and brewing. Microbiol. Mol. Biol. Rev. MMBR 77 , 157–172 (2013).

Dzialo, M. C., Park, R., Steensels, J., Lievens, B. & Verstrepen, K. J. Physiology, ecology and industrial applications of aroma formation in yeast. FEMS Microbiol. Rev. 41 , S95–S128 (2017).

Article   PubMed   PubMed Central   Google Scholar  

Datta, A. et al. Computer-aided food engineering. Nat. Food 3 , 894–904 (2022).

American Society of Brewing Chemists. Beer Methods. (American Society of Brewing Chemists, St. Paul, MN, U.S.A.).

Olaniran, A. O., Hiralal, L., Mokoena, M. P. & Pillay, B. Flavour-active volatile compounds in beer: production, regulation and control. J. Inst. Brew. 123 , 13–23 (2017).

Verstrepen, K. J. et al. Flavor-active esters: Adding fruitiness to beer. J. Biosci. Bioeng. 96 , 110–118 (2003).

Meilgaard, M. C. Flavour chemistry of beer. part I: flavour interaction between principal volatiles. Master Brew. Assoc. Am. Tech. Q 12 , 107–117 (1975).

Briggs, D. E., Boulton, C. A., Brookes, P. A. & Stevens, R. Brewing 227–254. (Woodhead Publishing). https://doi.org/10.1533/9781855739062.227 (2004).

Bossaert, S., Crauwels, S., De Rouck, G. & Lievens, B. The power of sour - A review: Old traditions, new opportunities. BrewingScience 72 , 78–88 (2019).

Google Scholar  

Verstrepen, K. J. et al. Flavor active esters: Adding fruitiness to beer. J. Biosci. Bioeng. 96 , 110–118 (2003).

Snauwaert, I. et al. Microbial diversity and metabolite composition of Belgian red-brown acidic ales. Int. J. Food Microbiol. 221 , 1–11 (2016).

Spitaels, F. et al. The microbial diversity of traditional spontaneously fermented lambic beer. PLoS ONE 9 , e95384 (2014).

Blanco, C. A., Andrés-Iglesias, C. & Montero, O. Low-alcohol Beers: Flavor Compounds, Defects, and Improvement Strategies. Crit. Rev. Food Sci. Nutr. 56 , 1379–1388 (2016).

Jackowski, M. & Trusek, A. Non-Alcohol. beer Prod. – Overv. 20 , 32–38 (2018).

Takoi, K. et al. The contribution of geraniol metabolism to the citrus flavour of beer: Synergy of geraniol and β-citronellol under coexistence with excess linalool. J. Inst. Brew. 116 , 251–260 (2010).

Kroeze, J. H. & Bartoshuk, L. M. Bitterness suppression as revealed by split-tongue taste stimulation in humans. Physiol. Behav. 35 , 779–783 (1985).

Mennella, J. A. et al. A spoonful of sugar helps the medicine go down”: Bitter masking bysucrose among children and adults. Chem. Senses 40 , 17–25 (2015).

Wietstock, P., Kunz, T., Perreira, F. & Methner, F.-J. Metal chelation behavior of hop acids in buffered model systems. BrewingScience 69 , 56–63 (2016).

Sancho, D., Blanco, C. A., Caballero, I. & Pascual, A. Free iron in pale, dark and alcohol-free commercial lager beers. J. Sci. Food Agric. 91 , 1142–1147 (2011).

Rodrigues, H. & Parr, W. V. Contribution of cross-cultural studies to understanding wine appreciation: A review. Food Res. Int. 115 , 251–258 (2019).

Korneva, E. & Blockeel, H. Towards better evaluation of multi-target regression models. in ECML PKDD 2020 Workshops (eds. Koprinska, I. et al.) 353–362 (Springer International Publishing, Cham, 2020). https://doi.org/10.1007/978-3-030-65965-3_23 .

Gastón Ares. Mathematical and Statistical Methods in Food Science and Technology. (Wiley, 2013).

Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? Preprint at http://arxiv.org/abs/2207.08815 (2022).

Gries, S. T. Statistics for Linguistics with R: A Practical Introduction. in Statistics for Linguistics with R (De Gruyter Mouton, 2021). https://doi.org/10.1515/9783110718256 .

Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2 , 56–67 (2020).

Ickes, C. M. & Cadwallader, K. R. Effects of ethanol on flavor perception in alcoholic beverages. Chemosens. Percept. 10 , 119–134 (2017).

Kato, M. et al. Influence of high molecular weight polypeptides on the mouthfeel of commercial beer. J. Inst. Brew. 127 , 27–40 (2021).

Wauters, R. et al. Novel Saccharomyces cerevisiae variants slow down the accumulation of staling aldehydes and improve beer shelf-life. Food Chem. 398 , 1–11 (2023).

Li, H., Jia, S. & Zhang, W. Rapid determination of low-level sulfur compounds in beer by headspace gas chromatography with a pulsed flame photometric detector. J. Am. Soc. Brew. Chem. 66 , 188–191 (2008).

Dercksen, A., Laurens, J., Torline, P., Axcell, B. C. & Rohwer, E. Quantitative analysis of volatile sulfur compounds in beer using a membrane extraction interface. J. Am. Soc. Brew. Chem. 54 , 228–233 (1996).

Molnar, C. Interpretable Machine Learning: A Guide for Making Black-Box Models Interpretable. (2020).

Zhao, Q. & Hastie, T. Causal interpretations of black-box models. J. Bus. Econ. Stat. Publ. Am. Stat. Assoc. 39 , 272–281 (2019).

Article   MathSciNet   Google Scholar  

Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. (Springer, 2019).

Labrado, D. et al. Identification by NMR of key compounds present in beer distillates and residual phases after dealcoholization by vacuum distillation. J. Sci. Food Agric. 100 , 3971–3978 (2020).

Lusk, L. T., Kay, S. B., Porubcan, A. & Ryder, D. S. Key olfactory cues for beer oxidation. J. Am. Soc. Brew. Chem. 70 , 257–261 (2012).

Gonzalez Viejo, C., Torrico, D. D., Dunshea, F. R. & Fuentes, S. Development of artificial neural network models to assess beer acceptability based on sensory properties using a robotic pourer: A comparative model approach to achieve an artificial intelligence system. Beverages 5 , 33 (2019).

Gonzalez Viejo, C., Fuentes, S., Torrico, D. D., Godbole, A. & Dunshea, F. R. Chemical characterization of aromas in beer and their effect on consumers liking. Food Chem. 293 , 479–485 (2019).

Gilbert, J. L. et al. Identifying breeding priorities for blueberry flavor using biochemical, sensory, and genotype by environment analyses. PLOS ONE 10 , 1–21 (2015).

Goulet, C. et al. Role of an esterase in flavor volatile variation within the tomato clade. Proc. Natl. Acad. Sci. 109 , 19009–19014 (2012).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Borisov, V. et al. Deep Neural Networks and Tabular Data: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 1–21 https://doi.org/10.1109/TNNLS.2022.3229161 (2022).

Statista. Statista Consumer Market Outlook: Beer - Worldwide.

Seitz, H. K. & Stickel, F. Molecular mechanisms of alcoholmediated carcinogenesis. Nat. Rev. Cancer 7 , 599–612 (2007).

Voordeckers, K. et al. Ethanol exposure increases mutation rate through error-prone polymerases. Nat. Commun. 11 , 3664 (2020).

Goelen, T. et al. Bacterial phylogeny predicts volatile organic compound composition and olfactory response of an aphid parasitoid. Oikos 129 , 1415–1428 (2020).

Article   ADS   Google Scholar  

Reher, T. et al. Evaluation of hop (Humulus lupulus) as a repellent for the management of Drosophila suzukii. Crop Prot. 124 , 104839 (2019).

Stein, S. E. An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data. J. Am. Soc. Mass Spectrom. 10 , 770–781 (1999).

American Society of Brewing Chemists. Sensory Analysis Methods. (American Society of Brewing Chemists, St. Paul, MN, U.S.A., 1992).

McAuley, J., Leskovec, J. & Jurafsky, D. Learning Attitudes and Attributes from Multi-Aspect Reviews. Preprint at https://doi.org/10.48550/arXiv.1210.3926 (2012).

Meilgaard, M. C., Carr, B. T. & Carr, B. T. Sensory Evaluation Techniques. (CRC Press, Boca Raton). https://doi.org/10.1201/b16452 (2014).

Schreurs, M. et al. Data from: Predicting and improving complex beer flavor through machine learning. Zenodo https://doi.org/10.5281/zenodo.10653704 (2024).

Download references

Acknowledgements

We thank all lab members for their discussions and thank all tasting panel members for their contributions. Special thanks go out to Dr. Karin Voordeckers for her tremendous help in proofreading and improving the manuscript. M.S. was supported by a Baillet-Latour fellowship, L.C. acknowledges financial support from KU Leuven (C16/17/006), F.A.T. was supported by a PhD fellowship from FWO (1S08821N). Research in the lab of K.J.V. is supported by KU Leuven, FWO, VIB, VLAIO and the Brewing Science Serves Health Fund. Research in the lab of T.W. is supported by FWO (G.0A51.15) and KU Leuven (C16/17/006).

Author information

These authors contributed equally: Michiel Schreurs, Supinya Piampongsant, Miguel Roncoroni.

Authors and Affiliations

VIB—KU Leuven Center for Microbiology, Gaston Geenslaan 1, B-3001, Leuven, Belgium

Michiel Schreurs, Supinya Piampongsant, Miguel Roncoroni, Lloyd Cool, Beatriz Herrera-Malaver, Florian A. Theßeling & Kevin J. Verstrepen

CMPG Laboratory of Genetics and Genomics, KU Leuven, Gaston Geenslaan 1, B-3001, Leuven, Belgium

Leuven Institute for Beer Research (LIBR), Gaston Geenslaan 1, B-3001, Leuven, Belgium

Laboratory of Socioecology and Social Evolution, KU Leuven, Naamsestraat 59, B-3000, Leuven, Belgium

Lloyd Cool, Christophe Vanderaa & Tom Wenseleers

VIB Bioinformatics Core, VIB, Rijvisschestraat 120, B-9052, Ghent, Belgium

Łukasz Kreft & Alexander Botzki

AB InBev SA/NV, Brouwerijplein 1, B-3000, Leuven, Belgium

Philippe Malcorps & Luk Daenen

You can also search for this author in PubMed   Google Scholar

Contributions

S.P., M.S. and K.J.V. conceived the experiments. S.P., M.S. and K.J.V. designed the experiments. S.P., M.S., M.R., B.H. and F.A.T. performed the experiments. S.P., M.S., L.C., C.V., L.K., A.B., P.M., L.D., T.W. and K.J.V. contributed analysis ideas. S.P., M.S., L.C., C.V., T.W. and K.J.V. analyzed the data. All authors contributed to writing the manuscript.

Corresponding author

Correspondence to Kevin J. Verstrepen .

Ethics declarations

Competing interests.

K.J.V. is affiliated with bar.on. The other authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Florian Bauer, Andrew John Macintosh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary files, supplementary data 1, supplementary data 2, supplementary data 3, supplementary data 4, supplementary data 5, supplementary data 6, supplementary data 7, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Schreurs, M., Piampongsant, S., Roncoroni, M. et al. Predicting and improving complex beer flavor through machine learning. Nat Commun 15 , 2368 (2024). https://doi.org/10.1038/s41467-024-46346-0

Download citation

Received : 30 October 2023

Accepted : 21 February 2024

Published : 26 March 2024

DOI : https://doi.org/10.1038/s41467-024-46346-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

how to write data availability statement in research paper

IMAGES

  1. The Data Availability Statement

    how to write data availability statement in research paper

  2. Data Availability Statements Tips

    how to write data availability statement in research paper

  3. Writing “Data Availability Statement” in Your Publications

    how to write data availability statement in research paper

  4. Data Availability Statements Tips

    how to write data availability statement in research paper

  5. Data Availability Statements in The Journal of Organic Chemistry

    how to write data availability statement in research paper

  6. Flow diagram of Data Availability Statement coding.

    how to write data availability statement in research paper

VIDEO

  1. Celestia Labs CEO on Mainnet Launch, Modular Ecosystem Outlook

  2. Approaches to writing a research proposal

  3. Data Availability Statement

  4. Writing a thesis statement: Research Paper

  5. Problem Statement in Research Proposal

  6. How to Find Research Papers & Organize References

COMMENTS

  1. Writing a data availability statement

    A data availability statement (also sometimes called a 'data access statement') tells the reader where the research data associated with a paper is available, and under what conditions the data can be accessed. They also include links (where applicable) to the data set.

  2. Data Availability Statements

    The data availability statement is a valuable link between a paper's results and the supporting evidence. Springer Nature's data policy is based on transparency, requiring these statements in original research articles across our journals. The guidance below offers advice on how to create a data availability statement, along with examples ...

  3. How to write a data availability statement (DAS)

    Data availability statement: "Data are available in a public, open access repository." Data access statement as part of the methods section Methods: "All data and codes are available under the following OSF repository: https://osf.io/9999/ " An unclear statement whether there is data. Availability of data and materials: "Not applicable"

  4. Write a data availability statement for a paper : Springer Support

    Your data availability statement should describe how the data supporting the results reported in your paper can be accessed. This includes both primary data (data you generated as part of your study) and secondary data (data from other sources that you reused in your study). If the data are in a repository, include hyperlinks and persistent ...

  5. PDF Data availability statements some examples of good practice

    It is generally not acceptable to include a statement just indicating that data and/or code will be made available on request by contacting the author. Many publishers and journals provide comprehensive guidance on how to. structure data availability statements - including, for example, Springer Nature and PLOS. The following examples of data ...

  6. The Data Availability Statement

    A Data Availability Statement (also called Data Access Statement) tells the reader if the data behind a research project can be accessed and, if so, where and how. Ideally, authors should include hyperlinks to public databases to make it easier for the readers to find them. If you are currently in the process of submitting an article to a ...

  7. Writing a data availability statement for your paper

    The statement should describe all of the data which underpin your article, and how and where they can be accessed. If your data are in a repository, you should include the name of the repository, hyperlinks and persistent identifiers for the datasets as part of your data availability statement. You can also add information about any ...

  8. Writing a Data Availability Statement: Expert Guidelines ...

    Based on the types of data reported in your study, there are primarily four sections that should be included in your Data Availability Statement. Source data: This includes the data that the authors have not collected but used for analysis in the reported study. In other words, it is the data that are obtained from the third party.

  9. PDF How to Write a Data Availability Statement

    Using subheadings for different data types This Research Article has source and underlying data available from different locations, so the authors have split out their Data Availability Statement using subheadings to make this clear. Third party data This Research Article uses third party source data owned by the Demographic and Health Surveys

  10. Data availability statements

    Data Availability Statements help to promote transparency and reproducibility in research, and to increase the visibility of valuable evidence produced or gathered during the course of research. As part of our commitment to supporting open research, some of our journals now require all manuscripts to include a Data Availability Statement in ...

  11. Data Availability

    Data availability allows and facilitates: Validation, replication, reanalysis, new analysis, reinterpretation or inclusion into meta-analyses; Reproducibility of research; Efforts to ensure data are archived, increasing the value of the investment made in funding scientific research;

  12. Write a data availability statement for a paper : BMC Support

    Your data availability statement should describe how the data supporting the results reported in your paper can be accessed. This includes both primary data (data you generated as part of your study) and secondary data (data from other sources that you reused in your study). If the data are in a repository, include hyperlinks and persistent ...

  13. How to write a data access statement

    Data access statements, also known as data availability statements, are included in publications to describe where the data associated with the paper is available, and under what conditions the data can be accessed. They are required by many funders and scientific journals as well as the UKRI Common Principles on Data Policy.

  14. Writing "Data Availability Statement" in Your Publications

    Data Availability Statement, or DAS, serves to add transparency, so that data can be validated, reused, and properly cited. DAS is simple to write. It is one or two sentences that tells the readers where to find the data associated with your paper. You can place it near the end of your manuscript, such as putting it before the "References".

  15. PDF nature research Data availability statements and data citations policy

    data availability statements and data citations policy: frequently asked questions (FAQs) may be appropriate not to name an individual responsible for providing data access but the research ...

  16. Creating a Data Availability Statement

    A data availability statement (DAS) details where the data used in a published paper can be found and how it can be accessed. These statements are required by many publishers, including FASEB, our member societies, Cambridge, Springer Nature, Wiley, and Taylor & Francis. Some grant funders also require DASs.

  17. Writing a Data Availability Statement: Expert Guidelines ...

    drafting data availability statements. If instructions regarding how to write an effective Data Availability Statement are not specified by journals, here are some templates that you might find handy. Furthermore, you can tailor these to your requirements. Illustration of Data Availability Statement Types: 1.

  18. (PDF) Data Availability Principles and Practice

    The Data Availability Statement need not be long, and the statement does not count toward the word-count limit. If data or the source code used to produce and analyze the data are for some reason

  19. Availability Statement Examples

    Availability Statement Examples. Authors are expected to provide an Availability Statement in their article immediately following the Acknowledgments section that details where data, software, and other research objects are available, and how they can be accessed and reused (listing specific restrictions, if any).

  20. Reporting standards and availability of data, materials, code and

    More information about writing data availability statements and data citation is available through the Springer Nature Research ... fully integrated with the paper, for all original research articles.

  21. A descriptive analysis of the data availability statements accompanying

    Conclusion. Requiring that authors submit a data availability statement is a good first step, but is insufficient to ensure data availability. Strict editorial policies that mandate data sharing (where appropriate) as a condition of publication appear to be effective in making research data available.

  22. Making a statement about data availability

    Open Science. Making a statement about data availability. All Hindawi journals now require a data availability statement to be provided at the point of submission for all new research articles and clinical studies. This is a part of Hindawi's drive for greater data sharing, and a significant step towards a comprehensive Open Data policy. Back ...

  23. How Important Are Data Availability Statements (DAS)?

    It is a short statement that tells readers howand where they can access the original data of any published research. The original data includes raw and processed data. The statement should also include a link (if any) and reference details (such as accession numbers) to the data.

  24. Predicting and improving complex beer flavor through machine ...

    Interestingly, different beer styles show distinct patterns for some flavor compounds (Supplementary Fig. S3).These observations agree with expectations for key beer styles, and serve as a control ...