• All subject areas
  • Agricultural and Biological Sciences
  • Arts and Humanities
  • Biochemistry, Genetics and Molecular Biology
  • Business, Management and Accounting
  • Chemical Engineering
  • Computer Science
  • Decision Sciences
  • Earth and Planetary Sciences
  • Economics, Econometrics and Finance
  • Engineering
  • Environmental Science
  • Health Professions
  • Immunology and Microbiology
  • Materials Science
  • Mathematics
  • Multidisciplinary
  • Neuroscience
  • Pharmacology, Toxicology and Pharmaceutics
  • Physics and Astronomy
  • Social Sciences
  • All subject categories
  • Acoustics and Ultrasonics
  • Advanced and Specialized Nursing
  • Aerospace Engineering
  • Agricultural and Biological Sciences (miscellaneous)
  • Agronomy and Crop Science
  • Algebra and Number Theory
  • Analytical Chemistry
  • Anesthesiology and Pain Medicine
  • Animal Science and Zoology
  • Anthropology
  • Applied Mathematics
  • Applied Microbiology and Biotechnology
  • Applied Psychology
  • Aquatic Science
  • Archeology (arts and humanities)
  • Architecture
  • Artificial Intelligence
  • Arts and Humanities (miscellaneous)
  • Assessment and Diagnosis
  • Astronomy and Astrophysics
  • Atmospheric Science
  • Atomic and Molecular Physics, and Optics
  • Automotive Engineering
  • Behavioral Neuroscience
  • Biochemistry
  • Biochemistry, Genetics and Molecular Biology (miscellaneous)
  • Biochemistry (medical)
  • Bioengineering
  • Biological Psychiatry
  • Biomaterials
  • Biomedical Engineering
  • Biotechnology
  • Building and Construction
  • Business and International Management
  • Business, Management and Accounting (miscellaneous)
  • Cancer Research
  • Cardiology and Cardiovascular Medicine
  • Care Planning
  • Cell Biology
  • Cellular and Molecular Neuroscience
  • Ceramics and Composites
  • Chemical Engineering (miscellaneous)
  • Chemical Health and Safety
  • Chemistry (miscellaneous)
  • Chiropractics
  • Civil and Structural Engineering
  • Clinical Biochemistry
  • Clinical Psychology
  • Cognitive Neuroscience
  • Colloid and Surface Chemistry
  • Communication
  • Community and Home Care
  • Complementary and Alternative Medicine
  • Complementary and Manual Therapy
  • Computational Mathematics
  • Computational Mechanics
  • Computational Theory and Mathematics
  • Computer Graphics and Computer-Aided Design
  • Computer Networks and Communications
  • Computer Science Applications
  • Computer Science (miscellaneous)
  • Computer Vision and Pattern Recognition
  • Computers in Earth Sciences
  • Condensed Matter Physics
  • Conservation
  • Control and Optimization
  • Control and Systems Engineering
  • Critical Care and Intensive Care Medicine
  • Critical Care Nursing
  • Cultural Studies
  • Decision Sciences (miscellaneous)
  • Dental Assisting
  • Dental Hygiene
  • Dentistry (miscellaneous)
  • Dermatology
  • Development
  • Developmental and Educational Psychology
  • Developmental Biology
  • Developmental Neuroscience
  • Discrete Mathematics and Combinatorics
  • Drug Discovery
  • Drug Guides
  • Earth and Planetary Sciences (miscellaneous)
  • Earth-Surface Processes
  • Ecological Modeling
  • Ecology, Evolution, Behavior and Systematics
  • Economic Geology
  • Economics and Econometrics
  • Economics, Econometrics and Finance (miscellaneous)
  • Electrical and Electronic Engineering
  • Electrochemistry
  • Electronic, Optical and Magnetic Materials
  • Emergency Medical Services
  • Emergency Medicine
  • Emergency Nursing
  • Endocrine and Autonomic Systems
  • Endocrinology
  • Endocrinology, Diabetes and Metabolism
  • Energy Engineering and Power Technology
  • Energy (miscellaneous)
  • Engineering (miscellaneous)
  • Environmental Chemistry
  • Environmental Engineering
  • Environmental Science (miscellaneous)
  • Epidemiology
  • Experimental and Cognitive Psychology
  • Family Practice
  • Filtration and Separation
  • Fluid Flow and Transfer Processes
  • Food Animals
  • Food Science
  • Fuel Technology
  • Fundamentals and Skills
  • Gastroenterology
  • Gender Studies
  • Genetics (clinical)
  • Geochemistry and Petrology
  • Geography, Planning and Development
  • Geometry and Topology
  • Geotechnical Engineering and Engineering Geology
  • Geriatrics and Gerontology
  • Gerontology
  • Global and Planetary Change
  • Hardware and Architecture
  • Health Informatics
  • Health Information Management
  • Health Policy
  • Health Professions (miscellaneous)
  • Health (social science)
  • Health, Toxicology and Mutagenesis
  • History and Philosophy of Science
  • Horticulture
  • Human Factors and Ergonomics
  • Human-Computer Interaction
  • Immunology and Allergy
  • Immunology and Microbiology (miscellaneous)
  • Industrial and Manufacturing Engineering
  • Industrial Relations
  • Infectious Diseases
  • Information Systems
  • Information Systems and Management
  • Inorganic Chemistry
  • Insect Science
  • Instrumentation
  • Internal Medicine
  • Issues, Ethics and Legal Aspects
  • Leadership and Management
  • Library and Information Sciences
  • Life-span and Life-course Studies
  • Linguistics and Language
  • Literature and Literary Theory
  • LPN and LVN
  • Management Information Systems
  • Management, Monitoring, Policy and Law
  • Management of Technology and Innovation
  • Management Science and Operations Research
  • Materials Chemistry
  • Materials Science (miscellaneous)
  • Maternity and Midwifery
  • Mathematical Physics
  • Mathematics (miscellaneous)
  • Mechanical Engineering
  • Mechanics of Materials
  • Media Technology
  • Medical and Surgical Nursing
  • Medical Assisting and Transcription
  • Medical Laboratory Technology
  • Medical Terminology
  • Medicine (miscellaneous)
  • Metals and Alloys
  • Microbiology
  • Microbiology (medical)
  • Modeling and Simulation
  • Molecular Biology
  • Molecular Medicine
  • Nanoscience and Nanotechnology
  • Nature and Landscape Conservation
  • Neurology (clinical)
  • Neuropsychology and Physiological Psychology
  • Neuroscience (miscellaneous)
  • Nuclear and High Energy Physics
  • Nuclear Energy and Engineering
  • Numerical Analysis
  • Nurse Assisting
  • Nursing (miscellaneous)
  • Nutrition and Dietetics
  • Obstetrics and Gynecology
  • Occupational Therapy
  • Ocean Engineering
  • Oceanography
  • Oncology (nursing)
  • Ophthalmology
  • Oral Surgery
  • Organic Chemistry
  • Organizational Behavior and Human Resource Management
  • Orthodontics
  • Orthopedics and Sports Medicine
  • Otorhinolaryngology
  • Paleontology
  • Parasitology
  • Pathology and Forensic Medicine
  • Pathophysiology
  • Pediatrics, Perinatology and Child Health
  • Periodontics
  • Pharmaceutical Science
  • Pharmacology
  • Pharmacology (medical)
  • Pharmacology (nursing)
  • Pharmacology, Toxicology and Pharmaceutics (miscellaneous)
  • Physical and Theoretical Chemistry
  • Physical Therapy, Sports Therapy and Rehabilitation
  • Physics and Astronomy (miscellaneous)
  • Physiology (medical)
  • Plant Science
  • Political Science and International Relations
  • Polymers and Plastics
  • Process Chemistry and Technology
  • Psychiatry and Mental Health
  • Psychology (miscellaneous)
  • Public Administration
  • Public Health, Environmental and Occupational Health
  • Pulmonary and Respiratory Medicine
  • Radiological and Ultrasound Technology
  • Radiology, Nuclear Medicine and Imaging
  • Rehabilitation
  • Religious Studies
  • Renewable Energy, Sustainability and the Environment
  • Reproductive Medicine
  • Research and Theory
  • Respiratory Care
  • Review and Exam Preparation
  • Reviews and References (medical)
  • Rheumatology
  • Safety Research
  • Safety, Risk, Reliability and Quality
  • Sensory Systems
  • Signal Processing
  • Small Animals
  • Social Psychology
  • Social Sciences (miscellaneous)
  • Social Work
  • Sociology and Political Science
  • Soil Science
  • Space and Planetary Science
  • Spectroscopy
  • Speech and Hearing
  • Sports Science
  • Statistical and Nonlinear Physics
  • Statistics and Probability
  • Statistics, Probability and Uncertainty
  • Strategy and Management
  • Stratigraphy
  • Structural Biology
  • Surfaces and Interfaces
  • Surfaces, Coatings and Films
  • Theoretical Computer Science
  • Tourism, Leisure and Hospitality Management
  • Transplantation
  • Transportation
  • Urban Studies
  • Veterinary (miscellaneous)
  • Visual Arts and Performing Arts
  • Waste Management and Disposal
  • Water Science and Technology
  • All regions / countries
  • Asiatic Region
  • Eastern Europe
  • Latin America
  • Middle East
  • Northern America
  • Pacific Region
  • Western Europe
  • Afghanistan
  • Bosnia and Herzegovina
  • Brunei Darussalam
  • Czech Republic
  • Dominican Republic
  • Netherlands
  • New Caledonia
  • New Zealand
  • Papua New Guinea
  • Philippines
  • Puerto Rico
  • Russian Federation
  • Saudi Arabia
  • South Africa
  • South Korea
  • Switzerland
  • Syrian Arab Republic
  • Trinidad and Tobago
  • United Arab Emirates
  • United Kingdom
  • United States
  • Vatican City State
  • Book Series
  • Conferences and Proceedings
  • Trade Journals

applied linguistics research journal web of science

  • Citable Docs. (3years)
  • Total Cites (3years)

applied linguistics research journal web of science

Title Type
1 journal4.006 Q156972325911500223215.6360.9429.27
2 journal3.260 Q111544749054420918748918.42121.7428.68
3 journal2.943 Q112474150480515001508.9364.9357.94
4 journal2.658 Q1162131095738411087.3944.0852.63
5 journal2.606 Q1105611012688644815.0044.0752.53
6 journal2.370 Q175115262771124452598.5767.0547.30
7 journal2.259 Q11125215531736801253.9361.0249.44
8 journal2.258 Q17216621055294624.0065.9462.86
9 journal2.124 Q11158216655228361534.5267.3452.29
10 journal2.075 Q1104146461945631324586.1264.7743.27
11 journal2.042 Q1168142309896823523068.7763.1514.64
12 journal2.034 Q110622671120416665.2150.9154.39
13 journal1.942 Q11723817128778091683.3275.7150.98
14 journal1.908 Q11329612742366691225.0344.1361.15
15 journal1.888 Q112211217857399231714.4451.2447.51
16 journal1.854 Q11255414426506951424.2349.0755.67
17 journal1.805 Q148557313110571.9362.6042.86
18 journal1.786 Q1507513039386811154.8152.5145.29
19 journal1.752 Q18359852708419794.7545.9044.26
20 journal1.738 Q1811923561299217243474.2567.6749.48
21 journal1.679 Q19819883552986.0283.00100.00
22 journal1.667 Q14433911654416854.4150.1262.30
23 journal1.626 Q1808713554814341293.0463.0030.19
24 journal1.608 Q11429792530334783.6887.2453.64
25 journal1.607 Q1171545822163431.6754.8057.14
26 journal1.593 Q12534542074251544.7661.0064.18
27 journal1.590 Q12262789091947034348963.3470.0444.40
28 journal1.589 Q1756218531748331843.7151.1949.55
29 journal1.568 Q1795712032585621204.7257.1650.00
30 journal1.550 Q142537319112332.2063.8056.25
31 journal1.537 Q1848282452245278.7551.0850.00
32 journal1.523 Q172521677005531493.0213.4650.00
33 journal1.493 Q16732651323391576.2441.3455.07
34 journal1.455 Q12832841646333723.4251.4439.71
35 journal1.425 Q18282279583010712292.4171.1064.02
36 journal1.419 Q113433832784277823.1884.3637.80
37 book series1.386 Q1621137544167362.9349.4558.82
38 journal1.380 Q174361001578461973.2643.8354.55
39 journal1.370 Q11024381409154374.0758.7140.00
40 journal1.367 Q111524822265791819.5294.3836.47
41 journal1.341 Q16895426427417794193.5244.9963.16
42 journal1.333 Q1479618628887861754.2230.0834.72
43 journal1.322 Q13229743237318713.96111.6246.15
44 journal1.315 Q172831195223368982.9462.9362.43
45 journal1.274 Q15014013571304741313.0450.9356.20
46 journal1.261 Q18443772655320724.1061.7455.81
47 journal1.246 Q1633911119264261094.2549.3853.75
48 journal1.245 Q13171963320453944.5946.7646.76
49 journal1.224 Q16344914408153911.52100.1844.16
50 journal1.224 Q194423298574233.2167.8446.28

Scimago Lab

Follow us on @ScimagoJR Scimago Lab , Copyright 2007-2024. Data Source: Scopus®

applied linguistics research journal web of science

Cookie settings

Cookie Policy

Legal Notice

Privacy Policy

Ali H. Al-Hoorie

د. علي الحوري

There is increasing discontent with the predominant business model publishers use: The author submits the paper for free, the reviewer works for free, the editor typically works for free, but then the publisher acquires the copyright and keeps the work behind expensive paywalls. This business model is outdated. With the available technology, most of what publishers were needed for is no longer justifiable. The current publishing system is a  rip-off  and is bad for science . 

So how to support open access? One way is to be involved with open access journals, whether by submitting your work to them, reviewing for them, or joining their editorial boards. Another way is to read research published in these journals and cite it in your research (if it is genuinely relevant). If you cite research published in the last two years, the journal impact factor increases.

applied linguistics research journal web of science

List of Open Access Journals

The journals below are diamond (also called platinum) open access, which means that neither the author nor the reader pays any fees. Generally, as a quality assurance measure, the journals listed below are indexed in Wed of Science (previously known as ISI), Scopus, and/or Directory of Open Access Journals. Journals indexed in Web of Science (esp. SSCI for applied linguistics; see below for explanation) are perceived to be of higher status and to have higher standards (i.e., harder to publish in). Web of Science has an additional category called Emerging Sources Citation Index, which lists journals provisionally included in Web of Science with no impact factor yet (possibly easier to publish in). 

These journals are related to applied linguistics broadly defined (literature & translation journals are not listed). So you need to make sure that your paper fits the scope of the journal you plan to submit your paper to. These journals also publish in English, though they might additionally publish in other languages, so don't be surprised when you open the website of a journal and find it in a foreign language. Most of these journals have a button to switch to English. 

Finally, the information in this list can change, particularly the indexing. If you spot a problem with the information or would like to suggest adding a journal, please let know. And as I mentioned above, the primary quality assurance measure I used to compile this list is indexing in Web of Science, Scopus, and/or Directory of Open Access Journals. (Don't think I have first-hand experience with every single one of these journals!) So, if you suspect that a journal is engaging in questionable practices, please let me know  to remove it. Having said this, the risk that a diamond access journal is predatory is low (they aren't charging you any money to start with). The bigger risk is that their peer review may not be up to par, allowing shaky research to be published. This is where we as a community should step in and give a hand by agreeing to review to these journals and by offering constructive feedback to manuscript submitters.  


SSCI = Social Sciences Citation Index. Perceived to be the most prestigious. A service by WoS. 

ESCI = Emerging Sources Citation Index. Conditionally accepted, no impact factor yet.  A service by WoS.

Scopus = A service by Elsevier. Offers a different set of impact factors (e.g., SJR & CiteScore). 

Across the Disciplines  

Alsic. Apprentissage des langues et systèmes d’information et de communication  

Alsinatuna: Journal of Arabic Linguistics and Education  

Altre Modernita Scopus, ESCI


Asian Journal of English Language Teaching  

  • Asp. La revue du GERAS  

Atlantis Scopus, ESCI

Aula Abierta Scopus, ESCI

Australasian Journal of Educational Technology Scopus, SSCI



BC TEAL Journal

Bellaterra Journal of Teaching & Learning Language & Literature Scopus 

Beyond Words

Boletin de Literatura Oral Scopus, ESCI

CALL-EJ Scopus 

Caracteres Scopus, ESCI

Chi'e: Journal of Japanese Learning and Teaching

Circulo de Linguistica Aplicada a la Comunicacion Scopus, SSCI

CLIL Journal of Innovation and Research in Plurilingual and Pluricultural Education

Communiquer Scopus, ESCI

Computational Linguistics Scopus, SSCI

Comunicar Scopus, SSCI


Culture, Language and Representation Scopus, ESCI

Current Issues in Philology and Pedagogical Linguistics

DELTA: Documentação de estudos em lingüística teórica e aplicada Scopus

Dialectologia Scopus, ESCI

Dialogic Pedagogy Scopus, ESCI

Dialogues: An Interdisciplinary Journal of English Language Teaching and Research

Dinamika Ilmu

Dutch Journal of Applied Linguistics Scopus, ESCI

Edukasi: Jurnal Pendidikan dan Pengajaran

E-JournALL: EuroAmerican Journal of Applied Linguistics and Languages

Electronic Journal of Foreign Language Teaching (e-FLT) Scopus

ELT Forum: Journal of English Language Teaching

ELT Worldwide: Journal of English Language Teaching

English Australia Journal Scopus

English Community Journal

English Language Teaching Educational Journal

ESP Today Scopus, ESCI

Estudios de Asia y África Scopus, ESCI

Estudios de Lingüística Inglesa Aplicada Scopus, ESCI

EuroAmerican Journal of Applied Linguistics and Languages  


Forma y Funcion Scopus, ESCI

GIST Education and Learning Research Journal

Global Business Languages

HOW Journal

IAFOR Journal of Education Scopus

Iberica  Scopus, SSCI

IC Revista Cientifica de Informacion y Comunicacion Scopus, ESCI

Indonesian EFL Journal: Journal of ELT, Linguistics, and Literature

Indonesian JELT

International Journal of Arabic Teaching and Learning

International Journal of Emerging Issues in Early Childhood Education

International Journal of Educational Technology in Higher Education Scopus, SSCI

International Journal of English Studies Scopus, ESCI

International Journal of English for Academic Purposes: Research and Practice  

International Journal of Foreign Language Teaching and Research

International Journal of Language Teaching and Education

International Journal of Social Sciences & Educational Studies

International Journal of TESOL Studies

International Journal on Language Research and Education Studies

Investigaciones Sobre Lectura

Iranian Journal of Language Teaching Research Scopus, ESCI 

Italian at School

Journal of English Language and Education

Jurnal Ilmiah Bahasa dan Sastra

Journal for Research on Adult Education

Journal for the Psychology of Language Learning

Journal of Child Language Acquisition and Development  

Journal of Classics Teaching

Journal of Communication Pedagogy

Journal of Education for Multilingualism

Journal of Education, Language, and Ideology

Journal of ELT Research

Journal of English Educators Society

Journal of English Language Teaching in Indonesia

Journal of Foreign Language Education and Technology

Journal of Human Services: Training, Research, & Practice

Journal of Language and Education Scopus, ESCI

Journal of Language and Law Scopus, ESCI

Journal of Language and Literature Education

Journal of Language Contact Scopus, ESCI

Journal of Language Education and Research

Journal of Linguistics Research

Journal of Media Literacy Education

Journal of Mother Tongue Education

Journal of Research in Applied Linguistics Scopus, ESCI

Journal of Research in Language & Translation  

Journal of Saudi Electronic University on Digitalization  

Journal of Special Education and Rehabilitation

Journal of Teaching and Learning ESCI

Journal of the European Second Language Association

Journal of Umm Al-Qura University for Langua ge Sciences and Literature

Journal of Writing Research Scopus, ESCI

Jurnal Bahasa Lingua Scientia

Jurnal Pendidikan Bahasa dan Sastra Indonesia Metalingua

Jurnal Pendidikan dan Kebudayaan


Konin Language Studies

Laboratory Phonology Scopus, SSCI

Language and Linguistics Scopus, SSCI

Language & Literacy

Language Development Research

Language Documentation and Conservation Scopus, ESCI

Language Learning & Technology  Scopus, SSCI

Language Teaching and Educational Research

Language Value  Scopus 

Latin American Journal of Content & Language Integrated Learning

Lenguas Modernas Scopus

Lexikos Scopus, SSCI

Lisanuna: Jurnal Ilmu Bahasa Arab dan Pembelajarannya

LITERA, the International Journal of Linguistics, Literature, and their Teaching

Literator Scopus, ESCI

LLT Journal: A Journal on Language and Language Teaching

Metathesis: Journal of English Language, Literature, and Teaching

MEXTESOL Journal Scopus

Mimbar Pendidikan: Jurnal Indonesia untuk Kajian Pendidikan

Muróbbî: Jurnal Ilmu Pendidikan

New Zealand Studies in Applied Linguistics


Onoma zein Scopus, SSCI

Papers in Language Testing and Assessment ESCI

Parole: Journal of Linguistics and Education

PASAA Scopus

Pixel-Bit, Revista de Medios y Educacion Scopus, ESCI

Polish Journal of English Studies  

Polyglot: Jurnal Ilmiah

Porta Linguarum Scopus, SSCI

Problems of Onomastics Scopus, ESCI

Procesamiento del Lenguaje Natural Scopus, ESCI

Professional Discourse & Communication

Profile: Issues in Teachers' Professional Development Scopus 

Psiholingvistika Scopus

Psychology of Language and Communication Scopus 

Quaderns de Filologia: Estudis Literaris Scopus, ESCI

RATE Issues

Reading & Writing Scopus, ESCI

Reading in a Foreign Language ESCI

Recherches en didactique des langues et des cultures

Recherche et pratiques pédagogiques en langues de spécialité. Cahiers de l'APLIUT

Res Rhetorica Scopus, ESCI

Research in Corpus Linguistics

Research Papers in Language Teaching and Learning

Researching and Teaching Languages for Specific Purposes Scopus, ESCI

Revista Argentina de Ciencias del Comportamiento Scopus, ESCI

Revista de Lingüística y Lenguas Aplicadas Scopus, ESCI

Revista EntreLínguas

Revista Espaço

Revista Nebrija de Linguistica Aplicada a la Enseñanza de Lenguas

Revista Signos Scopus, SSCI

RLA, Revista de linguistica teorica y aplicada Scopus, SSCI

Saudi Journal of Language Studies  


Scientific Journal of King Faisal University: Humanities and Management Sciences

Scripta Manent

SED - Journal of Art Education

Sign Systems Studies Scopus; Arts & Humanities Citation Index (a service by WoS)

Sintagma Scopus, SSCI

SiSAL Journal Scopus 

Special Education and Rehabilitation Scopus, SSCI

Stellenbosch Papers in Linguistics Plus Scopus 

Studies in Applied Linguistics & TESOL

Studies in Language Assessment ESCI

Studies in Pragmatics and Discourse Analysis

Studies in Second Language Learning and Teaching Scopus, SSCI

Sustainable Multilingualism

Tabuk University Journal of Human and Social Sciences

Teaching American Literature: A Journal of Theory and Practice

Teaching and Learning in Communication Sciences & Disorders Scopus, ESCI

TESL Canada Journal ESCI

TESL-EJ Scopus

TESOLANZ Journal  

Texto Livre: Linguagem e Tecnologia Scopus, ESCI

The Canadian Journal of Applied Linguistics ESCI

The Journal of Asia TEFL Scopus, ESCI

The Journal of Teaching English with Technology Scopus 

The Journal of the National Council of Less Commonly Taught Languages

The Literacy Trek

Theory and Practice of Second Language Acquisition Scopus

Tomsk State University Journal of Philology Scopus, ESCI

Training, Language and Culture

Vocabulary Learning and Instruction: A Journal of Vocabulary Research

Westcliff International Journal of Applied Research

Word and Text - A Journal of Literary Studies and Linguistics Scopus, ESCI

Zeitschrift für Sprachwissenschaft Scopus, SSCI

In 2017, I published a paper with Joe Vitta listing applied linguistics journals (44 in total only). Our inclusion criterion in that paper was indexing in Scopus and SSCI (not open access status as in the list above). That paper received thousands of reads and a great deal of attention, which suggested to me that there is a need for a list of journals for the field. Some trivia: That paper also listed the impact factors of the journals. At that time, the mean JCR impact factor (WoS) for applied linguistics journals was 1.554 with a standard deviation of 0.76, while the mean SJR impact factor (Scopus) was 0.738 with a standard deviation of 0.65. Check out the full paper here . 

As the indexing status can change over time, it is wise to double-check it before you decide to submit: 

For WoS, copy and paste the journal title here . 

For Scopus, copy and paste the journal title here . 

And if you do spot something that needs updating, please drop me a line . 

applied linguistics research journal web of science

Open Access Book Publishers

Unfortunately, there aren't many diamond open access publishers (that charge neither the author nor the reader). If you come across any, please suggest it to me . Here is what is currently available: 

Applied Linguistics Press

Language Science Press

Open Book Publishers  

  • University of Michigan Library
  • Research Guides

Linguistics Resources

  • Journal Databases
  • Ebooks & Media
  • Reference Works
  • Language Contact
  • Language Diversity
  • Phonetics & Phonology
  • Scholarly Societies, Programs, Conferences
  • Citation Guides
  • For Grad Students
  • Open Access Resources
  • Access Tools

Library Contact

Profile Photo

Indexes of Scholarly Articles on Linguistics

  • Linguistics & Language Behavior Abstracts Covers linguistics and related disciplines such as language acquisition, bilingualism, and artificial intelligence. Time-span 1973 - Limited to authorized UM users
  • Modern Language Association International Bibliography (MLAIB) Index to international critical journal articles, books and dissertations on literature, languages, linguistics and folklore. Time-span: 1925 - Limited to authorized UM users more... less... Creator: Modern Language Association (MLA) Search Hints: 'Phrase' searches will find records with the words not necessarily next to each other. For example, a search on 'sunny day' might find 'the day was sunny'.
  • PsycInfo (APA) This link opens in a new window Premier resource for surveying the literature of psychology and adjunct fields. Covers 1887-present. Produced by the APA.
  • Bibliography of Linguistic Literature (BLLDB) Index to international articles, books, conferences, and dissertations in the fields of applied and theoretical linguistics. Approximately 10,000 citations added annually. Time-span: 1971- present
  • ProQuest Literature & Language Combined search of 15 indexes that cover literature and linguistics
  • Francis (International Humanities and Social Sciences) Provides indexing and abstracts of books and articles from over 4,300 European-language journals in the humanities and social sciences--especially religion, history of art, literature, philosophy, and economics. Time-span: 1972 - more... less... http://www.lib.umich.edu/database/link/33015
  • Anthropology Plus A comprehensive, international index to the literature of anthropology and archaeology, Anthropology Plus combines two major indexes: Harvard University's Anthropological Literature database and the United Kingdom's Anthropological Index. The database covers all journal articles, reports, commentaries, obituaries, etc. at least two pages long in over 2,500 journals and monographic series. Covers the fields of social, cultural, physical, biological, and linguistic anthropology, ethnology, archa
  • Language in Australia and New Zealand A bibliography and research database for languages spoken in Australia and New Zealand. Covers indigenous European, Asian and contact languages. Includes dictionaries, grammars and glossaries as well as journal articles. Materials covered are from 1788-present.
  • Bibliographie der deutschen Sprach- und Literaturwissenschaft (BDSL) Online version (1985-)of the printed bibliography (1957-). BDSL covers all aspects of secondary literature in the areas of German language, linguistics, and literary studies. Contains 499,000 bibliographic records through 2020. Limited to UM users

Additional Resources

  • Web of Science Combines three citation indexes--Arts & Humanities, Science, and Social Sciences--which permit searching for articles that cite a known author or work, as well as searching by subject, author, journal, and author address. Covers over 12,000 journals, as well as scholarly books and conference proceedings.
  • Communication & Mass Media Complete Covers over 500 journals in the fields of mass media and communications.
  • ProQuest Combined search of 79 ProQuest databases including: ABI/INFORM Dateline, ABI/INFORM Global, ABI/INFORM Trade & Industry, Alt-Press Watch, Ethnic NewsWatch, GenderWatch, OxResearch, ProQuest Dissertations and Theses, ProQuest Research Library, and the Snapshots International Series.
  • Google Scholar @ U-M Searches for scholarly documents on the web, with the added feature of "Availability at UM" links.

Historical Journal Indexes and Collections

  • JSTOR Full-text access to scholarly journals in the arts, humanities, sciences and social sciences. Most journals have the first volume onward, but do not include current issues.
  • Periodical Index Online Index to the contents of more than 2,000 humanities and social science journals, from their first issues through 1995/96. Includes many 19th and early-to-mid 20th century titles not indexed electronically anywhere else.
  • Periodical Archive Online An archive of journals published in the arts, humanities and social sciences selected from Periodicals Index Online.
  • Search Menu

Sign in through your institution

  • Advance articles
  • Author Guidelines
  • Submission Site
  • Open Access
  • Why Submit?
  • About Applied Linguistics
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Terms and Conditions
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

  • < Previous

Research Trends in Applied Linguistics from 2005 to 2016: A Bibliometric Analysis and Its Implications

  • Article contents
  • Figures & tables
  • Supplementary Data

Lei Lei, Dilin Liu, Research Trends in Applied Linguistics from 2005 to 2016: A Bibliometric Analysis and Its Implications, Applied Linguistics , Volume 40, Issue 3, June 2019, Pages 540–561, https://doi.org/10.1093/applin/amy003

  • Permissions Icon Permissions

Using data of articles from 42 Social Science Citation Index (SSCI)-indexed journals of applied linguistics, this study renders a bibliometric analysis of the 2005–16 research trends in the field. The analysis focuses on, among other issues, the most frequently discussed topics, the most highly cited publications, and the changes that have occurred in the research trends. The results show that while most of the frequently discussed topics have remained popular over the 12 years, some (especially sociocultural/functional/identity issues) have experienced a significant increase of interest, but some others (particularly certain phonological/grammatical/generative linguistic topics) have witnessed a substantial decrease of interest. There has also been an increased use of new theories including those coming from distant disciplines. Furthermore, while the number of publications from traditional publication powerhouses, such as the USA, has shown a slow, albeit steady decline proportionally, those from some other countries, such as China, have exhibited a substantial steady rise. The latter countries’ increasing publication rates appear to have contributed to the increased discussion of issues specific to their context. Implications of the findings are also discussed.

Personal account

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD

Institutional access

Sign in with a library card.

  • Sign in with username/password
  • Recommend to your librarian
  • Institutional account management
  • Get help with access

Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:

IP based access

Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.

Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.

  • Click Sign in through your institution.
  • Select your institution from the list provided, which will take you to your institution's website to sign in.
  • When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
  • Following successful sign in, you will be returned to Oxford Academic.

If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.

Enter your library card number to sign in. If you cannot sign in, please contact your librarian.

Society Members

Society member access to a journal is achieved in one of the following ways:

Sign in through society site

Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:

  • Click Sign in through society site.
  • When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.

If you do not have a society account or have forgotten your username or password, please contact your society.

Sign in using a personal account

Some societies use Oxford Academic personal accounts to provide access to their members. See below.

A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.

Some societies use Oxford Academic personal accounts to provide access to their members.

Viewing your signed in accounts

Click the account icon in the top right to:

  • View your signed in personal account and access account management features.
  • View the institutional accounts that are providing access.

Signed in but can't access content

Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.

For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.

Short-term Access

To purchase short-term access, please sign in to your personal account above.

Don't already have a personal account? Register

Month: Total Views:
March 2018 318
April 2018 115
May 2018 150
June 2018 67
July 2018 50
August 2018 81
September 2018 58
October 2018 62
November 2018 85
December 2018 150
January 2019 64
February 2019 27
March 2019 74
April 2019 36
May 2019 169
June 2019 179
July 2019 191
August 2019 133
September 2019 133
October 2019 160
November 2019 102
December 2019 97
January 2020 90
February 2020 61
March 2020 80
April 2020 63
May 2020 45
June 2020 87
July 2020 72
August 2020 41
September 2020 73
October 2020 89
November 2020 55
December 2020 65
January 2021 47
February 2021 60
March 2021 37
April 2021 50
May 2021 77
June 2021 53
July 2021 71
August 2021 51
September 2021 83
October 2021 46
November 2021 56
December 2021 69
January 2022 49
February 2022 37
March 2022 83
April 2022 107
May 2022 59
June 2022 86
July 2022 125
August 2022 164
September 2022 301
October 2022 296
November 2022 361
December 2022 188
January 2023 47
February 2023 42
March 2023 88
April 2023 58
May 2023 79
June 2023 33
July 2023 56
August 2023 36
September 2023 55
October 2023 53
November 2023 49
December 2023 41
January 2024 46
February 2024 34
March 2024 49
April 2024 77
May 2024 51
June 2024 35

Email alerts

Citing articles via, looking for your next opportunity.

  • Recommend to your Library


  • Online ISSN 1477-450X
  • Print ISSN 0142-6001
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Rights and permissions
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

To read this content please select one of the options below:

Please note you do not have access to teaching notes, a bibliometric study of the applied linguistics research output of saudi institutions in the web of science for the decade 2011-2020.

The Electronic Library

ISSN : 0264-0473

Article publication date: 29 October 2021

Issue publication date: 29 November 2021

This paper aims to analyse the research contributions of the Kingdom of Saudi Arabia in the field of applied linguistics (AL) indexed in the Web of Science core collection for the period between 2011 and 2020.


The author searched key terms in the Social Science Citation Index and Science Citation Index Expanded categories that publish documents in AL. The author compiled the data, classified these documents according to their research focus and investigated different metrics such as keywords analysis, citation analysis, overseas collaboration and productivity over authors, institutions and sources by using VOSviewer and Excel sheet.

Results found that publications in Saudi Arabia have tremendously increased around three times in the years 2016–2020 than before. As unexpected, highly cited researchers, sources and institutions for the social science and arts and humanities disciplines were higher than the scientific disciplines that investigated linguistic issues such as neurology, audiology and computer science. The area of language teaching and learning was the most researched area in which the highly cited author, journals and keywords analysis metrics occurred within its scope. The highly cited articles were those that collaborated with the world contributing authors and acted as corresponding authors.


The study contributes to the body of literature of AL which shares other categories that investigated language as a central issue. The study provides a fine-grained picture about the research productivity of AL in scientific and social science categories in Saudi Arabia.

  • Bibliometric analysis
  • Bibliometrics
  • Applied linguistics
  • Saudi Arabia
  • Web of Science
  • Co-citation analysis
  • Co-occurrence analysis
  • Collaboration


The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author thanks the Deanship of Scientific Research at Najran University for funding this study through a grant research code (NU/-/SEHRC/10/965).

Mohsen, M.A. (2021), "A bibliometric study of the applied linguistics research output of Saudi institutions in the Web of Science for the decade 2011-2020", The Electronic Library , Vol. 39 No. 6, pp. 865-884. https://doi.org/10.1108/EL-06-2021-0121

Emerald Publishing Limited

Copyright © 2021, Emerald Publishing Limited

Related articles

All feedback is valuable.

Please share your general feedback

Report an issue or find answers to frequently asked questions

Contact Customer Support

  • Architecture and Design
  • Asian and Pacific Studies
  • Business and Economics
  • Classical and Ancient Near Eastern Studies
  • Computer Sciences
  • Cultural Studies
  • Engineering
  • General Interest
  • Geosciences
  • Industrial Chemistry
  • Islamic and Middle Eastern Studies
  • Jewish Studies
  • Library and Information Science, Book Studies
  • Life Sciences
  • Linguistics and Semiotics
  • Literary Studies
  • Materials Sciences
  • Mathematics
  • Social Sciences
  • Sports and Recreation
  • Theology and Religion
  • Publish your article
  • The role of authors
  • Promoting your article
  • Abstracting & indexing
  • Publishing Ethics
  • Why publish with De Gruyter
  • How to publish with De Gruyter
  • Our book series
  • Our subject areas
  • Your digital product at De Gruyter
  • Contribute to our reference works
  • Product information
  • Tools & resources
  • Product Information
  • Promotional Materials
  • Orders and Inquiries
  • FAQ for Library Suppliers and Book Sellers
  • Repository Policy
  • Free access policy
  • Open Access agreements
  • Database portals
  • For Authors
  • Customer service
  • People + Culture
  • Journal Management
  • How to join us
  • Working at De Gruyter
  • Mission & Vision
  • De Gruyter Foundation
  • De Gruyter Ebound
  • Our Responsibility
  • Partner publishers

applied linguistics research journal web of science

Your purchase has been completed. Your documents are now available to view.

journal: Open Linguistics

Open Linguistics

  • Type: Journal
  • Language: English
  • Publisher: De Gruyter Open Access
  • First published: September 1, 2014
  • Publication Frequency: 1 Issue per Year
  • Audience: Researchers in the field of linguistics

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 19 June 2024

Detecting hallucinations in large language models using semantic entropy

  • Sebastian Farquhar   ORCID: orcid.org/0000-0002-9185-6415 1   na1 ,
  • Jannik Kossen 1   na1 ,
  • Lorenz Kuhn 1   na1 &
  • Yarin Gal   ORCID: orcid.org/0000-0002-2733-2078 1  

Nature volume  630 ,  pages 625–630 ( 2024 ) Cite this article

70k Accesses

1 Citations

1458 Altmetric

Metrics details

  • Computer science
  • Information technology

Large language model (LLM) systems, such as ChatGPT 1 or Gemini 2 , can show impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers 3 , 4 . Answering unreliably or without the necessary information prevents adoption in diverse fields, with problems including fabrication of legal precedents 5 or untrue facts in news articles 6 and even posing a risk to human life in medical domains such as radiology 7 . Encouraging truthfulness through supervision or reinforcement has been only partially successful 8 . Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability.

Similar content being viewed by others

applied linguistics research journal web of science

Testing theory of mind in large language models and humans

applied linguistics research journal web of science

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

applied linguistics research journal web of science

ThoughtSource: A central hub for large language model reasoning data

‘Hallucinations’ are a critical problem 9 for natural language generation systems using large language models (LLMs), such as ChatGPT 1 or Gemini 2 , because users cannot trust that any given output is correct.

Hallucinations are often defined as LLMs generating “content that is nonsensical or unfaithful to the provided source content” 9 , 10 , 11 but they have come to include a vast array of failures of faithfulness and factuality. We focus on a subset of hallucinations which we call ‘confabulations’ 12 for which LLMs fluently make claims that are both wrong and arbitrary—by which we mean that the answer is sensitive to irrelevant details such as random seed. For example, when asked a medical question “What is the target of Sotorasib?” an LLM confabulates by sometimes answering KRASG12 ‘C’ (correct) and other times KRASG12 ‘D’ (incorrect) despite identical instructions. We distinguish this from cases in which a similar ‘symptom’ is caused by the following different mechanisms: when LLMs are consistently wrong as a result of being trained on erroneous data such as common misconceptions 13 ; when the LLM ‘lies’ in pursuit of a reward 14 ; or systematic failures of reasoning or generalization. We believe that combining these distinct mechanisms in the broad category hallucination is unhelpful. Our method makes progress on a portion of the problem of providing scalable oversight 15 by detecting confabulations that people might otherwise find plausible. However, it does not guarantee factuality because it does not help when LLM outputs are systematically bad. Nevertheless, we significantly improve question-answering accuracy for state-of-the-art LLMs, revealing that confabulations are a great source of error at present.

We show how to detect confabulations by developing a quantitative measure of when an input is likely to cause an LLM to generate arbitrary and ungrounded answers. Detecting confabulations allows systems built on LLMs to avoid answering questions likely to cause confabulations, to make users aware of the unreliability of answers to a question or to supplement the LLM with more grounded search or retrieval. This is essential for the critical emerging field of free-form generation in which naive approaches, suited to closed vocabulary and multiple choice, fail. Past work on uncertainty for LLMs has focused on simpler settings, such as classifiers 16 , 17 and regressors 18 , 19 , whereas the most exciting applications of LLMs relate to free-form generations.

The term hallucination in the context of machine learning originally comes from filling in ungrounded details, either as a deliberate strategy 20 or as a reliability problem 4 . The appropriateness of the metaphor has been questioned as promoting undue anthropomorphism 21 . Although we agree that metaphor must be used carefully with LLMs 22 , the widespread adoption of the term hallucination reflects the fact that it points to an important phenomenon. This work represents a step towards making that phenomenon more precise.

To detect confabulations, we use probabilistic tools to define and then measure the ‘semantic’ entropy of the generations of an LLM—an entropy that is computed over meanings of sentences. High entropy corresponds to high uncertainty 23 , 24 , 25 —so semantic entropy is one way to estimate semantic uncertainties. Semantic uncertainty, the broader category of measures we introduce, could be operationalized with other measures of uncertainty, such as mutual information, instead. Entropy in free-form generation is normally hard to measure because answers might mean the same thing (be semantically equivalent) despite being expressed differently (being syntactically or lexically distinct). This causes naive estimates of entropy or other lexical variation scores 26 to be misleadingly high when the same correct answer might be written in many ways without changing its meaning.

By contrast, our semantic entropy moves towards estimating the entropy of the distribution of meanings of free-form answers to questions, insofar as that is possible, rather than the distribution over the ‘tokens’ (words or word-pieces) which LLMs natively represent. This can be seen as a kind of semantic consistency check 27 for random seed variation. An overview of our approach is provided in Fig. 1 and a worked example in Supplementary Table 1 .

figure 1

a , Naive entropy-based uncertainty measures variation in the exact answers, treating ‘Paris’, ‘It’s Paris’ and ‘France’s capital Paris’ as different. But this is unsuitable for language tasks for which sometimes different answers mean the same things. Our semantic entropy clusters answers which share meanings before computing the entropy. A low semantic entropy shows that the LLM is confident about the meaning. b , Semantic entropy can also detect confabulations in longer passages. We automatically decompose a long generated answer into factoids. For each factoid, an LLM generates questions to which that factoid might have been the answer. The original LLM then samples  M possible answers to these questions. Finally, we compute the semantic entropy over the answers to each specific question, including the original factoid. Confabulations are indicated by high average semantic entropy for questions associated with that factoid. Here, semantic entropy classifies Fact 1 as probably not a confabulation because generations often mean the same thing, despite very different wordings, which a naive entropy would have missed.

Intuitively, our method works by sampling several possible answers to each question and clustering them algorithmically into answers that have similar meanings, which we determine on the basis of whether answers in the same cluster entail each other bidirectionally 28 . That is, if sentence A entails that sentence B is true and vice versa, then we consider them to be in the same semantic cluster. We measure entailment using both general-purpose LLMs and natural language inference (NLI) tools developed specifically for detecting entailment for which we show direct evaluations in Supplementary Tables 2 and 3 and Supplementary Fig. 1 . Textual entailment has previously been shown to correlate with faithfulness 10 in the context of factual consistency 29 as well as being used to measure factuality in abstractive summarization 30 , especially when applied at the right granularity 31 .

Semantic entropy detects confabulations in free-form text generation across a range of language models and domains, without previous domain knowledge. Our evaluations cover question answering in trivia knowledge (TriviaQA 32 ), general knowledge (SQuAD 1.1; ref. 33 ), life sciences (BioASQ 34 ) and open-domain natural questions (NQ-Open 35 ) derived from actual queries to Google Search 36 . In addition, semantic entropy detects confabulations in mathematical word problems (SVAMP 37 ) and in a biography-generation dataset, FactualBio, accompanying this paper.

Our results for TriviaQA, SQuAD, BioASQ, NQ-Open and SVAMP are all evaluated context-free and involve sentence-length answers (96 ± 70 characters, mean ± s.d.) and use LLaMA 2 Chat (7B, 13B and 70B parameters) 38 , Falcon Instruct (7B and 40B) 39 and Mistral Instruct (7B) 40 . In the Supplementary Information , we further consider short-phrase-length answers. Results for FactualBio (442 ± 122 characters) use GPT-4 (ref. 1 ). At the time of writing, GPT-4 (ref. 1 ) did not expose output probabilities 41 or hidden states, although it does now. As a result, we propose a discrete approximation of our estimator for semantic entropy which allows us to run experiments without access to output probabilities, which we use for all GPT-4 results in this paper and which performs similarly well.

Our confabulation detection with semantic entropy is more robust to user inputs from previously unseen domains than methods which aim to ‘learn’ how to detect confabulations from a set of example demonstrations. Our method is unsupervised, meaning that we do not need labelled examples of confabulations. By contrast, supervised methods detect confabulations by learning patterns behind examples of confabulations, assuming that future questions preserve these patterns. But this assumption is often untrue in new situations or with confabulations that human overseers are unable to identify (compare Fig. 17 of ref. 24 ). As a strong supervised baseline, we compare to an embedding regression method inspired by ref. 24 which trains a logistic regression classifier to predict whether the model correctly answered a question on the basis of the final ‘embedding’ (hidden state) of the LLM. We also use the P (True) method 24 which looks at the probability with which an LLM predicts that the next token is ‘True’ when few-shot prompted to compare a main answer with ‘brainstormed’ alternatives.

Confabulations contribute substantially to incorrect answers given by language models. We show that semantic entropy can be used to predict many incorrect model answers and to improve question-answering accuracy by refusing to answer those questions the model is uncertain about. Corresponding to these two uses, we evaluate two main metrics. First, the widely used area under the receiver operating characteristic (AUROC) curve for the binary event that a given answer is incorrect. This measure captures both precision and recall and ranges from 0 to 1, with 1 representing a perfect classifier and 0.5 representing an un-informative classifier. We also show a new measure, the area under the ‘rejection accuracy’ curve (AURAC). This studies the case in which the confabulation detection score is used to refuse to answer the questions judged most likely to cause confabulations. Rejection accuracy is the accuracy of the answers of the model on the remaining questions and the area under this curve is a summary statistic over many thresholds (representative threshold accuracies are provided in Supplementary Material ). The AURAC captures the accuracy improvement which users would experience if semantic entropy was used to filter out questions causing the highest entropy.

Detecting confabulations in QA and math

In Fig. 2 , we show that both semantic entropy and its discrete approximation outperform our best baselines for sentence-length generations. These results are averaged across datasets and provide the actual scores on the held-out evaluation dataset. We report the raw average score across held-out evaluation datasets without standard error because the distributional characteristics are more a property of the models and datasets selected than the method. Consistency of relative results across different datasets is a stronger indicator of variation in this case.

figure 2

Semantic entropy outperforms leading baselines and naive entropy. AUROC (scored on the y -axes) measures how well methods predict LLM mistakes, which correlate with confabulations. AURAC (likewise scored on the y -axes) measures the performance improvement of a system that refuses to answer questions which are judged likely to cause confabulations. Results are an average over five datasets, with individual metrics provided in the Supplementary Information .

Semantic entropy greatly outperforms the naive estimation of uncertainty using entropy: computing the entropy of the length-normalized joint probability of the token sequences. Naive entropy estimation ignores the fact that token probabilities also express the uncertainty of the model over phrasings that do not change the meaning of an output.

Our methods also outperform the supervised embedding regression method both in- and out-of-distribution. In pale-yellow bars we show that embedding regression performance deteriorates when its training data do not match the deployment distribution—which mirrors the common real-world case in which there is a distribution shift between training and deployment 42 —the plotted value is the average metric for embedding regression trained on one of the four ‘off-distribution’ datasets for that evaluation. This is critical because reliable uncertainty is most important when the data distribution shifts. Semantic entropy also outperforms P (True) which is supervised ‘in-context’; that is, it is adapted to the deployment task with a few training examples provided in the LLM prompt itself. The discrete variant of semantic entropy performs similarly to our standard estimator, despite not requiring exact output probabilities.

Averaged across the 30 combinations of tasks and models we study, semantic entropy achieves the best AUROC value of 0.790 whereas naive entropy (0.691), P (True) (0.698) and the embedding regression baseline (0.687) lag behind it. Semantic entropy performs well consistently, with stable performance (between 0.78 and 0.81 AUROC) across the different model families (LLaMA, Falcon and Mistral) and scales (from 7B to 70B parameters) which we study (we report summary statistics for each dataset and model as before). Although semantic entropy outperforms the baselines across all model sizes, P (True) seems to improve with model size, suggesting that it might become more competitive for very capable honest models in settings that the model understands well (which are, however, not the most important cases to have good uncertainty). We use ten generations to compute entropy, selected using analysis in Supplementary Fig. 2 . Further results for short-phrase generations are described in Supplementary Figs. 7 – 10 .

The results in Fig. 2 offer a lower bound on the effectiveness of semantic entropy at detecting confabulations. These evaluations determine whether semantic entropy and baseline methods can detect when the answers of the model are incorrect (which we validate against human correctness evaluations in Supplementary Table 4 ). In addition to errors from confabulations (arbitrary incorrectness), this also includes other types of mistakes for which semantic entropy is not suited, such as consistent errors learned from the training data. The fact that methods such as embedding regression are able to spot other kinds of errors, not just confabulations, but still are outperformed by semantic entropy, suggests that confabulations are a principal category of errors for actual generations.

Examples of questions and answers from TriviaQA, SQuAD and BioASQ, for LLaMA 2 Chat 70B, are shown in Table 1 . These illustrate how only semantic entropy detects when the meaning is constant but the form varies (the first row of the table) whereas semantic entropy and naive entropy both correctly predict the presence of confabulations when the form and meaning vary together (second row) and predict the absence of confabulations when the form and meaning are both constant across several resampled generations (third row). In the final row, we give an example in which semantic entropy is erroneously high as a result of overly sensitive semantic clustering relative to the reference answer. Our clustering method distinguishes the answers which provide a precise date from those which only provide a year. For some contexts that would have been correct but in this context the distinction between the specific day and the year is probably irrelevant. This highlights the importance of context and judgement in clustering, especially in subtle cases, as well as the shortcomings of evaluating against fixed reference answers which do not capture the open-ended flexibility of conversational deployments of LLMs.

Detecting confabulations in biographies

Semantic entropy is most natural for sentences that express a single proposition but the idea of semantic equivalence is trickier to apply to longer passages which express many propositions which might only agree partially 43 . Nevertheless, we can use semantic entropy to detect confabulations in longer generations, such as entire paragraphs of text. To show this, we develop a dataset of biographical generations from GPT-4 (v.0613) for 21 individuals notable enough to have their own Wikipedia page but without extensive online biographies. From each biography generated by GPT-4, we automatically extract propositional factual claims about the individual (150 factual claims in total), which we manually label as true or false.

Applying semantic entropy to this problem is challenging. Naively, one might simply regenerate each sentence (conditioned on the text so far) and then compute semantic entropy over these regenerations. However, the resampled sentences often target different aspects of the biography: for example, one time describing family and the next time profession. This is analogous to the original problem semantic entropy was designed to resolve: the model is uncertain about the right ordering of facts, not about the facts themselves. To address this, we break down the entire paragraph into factual claims and reconstruct questions which might have been answered by those claims. Only then do we apply semantic entropy (Fig. 1 ) by generating three new answers to each question (selected with analysis in Supplementary Figs. 3 and 4 ) and computing the semantic entropy over those generations plus the original factual claim. We aggregate these by averaging the semantic entropy over all the questions to get an uncertainty score for each proposition, which we use to detect confabulations. Unaggregated results are shown in Supplementary Figs. 5 and 6 .

As GPT-4 did not allow access to the probability of the generation at the time of writing, we use a discrete variant of semantic entropy which makes the further approximation that we can infer a discrete empirical distribution over semantic meaning clusters from only the generations ( Methods ). This allows us to compute semantic entropy using only the black-box outputs of an LLM. However, we were unable to compute the naive entropy baseline, the standard semantic entropy estimator or the embedding regression baseline for GPT-4 without output probabilities and embeddings.

In Fig. 3 we show that the discrete variant of semantic entropy effectively detects confabulations on this dataset. Its AUROC and AURAC are higher than either a simple ‘self-check’ baseline—which just asks the LLM whether the factoid is likely to be true—or a variant of P (True) which has been adapted to work for the paragraph-length setting. Discrete semantic entropy has better rejection accuracy performance until 20% of the questions have been rejected at which point P (True) has a narrow edge. This indicates that the questions predicted to cause confabulations are indeed more likely to be wrong.

figure 3

The discrete variant of our semantic entropy estimator outperforms baselines both when measured by AUROC and AURAC metrics (scored on the y -axis). The AUROC and AURAC are substantially higher than for both baselines. At above 80% of questions being answered, semantic entropy has the highest accuracy. Only when the top 20% of answers judged most likely to be confabulations are rejected does the answer accuracy on the remainder for the P (True) baseline exceed semantic entropy.

Our probabilistic approach, accounting for semantic equivalence, detects an important class of hallucinations: those that are caused by a lack of LLM knowledge. These are a substantial portion of the failures at present and will continue even as models grow in capabilities because situations and cases that humans cannot reliably supervise will persist. Confabulations are a particularly noteworthy failure mode for question answering but appear in other domains too. Semantic entropy needs no previous domain knowledge and we expect that algorithmic adaptations to other problems will allow similar advances in, for example, abstractive summarization. In addition, extensions to alternative input variations such as rephrasing or counterfactual scenarios would allow a similar method to act as a form of cross-examination 44 for scalable oversight through debate 45 .

The success of semantic entropy at detecting errors suggests that LLMs are even better at “knowing what they don’t know” than was argued by ref. 24 —they just don’t know they know what they don’t know. Our method explicitly does not directly address situations in which LLMs are confidently wrong because they have been trained with objectives that systematically produce dangerous behaviour, cause systematic reasoning errors or are systematically misleading the user. We believe that these represent different underlying mechanisms—despite similar ‘symptoms’—and need to be handled separately.

One exciting aspect of our approach is the way it makes use of classical probabilistic machine learning methods and adapts them to the unique properties of modern LLMs and free-form language generation. We hope to inspire a fruitful exchange of well-studied methods and emerging new problems by highlighting the importance of meaning when addressing language-based machine learning problems.

Semantic entropy as a strategy for overcoming confabulation builds on probabilistic tools for uncertainty estimation. It can be applied directly to any LLM or similar foundation model without requiring any modifications to the architecture. Our ‘discrete’ variant of semantic uncertainty can be applied even when the predicted probabilities for the generations are not available, for example, because access to the internals of the model is limited.

In this section we introduce background on probabilistic methods and uncertainty in machine learning, discuss how it applies to language models and then discuss our contribution, semantic entropy, in detail.

Uncertainty and machine learning

We aim to detect confabulations in LLMs, using the principle that the model will be uncertain about generations for which its output is going to be arbitrary.

One measure of uncertainty is the predictive entropy of the output distribution, which measures the information one has about the output given the input 25 . The predictive entropy (PE) for an input sentence x is the conditional entropy ( H ) of the output random variable Y with realization y given x ,

A low predictive entropy indicates an output distribution which is heavily concentrated whereas a high predictive entropy indicates that many possible outputs are similarly likely.

Aleatoric and epistemic uncertainty

We do not distinguish between aleatoric and epistemic uncertainty in our analysis. Researchers sometimes separate aleatoric uncertainty (uncertainty in the underlying data distribution) from epistemic uncertainty (caused by having only limited information) 46 . Further advances in uncertainty estimation which separate these kinds of uncertainty would enhance the potential for our semantic uncertainty approach by allowing extensions beyond entropy.

Joint probabilities of sequences of tokens

Generative LLMs produce strings of text by selecting tokens in sequence. Each token is a wordpiece that often represents three or four characters (though especially common sequences and important words such as numbers typically get their own token). To compute entropies, we need access to the probabilities the LLM assigns to the generated sequence of tokens. The probability of the entire sequence, s , conditioned on the context, x , is the product of the conditional probabilities of new tokens given past tokens, whose resulting log-probability is \(\log P({\bf{s}}| {\boldsymbol{x}})={\sum }_{i}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , where s i is the i th output token and s < i denotes the set of previous tokens.

Length normalization

When comparing the log-probabilities of generated sequences, we use ‘length normalization’, that is, we use an arithmetic mean log-probability, \(\frac{1}{N}{\sum }_{i}^{N}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , instead of the sum. In expectation, longer sequences have lower joint likelihoods because of the conditional independence of the token probabilities 47 . The joint likelihood of a sequence of length N shrinks exponentially in N . Its negative log-probability therefore grows linearly in N , so longer sentences tend to contribute more to entropy. We therefore interpret length-normalizing the log-probabilities when estimating the entropy as asserting that the expected uncertainty of generations is independent of sentence length. Length normalization has some empirical success 48 , including in our own preliminary experiments, but little theoretical justification in the literature.

Principles of semantic uncertainty

If we naively calculate the predictive entropy directly from the probabilities of the generated sequence of tokens, we conflate the uncertainty of the model over the meaning of its answer with the uncertainty over the exact tokens used to express that meaning. For example, even if the model is confident in the meaning of a generation, there are still usually many different ways for phrasing that generation without changing its meaning. For the purposes of detecting confabulations, the uncertainty of the LLM over meanings is more important than the uncertainty over the exact tokens used to express those meanings.

Our semantic uncertainty method therefore seeks to estimate only the uncertainty the LLM has over the meaning of its generation, not the choice of words. To do this, we introduce an algorithm that clusters model generations by meaning and subsequently calculates semantic uncertainty. At a high level this involves three steps:

Generation: sample output sequences of tokens from the predictive distribution of a LLM given a context x .

Clustering: cluster sequences by their meaning using our clustering algorithm based on bidirectional entailment.

Entropy estimation: estimate semantic entropy by summing probabilities of sequences that share a meaning following equation ( 2 ) and compute their entropy.

Generating a set of answers from the model

Given some context x as input to the LLM, we sample M sequences, { s (1) , …,  s ( M ) } and record their token probabilities, { P ( s (1) ∣ x ), …,  P ( s ( M ) ∣ x )}. We sample all our generations from a single model, varying only the random seed used for sampling from the token probabilities. We do not observe the method to be particularly sensitive to details of the sampling scheme. In our implementation, we sample at temperature 1 using nucleus sampling ( P  = 0.9) (ref. 49 ) and top- K sampling ( K  = 50) (ref. 50 ). We also sample a single generation at low temperature (0.1) as an estimate of the ‘best generation’ of the model to the context, which we use to assess the accuracy of the model. (A lower sampling temperature increases the probability of sampling the most likely tokens).

Clustering by semantic equivalence

To estimate semantic entropy we need to cluster generated outputs from the model into groups of outputs that mean the same thing as each other.

This can be described using ‘semantic equivalence’ which is the relation that holds between two sentences when they mean the same thing. We can formalize semantic equivalence mathematically. Let the space of tokens in a language be \({\mathcal{T}}\) . The space of all possible sequences of tokens of length N is then \({{\mathcal{S}}}_{N}\equiv {{\mathcal{T}}}^{N}\) . Note that N can be made arbitrarily large to accommodate whatever size of sentence one can imagine and one of the tokens can be a ‘padding’ token which occurs with certainty for each token after the end-of-sequence token. For some sentence \({\bf{s}}\in {{\mathcal{S}}}_{N}\) , composed of a sequence of tokens, \({s}_{i}\in {\mathcal{T}}\) , there is an associated meaning. Theories of meaning are contested 51 . However, for specific models and deployment contexts many considerations can be set aside. Care should be taken comparing very different models and contexts.

Let us introduce a semantic equivalence relation, E (  ⋅  ,  ⋅  ), which holds for any two sentences that mean the same thing—we will operationalize this presently. Recall that an equivalence relation is any reflexive, symmetric and transitive relation and that any equivalence relation on a set corresponds to a set of equivalence classes. Each semantic equivalence class captures outputs that can be considered to express the same meaning. That is, for the space of semantic equivalence classes \({\mathcal{C}}\) the sentences in the set \(c\in {\mathcal{C}}\) can be regarded in many settings as expressing a similar meaning such that \(\forall {\bf{s}},{{\bf{s}}}^{{\prime} }\in c:E({\bf{s}},{{\bf{s}}}^{{\prime} })\) . So we can build up these classes of semantically equivalent sentences by checking if new sentences share a meaning with any sentences we have already clustered and, if so, adding them into that class.

We operationalize E (  ⋅  ,  ⋅  ) using the idea of bidirectional entailment, which has a long history in linguistics 52 and natural language processing 28 , 53 , 54 . A sequence, s , means the same thing as a second sequence, s ′, only if the sequences entail (that is, logically imply) each other. For example, ‘The capital of France is Paris’ entails ‘Paris is the capital of France’ and vice versa because they mean the same thing. (See later for a discussion of soft equivalence and cases in which bidirectional entailment does not guarantee equivalent meanings).

Importantly, we require that the sequences mean the same thing with respect to the context—key meaning is sometimes contained in the context. For example, ‘Paris’ does not entail ‘The capital of France is Paris’ because ‘Paris’ is not a declarative sentence without context. But in the context of the question ‘What is the capital of France?’, the one-word answer does entail the longer answer.

Detecting entailment has been the object of study of a great deal of research in NLI 55 . We rely on language models to predict entailment, such as DeBERTa-Large-MNLI 56 , which has been trained to predict entailment, or general-purpose LLMs such as GPT-3.5 (ref. 57 ), which can predict entailment given suitable prompts.

We then cluster sentences according to whether they bidirectionally entail each other using the algorithm presented in Extended Data Fig. 1 . Note that, to check if a sequence should be added to an existing cluster, it is sufficient to check if the sequence bidirectionally entails any of the existing sequences in that cluster (we arbitrarily pick the first one), given the transitivity of semantic equivalence. If a sequence does not share meaning with any existing cluster, we assign it its own cluster.

Computing the semantic entropy

Having determined the classes of generated sequences that mean the same thing, we can estimate the likelihood that a sequence generated by the LLM belongs to a given class by computing the sum of the probabilities of all the possible sequences of tokens which can be considered to express the same meaning as

Formally, this treats the output as a random variable whose event-space is the space of all possible meaning-classes, C , a sub- σ -algebra of the standard event-space S . We can then estimate the semantic entropy (SE) as the entropy over the meaning-distribution,

There is a complication which prevents direct computation: we do not have access to every possible meaning-class c . Instead, we can only sample c from the sequence-generating distribution induced by the model. To handle this, we estimate the expectation in equation ( 3 ) using a Rao–Blackwellized Monte Carlo integration over the semantic equivalence classes C ,

where \(P({C}_{i}| {\boldsymbol{x}})=\frac{P({c}_{i}| {\boldsymbol{x}})}{{\sum }_{c}P(c| {\boldsymbol{x}})}\) estimates a categorical distribution over the cluster meanings, that is, ∑ i P ( C i ∣ x ) = 1. Without this normalization step cluster ‘probabilities’ could exceed one because of length normalization, resulting in degeneracies. Equation ( 5 ) is the estimator giving our main method that we refer to as semantic entropy throughout the text.

For scenarios in which the sequence probabilities are not available, we propose a variant of semantic entropy which we call ‘discrete’ semantic entropy. Discrete semantic entropy approximates P ( C i ∣ x ) directly from the number of generations in each cluster, disregarding the token probabilities. That is, we approximate P ( C i ∣ x ) as \({\sum }_{1}^{M}\frac{{I}_{c={C}_{i}}}{M}\) , the proportion of all the sampled answers which belong to that cluster. Effectively, this just assumes that each output that was actually generated was equally probable—estimating the underlying distribution as the categorical empirical distribution. In the limit of M the estimator converges to equation ( 5 ) by the law of large numbers. We find that discrete semantic entropy results in similar performance empirically.

We provide a worked example of the computation of semantic entropy in Supplementary Note  1 .

Semantic entropy is designed to detect confabulations, that is, model outputs with arbitrary meaning. In our experiments, we use semantic uncertainty to predict model accuracy, demonstrating that confabulations make up a notable fraction of model mistakes. We further show that semantic uncertainty can be used to improve model accuracy by refusing to answer questions when semantic uncertainty is high. Last, semantic uncertainty can be used to give users a way to know when model generations are probably unreliable.

We use the datasets BioASQ 34 , SQuAD 33 , TriviaQA 32 , SVAMP 37 and NQ-Open 35 . BioASQ is a life-sciences question-answering dataset based on the annual challenge of the same name. The specific dataset we use is based on the QA dataset from Task B of the 2023 BioASQ challenge (11B). SQuAD is a reading comprehension dataset whose context passages are drawn from Wikipedia and for which the answers to questions can be found in these passages. We use SQuAD 1.1 which excludes the unanswerable questions added in v.2.0 that are deliberately constructed to induce mistakes so they do not in practice cause confabulations to occur. TriviaQA is a trivia question-answering dataset. SVAMP is a word-problem maths dataset containing elementary-school mathematical reasoning tasks. NQ-Open is a dataset of realistic questions aggregated from Google Search which have been chosen to be answerable without reference to a source text. For each dataset, we use 400 train examples and 400 test examples randomly sampled from the original larger dataset. Note that only some of the methods require training, for example semantic entropy does not use the training data. If the datasets themselves are already split into train and test (or validation) samples, we sample our examples from within the corresponding split.

All these datasets are free-form, rather than multiple choice, because this better captures the opportunities created by LLMs to produce free-form sentences as answers. We refer to this default scenario as our ‘sentence-length’ experiments. In Supplementary Note  7 , we also present results for confabulation detection in a ‘short-phrase’ scenario, in which we constrain model answers on these datasets to be as concise as possible.

To make the problems more difficult and induce confabulations, we do not provide the context passages for any of the datasets. When the context passages are provided, the accuracy rate is too high for these datasets for the latest generations of models to meaningfully study confabulations.

For sentence-length generations we use: Falcon 39 Instruct (7B and 40B), LLaMA 2 Chat 38 (7B, 13B and 70B) and Mistral 40 Instruct (7B).

In addition to reporting results for semantic entropy, discrete semantic entropy and naive entropy, we consider two strong baselines.

Embedding regression is a supervised baseline inspired by the P (IK) method 24 . In that paper, the authors fine-tune their proprietary LLM on a dataset of questions to predict whether the model would have been correct. This requires access to a dataset of ground-truth answers to the questions. Rather than fine-tuning the entire LLM in this way, we simply take the final hidden units and train a logistic regression classifier to make the same prediction. By contrast to their method, this is much simpler because it does not require fine-tuning the entire language model, as well as being more reproducible because the solution to the logistic regression optimization problem is not as seed-dependent as the fine-tuning procedure. As expected, this supervised approach performs well in-distribution but fails when the distribution of questions is different from that on which the classifier is trained.

The second baseline we consider is the P (True) method 24 , in which the model first samples M answers (identically to our semantic entropy approach) and then is prompted with the list of all answers generated followed by the highest probability answer and a question whether this answer is “(a) True” or “(b) False”. The confidence score is then taken to be the probability with which the LLM responds with ‘a’ to the multiple-choice question. The performance of this method is boosted with a few-shot prompt, in which up to 20 examples from the training set are randomly chosen, filled in as above, but then provided with the actual ground truth of whether the proposed answer was true or false. In this way, the method can be considered as supervised ‘in-context’ because it makes use of some ground-truth training labels but can be used without retraining the model. Because of context-size constraints, this method cannot fit a full 20 few-shot examples in the context when input questions are long or large numbers of generations are used. As a result, we sometimes have to reduce the number of few-shot examples to suit the context size and we note this in the  Supplementary Material .

Entailment estimator

Any NLI classification system could be used for our bidirectional entailment clustering algorithm. We consider two different kinds of entailment detector.

One option is to use an instruction-tuned LLM such as LLaMA 2, GPT-3.5 (Turbo 1106) or GPT-4 to predict entailment between generations. We use the following prompt:

We are evaluating answers to the question {question} Here are two possible answers: Possible Answer 1: {text1} Possible Answer 2: {text2} Does Possible Answer 1 semantically entail Possible Answer 2? Respond with entailment, contradiction, or neutral.

Alternatively, we consider using a language model trained for entailment prediction, specifically the DeBERTa-large model 56 fine-tuned on the NLI dataset MNLI 58 . This builds on past work towards paraphrase identification based on embedding similarity 59 , 60 and BERT-style models 61 , 62 . We template more simply, checking if DeBERTa predicts entailment between the concatenation of the question and one answer and the concatenation of the question and another answer. Note that DeBERTa-large is a relatively lightweight model with only 1.5B parameters which is much less powerful than most of the LLMs under study.

In Supplementary Note 2 , we carefully evaluate the benefits and drawbacks of these methods for entailment prediction. We settle on using GPT-3.5 with the above prompt, as its entailment predictions agree well with human raters and lead to good confabulation detection performance.

In Supplementary Note  3 , we provide a discussion of the computational cost and choosing the number of generations for reliable clustering.

Prompting templates

We use a simple generation template for all sentence-length answer datasets:

Answer the following question in a single brief but complete sentence. Question: {question} Answer:

Metrics and accuracy measurements

We use three main metrics to evaluate our method: AUROC, rejection accuracy and AURAC. Each of these is grounded in an automated factuality estimation measurement relative to the reference answers provided by the datasets that we use.

AUROC, rejection accuracy and AURAC

First, we use the AUROC curve, which measures the reliability of a classifier accounting for both precision and recall. The AUROC can be interpreted as the probability that a randomly chosen correct answer has been assigned a higher confidence score than a randomly chosen incorrect answer. For a perfect classifier, this is 1.

Second, we compute the ‘rejection accuracy at X %’, which is the question-answering accuracy of the model on the most-confident X % of the inputs as identified by the respective uncertainty method. If an uncertainty method works well, predictions on the confident subset should be more accurate than predictions on the excluded subset and the rejection accuracy should increase as we reject more inputs.

To summarize this statistic we compute the AURAC—the total area enclosed by the accuracies at all cut-off percentages X %. This should increase towards 1 as given uncertainty method becomes more accurate and better at detecting likely-inaccurate responses but it is more sensitive to the overall accuracy of the model than the AUROC metric.

In Supplementary Note  5 , we provide the unaggregated rejection accuracies for sentence-length generations.

Assessing accuracy

For the short-phrase-length generation setting presented in Supplementary Note  7 , we simply assess the accuracy of the generations by checking if the F1 score of the commonly used SQuAD metric exceeds 0.5. There are limitations to such simple scoring rules 63 but this method is widely used in practice and its error is comparatively small on these standard datasets.

For our default scenario, the longer sentence-length generations, this measure fails, as the overlap between the short reference answer and our long model answer is invariably too small. For sentence-length generations, we therefore automatically determine whether an answer to the question is correct or incorrect by using GPT-4 to compare the given answer to the reference answer. We use the template:

We are assessing the quality of answers to the following question: {question} The expected answer is: {reference answer} The proposed answer is: {predicted answer} Within the context of the question, does the proposed answer mean the same as the expected answer? Respond only with yes or no.

We make a small modification for datasets with several reference answers: line two becomes “The following are expected answers to this question:” and the final line asks “does the proposed answer mean the same as any of the expected answers?”.

In Supplementary Note 6 , we check the quality of our automated ground-truth evaluations against human judgement by hand. We find that GPT-4 gives the best results for determining model accuracy and thus use it in all our sentence-length experiments.

In this section we describe the application of semantic entropy to confabulation detection in longer model generations, specifically paragraph-length biographies.

We introduce a biography-generation dataset—FactualBio—available alongside this paper. FactualBio is a collection of biographies of individuals who are notable enough to have Wikipedia pages but not notable enough to have large amounts of detailed coverage, generated by GPT-4 (v.0613). To generate the dataset, we randomly sampled 21 individuals from the WikiBio dataset 64 . For each biography, we generated a list of factual claims contained in each biography using GPT-4, with 150 total factual claims (the total number is only coincidentally a round number). For each of these factual claims, we manually determined whether the claim was correct or incorrect. Out of 150 claims, 45 were incorrect. As before, we apply confabulation detection to detect incorrect model predictions, even though there may be model errors which are not confabulations.

Prompting and generation

Given a paragraph-length piece of LLM-generated text, we apply the following sequence of steps:

Automatically decompose the paragraph into specific factual claims using an LLM (not necessarily the same as the original).

For each factual claim, use an LLM to automatically construct Q questions which might have produced that claim.

For each question, prompt the original LLM to generate M answers.

For each question, compute the semantic entropy of the answers, including the original factual claim.

Average the semantic entropies over the questions to arrive at a score for the original factual claim.

We pursue this slightly indirect way of generating answers because we find that simply resampling each sentence creates variation unrelated to the uncertainty of the model about the factual claim, such as differences in paragraph structure.

We decompose the paragraph into factual claims using the following prompt:

Please list the specific factual propositions included in the answer above. Be complete and do not leave any factual claims out. Provide each claim as a separate sentence in a separate bullet point.

We found that we agreed with the decompositions in all cases in the dataset.

We then generate six questions for each of the facts from the decomposition. We generate these questions by prompting the model twice with the following:

Following this text: {text so far} You see the sentence: {proposition} Generate a list of three questions, that might have generated the sentence in the context of the preceding original text, as well as their answers. Please do not use specific facts that appear in the follow-up sentence when formulating the question. Make the questions and answers diverse. Avoid yes-no questions. The answers should not be a full sentence and as short as possible, e.g. only a name, place, or thing. Use the format “1. {question} – {answer}”.

These questions are not necessarily well-targeted and the difficulty of this step is the main source of errors in the procedure. We generate three questions with each prompt, as this encourages diversity of the questions, each question targeting a different aspect of the fact. However, we observed that the generated questions will sometimes miss obvious aspects of the fact. Executing the above prompt twice (for a total of six questions) can improve coverage. We also ask for brief answers because the current version of GPT-4 tends to give long, convoluted and highly hedged answers unless explicitly told not to.

Then, for each question, we generate three new answers using the following prompt:

We are writing an answer to the question “{user question}”. So far we have written: {text so far} The next sentence should be the answer to the following question: {question} Please answer this question. Do not answer in a full sentence. Answer with as few words as possible, e.g. only a name, place, or thing.

We then compute the semantic entropy over these answers plus the original factual claim. Including the original fact ensures that the estimator remains grounded in the original claim and helps detect situations in which the question has been interpreted completely differently from the original context. We make a small modification to handle the fact that GPT-4 generations often include refusals to answer questions. These refusals were not something we commonly observe in our experiments with LLaMA 2, Falcon or Mistral models. If more than half of the answers include one of the strings ‘not available’, ‘not provided’, ‘unknown’ or ‘unclear’ then we treat the semantic uncertainty as maximal.

We then average the semantic entropies for each question corresponding to the factual claim to get an entropy for this factual claim.

Despite the extra assumptions and complexity, we find that this method greatly outperforms the baselines.

To compute semantic entailment between the original claim and regenerated answers, we rely on the DeBERTa entailment prediction model as we find empirically that DeBERTa predictions result in higher train-set AUROC than other methods. Because DeBERTa has slightly lower recall than GPT-3.5/4, we use a modified set-up for which we say the answers mean the same as each other if at least one of them entails the other and neither is seen to contradict the other—a kind of ‘non-defeating’ bidirectional entailment check rather than true bidirectional entailment. The good performance of DeBERTa in this scenario is not surprising as both factual claims and regenerated answers are relatively short. We refer to Supplementary Notes 2 and 3 for ablations and experiments regarding our choice of entailment estimator for paragraph-length generations.

We implement two baselines. First, we implement a variant of the P (True) method, which is adapted to the new setting. For each factoid, we generate a question with answers in the same way as for semantic entropy. We then use the following prompt:

Question: {question} Here are some brainstormed ideas: {list of regenerated answers} Possible answer: {original answer} Is the possible answer true? Respond with “yes” or “no”.

As we cannot access the probabilities GPT-4 assigns to predicting ‘yes’ and ‘no’ as the next token, we approximate this using Monte Carlo samples. Concretely, we execute the above prompt ten times (at temperature 1) and then take the fraction of answers which was ‘yes’ as our unbiased Monte Carlo estimate of the token probability GPT-4 assigns to ‘yes’.

As a second, simpler, baseline we check if the model thinks the answer is true. We simply ask:

Following this text: {text so far} You see this statement: {proposition} Is it likely that the statement is true? Respond with ‘yes’ or ‘no’.

It is interesting that this method ought to perform very well if we think that the model has good ‘self-knowledge’ (that is, if “models mostly know what they don’t know” 24 ) but in fact semantic entropy is much better at detecting confabulations.

Data availability

The data used for the short-phrase and sentence-length generations are publicly available and the released code details how to access it. We release a public version of the FactualBio dataset as part of the code base for reproducing the paragraph-length experiments.

Code availability

We release all code used to produce the main experiments. The code for short-phrase and sentence-length experiments can be found at github.com/jlko/semantic_uncertainty and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ). The code for paragraph-length experiments can be found at github.com/jlko/long_hallucinations and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ).

GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).

Xiao, Y. & Wang, W. Y. On hallucination and predictive uncertainty in conditional language generation. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics 2734–2744 (Association for Computational Linguistics, 2021).

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T. & Saenko, K. Object hallucination in image captioning. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E., Chiang, D., Hockenmaier, J. & Tsujii, J.) 4035–4045 (Association for Computational Linguistics, 2018).

Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times (8 Jun 2023).

Opdahl, A. L. et al. Trustworthy journalism through AI. Data Knowl. Eng . 146 , 102182 (2023).

Shen, Y. et al. ChatGPT and other large language models are double-edged swords. Radiology 307 , e230163 (2023).

Article   PubMed   Google Scholar  

Schulman, J. Reinforcement learning from human feedback: progress and challenges. Presented at the Berkeley EECS Colloquium. YouTube www.youtube.com/watch?v=hhiLw5Q_UFg (2023).

Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55 , 248 (2023).

Maynez, J., Narayan, S., Bohnet, B. & McDonald, R. On faithfulness and factuality in abstractive summarization. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 1906–1919 (Association for Computational Linguistics, 2020).

Filippova, K. Controlled hallucinations: learning to generate faithfully from noisy data. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 864–870 (Association for Computational Linguistics, 2020).

Berrios, G. Confabulations: a conceptual history. J. Hist. Neurosci. 7 , 225–241 (1998).

Article   CAS   PubMed   Google Scholar  

Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Transact. Mach. Learn. Res. (2022).

Evans, O. et al. Truthful AI: developing and governing AI that does not lie. Preprint at https://arxiv.org/abs/2110.06674 (2021).

Amodei, D. et al. Concrete problems in AI safety. Preprint at https://arxiv.org/abs/1606.06565 (2016).

Jiang, Z., Araki, J., Ding, H. & Neubig, G. How can we know when language models know? On the calibration of language models for question answering. Transact. Assoc. Comput. Linguist. 9 , 962–977 (2021).

Article   Google Scholar  

Desai, S. & Durrett, G. Calibration of pre-trained transformers. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 295–302 (Association for Computational Linguistics, 2020).

Glushkova, T., Zerva, C., Rei, R. & Martins, A. F. Uncertainty-aware machine translation evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (eds Moens, M-F., Huang, X., Specia, L. & Yih, S.) 3920–3938 (Association for Computational Linguistics, 2021).

Wang, Y., Beck, D., Baldwin, T. & Verspoor, K. Uncertainty estimation and reduction of pre-trained models for text regression. Transact. Assoc. Comput. Linguist. 10 , 680–696 (2022).

Baker, S. & Kanade, T. Hallucinating faces. In Proc. Fourth IEEE International Conference on Automatic Face and Gesture Recognition . 83–88 (IEEE, Catalogue no PR00580, 2002).

Eliot, L. AI ethics lucidly questioning this whole hallucinating AI popularized trend that has got to stop. Forbes Magazine (24 August 2022).

Shanahan, M. Talking about large language models. Commun. Assoc. Comp. Machinery 67 , 68–79 (2024).

MacKay, D. J. C. Information-based objective functions for active data selection. Neural Comput. 4 , 590–604 (1992).

Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https://arxiv.org/abs/2207.05221 (2022).

Lindley, D. V. On a measure of the information provided by an experiment. Ann. Math. Stat. 27 , 986–1005 (1956).

Article   MathSciNet   Google Scholar  

Xiao, T. Z., Gomez, A. N. & Gal, Y. Wat zei je? Detecting out-of-distribution translations with variational transformers. In Workshop on Bayesian Deep Learning at the Conference on Neural Information Processing Systems (NeurIPS, Vancouver, 2019).

Christiano, P., Cotra, A. & Xu, M. Eliciting Latent Knowledge (Alignment Research Center, 2021); https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit .

Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D. & Marchetti, A. Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora. In Proc. 2011 Conference on Empirical Methods in Natural Language Processing 670–679 (Association for Computational Linguistics, 2011).

Honovich, O. et al. TRUE: Re-evaluating factual consistency evaluation. In Proc. Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering 161–175 (Association for Computational Linguistics, 2022).

Falke, T., Ribeiro, L. F. R., Utama, P. A., Dagan, I. & Gurevych, I. Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 2214–2220 (Association for Computational Linguistics, 2019).

Laban, P., Schnabel, T., Bennett, P. N. & Hearst, M. A. SummaC: re-visiting NLI-based models for inconsistency detection in summarization. Trans. Assoc. Comput. Linguist. 10 , 163–177 (2022).

Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proc. 55th Annual Meeting of the Association for Computational Linguistics 1601–1611 (Association for Computational Linguistics. 2017).

Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. SQuAD: 100,000+ questions for machine compression of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J., Duh, K. & Carreras, X.) 2383–2392 (Association for Computational Linguistics, 2016).

Tsatsaronis, G. et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16 , 138 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Lee, K., Chang, M.-W. & Toutanova, K. Latent retrieval for weakly supervised open domain question answering. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 6086–6096 (Association for Computational Linguistics, 2019).

Kwiatkowski, T. et al. Natural questions: a benchmark for question answering research. Transact. Assoc. Comput. Linguist. 7 , 452–466 (2019).

Patel, A., Bhattamishra, S. & Goyal, N. Are NLP models really able to solve simple math word problems? In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 2080–2094 (Assoc. Comp. Linguistics, 2021).

Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

Penedo, G. et al. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. In Proc. 36th Conference on Neural Information Processing Systems (eds Oh, A. et al.) 79155–79172 (Curran Associates, 2023)

Jiang, A. Q. et al. Mistral 7B. Preprint at https://arxiv.org/abs/2310.06825 (2023).

Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: Zero-Resource Black-Box hallucination detection for generative large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 9004–9017 (Assoc. Comp. Linguistics, 2023).

Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H. & Gal, Y. Deep deterministic uncertainty: a new simple baseline. In IEEE/CVF Conference on Computer Vision and Pattern Recognition 24384–24394 (Computer Vision Foundation, 2023).

Schuster, T., Chen, S., Buthpitiya, S., Fabrikant, A. & Metzler, D. Stretching sentence-pair NLI models to reason over long documents and clusters. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 394–412 (Association for Computational Linguistics, 2022).

Barnes, B. & Christiano, P. Progress on AI Safety via Debate. AI Alignment Forum www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1 (2020).

Irving, G., Christiano, P. & Amodei, D. AI safety via debate. Preprint at https://arxiv.org/abs/1805.00899 (2018).

Der Kiureghian, A. & Ditlevsen, O. Aleatory or epistemic? Does it matter? Struct. Saf. 31 , 105–112 (2009).

Malinin, A. & Gales, M. Uncertainty estimation in autoregressive structured prediction. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=jN5y-zb5Q7m (2021).

Murray, K. & Chiang, D. Correcting length bias in neural machine translation. In Proc. Third Conference on Machine Translation (eds Bojar, O. et al.) 212–223 (Assoc. Comp. Linguistics, 2018).

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=rygGQyrFvH (2020).

Fan, A., Lewis, M. & Dauphin, Y. Hierarchical neural story generation. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds Gurevych, I. & Miyao, Y.) 889–898 (Association for Computational Linguistics, 2018).

Speaks, J. in The Stanford Encyclopedia of Philosophy (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford Univ., 2021).

Culicover, P. W. Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11 , 78–88 (1968).

Google Scholar  

Padó, S., Cer, D., Galley, M., Jurafsky, D. & Manning, C. D. Measuring machine translation quality as semantic equivalence: a metric based on entailment features. Mach. Transl. 23 , 181–193 (2009).

Androutsopoulos, I. & Malakasiotis, P. A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38 , 135–187 (2010).

MacCartney, B. Natural Language Inference (Stanford Univ., 2009).

He, P., Liu, X., Gao, J. & Chen, W. Deberta: decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations https://openreview.net/forum?id=XPZIaotutsD (2021).

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33 , 1877–1901 (2020).

Williams, A., Nangia, N. & Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Walker, M. et al.) 1112–1122 (Assoc. Comp. Linguistics, 2018).

Yu, L., Hermann, K. M., Blunsom, P. & Pulman, S. Deep learning for answer sentence selection. Preprint at https://arxiv.org/abs/1412.1632 (2014).

Socher, R., Huang, E., Pennin, J., Manning, C. D. & Ng, A. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th Conference on Neural Information Processing Systems (eds Shawe-Taylor, J. et al.) (2011)

He, R., Ravula, A., Kanagal, B. & Ainslie, J. Realformer: Transformer likes residual attention. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (eds Zhong, C., et al.) 929–943 (Assoc. Comp. Linguistics, 2021).

Tay, Y. et al. Charformer: fast character transformers via gradient-based subword tokenization. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=JtBRnrlOEFN (2022).

Kane, H., Kocyigit, Y., Abdalla, A., Ajanoh, P. & Coulibali, M. Towards neural similarity evaluators. In Workshop on Document Intelligence at the 32nd conference on Neural Information Processing (2019).

Lebret, R., Grangier, D. & Auli, M. Neural text generation from structured data with application to the biography domain. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J. et al.) 1203–1213 (Association for Computational Linguistics, 2016).

Kossen, J., jlko/semantic_uncertainty: Initial release v.1.0.0. Zenodo https://doi.org/10.5281/zenodo.10964366 (2024).

Download references


We thank G. Irving, K. Perlin, J. Richens, L. Rimell and M. Turpin for their comments or discussion related to this work. We thank K. Handa for his help with the human evaluation of our automated accuracy assessment. We thank F. Bickford Smith and L. Melo for their code review. Y.G. is supported by a Turing AI Fellowship funded by the UK government’s Office for AI, through UK Research and Innovation (grant reference EP/V030302/1), and delivered by the Alan Turing Institute.

Author information

These authors contributed equally: Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn

Authors and Affiliations

OATML, Department of Computer Science, University of Oxford, Oxford, UK

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn & Yarin Gal

You can also search for this author in PubMed   Google Scholar


S.F. led the work from conception to completion and proposed using bidirectional entailment to cluster generations as a way of computing entropy in LLMs. He wrote the main text, most of the Methods and Supplementary Information and prepared most of the figures. J.K. improved the mathematical formalization of semantic entropy; led the extension of semantic entropy to sentence- and paragraph-length generations; wrote the code for, and carried out, all the experiments and evaluations; wrote much of the Methods and Supplementary Information and prepared drafts of many figures; and gave critical feedback on the main text. L.K. developed the initial mathematical formalization of semantic entropy; wrote code for, and carried out, the initial experiments around semantic entropy and its variants which demonstrated the promise of the idea and helped narrow down possible research avenues to explore; and gave critical feedback on the main text. Y.G. ideated the project, proposing the idea to differentiate semantic and syntactic diversity as a tool for detecting hallucinations, provided high-level guidance on the research and gave critical feedback on the main text; he runs the research laboratory in which the work was carried out.

Corresponding author

Correspondence to Sebastian Farquhar .

Ethics declarations

Competing interests.

S.F. is currently employed by Google DeepMind and L.K. by OpenAI. For both, this paper was written under their University of Oxford affiliation. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Mirella Lapata and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 algorithm outline for bidirectional entailment clustering..

Given a set of outputs in response to a context, the bidirectional entailment answer returns a set of sets of outputs which have been classified as sharing a meaning.

Supplementary information

Supplementary information.

Supplementary Notes 1–7, Figs. 1–10, Tables 1–4 and references. Includes, worked example for semantic entropy calculation, discussion of limitations and computational cost of entailment clustering, ablation of entailment prediction and clustering methods, discussion of automated accuracy assessment, unaggregated results for sentence-length generations and further results for short-phrase generations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Farquhar, S., Kossen, J., Kuhn, L. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630 , 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0

Download citation

Received : 17 July 2023

Accepted : 12 April 2024

Published : 19 June 2024

Issue Date : 20 June 2024

DOI : https://doi.org/10.1038/s41586-024-07421-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

applied linguistics research journal web of science

share this!

June 26, 2024

This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:


trusted source

Assessing the place of citizen science in modern research

by Samuel Jarman, SciencePOD

Assessing the place of citizen science in modern research

In recent years, numerous fields of research have seen an explosion in the volume and complexity of their scientific data. To keep pace with these changes, EU-funded research projects are increasingly crowdsourcing their data through citizen science projects, which allow the public to engage directly with their research.

Through a detailed analysis published in The European Physical Journal Plus , Stephen Serjeant and colleagues at The Open University present new recommendations for how citizen science should be deployed to ensure the best possible outcome for research. The team's insights could help researchers to better understand the potential impacts of this new way of doing science.

Traditionally, most major EU-funded research projects have included efforts to communicate their work to the public. However, there has long been concern that these efforts don't provide any opportunities for the science-interested public to engage directly with research or contribute to it.

As research projects became larger and more complex, this picture started to change radically. Through citizen science projects , researchers are now crowdsourcing their data to public volunteers interested in their work, who are still far better suited for many classification tasks than machine learning algorithms. Today, the approach is applied across fields as diverse as genomics, social sciences, and astronomical imaging.

In their study, Serjeant's team summarize the use of citizen science in several projects funded by the EU's Horizon program, which collectively engaged hundreds of thousands of volunteers. Their analysis shows that these programs had a wide-ranging, diverse and deep scientific impact.

Altogether, the researchers present valuable recommendations for how citizen science should be deployed in future projects in physical science. They also clarify that if public engagement or outreach is the primary goal of a project, citizen science may not always be the best approach: instead, they suggest that other, more targeted approaches could be more effective.

Provided by SciencePOD

Explore further

Feedback to editors

applied linguistics research journal web of science

The beginnings of fashion: Paleolithic eyed needles and the evolution of dress

10 hours ago

applied linguistics research journal web of science

Analysis of NASA InSight data suggests Mars hit by meteoroids more often than thought

applied linguistics research journal web of science

New computational microscopy technique provides more direct route to crisp images

applied linguistics research journal web of science

A harmless asteroid will whiz past Earth Saturday. Here's how to spot it

11 hours ago

applied linguistics research journal web of science

Tiny bright objects discovered at dawn of universe baffle scientists

applied linguistics research journal web of science

New method for generating monochromatic light in storage rings

12 hours ago

applied linguistics research journal web of science

Soft, stretchy electrode simulates touch sensations using electrical signals

applied linguistics research journal web of science

Updating the textbook on polarization in gallium nitride to optimize wide bandgap semiconductors

applied linguistics research journal web of science

Investigating newly discovered hydrothermal vents at depths of 3,000 meters off Svalbard

applied linguistics research journal web of science

Researchers develop technology to mass produce quantum dot lasers for optical communications

14 hours ago

Relevant PhysicsForums posts

Cover songs versus the original track, which ones are better, today's fusion music: t square, cassiopeia, rei & kanade sato.

8 hours ago

Biographies, history, personal accounts

15 hours ago

The Balinese Alphabet

Jun 27, 2024

Who is your favorite Jazz musician and what is your favorite song?

Jun 26, 2024

History of Railroad Safety - Spotlight on current derailments

More from Art, Music, History, and Linguistics

Related Stories

applied linguistics research journal web of science

Citizen scientists are demographically homogenous: The need for a volunteer-centric approach

Jun 22, 2022

applied linguistics research journal web of science

Citizen science inspires kids to take local action

Jul 27, 2023

applied linguistics research journal web of science

Wildlife recording is good for people, as well as for science

Feb 9, 2023

applied linguistics research journal web of science

Citizen astronomers and AI discover 30,000 ring galaxies

Mar 14, 2024

applied linguistics research journal web of science

More than a hobby—how volunteers support science

Jul 12, 2018

applied linguistics research journal web of science

Citizen science volunteers driven by desire to learn

Jul 31, 2017

Recommended for you

applied linguistics research journal web of science

Saturday Citations: Bulking tips for black holes; microbes influence drinking; new dinosaur just dropped

Jun 22, 2024

applied linguistics research journal web of science

Saturday Citations: Bacterial warfare, a self-programming language model, passive cooling in the big city

Jun 15, 2024

applied linguistics research journal web of science

Saturday Citations: Praising dogs; the evolution of brown fat; how SSRIs relieve depression. Plus: Boeing's Starliner

Jun 8, 2024

applied linguistics research journal web of science

Saturday Citations: The sound of music, sneaky birds, better training for LLMs. Plus: Diversity improves research

Jun 1, 2024

applied linguistics research journal web of science

Researchers identify the 18 World War II executed civilians of Adele, Rethymnon, using ancient DNA analysis

May 27, 2024

applied linguistics research journal web of science

Saturday Citations: The cheapness horizon of electric batteries; the battle-worthiness of ancient armor; scared animals

May 25, 2024

Let us know if there is a problem with our content

Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form . For general feedback, use the public comments section below (please adhere to guidelines ).

Please select the most appropriate category to facilitate processing of your request

Thank you for taking time to provide your feedback to the editors.

Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.

E-mail the story

Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient's address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Phys.org in any form.

Newsletter sign up

Get weekly and/or daily updates delivered to your inbox. You can unsubscribe at any time and we'll never share your details to third parties.

More information Privacy policy

Donate and enjoy an ad-free experience

We keep our content available to everyone. Consider supporting Science X's mission by getting a premium account.

E-mail newsletter


  1. (PDF) Applied Linguistics Research Journal

    applied linguistics research journal web of science

  2. Applied Linguistics Journal

    applied linguistics research journal web of science

  3. International Journal of Applied Linguistics

    applied linguistics research journal web of science

  4. (PDF) The Handbook of Applied Linguistics

    applied linguistics research journal web of science

  5. Journal Reviewer

    applied linguistics research journal web of science

  6. (PDF) Illuminating insights into subjectivity: Q as a methodology in

    applied linguistics research journal web of science


  1. Jali Research: E3 2019 Languages

  2. the difference between linguistics and applied linguistics

  3. Linguistics at A Glance

  4. Dr Magali Paquot 30 October 2023

  5. Applied Linguistics Research 2015

  6. Dr Isobelle Clarke 25 Oct 2023


  1. Journal of Research in Applied Linguistics

    Journal of Research in Applied Linguistics is now indexed in Web of Science Core Collection. 2018-05-11; Journal of Research in Applied Linguistics is now indexed in Scopus. 2017-09-25; Getting indexed in J-Gate and Linguistics Abstracts Online indexing databases 2017-06-14; Important Note for contributors to RALs 2016-10-13

  2. Applied Linguistics

    International Association for Applied Linguistics (AILA) AILA (originally founded in 1964 in France) is an international federation of national and regional associations of Applied Linguistics. Find out more. Publishes research into language with relevance to real-world problems. Connections are made between fields, theories, research methods ...

  3. Journal Rankings on Linguistics and Language

    International Scientific Journal & Country Ranking. SCImago Institutions Rankings SCImago Media Rankings SCImago Iber SCImago Research Centers Ranking SCImago Graphica Ediciones Profesionales de la Información

  4. Web of Science Master Journal List

    Browse, search, and explore journals indexed in the Web of Science. The Master Journal List is an invaluable tool to help you to find the right journal for your needs across multiple indices hosted on the Web of Science platform. Spanning all disciplines and regions, Web of Science Core Collection is at the heart of the Web of Science platform. Curated with care by an expert team of in-house ...

  5. Research Methods in Applied Linguistics

    Research Methods in Applied Linguistics is the first and only journal devoted exclusively to research methods in applied linguistics, a discipline that explores real-world language-related issues and phenomena. Core areas of applied linguistics include bilingualism and multilingualism, …. View full aims & scope. $1500. Article publishing charge.

  6. International Journal of Applied Linguistics

    The International Journal of Applied Linguistics explores how the knowledge of linguistics is connected to the practical reality of language. This leading linguistics journal is interested in how the particular and the general are inter-related and encourage research which is international in the sense that it shows explicitly how local issues of language use or learning exemplify global concerns.

  7. Research Trends in Applied Linguistics (2017-2021): A ...

    Applied linguistics can be broadly defined as a discipline that studies "language with relevance to real-world issues", according to the stated aims of its flagship journal, Applied Linguistics (2022). The recent decades have witnessed its fast growth in terms of the number of papers published every year, the topics examined, and the emergence of new theories, approaches, methodologies and ...

  8. About

    Aims. Applied Linguistics welcomes submissions about language-related problems and solutions in real-world contexts. The journal aims to publish papers that appeal to its broad readership and make connections between scholarly discourses, theories, and research methods from a range of areas of study. Applied linguistics research typically has ...

  9. PDF Applied Linguistics Research: Current Issues, Methods, and ...

    ESOL Quarterly) that have aimed to address methodological issues. In 2016, Applied Linguistics published a special issue on "Innovation in research methods in applied lin-guistics," which explored how innovation had the potential to influence methods of applied linguistics research (e.g., digital learning environmen.

  10. Guide for authors

    Submission categories Research Methods in Applied Linguistics accepts submissions in the following categories:. Research Articles; Review Articles; Brief Reports; Methods Tutorials; Research Articles. These are manuscripts reporting empirical studies that compare, validate, or refine existing methods, or studies that introduce new methods and exemplify how they can be applied to answer ...

  11. Volume 43 Issue 1

    Applied Linguistics, Volume 43, Issue 1, February 2022, Pages 115-146, ... English-Medium Instruction in International Bio-Science Engineering Programs in Vietnam: Incentivization, Support, and Discretion in a Context of Academic Consolidation ... It furthers the University's objective of excellence in research, scholarship, and education by ...

  12. Applied Linguistics Review

    Objective Applied Linguistics Review (ALR) is an international, peer-reviewed journal that bridges the gap between linguistics and applied areas such as education, psychology and human development, sociology and politics. It serves as a testing ground for the articulation of original ideas and approaches in the study of real-world issues in which language plays a crucial role. ALR brings ...

  13. Research Methods in Applied Linguistics

    Applied linguistics journal editor perspectives: Research ethics and academic publishing. Rita Elaine Silver, Evangeline Lin, Baoqi Sun. Article 100069 ... Implementing the map task in applied linguistics research: What, how, and why. Juan Berríos, Angela Swain, Melinda Fricke. Article 100081 View PDF.

  14. Applied Linguistics Open Access Journals

    Generally, as a quality assurance measure, the journals listed below are indexed in Wed of Science (previously known as ISI), Scopus, and/or Directory of Open Access Journals. Journals indexed in Web of Science (esp. SSCI for applied linguistics; see below for explanation) are perceived to be of higher status and to have higher standards (i.e ...

  15. Journal of Research in Applied Linguistics

    Journal of Research in Applied Linguistics is now indexed in Web of Science Core Collection. 2018-05-11; Journal of Research in Applied Linguistics is now indexed in Scopus. 2017-09-25; Getting indexed in J-Gate and Linguistics Abstracts Online indexing databases 2017-06-14; Important Note for contributors to RALs 2016-10-13

  16. Journal of Research in Applied Linguistics

    Journal of Research in Applied Linguistics is now indexed in Web of Science Core Collection. 2018-05-11; Journal of Research in Applied Linguistics is now indexed in Scopus. 2017-09-25; Getting indexed in J-Gate and Linguistics Abstracts Online indexing databases 2017-06-14; Important Note for contributors to RALs 2016-10-13

  17. Research Guides: Linguistics Resources: Journal Databases

    The database covers all journal articles, reports, commentaries, obituaries, etc. at least two pages long in over 2,500 journals and monographic series. Covers the fields of social, cultural, physical, biological, and linguistic anthropology, ethnology, archa. Language in Australia and New Zealand. A bibliography and research database for ...

  18. Research Trends in Applied Linguistics from 2005 to 2016: A

    Using data of articles from 42 Social Science Citation Index (SSCI)-indexed journals of applied linguistics, this study renders a bibliometric analysis of the 2005-16 research trends in the field. The analysis focuses on, among other issues, the most frequently discussed topics, the most highly cited publications, and the changes that have ...

  19. A bibliometric study of the applied linguistics research output of

    This paper aims to analyse the research contributions of the Kingdom of Saudi Arabia in the field of applied linguistics (AL) indexed in the Web of Science core collection for the period between 2011 and 2020.,The author searched key terms in the Social Science Citation Index and Science Citation Index Expanded categories that publish documents ...

  20. Applied linguistics journal editor perspectives: Research ethics and

    (See also Haven et al., 2022; Marsden, 2019; Moher et al., 2020 on links between open science and research integrity and ethics.) In a review of applied linguistics journals, Silver & Lin (in press) examined open science practices, including practices related to research ethics and integrity, based on website information. They found that though ...

  21. Open Linguistics

    Open Linguistics is an academic peer-reviewed journal covering all areas of linguistics. The objective of this journal is to foster free exchange of ideas and provide an appropriate platform for presenting, discussing and disseminating new concepts, current trends, theoretical developments and research findings related to a broad spectrum of topics: descriptive linguistics, theoretical ...

  22. Web of Science Master Journal List

    The Master Journal List is an invaluable tool to help you to find the right journal for your needs across multiple indices hosted on the Web of Science platform. Spanning all disciplines and regions, Web of Science Core Collection is at the heart of the Web of Science platform. Curated with care by an expert team of in-house editors, Web of Science Core Collection includes only journals that ...

  23. Detecting hallucinations in large language models using ...

    Detecting confabulations in QA and math. Semantic entropy is designed to detect confabulations, that is, model outputs with arbitrary meaning. In our experiments, we use semantic uncertainty to ...

  24. Application of theories in Library and Information Science research in

    Adbekhoda M, Ashrafi-Rizi H, Ranjbaran F (2018) Theoretical issues in medical library and information sciences' articles published in scopus and web of science databases: a scoping review. Journal of Education and Health Promotion 12(244): 1-6.

  25. Assessing the place of citizen science in modern research

    Assessing the place of citizen science in modern research. In recent years, numerous fields of research have seen an explosion in the volume and complexity of their scientific data. To keep pace ...