Publications

Publishing our work allows us to share ideas and work collaboratively to advance the field of computer science.

People looking at a screen

  • Algorithms and Optimization 323
  • Applied science 186
  • Climate and Sustainability 10
  • Cloud AI 46
  • Language 235
  • Perception 291

Research Area

  • Algorithms and Theory 1317
  • Data Management 166
  • Data Mining and Modeling 353
  • Distributed Systems and Parallel Computing 340
  • Economics and Electronic Commerce 340
  • Education Innovation 68
  • General Science 329
  • Hardware and Architecture 145
  • Health & Bioscience 360
  • Human-Computer Interaction and Visualization 803
  • Information Retrieval and the Web 414
  • Machine Intelligence 3782
  • Machine Perception 1449
  • Machine Translation 145
  • Mobile Systems 107
  • Natural Language Processing 1068
  • Networking 312
  • Quantum Computing 124
  • Responsible AI 221
  • Robotics 198
  • Security, Privacy and Abuse Prevention 491
  • Software Engineering 200
  • Software Systems 448
  • Speech Processing 542
  • Title, desc

Meet the groups behind our innovation

Our teams advance the state of the art through research, systems engineering, and collaboration across Google.

Teams

  • Search Menu
  • Computer Science
  • Earth Sciences
  • Information Science
  • Life Sciences
  • Materials Science
  • Science Policy
  • Advance Access
  • Special Topics
  • Author Guidelines
  • Submission Site
  • Open Access Options
  • Self-Archiving Policy
  • About National Science Review
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

2020 Best Paper Awards

The NSR 2020 Best Paper Awards recognizes the best papers published in the recent years in National Science Review.

The Editorial Board have selected the below papers as winners of the 2020 Best Paper Awards highlighting their significant contributions to research measured against achievement and potential impact.

BMAL1 knockout macaque monkeys display reduced sleep and psychiatric disorders

Circadian disruption is a risk factor for metabolic, psychiatric and age-related disorders, and non-human primate models could help to develop therapeutic treatments. Here, we report the generation of BMAL1 knockout cynomolgus monkeys for circadian-related disorders by CRISPR/Cas9 editing of monkey embryos. These monkeys showed higher nocturnal locomotion and reduced sleep, which was further exacerbated by a constant light regimen.

Programmable time-domain digital-coding metasurface for non-linear harmonic manipulation and new wireless communication systems

Optical non-linear phenomena are typically observed in natural materials interacting with light at high intensities, and they benefit a diverse range of applications from communication to sensing. However, controlling harmonic conversion with high efficiency and flexibility remains a major issue in modern optical and radio-frequency systems. Here, we introduce a dynamic time-domain digital-coding metasurface that enables efficient manipulation of spectral harmonic distribution.

Linking atmospheric pollution to cryospheric change in the Third Pole region: current progress and future prospects

The Tibetan Plateau and its surroundings are known as the Third Pole (TP). This region is noted for its high rates of glacier melt and the associated hydrological shifts that affect water supplies in Asia. Atmospheric pollutants contribute to climatic and cryospheric changes through their effects on solar radiation and the albedos of snow and ice surfaces; moreover, the behavior and fates within the cryosphere and environmental impacts of environmental pollutants are topics of increasing concern. 

An overview of multi-task learning

As a promising area in machine learning, multi-task learning (MTL) aims to improve the performance of multiple related learning tasks by leveraging useful information among them. In this paper, we give an overview of MTL by first giving a definition of MTL. Then several different settings of MTL are introduced, including multi-task supervised learning, multi-task unsupervised learning, multi-task semi-supervised learning, multi-task active learning, multi-task reinforcement learning, multi-task online learning and multi-task multi-view learning.

Recent advances in the precise control of isolated single-site catalysts by chemical methods

The search for constructing high-performance catalysts is an unfailing topic in chemical fields. Recently, we have witnessed many breakthroughs in the synthesis of single-atom catalysts (SACs) and their applications in catalytic systems. They have shown excellent activity, selectivity, stability, efficient atom utilization and can serve as an efficient bridge between homogeneous and heterogenous catalysis.

Controlling flexibility of metal–organic frameworks

Framework flexibility is one of the most important characteristics of metal–organic frameworks (MOFs), which is not only interesting, but also useful for a variety of applications. Designing, tailoring or controlling framework flexibility of MOFs is much more difficult than for static structural features such as the framework topology and pore size/shape.

  • Recommend to Your Librarian

Affiliations

  • Online ISSN 2053-714X
  • Print ISSN 2095-5138
  • Copyright © 2024 China Science Publishing & Media Ltd. (Science Press)
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

University of Reading logo

Our most popular papers of 2020

Which Reading research publications got the most attention across the globe in 2020? We’ve scoured Altmetric data to bring you the top ten most talked about Reading-authored papers of the past year.

best research papers 2020

The 2020 Lancet Countdown report on health and climate change: responding to converging crises

best research papers 2020

Climate change could lead to global healthcare systems being overwhelmed in the same way they have been during the current COVID-19 pandemic, the authors of the latest world-leading report on climate and health have warned.

The 2020 Lancet Countdown report , produced by experts from around the world including Professor Elizabeth Robinson from the University of Reading, is urging urgent action on climate change to reduce the increasing threats to health, lives and livelihoods in all countries around the world.

Right now, people around the world face increasing extremes of heat, food and water insecurity, and changing patterns of infectious diseases. Responding to climate change today will bring about cleaner skies, healthier diets, and safer places to live–as well as reduce the risk factors of future infectious diseases.

The authors call for a coordinated immediate response to climate change alongside the pandemic, in order to avoid future health crises, improve public health, create a sustainable economy and protect the environment.

best research papers 2020

Origins of the sarsen megaliths at Stonehenge

best research papers 2020

A mystery speculated on for centuries has been solved as new research revealed the origin of Stonehenge’s giant sarsen stones to be West Woods, on the edge of Wiltshire’s Marlborough Downs.

Research jointly carried out by the University of Reading and published in the journal Science Advances , has pinpointed the source of the megaliths to an area around 15 miles north of the stone circle site.

The breakthrough came when a core – drilled from Stonehenge’s ‘Stone 58’ during repair work in the 1950s – was returned to English Heritage from Florida last year at the request of one of those involved at the time, Mr Robert Phillips. This presented a unique opportunity to analyse the interior of one of Stonehenge’s great sarsens.

best research papers 2020

The impact of social isolation and loneliness on children and young people during COVID-19

best research papers 2020

The move into lockdown as part of the control measures to avoid the spread of COVID-19 necessitated widespread social isolation. A review published in Journal of the American Academy of Child & Adolescent Psychiatry which researchers from the Charlie Waller Institute contributed to, aimed to establish how loneliness and lockdown affect the mental health of children and adolescents.

The review covered articles published between 1946 and March 2020 based on 63 studies w 51,576 participants. It found a clear association between loneliness and mental health problems in children and adolescents, findings which were consistent across studies of children, adolescents, and young adults. There may also be gender differences, with some research indicating that loneliness was more strongly associated with elevated depression in girls and with elevated social anxiety in boys.

This is of particular relevance in the COVID-19 context, as politicians in different countries considered the length of time that schools should remain closed, and the implementation of social distancing within schools.

best research papers 2020

International scientists formulate a roadmap for insect conservation and recovery

best research papers 2020

Professor Simon Potts contributed to a review in Nature Ecology and Evolution of the evidence that reducing insect abundance, diversity and biomass across the globe is a very real and serious threat that must be urgently addressed.

The review proposes a global roadmap for insect conservation and recovery alongside the immediate implementation of measures to slow or stop insect declines. These actions include conservation particularly of threatened species and mitigation of alien species along with a reduction in pollution and pesticide use, and an increase in land use diversity in agriculture. Medium-term action is for further research and making sure we make best use of the currently held data and collections. Longer-term action includes the building of partnerships and a global monitoring programme.

Most importantly, the review says that there is currently enough information on some key causes of insect decline to formulate solutions. And that a learning-by-doing approach ensures that these conservation strategies are robust to newly emerging pressures and threats.

best research papers 2020

Sea-ice-free Arctic during the last interglacial period supports fast future loss

best research papers 2020

Researchers from the Meteorology Department published new research in Nature Climate Change about Arctic sea ice loss and how we can predict this better in the future. While knowledge of past Arctic temperatures is robust, thanks to the available observations, the interpretation of Arctic sea-ice changes during the last interglacial period (approximately 130,000–116,000 years ago) has been much less certain.

To address this question, researchers used the latest modelling techniques to compare conclusions from previous models and to create a fully linked atmosphere–land–ocean–ice climate model for the last interglacial period. The new model created a much-improved representation of Arctic summers during the warmer interglacial climate compared with previous model simulations.

Simulating an ice-free summer Arctic offers a unique solution to the longstanding puzzle of what occurred to drive the temperatures to rise during last interglacial period Arctic and will provide independent support for predictions of ice-free conditions by summer 2035.

best research papers 2020

Is mere exposure enough? The effects of bilingual environments on infant cognitive development

best research papers 2020

Recent research in Royal Society Open Science looked at the idea of the ‘bilingual advantage’ among young children. The most common explanation is that managing two languages during language learning constantly strengthens learning ability. However, this theory cannot explain why this bilingual advantage has also been found in infants before they can speak.

To understand why this should be, researchers carried out four separate eye-tracking tasks in seven- to nine-month-old infants who were being raised in either bilingual or monolingual homes. The results suggest that infants raised in bilingual homes are faster at disengaging attention in order to shift attention to a new stimulus and also switch attention more frequently between two visual stimuli.

This raises the possibility that infants adapt to bilingual environments partly by disengaging attention faster and switching attention more frequently. This supports the researcher’s proposal that bilingual infants adapt by placing more weight on novel information in order to collect more samples from their more varied environments.

best research papers 2020

Ensemble CME modelling constrained by heliospheric imager observations

best research papers 2020

Improving the prediction of space weather is the subject of a new study in AGU Advances . Forecasting the arrival of coronal mass ejections (large eruptions of magnetised plasma from the Sun) at the Earth is an important service since these eruptions are the main cause of severe space weather and can disrupt technology such as satellites, communications networks and power grids.

Improving these forecasts is an important area of space weather research, particularly measuring and improving predictions of when coronal mass ejections might arrive at the Earth. Special imagers can take pictures of the heliosphere of the sun and the study shows that these can be used to reduce the uncertainty and improve the accuracy of predictions.

NASA and ESA are currently planning the next spacecraft missions that will observe the Sun and space for space weather forecasting. This proof‐of‐concept study provides some evidence that it would be useful to include a heliospheric imager on future space missions.

best research papers 2020

The Global Methane Budget 2000–2017

best research papers 2020

The Global Carbon Project, an international consortium of multidisciplinary scientists, including those from our University, has created a global methane budget in Earth System Science Data . With emissions of methane increasing, making it the second most important greenhouse gas after carbon dioxide, the budget is important for assessing realistic ways to mitigate climate change.

Two major challenges in estimating the budget are the variety of geographically overlapping methane sources and the destruction of methane in the atmosphere. The budget was built by synthesising a large collection of new and published methods and results, including atmospheric observations and inversions, process-based models for land surface emissions and atmospheric chemistry and inventories of emissions.

The Global Carbon Project aims to continually update this methane budget regularly (approximately every 2–3 years). Each update will produce a more recent ten-year methane budget, highlight changes in emissions and trends, and incorporate newly available data and model improvements.

best research papers 2020

Biomarker-estimated flavan-3-ol intake is associated with lower blood pressure in cross-sectional analysis in EPIC Norfolk

best research papers 2020

People who consume a diet including flavanol-rich foods and drinks, including tea, apples and berries, could benefit from lower blood pressure, according to the first study using objective measures of the diets of thousands of UK residents, published in Scientific Reports .

In contrast to most other studies investigating links between nutrition and health, the researchers did not rely on study participants reporting their diet, but instead measured flavanol intake objectively using nutritional biomarkers – indicators of dietary intake, metabolism or nutritional status that are present in our blood.

The difference in blood pressure between those with the lowest 10% of flavanol intake and those with the highest 10% is comparable to meaningful changes in blood pressure observed in those following a Mediterranean diet or the medium sodium intake in the DASH sodium trial. Notably, the effect was more pronounced in participants with hypertension.

best research papers 2020

Precipitation modification by ionization

best research papers 2020

Nuclear bomb tests during the Cold War may have changed rainfall patterns thousands of miles from the detonation sites, new research has revealed.

Scientists at the University have researched how the electric charge released by radiation from the test detonations, carried out predominantly by the USA and Soviet Union in the 1950s and 1960s, affected rainclouds at the time.

The study, published in Physical Review Letters , used historic records between 1962 and 1964 from a research station in Scotland. Scientists compared days with high and low radioactively generated charge, finding that clouds were visibly thicker, and there was 24% more rain on average on the days with more radioactivity.

best research papers 2020

The Altmetric Attention Score (shown in the badges above) is a weighted count of all of the online attention Altmetric have found for an individual research output. This includes mentions in public policy documents and references in Wikipedia, the mainstream news, social networks, blogs and more. For this blog we’ve featured the top ten highest-scoring publications with a Reading-affiliated author published between 1 January and 15 December 2020.

University of Reading cookie policy

We use cookies on reading.ac.uk to improve your experience. You can find out more about our cookie policy . By continuing to use our site you accept these terms, and are happy for us to use cookies to improve your browsing experience.

  • Our Mission

An illustration of a figure examining data

The 10 Most Significant Education Studies of 2020

We reviewed hundreds of educational studies in 2020 and then highlighted 10 of the most significant—covering topics from virtual learning to the reading wars and the decline of standardized tests.

In the month of March of 2020, the year suddenly became a whirlwind. With a pandemic disrupting life across the entire globe, teachers scrambled to transform their physical classrooms into virtual—or even hybrid—ones, and researchers slowly began to collect insights into what works, and what doesn’t, in online learning environments around the world.

Meanwhile, neuroscientists made a convincing case for keeping handwriting in schools, and after the closure of several coal-fired power plants in Chicago, researchers reported a drop in pediatric emergency room visits and fewer absences in schools, reminding us that questions of educational equity do not begin and end at the schoolhouse door.

1. To Teach Vocabulary, Let Kids Be Thespians

When students are learning a new language, ask them to act out vocabulary words. It’s fun to unleash a child’s inner thespian, of course, but a 2020 study concluded that it also nearly doubles their ability to remember the words months later.

Researchers asked 8-year-old students to listen to words in another language and then use their hands and bodies to mimic the words—spreading their arms and pretending to fly, for example, when learning the German word flugzeug , which means “airplane.” After two months, these young actors were a remarkable 73 percent more likely to remember the new words than students who had listened without accompanying gestures. Researchers discovered similar, if slightly less dramatic, results when students looked at pictures while listening to the corresponding vocabulary. 

It’s a simple reminder that if you want students to remember something, encourage them to learn it in a variety of ways—by drawing it , acting it out, or pairing it with relevant images , for example.

2. Neuroscientists Defend the Value of Teaching Handwriting—Again

For most kids, typing just doesn’t cut it. In 2012, brain scans of preliterate children revealed crucial reading circuitry flickering to life when kids hand-printed letters and then tried to read them. The effect largely disappeared when the letters were typed or traced.

More recently, in 2020, a team of researchers studied older children—seventh graders—while they handwrote, drew, and typed words, and concluded that handwriting and drawing produced telltale neural tracings indicative of deeper learning.

“Whenever self-generated movements are included as a learning strategy, more of the brain gets stimulated,” the researchers explain, before echoing the 2012 study: “It also appears that the movements related to keyboard typing do not activate these networks the same way that drawing and handwriting do.”

It would be a mistake to replace typing with handwriting, though. All kids need to develop digital skills, and there’s evidence that technology helps children with dyslexia to overcome obstacles like note taking or illegible handwriting, ultimately freeing them to “use their time for all the things in which they are gifted,” says the Yale Center for Dyslexia and Creativity.

3. The ACT Test Just Got a Negative Score (Face Palm)

A 2020 study found that ACT test scores, which are often a key factor in college admissions, showed a weak—or even negative —relationship when it came to predicting how successful students would be in college. “There is little evidence that students will have more college success if they work to improve their ACT score,” the researchers explain, and students with very high ACT scores—but indifferent high school grades—often flamed out in college, overmatched by the rigors of a university’s academic schedule.

Just last year, the SAT—cousin to the ACT—had a similarly dubious public showing. In a major 2019 study of nearly 50,000 students led by researcher Brian Galla, and including Angela Duckworth, researchers found that high school grades were stronger predictors of four-year-college graduation than SAT scores.

The reason? Four-year high school grades, the researchers asserted, are a better indicator of crucial skills like perseverance, time management, and the ability to avoid distractions. It’s most likely those skills, in the end, that keep kids in college.

4. A Rubric Reduces Racial Grading Bias

A simple step might help undercut the pernicious effect of grading bias, a new study found: Articulate your standards clearly before you begin grading, and refer to the standards regularly during the assessment process.

In 2020, more than 1,500 teachers were recruited and asked to grade a writing sample from a fictional second-grade student. All of the sample stories were identical—but in one set, the student mentions a family member named Dashawn, while the other set references a sibling named Connor.

Teachers were 13 percent more likely to give the Connor papers a passing grade, revealing the invisible advantages that many students unknowingly benefit from. When grading criteria are vague, implicit stereotypes can insidiously “fill in the blanks,” explains the study’s author. But when teachers have an explicit set of criteria to evaluate the writing—asking whether the student “provides a well-elaborated recount of an event,” for example—the difference in grades is nearly eliminated.

5. What Do Coal-Fired Power Plants Have to Do With Learning? Plenty

When three coal-fired plants closed in the Chicago area, student absences in nearby schools dropped by 7 percent, a change largely driven by fewer emergency room visits for asthma-related problems. The stunning finding, published in a 2020 study from Duke and Penn State, underscores the role that often-overlooked environmental factors—like air quality, neighborhood crime, and noise pollution—have in keeping our children healthy and ready to learn.

At scale, the opportunity cost is staggering: About 2.3 million children in the United States still attend a public elementary or middle school located within 10 kilometers of a coal-fired plant.

The study builds on a growing body of research that reminds us that questions of educational equity do not begin and end at the schoolhouse door. What we call an achievement gap is often an equity gap, one that “takes root in the earliest years of children’s lives,” according to a 2017 study . We won’t have equal opportunity in our schools, the researchers admonish, until we are diligent about confronting inequality in our cities, our neighborhoods—and ultimately our own backyards.

6. Students Who Generate Good Questions Are Better Learners

Some of the most popular study strategies—highlighting passages, rereading notes, and underlining key sentences—are also among the least effective. A 2020 study highlighted a powerful alternative: Get students to generate questions about their learning, and gradually press them to ask more probing questions.

In the study, students who studied a topic and then generated their own questions scored an average of 14 percentage points higher on a test than students who used passive strategies like studying their notes and rereading classroom material. Creating questions, the researchers found, not only encouraged students to think more deeply about the topic but also strengthened their ability to remember what they were studying.

There are many engaging ways to have students create highly productive questions : When creating a test, you can ask students to submit their own questions, or you can use the Jeopardy! game as a platform for student-created questions.

7. Did a 2020 Study Just End the ‘Reading Wars’?

One of the most widely used reading programs was dealt a severe blow when a panel of reading experts concluded that it “would be unlikely to lead to literacy success for all of America’s public schoolchildren.”

In the 2020 study , the experts found that the controversial program—called “Units of Study” and developed over the course of four decades by Lucy Calkins at the Teachers College Reading and Writing Project—failed to explicitly and systematically teach young readers how to decode and encode written words, and was thus “in direct opposition to an enormous body of settled research.”

The study sounded the death knell for practices that de-emphasize phonics in favor of having children use multiple sources of information—like story events or illustrations—to predict the meaning of unfamiliar words, an approach often associated with “balanced literacy.” In an internal memo obtained by publisher APM, Calkins seemed to concede the point, writing that “aspects of balanced literacy need some ‘rebalancing.’”

8. A Secret to High-Performing Virtual Classrooms

In 2020, a team at Georgia State University compiled a report on virtual learning best practices. While evidence in the field is "sparse" and "inconsistent," the report noted that logistical issues like accessing materials—and not content-specific problems like failures of comprehension—were often among the most significant obstacles to online learning. It wasn’t that students didn’t understand photosynthesis in a virtual setting, in other words—it was that they didn’t find (or simply didn't access) the lesson on photosynthesis at all.

That basic insight echoed a 2019 study that highlighted the crucial need to organize virtual classrooms even more intentionally than physical ones. Remote teachers should use a single, dedicated hub for important documents like assignments; simplify communications and reminders by using one channel like email or text; and reduce visual clutter like hard-to-read fonts and unnecessary decorations throughout their virtual spaces.

Because the tools are new to everyone, regular feedback on topics like accessibility and ease of use is crucial. Teachers should post simple surveys asking questions like “Have you encountered any technical issues?” and “Can you easily locate your assignments?” to ensure that students experience a smooth-running virtual learning space.

9. Love to Learn Languages? Surprisingly, Coding May Be Right for You

Learning how to code more closely resembles learning a language such as Chinese or Spanish than learning math, a 2020 study found—upending the conventional wisdom about what makes a good programmer.

In the study, young adults with no programming experience were asked to learn Python, a popular programming language; they then took a series of tests assessing their problem-solving, math, and language skills. The researchers discovered that mathematical skill accounted for only 2 percent of a person’s ability to learn how to code, while language skills were almost nine times more predictive, accounting for 17 percent of learning ability.

That’s an important insight because all too often, programming classes require that students pass advanced math courses—a hurdle that needlessly excludes students with untapped promise, the researchers claim.

10. Researchers Cast Doubt on Reading Tasks Like ‘Finding the Main Idea’

“Content is comprehension,” declared a 2020 Fordham Institute study , sounding a note of defiance as it staked out a position in the ongoing debate over the teaching of intrinsic reading skills versus the teaching of content knowledge.

While elementary students spend an enormous amount of time working on skills like “finding the main idea” and “summarizing”—tasks born of the belief that reading is a discrete and trainable ability that transfers seamlessly across content areas—these young readers aren’t experiencing “the additional reading gains that well-intentioned educators hoped for,” the study concluded.

So what works? The researchers looked at data from more than 18,000 K–5 students, focusing on the time spent in subject areas like math, social studies, and ELA, and found that “social studies is the only subject with a clear, positive, and statistically significant effect on reading improvement.” In effect, exposing kids to rich content in civics, history, and law appeared to teach reading more effectively than our current methods of teaching reading. Perhaps defiance is no longer needed: Fordham’s conclusions are rapidly becoming conventional wisdom—and they extend beyond the limited claim of reading social studies texts. According to Natalie Wexler, the author of the well-received 2019 book  The Knowledge Gap , content knowledge and reading are intertwined. “Students with more [background] knowledge have a better chance of understanding whatever text they encounter. They’re able to retrieve more information about the topic from long-term memory, leaving more space in working memory for comprehension,” she recently told Edutopia .

  • Systematic review
  • Open access
  • Published: 19 February 2024

‘It depends’: what 86 systematic reviews tell us about what strategies to use to support the use of research in clinical practice

  • Annette Boaz   ORCID: orcid.org/0000-0003-0557-1294 1 ,
  • Juan Baeza 2 ,
  • Alec Fraser   ORCID: orcid.org/0000-0003-1121-1551 2 &
  • Erik Persson 3  

Implementation Science volume  19 , Article number:  15 ( 2024 ) Cite this article

990 Accesses

55 Altmetric

Metrics details

The gap between research findings and clinical practice is well documented and a range of strategies have been developed to support the implementation of research into clinical practice. The objective of this study was to update and extend two previous reviews of systematic reviews of strategies designed to implement research evidence into clinical practice.

We developed a comprehensive systematic literature search strategy based on the terms used in the previous reviews to identify studies that looked explicitly at interventions designed to turn research evidence into practice. The search was performed in June 2022 in four electronic databases: Medline, Embase, Cochrane and Epistemonikos. We searched from January 2010 up to June 2022 and applied no language restrictions. Two independent reviewers appraised the quality of included studies using a quality assessment checklist. To reduce the risk of bias, papers were excluded following discussion between all members of the team. Data were synthesised using descriptive and narrative techniques to identify themes and patterns linked to intervention strategies, targeted behaviours, study settings and study outcomes.

We identified 32 reviews conducted between 2010 and 2022. The reviews are mainly of multi-faceted interventions ( n  = 20) although there are reviews focusing on single strategies (ICT, educational, reminders, local opinion leaders, audit and feedback, social media and toolkits). The majority of reviews report strategies achieving small impacts (normally on processes of care). There is much less evidence that these strategies have shifted patient outcomes. Furthermore, a lot of nuance lies behind these headline findings, and this is increasingly commented upon in the reviews themselves.

Combined with the two previous reviews, 86 systematic reviews of strategies to increase the implementation of research into clinical practice have been identified. We need to shift the emphasis away from isolating individual and multi-faceted interventions to better understanding and building more situated, relational and organisational capability to support the use of research in clinical practice. This will involve drawing on a wider range of research perspectives (including social science) in primary studies and diversifying the types of synthesis undertaken to include approaches such as realist synthesis which facilitate exploration of the context in which strategies are employed.

Peer Review reports

Contribution to the literature

Considerable time and money is invested in implementing and evaluating strategies to increase the implementation of research into clinical practice.

The growing body of evidence is not providing the anticipated clear lessons to support improved implementation.

Instead what is needed is better understanding and building more situated, relational and organisational capability to support the use of research in clinical practice.

This would involve a more central role in implementation science for a wider range of perspectives, especially from the social, economic, political and behavioural sciences and for greater use of different types of synthesis, such as realist synthesis.

Introduction

The gap between research findings and clinical practice is well documented and a range of interventions has been developed to increase the implementation of research into clinical practice [ 1 , 2 ]. In recent years researchers have worked to improve the consistency in the ways in which these interventions (often called strategies) are described to support their evaluation. One notable development has been the emergence of Implementation Science as a field focusing explicitly on “the scientific study of methods to promote the systematic uptake of research findings and other evidence-based practices into routine practice” ([ 3 ] p. 1). The work of implementation science focuses on closing, or at least narrowing, the gap between research and practice. One contribution has been to map existing interventions, identifying 73 discreet strategies to support research implementation [ 4 ] which have been grouped into 9 clusters [ 5 ]. The authors note that they have not considered the evidence of effectiveness of the individual strategies and that a next step is to understand better which strategies perform best in which combinations and for what purposes [ 4 ]. Other authors have noted that there is also scope to learn more from other related fields of study such as policy implementation [ 6 ] and to draw on methods designed to support the evaluation of complex interventions [ 7 ].

The increase in activity designed to support the implementation of research into practice and improvements in reporting provided the impetus for an update of a review of systematic reviews of the effectiveness of interventions designed to support the use of research in clinical practice [ 8 ] which was itself an update of the review conducted by Grimshaw and colleagues in 2001. The 2001 review [ 9 ] identified 41 reviews considering a range of strategies including educational interventions, audit and feedback, computerised decision support to financial incentives and combined interventions. The authors concluded that all the interventions had the potential to promote the uptake of evidence in practice, although no one intervention seemed to be more effective than the others in all settings. They concluded that combined interventions were more likely to be effective than single interventions. The 2011 review identified a further 13 systematic reviews containing 313 discrete primary studies. Consistent with the previous review, four main strategy types were identified: audit and feedback; computerised decision support; opinion leaders; and multi-faceted interventions (MFIs). Nine of the reviews reported on MFIs. The review highlighted the small effects of single interventions such as audit and feedback, computerised decision support and opinion leaders. MFIs claimed an improvement in effectiveness over single interventions, although effect sizes remained small to moderate and this improvement in effectiveness relating to MFIs has been questioned in a subsequent review [ 10 ]. In updating the review, we anticipated a larger pool of reviews and an opportunity to consolidate learning from more recent systematic reviews of interventions.

This review updates and extends our previous review of systematic reviews of interventions designed to implement research evidence into clinical practice. To identify potentially relevant peer-reviewed research papers, we developed a comprehensive systematic literature search strategy based on the terms used in the Grimshaw et al. [ 9 ] and Boaz, Baeza and Fraser [ 8 ] overview articles. To ensure optimal retrieval, our search strategy was refined with support from an expert university librarian, considering the ongoing improvements in the development of search filters for systematic reviews since our first review [ 11 ]. We also wanted to include technology-related terms (e.g. apps, algorithms, machine learning, artificial intelligence) to find studies that explored interventions based on the use of technological innovations as mechanistic tools for increasing the use of evidence into practice (see Additional file 1 : Appendix A for full search strategy).

The search was performed in June 2022 in the following electronic databases: Medline, Embase, Cochrane and Epistemonikos. We searched for articles published since the 2011 review. We searched from January 2010 up to June 2022 and applied no language restrictions. Reference lists of relevant papers were also examined.

We uploaded the results using EPPI-Reviewer, a web-based tool that facilitated semi-automation of the screening process and removal of duplicate studies. We made particular use of a priority screening function to reduce screening workload and avoid ‘data deluge’ [ 12 ]. Through machine learning, one reviewer screened a smaller number of records ( n  = 1200) to train the software to predict whether a given record was more likely to be relevant or irrelevant, thus pulling the relevant studies towards the beginning of the screening process. This automation did not replace manual work but helped the reviewer to identify eligible studies more quickly. During the selection process, we included studies that looked explicitly at interventions designed to turn research evidence into practice. Studies were included if they met the following pre-determined inclusion criteria:

The study was a systematic review

Search terms were included

Focused on the implementation of research evidence into practice

The methodological quality of the included studies was assessed as part of the review

Study populations included healthcare providers and patients. The EPOC taxonomy [ 13 ] was used to categorise the strategies. The EPOC taxonomy has four domains: delivery arrangements, financial arrangements, governance arrangements and implementation strategies. The implementation strategies domain includes 20 strategies targeted at healthcare workers. Numerous EPOC strategies were assessed in the review including educational strategies, local opinion leaders, reminders, ICT-focused approaches and audit and feedback. Some strategies that did not fit easily within the EPOC categories were also included. These were social media strategies and toolkits, and multi-faceted interventions (MFIs) (see Table  2 ). Some systematic reviews included comparisons of different interventions while other reviews compared one type of intervention against a control group. Outcomes related to improvements in health care processes or patient well-being. Numerous individual study types (RCT, CCT, BA, ITS) were included within the systematic reviews.

We excluded papers that:

Focused on changing patient rather than provider behaviour

Had no demonstrable outcomes

Made unclear or no reference to research evidence

The last of these criteria was sometimes difficult to judge, and there was considerable discussion amongst the research team as to whether the link between research evidence and practice was sufficiently explicit in the interventions analysed. As we discussed in the previous review [ 8 ] in the field of healthcare, the principle of evidence-based practice is widely acknowledged and tools to change behaviour such as guidelines are often seen to be an implicit codification of evidence, despite the fact that this is not always the case.

Reviewers employed a two-stage process to select papers for inclusion. First, all titles and abstracts were screened by one reviewer to determine whether the study met the inclusion criteria. Two papers [ 14 , 15 ] were identified that fell just before the 2010 cut-off. As they were not identified in the searches for the first review [ 8 ] they were included and progressed to assessment. Each paper was rated as include, exclude or maybe. The full texts of 111 relevant papers were assessed independently by at least two authors. To reduce the risk of bias, papers were excluded following discussion between all members of the team. 32 papers met the inclusion criteria and proceeded to data extraction. The study selection procedure is documented in a PRISMA literature flow diagram (see Fig.  1 ). We were able to include French, Spanish and Portuguese papers in the selection reflecting the language skills in the study team, but none of the papers identified met the inclusion criteria. Other non- English language papers were excluded.

figure 1

PRISMA flow diagram. Source: authors

One reviewer extracted data on strategy type, number of included studies, local, target population, effectiveness and scope of impact from the included studies. Two reviewers then independently read each paper and noted key findings and broad themes of interest which were then discussed amongst the wider authorial team. Two independent reviewers appraised the quality of included studies using a Quality Assessment Checklist based on Oxman and Guyatt [ 16 ] and Francke et al. [ 17 ]. Each study was rated a quality score ranging from 1 (extensive flaws) to 7 (minimal flaws) (see Additional file 2 : Appendix B). All disagreements were resolved through discussion. Studies were not excluded in this updated overview based on methodological quality as we aimed to reflect the full extent of current research into this topic.

The extracted data were synthesised using descriptive and narrative techniques to identify themes and patterns in the data linked to intervention strategies, targeted behaviours, study settings and study outcomes.

Thirty-two studies were included in the systematic review. Table 1. provides a detailed overview of the included systematic reviews comprising reference, strategy type, quality score, number of included studies, local, target population, effectiveness and scope of impact (see Table  1. at the end of the manuscript). Overall, the quality of the studies was high. Twenty-three studies scored 7, six studies scored 6, one study scored 5, one study scored 4 and one study scored 3. The primary focus of the review was on reviews of effectiveness studies, but a small number of reviews did include data from a wider range of methods including qualitative studies which added to the analysis in the papers [ 18 , 19 , 20 , 21 ]. The majority of reviews report strategies achieving small impacts (normally on processes of care). There is much less evidence that these strategies have shifted patient outcomes. In this section, we discuss the different EPOC-defined implementation strategies in turn. Interestingly, we found only two ‘new’ approaches in this review that did not fit into the existing EPOC approaches. These are a review focused on the use of social media and a review considering toolkits. In addition to single interventions, we also discuss multi-faceted interventions. These were the most common intervention approach overall. A summary is provided in Table  2 .

Educational strategies

The overview identified three systematic reviews focusing on educational strategies. Grudniewicz et al. [ 22 ] explored the effectiveness of printed educational materials on primary care physician knowledge, behaviour and patient outcomes and concluded they were not effective in any of these aspects. Koota, Kääriäinen and Melender [ 23 ] focused on educational interventions promoting evidence-based practice among emergency room/accident and emergency nurses and found that interventions involving face-to-face contact led to significant or highly significant effects on patient benefits and emergency nurses’ knowledge, skills and behaviour. Interventions using written self-directed learning materials also led to significant improvements in nurses’ knowledge of evidence-based practice. Although the quality of the studies was high, the review primarily included small studies with low response rates, and many of them relied on self-assessed outcomes; consequently, the strength of the evidence for these outcomes is modest. Wu et al. [ 20 ] questioned if educational interventions aimed at nurses to support the implementation of evidence-based practice improve patient outcomes. Although based on evaluation projects and qualitative data, their results also suggest that positive changes on patient outcomes can be made following the implementation of specific evidence-based approaches (or projects). The differing positive outcomes for educational strategies aimed at nurses might indicate that the target audience is important.

Local opinion leaders

Flodgren et al. [ 24 ] was the only systemic review focusing solely on opinion leaders. The review found that local opinion leaders alone, or in combination with other interventions, can be effective in promoting evidence‐based practice, but this varies both within and between studies and the effect on patient outcomes is uncertain. The review found that, overall, any intervention involving opinion leaders probably improves healthcare professionals’ compliance with evidence-based practice but varies within and across studies. However, how opinion leaders had an impact could not be determined because of insufficient details were provided, illustrating that reporting specific details in published studies is important if diffusion of effective methods of increasing evidence-based practice is to be spread across a system. The usefulness of this review is questionable because it cannot provide evidence of what is an effective opinion leader, whether teams of opinion leaders or a single opinion leader are most effective, or the most effective methods used by opinion leaders.

Pantoja et al. [ 26 ] was the only systemic review focusing solely on manually generated reminders delivered on paper included in the overview. The review explored how these affected professional practice and patient outcomes. The review concluded that manually generated reminders delivered on paper as a single intervention probably led to small to moderate increases in adherence to clinical recommendations, and they could be used as a single quality improvement intervention. However, the authors indicated that this intervention would make little or no difference to patient outcomes. The authors state that such a low-tech intervention may be useful in low- and middle-income countries where paper records are more likely to be the norm.

ICT-focused approaches

The three ICT-focused reviews [ 14 , 27 , 28 ] showed mixed results. Jamal, McKenzie and Clark [ 14 ] explored the impact of health information technology on the quality of medical and health care. They examined the impact of electronic health record, computerised provider order-entry, or decision support system. This showed a positive improvement in adherence to evidence-based guidelines but not to patient outcomes. The number of studies included in the review was low and so a conclusive recommendation could not be reached based on this review. Similarly, Brown et al. [ 28 ] found that technology-enabled knowledge translation interventions may improve knowledge of health professionals, but all eight studies raised concerns of bias. The De Angelis et al. [ 27 ] review was more promising, reporting that ICT can be a good way of disseminating clinical practice guidelines but conclude that it is unclear which type of ICT method is the most effective.

Audit and feedback

Sykes, McAnuff and Kolehmainen [ 29 ] examined whether audit and feedback were effective in dementia care and concluded that it remains unclear which ingredients of audit and feedback are successful as the reviewed papers illustrated large variations in the effectiveness of interventions using audit and feedback.

Non-EPOC listed strategies: social media, toolkits

There were two new (non-EPOC listed) intervention types identified in this review compared to the 2011 review — fewer than anticipated. We categorised a third — ‘care bundles’ [ 36 ] as a multi-faceted intervention due to its description in practice and a fourth — ‘Technology Enhanced Knowledge Transfer’ [ 28 ] was classified as an ICT-focused approach. The first new strategy was identified in Bhatt et al.’s [ 30 ] systematic review of the use of social media for the dissemination of clinical practice guidelines. They reported that the use of social media resulted in a significant improvement in knowledge and compliance with evidence-based guidelines compared with more traditional methods. They noted that a wide selection of different healthcare professionals and patients engaged with this type of social media and its global reach may be significant for low- and middle-income countries. This review was also noteworthy for developing a simple stepwise method for using social media for the dissemination of clinical practice guidelines. However, it is debatable whether social media can be classified as an intervention or just a different way of delivering an intervention. For example, the review discussed involving opinion leaders and patient advocates through social media. However, this was a small review that included only five studies, so further research in this new area is needed. Yamada et al. [ 31 ] draw on 39 studies to explore the application of toolkits, 18 of which had toolkits embedded within larger KT interventions, and 21 of which evaluated toolkits as standalone interventions. The individual component strategies of the toolkits were highly variable though the authors suggest that they align most closely with educational strategies. The authors conclude that toolkits as either standalone strategies or as part of MFIs hold some promise for facilitating evidence use in practice but caution that the quality of many of the primary studies included is considered weak limiting these findings.

Multi-faceted interventions

The majority of the systematic reviews ( n  = 20) reported on more than one intervention type. Some of these systematic reviews focus exclusively on multi-faceted interventions, whilst others compare different single or combined interventions aimed at achieving similar outcomes in particular settings. While these two approaches are often described in a similar way, they are actually quite distinct from each other as the former report how multiple strategies may be strategically combined in pursuance of an agreed goal, whilst the latter report how different strategies may be incidentally used in sometimes contrasting settings in the pursuance of similar goals. Ariyo et al. [ 35 ] helpfully summarise five key elements often found in effective MFI strategies in LMICs — but which may also be transferrable to HICs. First, effective MFIs encourage a multi-disciplinary approach acknowledging the roles played by different professional groups to collectively incorporate evidence-informed practice. Second, they utilise leadership drawing on a wide set of clinical and non-clinical actors including managers and even government officials. Third, multiple types of educational practices are utilised — including input from patients as stakeholders in some cases. Fourth, protocols, checklists and bundles are used — most effectively when local ownership is encouraged. Finally, most MFIs included an emphasis on monitoring and evaluation [ 35 ]. In contrast, other studies offer little information about the nature of the different MFI components of included studies which makes it difficult to extrapolate much learning from them in relation to why or how MFIs might affect practice (e.g. [ 28 , 38 ]). Ultimately, context matters, which some review authors argue makes it difficult to say with real certainty whether single or MFI strategies are superior (e.g. [ 21 , 27 ]). Taking all the systematic reviews together we may conclude that MFIs appear to be more likely to generate positive results than single interventions (e.g. [ 34 , 45 ]) though other reviews should make us cautious (e.g. [ 32 , 43 ]).

While multi-faceted interventions still seem to be more effective than single-strategy interventions, there were important distinctions between how the results of reviews of MFIs are interpreted in this review as compared to the previous reviews [ 8 , 9 ], reflecting greater nuance and debate in the literature. This was particularly noticeable where the effectiveness of MFIs was compared to single strategies, reflecting developments widely discussed in previous studies [ 10 ]. We found that most systematic reviews are bounded by their clinical, professional, spatial, system, or setting criteria and often seek to draw out implications for the implementation of evidence in their areas of specific interest (such as nursing or acute care). Frequently this means combining all relevant studies to explore the respective foci of each systematic review. Therefore, most reviews we categorised as MFIs actually include highly variable numbers and combinations of intervention strategies and highly heterogeneous original study designs. This makes statistical analyses of the type used by Squires et al. [ 10 ] on the three reviews in their paper not possible. Further, it also makes extrapolating findings and commenting on broad themes complex and difficult. This may suggest that future research should shift its focus from merely examining ‘what works’ to ‘what works where and what works for whom’ — perhaps pointing to the value of realist approaches to these complex review topics [ 48 , 49 ] and other more theory-informed approaches [ 50 ].

Some reviews have a relatively small number of studies (i.e. fewer than 10) and the authors are often understandably reluctant to engage with wider debates about the implications of their findings. Other larger studies do engage in deeper discussions about internal comparisons of findings across included studies and also contextualise these in wider debates. Some of the most informative studies (e.g. [ 35 , 40 ]) move beyond EPOC categories and contextualise MFIs within wider systems thinking and implementation theory. This distinction between MFIs and single interventions can actually be very useful as it offers lessons about the contexts in which individual interventions might have bounded effectiveness (i.e. educational interventions for individual change). Taken as a whole, this may also then help in terms of how and when to conjoin single interventions into effective MFIs.

In the two previous reviews, a consistent finding was that MFIs were more effective than single interventions [ 8 , 9 ]. However, like Squires et al. [ 10 ] this overview is more equivocal on this important issue. There are four points which may help account for the differences in findings in this regard. Firstly, the diversity of the systematic reviews in terms of clinical topic or setting is an important factor. Secondly, there is heterogeneity of the studies within the included systematic reviews themselves. Thirdly, there is a lack of consistency with regards to the definition and strategies included within of MFIs. Finally, there are epistemological differences across the papers and the reviews. This means that the results that are presented depend on the methods used to measure, report, and synthesise them. For instance, some reviews highlight that education strategies can be useful to improve provider understanding — but without wider organisational or system-level change, they may struggle to deliver sustained transformation [ 19 , 44 ].

It is also worth highlighting the importance of the theory of change underlying the different interventions. Where authors of the systematic reviews draw on theory, there is space to discuss/explain findings. We note a distinction between theoretical and atheoretical systematic review discussion sections. Atheoretical reviews tend to present acontextual findings (for instance, one study found very positive results for one intervention, and this gets highlighted in the abstract) whilst theoretically informed reviews attempt to contextualise and explain patterns within the included studies. Theory-informed systematic reviews seem more likely to offer more profound and useful insights (see [ 19 , 35 , 40 , 43 , 45 ]). We find that the most insightful systematic reviews of MFIs engage in theoretical generalisation — they attempt to go beyond the data of individual studies and discuss the wider implications of the findings of the studies within their reviews drawing on implementation theory. At the same time, they highlight the active role of context and the wider relational and system-wide issues linked to implementation. It is these types of investigations that can help providers further develop evidence-based practice.

This overview has identified a small, but insightful set of papers that interrogate and help theorise why, how, for whom, and in which circumstances it might be the case that MFIs are superior (see [ 19 , 35 , 40 ] once more). At the level of this overview — and in most of the systematic reviews included — it appears to be the case that MFIs struggle with the question of attribution. In addition, there are other important elements that are often unmeasured, or unreported (e.g. costs of the intervention — see [ 40 ]). Finally, the stronger systematic reviews [ 19 , 35 , 40 , 43 , 45 ] engage with systems issues, human agency and context [ 18 ] in a way that was not evident in the systematic reviews identified in the previous reviews [ 8 , 9 ]. The earlier reviews lacked any theory of change that might explain why MFIs might be more effective than single ones — whereas now some systematic reviews do this, which enables them to conclude that sometimes single interventions can still be more effective.

As Nilsen et al. ([ 6 ] p. 7) note ‘Study findings concerning the effectiveness of various approaches are continuously synthesized and assembled in systematic reviews’. We may have gone as far as we can in understanding the implementation of evidence through systematic reviews of single and multi-faceted interventions and the next step would be to conduct more research exploring the complex and situated nature of evidence used in clinical practice and by particular professional groups. This would further build on the nuanced discussion and conclusion sections in a subset of the papers we reviewed. This might also support the field to move away from isolating individual implementation strategies [ 6 ] to explore the complex processes involving a range of actors with differing capacities [ 51 ] working in diverse organisational cultures. Taxonomies of implementation strategies do not fully account for the complex process of implementation, which involves a range of different actors with different capacities and skills across multiple system levels. There is plenty of work to build on, particularly in the social sciences, which currently sits at the margins of debates about evidence implementation (see for example, Normalisation Process Theory [ 52 ]).

There are several changes that we have identified in this overview of systematic reviews in comparison to the review we published in 2011 [ 8 ]. A consistent and welcome finding is that the overall quality of the systematic reviews themselves appears to have improved between the two reviews, although this is not reflected upon in the papers. This is exhibited through better, clearer reporting mechanisms in relation to the mechanics of the reviews, alongside a greater attention to, and deeper description of, how potential biases in included papers are discussed. Additionally, there is an increased, but still limited, inclusion of original studies conducted in low- and middle-income countries as opposed to just high-income countries. Importantly, we found that many of these systematic reviews are attuned to, and comment upon the contextual distinctions of pursuing evidence-informed interventions in health care settings in different economic settings. Furthermore, systematic reviews included in this updated article cover a wider set of clinical specialities (both within and beyond hospital settings) and have a focus on a wider set of healthcare professions — discussing both similarities, differences and inter-professional challenges faced therein, compared to the earlier reviews. These wider ranges of studies highlight that a particular intervention or group of interventions may work well for one professional group but be ineffective for another. This diversity of study settings allows us to consider the important role context (in its many forms) plays on implementing evidence into practice. Examining the complex and varied context of health care will help us address what Nilsen et al. ([ 6 ] p. 1) described as, ‘society’s health problems [that] require research-based knowledge acted on by healthcare practitioners together with implementation of political measures from governmental agencies’. This will help us shift implementation science to move, ‘beyond a success or failure perspective towards improved analysis of variables that could explain the impact of the implementation process’ ([ 6 ] p. 2).

This review brings together 32 papers considering individual and multi-faceted interventions designed to support the use of evidence in clinical practice. The majority of reviews report strategies achieving small impacts (normally on processes of care). There is much less evidence that these strategies have shifted patient outcomes. Combined with the two previous reviews, 86 systematic reviews of strategies to increase the implementation of research into clinical practice have been conducted. As a whole, this substantial body of knowledge struggles to tell us more about the use of individual and MFIs than: ‘it depends’. To really move forwards in addressing the gap between research evidence and practice, we may need to shift the emphasis away from isolating individual and multi-faceted interventions to better understanding and building more situated, relational and organisational capability to support the use of research in clinical practice. This will involve drawing on a wider range of perspectives, especially from the social, economic, political and behavioural sciences in primary studies and diversifying the types of synthesis undertaken to include approaches such as realist synthesis which facilitate exploration of the context in which strategies are employed. Harvey et al. [ 53 ] suggest that when context is likely to be critical to implementation success there are a range of primary research approaches (participatory research, realist evaluation, developmental evaluation, ethnography, quality/ rapid cycle improvement) that are likely to be appropriate and insightful. While these approaches often form part of implementation studies in the form of process evaluations, they are usually relatively small scale in relation to implementation research as a whole. As a result, the findings often do not make it into the subsequent systematic reviews. This review provides further evidence that we need to bring qualitative approaches in from the periphery to play a central role in many implementation studies and subsequent evidence syntheses. It would be helpful for systematic reviews, at the very least, to include more detail about the interventions and their implementation in terms of how and why they worked.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

Before and after study

Controlled clinical trial

Effective Practice and Organisation of Care

High-income countries

Information and Communications Technology

Interrupted time series

Knowledge translation

Low- and middle-income countries

Randomised controlled trial

Grol R, Grimshaw J. From best evidence to best practice: effective implementation of change in patients’ care. Lancet. 2003;362:1225–30. https://doi.org/10.1016/S0140-6736(03)14546-1 .

Article   PubMed   Google Scholar  

Green LA, Seifert CM. Translation of research into practice: why we can’t “just do it.” J Am Board Fam Pract. 2005;18:541–5. https://doi.org/10.3122/jabfm.18.6.541 .

Eccles MP, Mittman BS. Welcome to Implementation Science. Implement Sci. 2006;1:1–3. https://doi.org/10.1186/1748-5908-1-1 .

Article   PubMed Central   Google Scholar  

Powell BJ, Waltz TJ, Chinman MJ, Damschroder LJ, Smith JL, Matthieu MM, et al. A refined compilation of implementation strategies: results from the Expert Recommendations for Implementing Change (ERIC) project. Implement Sci. 2015;10:2–14. https://doi.org/10.1186/s13012-015-0209-1 .

Article   Google Scholar  

Waltz TJ, Powell BJ, Matthieu MM, Damschroder LJ, et al. Use of concept mapping to characterize relationships among implementation strategies and assess their feasibility and importance: results from the Expert Recommendations for Implementing Change (ERIC) study. Implement Sci. 2015;10:1–8. https://doi.org/10.1186/s13012-015-0295-0 .

Nilsen P, Ståhl C, Roback K, et al. Never the twain shall meet? - a comparison of implementation science and policy implementation research. Implementation Sci. 2013;8:2–12. https://doi.org/10.1186/1748-5908-8-63 .

Rycroft-Malone J, Seers K, Eldh AC, et al. A realist process evaluation within the Facilitating Implementation of Research Evidence (FIRE) cluster randomised controlled international trial: an exemplar. Implementation Sci. 2018;13:1–15. https://doi.org/10.1186/s13012-018-0811-0 .

Boaz A, Baeza J, Fraser A, European Implementation Score Collaborative Group (EIS). Effective implementation of research into practice: an overview of systematic reviews of the health literature. BMC Res Notes. 2011;4:212. https://doi.org/10.1186/1756-0500-4-212 .

Article   PubMed   PubMed Central   Google Scholar  

Grimshaw JM, Shirran L, Thomas R, Mowatt G, Fraser C, Bero L, et al. Changing provider behavior – an overview of systematic reviews of interventions. Med Care. 2001;39 8Suppl 2:II2–45.

Google Scholar  

Squires JE, Sullivan K, Eccles MP, et al. Are multifaceted interventions more effective than single-component interventions in changing health-care professionals’ behaviours? An overview of systematic reviews. Implement Sci. 2014;9:1–22. https://doi.org/10.1186/s13012-014-0152-6 .

Salvador-Oliván JA, Marco-Cuenca G, Arquero-Avilés R. Development of an efficient search filter to retrieve systematic reviews from PubMed. J Med Libr Assoc. 2021;109:561–74. https://doi.org/10.5195/jmla.2021.1223 .

Thomas JM. Diffusion of innovation in systematic review methodology: why is study selection not yet assisted by automation? OA Evid Based Med. 2013;1:1–6.

Effective Practice and Organisation of Care (EPOC). The EPOC taxonomy of health systems interventions. EPOC Resources for review authors. Oslo: Norwegian Knowledge Centre for the Health Services; 2016. epoc.cochrane.org/epoc-taxonomy . Accessed 9 Oct 2023.

Jamal A, McKenzie K, Clark M. The impact of health information technology on the quality of medical and health care: a systematic review. Health Inf Manag. 2009;38:26–37. https://doi.org/10.1177/183335830903800305 .

Menon A, Korner-Bitensky N, Kastner M, et al. Strategies for rehabilitation professionals to move evidence-based knowledge into practice: a systematic review. J Rehabil Med. 2009;41:1024–32. https://doi.org/10.2340/16501977-0451 .

Oxman AD, Guyatt GH. Validation of an index of the quality of review articles. J Clin Epidemiol. 1991;44:1271–8. https://doi.org/10.1016/0895-4356(91)90160-b .

Article   CAS   PubMed   Google Scholar  

Francke AL, Smit MC, de Veer AJ, et al. Factors influencing the implementation of clinical guidelines for health care professionals: a systematic meta-review. BMC Med Inform Decis Mak. 2008;8:1–11. https://doi.org/10.1186/1472-6947-8-38 .

Jones CA, Roop SC, Pohar SL, et al. Translating knowledge in rehabilitation: systematic review. Phys Ther. 2015;95:663–77. https://doi.org/10.2522/ptj.20130512 .

Scott D, Albrecht L, O’Leary K, Ball GDC, et al. Systematic review of knowledge translation strategies in the allied health professions. Implement Sci. 2012;7:1–17. https://doi.org/10.1186/1748-5908-7-70 .

Wu Y, Brettle A, Zhou C, Ou J, et al. Do educational interventions aimed at nurses to support the implementation of evidence-based practice improve patient outcomes? A systematic review. Nurse Educ Today. 2018;70:109–14. https://doi.org/10.1016/j.nedt.2018.08.026 .

Yost J, Ganann R, Thompson D, Aloweni F, et al. The effectiveness of knowledge translation interventions for promoting evidence-informed decision-making among nurses in tertiary care: a systematic review and meta-analysis. Implement Sci. 2015;10:1–15. https://doi.org/10.1186/s13012-015-0286-1 .

Grudniewicz A, Kealy R, Rodseth RN, Hamid J, et al. What is the effectiveness of printed educational materials on primary care physician knowledge, behaviour, and patient outcomes: a systematic review and meta-analyses. Implement Sci. 2015;10:2–12. https://doi.org/10.1186/s13012-015-0347-5 .

Koota E, Kääriäinen M, Melender HL. Educational interventions promoting evidence-based practice among emergency nurses: a systematic review. Int Emerg Nurs. 2018;41:51–8. https://doi.org/10.1016/j.ienj.2018.06.004 .

Flodgren G, O’Brien MA, Parmelli E, et al. Local opinion leaders: effects on professional practice and healthcare outcomes. Cochrane Database Syst Rev. 2019. https://doi.org/10.1002/14651858.CD000125.pub5 .

Arditi C, Rège-Walther M, Durieux P, et al. Computer-generated reminders delivered on paper to healthcare professionals: effects on professional practice and healthcare outcomes. Cochrane Database Syst Rev. 2017. https://doi.org/10.1002/14651858.CD001175.pub4 .

Pantoja T, Grimshaw JM, Colomer N, et al. Manually-generated reminders delivered on paper: effects on professional practice and patient outcomes. Cochrane Database Syst Rev. 2019. https://doi.org/10.1002/14651858.CD001174.pub4 .

De Angelis G, Davies B, King J, McEwan J, et al. Information and communication technologies for the dissemination of clinical practice guidelines to health professionals: a systematic review. JMIR Med Educ. 2016;2:e16. https://doi.org/10.2196/mededu.6288 .

Brown A, Barnes C, Byaruhanga J, McLaughlin M, et al. Effectiveness of technology-enabled knowledge translation strategies in improving the use of research in public health: systematic review. J Med Internet Res. 2020;22:e17274. https://doi.org/10.2196/17274 .

Sykes MJ, McAnuff J, Kolehmainen N. When is audit and feedback effective in dementia care? A systematic review. Int J Nurs Stud. 2018;79:27–35. https://doi.org/10.1016/j.ijnurstu.2017.10.013 .

Bhatt NR, Czarniecki SW, Borgmann H, et al. A systematic review of the use of social media for dissemination of clinical practice guidelines. Eur Urol Focus. 2021;7:1195–204. https://doi.org/10.1016/j.euf.2020.10.008 .

Yamada J, Shorkey A, Barwick M, Widger K, et al. The effectiveness of toolkits as knowledge translation strategies for integrating evidence into clinical care: a systematic review. BMJ Open. 2015;5:e006808. https://doi.org/10.1136/bmjopen-2014-006808 .

Afari-Asiedu S, Abdulai MA, Tostmann A, et al. Interventions to improve dispensing of antibiotics at the community level in low and middle income countries: a systematic review. J Glob Antimicrob Resist. 2022;29:259–74. https://doi.org/10.1016/j.jgar.2022.03.009 .

Boonacker CW, Hoes AW, Dikhoff MJ, Schilder AG, et al. Interventions in health care professionals to improve treatment in children with upper respiratory tract infections. Int J Pediatr Otorhinolaryngol. 2010;74:1113–21. https://doi.org/10.1016/j.ijporl.2010.07.008 .

Al Zoubi FM, Menon A, Mayo NE, et al. The effectiveness of interventions designed to increase the uptake of clinical practice guidelines and best practices among musculoskeletal professionals: a systematic review. BMC Health Serv Res. 2018;18:2–11. https://doi.org/10.1186/s12913-018-3253-0 .

Ariyo P, Zayed B, Riese V, Anton B, et al. Implementation strategies to reduce surgical site infections: a systematic review. Infect Control Hosp Epidemiol. 2019;3:287–300. https://doi.org/10.1017/ice.2018.355 .

Borgert MJ, Goossens A, Dongelmans DA. What are effective strategies for the implementation of care bundles on ICUs: a systematic review. Implement Sci. 2015;10:1–11. https://doi.org/10.1186/s13012-015-0306-1 .

Cahill LS, Carey LM, Lannin NA, et al. Implementation interventions to promote the uptake of evidence-based practices in stroke rehabilitation. Cochrane Database Syst Rev. 2020. https://doi.org/10.1002/14651858.CD012575.pub2 .

Pedersen ER, Rubenstein L, Kandrack R, Danz M, et al. Elusive search for effective provider interventions: a systematic review of provider interventions to increase adherence to evidence-based treatment for depression. Implement Sci. 2018;13:1–30. https://doi.org/10.1186/s13012-018-0788-8 .

Jenkins HJ, Hancock MJ, French SD, Maher CG, et al. Effectiveness of interventions designed to reduce the use of imaging for low-back pain: a systematic review. CMAJ. 2015;187:401–8. https://doi.org/10.1503/cmaj.141183 .

Bennett S, Laver K, MacAndrew M, Beattie E, et al. Implementation of evidence-based, non-pharmacological interventions addressing behavior and psychological symptoms of dementia: a systematic review focused on implementation strategies. Int Psychogeriatr. 2021;33:947–75. https://doi.org/10.1017/S1041610220001702 .

Noonan VK, Wolfe DL, Thorogood NP, et al. Knowledge translation and implementation in spinal cord injury: a systematic review. Spinal Cord. 2014;52:578–87. https://doi.org/10.1038/sc.2014.62 .

Albrecht L, Archibald M, Snelgrove-Clarke E, et al. Systematic review of knowledge translation strategies to promote research uptake in child health settings. J Pediatr Nurs. 2016;31:235–54. https://doi.org/10.1016/j.pedn.2015.12.002 .

Campbell A, Louie-Poon S, Slater L, et al. Knowledge translation strategies used by healthcare professionals in child health settings: an updated systematic review. J Pediatr Nurs. 2019;47:114–20. https://doi.org/10.1016/j.pedn.2019.04.026 .

Bird ML, Miller T, Connell LA, et al. Moving stroke rehabilitation evidence into practice: a systematic review of randomized controlled trials. Clin Rehabil. 2019;33:1586–95. https://doi.org/10.1177/0269215519847253 .

Goorts K, Dizon J, Milanese S. The effectiveness of implementation strategies for promoting evidence informed interventions in allied healthcare: a systematic review. BMC Health Serv Res. 2021;21:1–11. https://doi.org/10.1186/s12913-021-06190-0 .

Zadro JR, O’Keeffe M, Allison JL, Lembke KA, et al. Effectiveness of implementation strategies to improve adherence of physical therapist treatment choices to clinical practice guidelines for musculoskeletal conditions: systematic review. Phys Ther. 2020;100:1516–41. https://doi.org/10.1093/ptj/pzaa101 .

Van der Veer SN, Jager KJ, Nache AM, et al. Translating knowledge on best practice into improving quality of RRT care: a systematic review of implementation strategies. Kidney Int. 2011;80:1021–34. https://doi.org/10.1038/ki.2011.222 .

Pawson R, Greenhalgh T, Harvey G, et al. Realist review–a new method of systematic review designed for complex policy interventions. J Health Serv Res Policy. 2005;10Suppl 1:21–34. https://doi.org/10.1258/1355819054308530 .

Rycroft-Malone J, McCormack B, Hutchinson AM, et al. Realist synthesis: illustrating the method for implementation research. Implementation Sci. 2012;7:1–10. https://doi.org/10.1186/1748-5908-7-33 .

Johnson MJ, May CR. Promoting professional behaviour change in healthcare: what interventions work, and why? A theory-led overview of systematic reviews. BMJ Open. 2015;5:e008592. https://doi.org/10.1136/bmjopen-2015-008592 .

Metz A, Jensen T, Farley A, Boaz A, et al. Is implementation research out of step with implementation practice? Pathways to effective implementation support over the last decade. Implement Res Pract. 2022;3:1–11. https://doi.org/10.1177/26334895221105585 .

May CR, Finch TL, Cornford J, Exley C, et al. Integrating telecare for chronic disease management in the community: What needs to be done? BMC Health Serv Res. 2011;11:1–11. https://doi.org/10.1186/1472-6963-11-131 .

Harvey G, Rycroft-Malone J, Seers K, Wilson P, et al. Connecting the science and practice of implementation – applying the lens of context to inform study design in implementation research. Front Health Serv. 2023;3:1–15. https://doi.org/10.3389/frhs.2023.1162762 .

Download references

Acknowledgements

The authors would like to thank Professor Kathryn Oliver for her support in the planning the review, Professor Steve Hanney for reading and commenting on the final manuscript and the staff at LSHTM library for their support in planning and conducting the literature search.

This study was supported by LSHTM’s Research England QR strategic priorities funding allocation and the National Institute for Health and Care Research (NIHR) Applied Research Collaboration South London (NIHR ARC South London) at King’s College Hospital NHS Foundation Trust. Grant number NIHR200152. The views expressed are those of the author(s) and not necessarily those of the NIHR, the Department of Health and Social Care or Research England.

Author information

Authors and affiliations.

Health and Social Care Workforce Research Unit, The Policy Institute, King’s College London, Virginia Woolf Building, 22 Kingsway, London, WC2B 6LE, UK

Annette Boaz

King’s Business School, King’s College London, 30 Aldwych, London, WC2B 4BG, UK

Juan Baeza & Alec Fraser

Federal University of Santa Catarina (UFSC), Campus Universitário Reitor João Davi Ferreira Lima, Florianópolis, SC, 88.040-900, Brazil

Erik Persson

You can also search for this author in PubMed   Google Scholar

Contributions

AB led the conceptual development and structure of the manuscript. EP conducted the searches and data extraction. All authors contributed to screening and quality appraisal. EP and AF wrote the first draft of the methods section. AB, JB and AF performed result synthesis and contributed to the analyses. AB wrote the first draft of the manuscript and incorporated feedback and revisions from all other authors. All authors revised and approved the final manuscript.

Corresponding author

Correspondence to Annette Boaz .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: appendix a., additional file 2: appendix b., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Boaz, A., Baeza, J., Fraser, A. et al. ‘It depends’: what 86 systematic reviews tell us about what strategies to use to support the use of research in clinical practice. Implementation Sci 19 , 15 (2024). https://doi.org/10.1186/s13012-024-01337-z

Download citation

Received : 01 November 2023

Accepted : 05 January 2024

Published : 19 February 2024

DOI : https://doi.org/10.1186/s13012-024-01337-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Implementation
  • Interventions
  • Clinical practice
  • Research evidence
  • Multi-faceted

Implementation Science

ISSN: 1748-5908

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

best research papers 2020

  • About Paper360
  • Paper360 Archive
  • Tissue360 Archive
  • TAPPI/PIMA Career Center
  • Submit Editorial

TJ Honors TWO Best Research Papers for 2020

For the first time since the award’s inception in 2006, two papers have been selected as TAPPI Journal Best Research Paper for 2020. “The six papers that were nominated this year covered a wide range of the entire pulp, paper, and supplier industries, making the selection extremely challenging,” said Peter W. Hart, Ph.D., TAPPI Journal editor-in-chief. “On behalf of the TJ Editorial Board, I’d like to congratulate both winners and the other nominated papers. They did wonderful work last year.”

Each year, the TJ Editorial Board honors the best of the journal’s content by nominating and voting for the TAPPI Journal Best Research Paper, which is ultimately selected based on scientific merit, innovation, creativity, and clarity. The board is currently comprised of 18 experts from academia and industry.

One of the winning papers, “Creasing severity and reverse-side cracking,” was co-authored by Joel C. Panek, research scientist, WestRock; Swan D. Smith, R&D researcher, WestRock; and Douglas W. Coffin, professor, Miami University. The paper physics research appeared in TJ’s April, 2020 issue.

“It was great to see a paper in the running from North America on paper physics, because it’s exciting that this important area of research is making a resurgence in the region,” said Hart.

In addition to receiving Best Research Paper honors, Panek will also be awarded the Honghi Tran TAPPI Journal Best Research Paper Prize. The US$2,000 cash prize is endowed by Professor Emeritus Honghi Tran, Ph.D., of the University of Toronto and author or co-author of more than 80 papers published in TJ.

Tran’s body of work also includes co-authoring the second winning paper for 2020, “Modeling of the energy of a smelt-water explosion in the recovery boiler dissolving tank,” which appeared in the August, 2020 issue.

The recovery boiler winning paper’s first author was Eric Jin, a former student of Tran’s who is now a performance engineering specialist with The Babcock & Wilcox Company. In addition to Tran, the paper’s co-authors included Malcom Mackenzie and Steve Osborne, who are technology manager and advisory engineer, respectively, for The Babcock & Wilcox Company.

“Under the tutelage of Tran, teams continue to add high-quality content in the liquor cycle, kiln, and recausticizing arenas,” said Hart. “This makes the TJ Editorial Board’s selection even more difficult.”

Tran established the prize in 2019 to encourage and reward the publication of high-quality research in TJ. Since his former student, Jin, was co-winner of the 2020 Best Research Paper Award, the prize was awarded to Panek as first author of the other paper.

“The TJ Editorial Board would like to thank Dr. Tran for establishing the TAPPI Journal Best Research Paper Prize,” said Hart. “Our hope is that it infuses other groups with a healthy dose of competitive spirit, resulting in additional high-quality research papers from across the industry and world to be considered in future Best Paper decisions.”

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

TOPBOTS Logo

The Best of Applied Artificial Intelligence, Machine Learning, Automation, Bots, Chatbots

GPT-3 & Beyond: 10 NLP Research Papers You Should Read

November 17, 2020 by Mariya Yao

nlp research papers

NLP research advances in 2020 are still dominated by large pre-trained language models, and specifically transformers. There were many interesting updates introduced this year that have made transformer architecture more efficient and applicable to long documents.

Another hot topic relates to the evaluation of NLP models in different applications. We still lack evaluation approaches that clearly show where a model fails and how to fix it.

Also, with the growing capabilities of language models such as GPT-3, conversational AI is enjoying a new wave of interest. Chatbots are improving, with several impressive bots like Meena and Blender introduced this year by top technology companies.

To help you stay up to date with the latest NLP research breakthroughs, we’ve curated and summarized the key research papers in natural language processing from 2020. The papers cover the leading language models, updates to the transformer architecture, novel evaluation approaches, and major advances in conversational AI.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

  • WinoGrande: An Adversarial Winograd Schema Challenge at Scale
  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
  • Reformer: The Efficient Transformer
  • Longformer: The Long-Document Transformer
  • ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
  • Language Models are Few-Shot Learners
  • Beyond Accuracy: Behavioral Testing of NLP models with CheckList
  • Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
  • Towards a Human-like Open-Domain Chatbot
  • Recipes for Building an Open-Domain Chatbot

Best NLP Research Papers 2020

1. winogrande: an adversarial winograd schema challenge at scale , by keisuke sakaguchi, ronan le bras, chandra bhagavatula, yejin choi, original abstract .

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. 

To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of 94.0%, depending on the amount of the training data allowed. 

Furthermore, we establish new state-of-the-art results on five related benchmarks – WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.

Our Summary 

The research group from the Allen Institute for Artificial Intelligence introduces WinoGrande , a new benchmark for commonsense reasoning. They build on the design of the famous Winograd Schema Challenge (WSC) benchmark but significantly increase the scale of the dataset to 44K problems and reduce systematic bias using a novel AfLite algorithm. The experiments demonstrate that state-of-the-art methods achieve up to 79.1% accuracy on WinoGrande, which is significantly below the human performance of 94%. Furthermore, the researchers show that WinoGrande is an effective resource for transfer learning, by using a RoBERTa model fine-tuned with WinoGrande to achieve new state-of-the-art results on WSC and four other related benchmarks.

NLP research paper - WinoGrande

What’s the core idea of this paper?

  • The authors claim that existing benchmarks for commonsense reasoning suffer from systematic bias and annotation artifacts, leading to overestimation of the true capabilities of machine intelligence on commonsense reasoning.
  • Crowdworkers were asked to write twin sentences that meet the WSC requirements and contain certain anchor words. This new requirement is aimed at improving the creativity of crowdworkers.
  • Collected problems were validated through a distinct set of three crowdworkers. Out of 77K collected questions, 53K were deemed valid.
  • It generalizes human-detectable biases based on word occurrences to machine-detectable biases based on embedding occurrences.
  • After applying the AfLite algorithm, the debiased WinoGrande dataset contains 44K samples. 

What’s the key achievement?

  • Wino Knowledge Hunting (WKH) and Ensemble LMs only achieve chance-level performance (50%);
  • RoBERTa achieves 79.1% test-set accuracy;
  • whereas human performance achieves 94% accuracy.
  • 90.1% on WSC;
  • 93.1% on DPR ; 
  • 90.6% on COPA ; 
  • 85.6% on KnowRef ; and 
  • 97.1% on Winogender .

What does the AI community think?

  • The paper received the Outstanding Paper Award at AAAI 2020, one of the key conferences in artificial intelligence.

What are future research areas?

  • Exploring new algorithmic approaches for systematic bias reduction.
  • Debiasing other NLP benchmarks.

Where can you get implementation code?

  • The dataset can be downloaded from the WinoGrande project page .
  • The implementation code is available on GitHub .
  • And here is the WinoGrande leaderboard .

2nd Edition Applied AI book

2. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

The Google research team suggests a unified approach to transfer learning in NLP with the goal to set a new state of the art in the field. To this end, they propose treating each NLP problem as a “text-to-text” problem. Such a framework allows using the same model, objective, training procedure, and decoding process for different tasks, including summarization, sentiment analysis, question answering, and machine translation. The researchers call their model a Text-to-Text Transfer Transformer (T5) and train it on the large corpus of web-scraped data to get state-of-the-art results on a number of NLP tasks.

T5 language model

  • Providing a comprehensive perspective on where the NLP field stands by exploring and comparing existing techniques.
  • The mode understands which tasks should be performed thanks to the task-specific prefix added to the original input sentence (e.g., “translate English to German:”, “summarize:”).
  • Presenting and releasing a new dataset consisting of hundreds of gigabytes of clean web-scraped English text, the Colossal Clean Crawled Corpus (C4) .
  • Training a large (up to 11B parameters) model, called Text-to-Text Transfer Transformer (T5) on the C4 dataset.
  • the GLUE score of 89.7 with substantially improved performance on CoLA, RTE, and WNLI tasks;
  • the Exact Match score of 90.06 on SQuAD dataset;
  • the SuperGLUE score of 88.9, which is a very significant improvement over the previous state-of-the-art result (84.6) and very close to human performance (89.8);
  • the ROUGE-2-F score of 21.55 on CNN/Daily Mail abstractive summarization task.
  • Researching the methods to achieve stronger performance with cheaper models.
  • Exploring more efficient knowledge extraction techniques.
  • Further investigating the language-agnostic models.

What are possible business applications?

  • Even though the introduced model has billions of parameters and can be too heavy to be applied in the business setting, the presented ideas can be used to improve the performance on different NLP tasks, including summarization, question answering, and sentiment analysis.
  • The pretrained models together with the dataset and code are released on GitHub .

3. Reformer: The Efficient Transformer , by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O( L 2 ) to O( L log L ), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

The leading Transformer models have become so big that they can be realistically trained only in large research laboratories. To address this problem, the Google Research team introduces several techniques that improve the efficiency of Transformers. In particular, they suggest (1) using reversible layers to allow storing the activations only once instead of for each layer, and (2) using locality-sensitive hashing to avoid costly softmax computation in the case of full dot-product attention. Experiments on several text tasks demonstrate that the introduced Reformer model matches the performance of the full Transformer but runs much faster and with much better memory efficiency.

Reformer - NLP

  • The activations of every layer need to be stored for back-propagation.
  • The intermediate feed-forward layers account for a large fraction of memory use since their depth is often much larger than the depth of attention activations.
  • The complexity of attention on a sequence of length L is O( L 2 ).
  • using reversible layers to store only a single copy of activations;
  • splitting activations inside the feed-forward layers and processing them in chunks;
  • approximating attention computation based on locality-sensitive hashing .
  • switching to locality-sensitive hashing attention;
  • using reversible layers.
  • For example, on the newstest2014 task for machine translation from English to German, the Reformer base model gets a BLEU score of 27.6 compared to Vaswani’s et al. (2017) BLEU score of 27.3. 
  • The paper was selected for oral presentation at ICLR 2020, the leading conference in deep learning.
  • text generation;
  • visual content generation;
  • music generation;
  • time-series forecasting.
  • The official code implementation from Google is publicly available on GitHub .
  • The PyTorch implementation of Reformer is also available on GitHub .

4. Longformer: The Long-Document Transformer , by Iz Beltagy, Matthew E. Peters, Arman Cohan

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.

Self-attention is one of the key factors behind the success of Transformer architecture. However, it also makes transformer-based models hard to apply to long documents. The existing techniques usually divide the long input into a number of chunks and then use complex architectures to combine information across these chunks. The research team from the Allen Institute for Artificial Intelligence introduces a more elegant solution to this problem. The suggested Longformer model employs an attention pattern that combines local windowed attention with task-motivated global attention. This attention mechanism scales linearly with the sequence length and enables processing of documents with thousands of tokens. The experiments demonstrate that Longformer achieves state-of-the-art results on character-level language modeling tasks, and when pre-trained, consistently outperforms RoBERTa on long-document tasks.

Longformer - NLP

  • The computational requirements of self-attention grow quadratically with sequence length, making it hard to process on current hardware. 
  • allows memory usage to scale linearly, and not quadratically, with the sequence length;
  • a windowed local-context self-attention to build contextual representations;
  • an end task motivated global attention to encode inductive bias about the task and build full sequence representation.
  • Since the implementation of the sliding window attention pattern requires a form of banded matrix multiplication that is not supported in the existing deep learning libraries like PyTorch and Tensorflow, the authors also introduce a custom CUDA kernel for implementing these attention operations.
  • BPC of 1.10 on text8 ;
  • BPC of 1.00 on enwik8 .
  • accuracy of 75.0 vs. 72.4 on WikiHop ;
  • F1 score of 75.2 vs. 74.2 on TriviaQA ;
  • joint F1 score of 64.4 vs. 63.5 on HotpotQA ;
  • average F1 score of 78.6 vs. 78.4 on the OntoNotes coreference resolution task;
  • accuracy of 95.7 vs. 95.3 on the IMDB classification task;
  • F1 score of 94.0 vs. 87.4 on the Hyperpartisan classification task.
  • The performance gains are especially remarkable for the tasks that require a long context (i.e., WikiHop and Hyperpartisan).
  • Exploring other attention patterns that are more efficient due to dynamic adaptation to the input. 
  • Applying Longformer to other relevant long document tasks such as summarization.
  • document classification;
  • question answering;
  • coreference resolution;
  • summarization;
  • semantic search.
  • The code implementation of Longformer is open-sourced on GitHub .

5. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30× more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

The pre-training task for popular language models like BERT and XLNet involves masking a small subset of unlabeled input and then training the network to recover this original input. Even though it works quite well, this approach is not particularly data-efficient as it learns from only a small fraction of tokens (typically ~15%). As an alternative, the researchers from Stanford University and Google Brain propose a new pre-training task called replaced token detection . Instead of masking, they suggest replacing some tokens with plausible alternatives generated by a small language model. Then, the pre-trained discriminator is used to predict whether each token is an original or a replacement. As a result, the model learns from all input tokens instead of the small masked fraction, making it much more computationally efficient. The experiments confirm that the introduced approach leads to significantly faster training and higher accuracy on downstream NLP tasks.

ELECTRA - NLP

  • Pre-training methods that are based on masked language modeling are computationally inefficient as they use only a small fraction of tokens for learning.
  • some tokens are replaced by samples from a small generator network; 
  • a model is pre-trained as a discriminator to distinguish between original and replaced tokens.
  • enables the model to learn from all input tokens instead of the small masked-out subset;
  • is not adversarial, despite the similarity to GAN, as the generator producing tokens for replacement is trained with maximum likelihood.
  • Demonstrating that the discriminative task of distinguishing between real data and challenging negative samples is more efficient than existing generative methods for language representation learning.
  • ELECTRA-Small gets a GLUE score of 79.9 and outperforms a comparably small BERT model with a score of 75.1 and a much larger GPT model with a score of 78.8.
  • An ELECTRA model that performs comparably to XLNet and RoBERTa uses only 25% of their pre-training compute.
  • ELECTRA-Large outscores the alternative state-of-the-art models on the GLUE and SQuAD benchmarks while still requiring less pre-training compute.
  • The paper was selected for presentation at ICLR 2020, the leading conference in deep learning.
  • Because of its computational efficiency, the ELECTRA approach can make the application of pre-trained text encoders more accessible to business practitioners.
  • The original TensorFlow implementation and pre-trained weights are released on GitHub .

6. Language Models are Few-Shot Learners , by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

The OpenAI research team draws attention to the fact that the need for a labeled dataset for every new language task limits the applicability of language models. Considering that there is a wide range of possible tasks and it’s often difficult to collect a large labeled training dataset, the researchers suggest an alternative solution, which is scaling up language models to improve task-agnostic few-shot performance. They test their solution by training a 175B-parameter autoregressive language model, called GPT-3 , and evaluating its performance on over two dozen NLP tasks. The evaluation under few-shot learning, one-shot learning, and zero-shot learning demonstrates that GPT-3 achieves promising results and even occasionally outperforms the state of the art achieved by fine-tuned models.

GPT-3

  • The GPT-3 model uses the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization.
  • However, in contrast to GPT-2, it uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, as in the Sparse Transformer .
  • Few-shot learning , when the model is given a few demonstrations of the task (typically, 10 to 100) at inference time but with no weight updates allowed.
  • One-shot learning , when only one demonstration is allowed, together with a natural language description of the task.
  • Zero-shot learning , when no demonstrations are allowed and the model has access only to a natural language description of the task.
  • On the CoQA benchmark, 81.5 F1 in the zero-shot setting, 84.0 F1 in the one-shot setting, and 85.0 F1 in the few-shot setting, compared to the 90.7 F1 score achieved by fine-tuned SOTA.
  • On the TriviaQA benchmark, 64.3% accuracy in the zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, surpassing the state of the art (68%) by 3.2%.
  • On the LAMBADA dataset, 76.2 % accuracy in the zero-shot setting, 72.5% in the one-shot setting, and 86.4% in the few-shot setting, surpassing the state of the art (68%) by 18%.
  • The news articles generated by the 175B-parameter GPT-3 model are hard to distinguish from real ones, according to human evaluations (with accuracy barely above the chance level at ~52%).
  • “The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.” – Sam Altman, CEO and co-founder of OpenAI .
  • “I’m shocked how hard it is to generate text about Muslims from GPT-3 that has nothing to do with violence… or being killed…” – Abubakar Abid, CEO and founder of Gradio .
  • “No. GPT-3 fundamentally does not understand the world that it talks about. Increasing corpus further will allow it to generate a more credible pastiche but not fix its fundamental lack of comprehension of the world. Demos of GPT-4 will still require human cherry picking.” – Gary Marcus, CEO and founder of Robust.ai .
  • “Extrapolating the spectacular performance of GPT3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.” – Geoffrey Hinton, Turing Award winner .
  • Improving pre-training sample efficiency.
  • Exploring how few-shot learning works.
  • Distillation of large models down to a manageable size for real-world applications.
  • The model with 175B parameters is hard to apply to real business problems due to its impractical resource requirements, but if the researchers manage to distill this model down to a workable size, it could be applied to a wide range of language tasks, including question answering, dialog agents, and ad copy generation.
  • The code itself is not available, but some dataset statistics together with unconditional, unfiltered 2048-token samples from GPT-3 are released on GitHub .

7. Beyond Accuracy: Behavioral Testing of NLP models with CheckList , by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

The authors point out the shortcomings of existing approaches to evaluating performance of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate where the model is failing and how to fix it. The alternative evaluation approaches usually focus on individual tasks or specific capabilities. To address the lack of comprehensive evaluation approaches, the researchers introduce CheckList , a new evaluation methodology for testing of NLP models. The approach is inspired by principles of behavioral testing in software engineering. Basically, CheckList is a matrix of linguistic capabilities and test types that facilitates test ideation. Multiple user studies demonstrate that CheckList is very effective at discovering actionable bugs, even in extensively tested NLP models.

CheckList

  • The primary approach to the evaluation of models’ generalization capabilities, which is accuracy on held-out data, may lead to performance overestimation, as the held-out data often contains the same biases as the training data. Moreover, this single aggregate statistic doesn’t help much in figuring out where the NLP model is failing and how to fix these bugs.
  • The alternative approaches are usually designed for evaluation of specific behaviors on individual tasks and thus, lack comprehensiveness.
  • CheckList provides users with a list of linguistic capabilities to be tested, like vocabulary, named entity recognition, and negation.
  • Then, to break down potential capability failures into specific behaviors, CheckList suggests different test types , such as prediction invariance or directional expectation tests in case of certain perturbations.
  • Potential tests are structured as a matrix, with capabilities as rows and test types as columns.
  • The suggested implementation of CheckList also introduces a variety of abstractions to help users generate large numbers of test cases easily.
  • Evaluation of state-of-the-art models with CheckList demonstrated that even though some NLP tasks are considered “solved” based on accuracy results, the behavioral testing highlights many areas for improvement.
  • helps to identify and test for capabilities not previously considered;
  • results in more thorough and comprehensive testing for previously considered capabilities;
  • helps to discover many more actionable bugs.
  • The paper received the Best Paper Award at ACL 2020, the leading conference in natural language processing.
  • CheckList can be used to create more exhaustive testing for a variety of NLP tasks.
  • Such comprehensive testing that helps in identifying many actionable bugs is likely to lead to more robust NLP systems.
  • The code for testing NLP models with CheckList is available on GitHub .

8. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , by Nitika Mathur, Timothy Baldwin, Trevor Cohn

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric’s efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

The most recent Conference on Machine Translation (WMT) has revealed that, based on Pearson’s correlation coefficient, automatic metrics poorly match human evaluations of translation quality when comparing only a few best systems. Even negative correlations were exhibited in some instances. The research team from the University of Melbourne investigates this issue by studying the role of outlier systems, exploring how the correlation coefficient reflects different patterns of errors (type I vs. type II errors), and what magnitude of difference in the metric score corresponds to true improvements in translation quality as judged by humans. Their findings suggest that small BLEU differences (i.e., 1–2 points) have little meaning and other metrics, such as chrF, YiSi-1, and ESIM should be preferred over BLEU. However, only human evaluations can be a reliable basis for drawing important empirical conclusions.

Tangled up in BLEU

  • Automatic metrics are used as a proxy for human translation evaluation, which is considerably more expensive and time-consuming.
  • For example, the recent findings show that if the correlation between leading metrics and human evaluations is computed using a large set of translation systems, it is typically very high (i.e., 0.9). However, if only a few best systems are considered, the correlation reduces markedly and can even be negative in some cases.
  • The identified problem with Pearson’s correlation is due to the small sample size and not specific to comparing strong MT systems.
  • Outlier systems, whose quality is much higher or lower than the rest of the systems, have a disproportionate effect on the computed correlation and should be removed.
  • The same correlation coefficient can reflect different patterns of errors. Thus, a better approach for gaining insights into metric reliability is to visualize metric scores against human scores.
  • Small BLEU differences of 1-2 points correspond to true improvements in translation quality (as judged by humans) only in 50% of cases.
  • Giving preference to such evaluation metrics as chrF, YiSi-1, and ESIM over BLEU and TER.
  • Moving away from using small changes in evaluation metrics as the sole basis to draw important empirical conclusions, and always ensuring support from human evaluations before claiming that one MT system significantly outperforms another one.
  • The paper received an Honorable Mention at ACL 2020, the leading conference in natural language processing. 
  • The implementation code, data, and additional analysis will be released on GitHub .

9. Towards a Human-like Open-Domain Chatbot , by Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Quoc V. Le

We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated. 

In contrast to most modern conversational agents, which are highly specialized, the Google research team introduces a chatbot Meena that can chat about virtually anything. It’s built on a large neural network with 2.6B parameters trained on 341 GB of text. The researchers also propose a new human evaluation metric for open-domain chatbots, called Sensibleness and Specificity Average (SSA), which can capture important attributes for human conversation. They demonstrate that this metric correlates highly with perplexity, an automatic metric that is readily available. Thus, the Meena chatbot, which is trained to minimize perplexity, can conduct conversations that are more sensible and specific compared to other chatbots. Particularly, the experiments demonstrate that Meena outperforms existing state-of-the-art chatbots by a large margin in terms of the SSA score (79% vs. 56%) and is closing the gap with human performance (86%).

Meena chatbot

  • Despite recent progress, open-domain chatbots still have significant weaknesses: their responses often do not make sense or are too vague or generic.
  • Meena is built on a seq2seq model with Evolved Transformer (ET) that includes 1 ET encoder block and 13 ET decoder blocks.
  • The model is trained on multi-turn conversations with the input sequence including all turns of the context (up to 7) and the output sequence being the response.
  • making sense,
  • being specific.
  • The research team discovered that the SSA metric shows high negative correlation (R2 = 0.93) with perplexity, a readily available automatic metric that Meena is trained to minimize.
  • Proposing a simple human-evaluation metric for open-domain chatbots.
  • The best end-to-end trained Meena model outperforms existing state-of-the-art open-domain chatbots by a large margin, achieving an SSA score of 72% (vs. 56%).
  • Furthermore, the full version of Meena, with a filtering mechanism and tuned decoding, further advances the SSA score to 79%, which is not far from the 86% SSA achieved by the average human.
  • “Google’s “Meena” chatbot was trained on a full TPUv3 pod (2048 TPU cores) for 30 full days – that’s more than $1,400,000 of compute time to train this chatbot model.” – Elliot Turner, CEO and founder of Hyperia .
  • “So I was browsing the results for the new Google chatbot Meena, and they look pretty OK (if boring sometimes). However, every once in a while it enters ‘scary sociopath mode,’ which is, shall we say, sub-optimal” – Graham Neubig, Associate professor at Carnegie Mellon University .

Meena chatbot

  • Lowering the perplexity through improvements in algorithms, architectures, data, and compute.
  • Considering other aspects of conversations beyond sensibleness and specificity, such as, for example, personality and factuality.
  • Tackling safety and bias in the models.
  • further humanizing computer interactions; 
  • improving foreign language practice; 
  • making interactive movie and videogame characters relatable.
  • Considering the challenges related to safety and bias in the models, the authors haven’t released the Meena model yet. However, they are still evaluating the risks and benefits and may decide otherwise in the coming months.

10. Recipes for Building an Open-Domain Chatbot , by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston

Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models. 

The Facebook AI Research team shows that with appropriate training data and generation strategy, large-scale models can learn many important conversational skills, such as engagingness, knowledge, empathy, and persona consistency. Thus, to build their state-of-the-art conversational agent, called BlenderBot , they leveraged a model with 9.4B parameters, trained it on a novel task called Blended Skill Talk , and deployed beam search with carefully selected hyperparameters as a generation strategy. Human evaluations demonstrate that BlenderBot outperforms Meena in pairwise comparison 75% to 25% in terms of engagingness and 65% to 35% in terms of humanness.

BlenderBot

  • Large scale. The largest model has 9.4 billion parameters and was trained on 1.5 billion training examples of extracted conversations.
  • Blended skills. The chatbot was trained on the Blended Skill Talk task to learn such skills as engaging use of personality, engaging use of knowledge, and display of empathy.
  • Beam search used for decoding. The researchers show that this generation strategy, deployed with carefully selected hyperparameters, gives strong results. In particular, it was demonstrated that the lengths of the agent’s utterances is very important for chatbot performance (i.e, too short responses are often considered dull and too long responses make the chatbot appear to waffle and not listen).
  • 75% of the time in terms of engagingness;
  • 65% of the time in terms of humanness.
  • In an A/B comparison between human-to-human and human-to-BlenderBot conversations, the latter were preferred 49% of the time as more engaging.
  • a lack of in-depth knowledge if sufficiently interrogated; 
  • a tendency to use simpler language; 
  • a tendency to repeat oft-used phrases.
  • Further exploring unlikelihood training and retrieve-and-refine mechanisms as potential avenues for fixing these issues.
  • Facebook AI open-sourced BlenderBot by releasing code to fine-tune the conversational agent, the model weights, and code to evaluate it.

If you like these research summaries, you might be also interested in the following articles:

  • 2020’s Top AI & Machine Learning Research Papers
  • Novel Computer Vision Research Papers From 2020
  • AAAI 2021: Top Research Papers With Business Applications
  • ICLR 2021: Key Research Papers

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

  • Email Address *
  • Name * First Last
  • Natural Language Processing (NLP)
  • Chatbots & Conversational AI
  • Computer Vision
  • Ethics & Safety
  • Machine Learning
  • Deep Learning
  • Reinforcement Learning
  • Generative Models
  • Other (Please Describe Below)
  • What is your biggest challenge with AI research? *

Reader Interactions

' src=

About Mariya Yao

Mariya is the co-author of Applied AI: A Handbook For Business Leaders and former CTO at Metamaven. She "translates" arcane technical concepts into actionable business advice for executives and designs lovable products people actually want to use. Follow her on Twitter at @thinkmariya to raise your AI IQ.

' src=

December 9, 2023 at 3:26 am

Thank you so much for sharing such great information with us. Your website is great. We are impressed by the details you have on your website. We have bookmarked this site. keep it up and thanks again.

Leave a Reply

Your email address will not be published. Required fields are marked *

About TOPBOTS

  • Expert Contributors
  • Terms of Service & Privacy Policy
  • Contact TOPBOTS

Best Research Paper Award

Previous years, 2022 winner, 2021 winner, 2020 winner, 2019 winner, 2018 winner, 2017 winner, 2016 joint winners, 2015 winner.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • 14 February 2024

Largest post-pandemic survey finds trust in scientists is high

  • Carissa Wong

You can also search for this author in PubMed   Google Scholar

You have full access to this article via your institution.

Protestors in Washington DC walk with signs and banners during the March For Science in 2018.

The global survey indicated that people have moderately high levels of trust in scientists overall. Credit: Michael Candelori/Pacific Press Via Zuma Wire/Shutterstock

People around the world have high levels of trust in scientists, and most want researchers to get more involved in policymaking, finds a global survey with more than 70,000 participants. But trust levels are influenced by political orientation and differ among nations, according to the study, which was described in a preprint posted online last month 1 .

best research papers 2020

How can scientists make the most of the public’s trust in them?

“The overall message is positive,” says James Liu, a psychologist at the Massey University of New Zealand in Auckland. “Even in the wake of the COVID-19 pandemic, which could have been highly polarizing for people’s trust in scientists, trust levels are fairly high across a range of demographics.”

“The researchers use a more robust measure of trust compared to previous studies that focus on just one or two dimensions,” says Nan Li, who studies how the public engages with science at the University of Wisconsin–Madison. “I really admired the authors’ ambitions of doing this type of study, which includes researchers from all over the world.”

The scope makes the survey one of the largest studies on trust in scientists to be conducted since the onset of the pandemic.

Worldwide attitudes

Social scientist Viktoria Cologna at Leibniz University Hannover, Germany, and her colleagues surveyed 71,417 people in 67 countries. In most places, the researchers recruited participants online through marketing companies, with the exception of the Democratic Republic of the Congo, where they used in-person surveys. Respondents were asked to indicate how much they agreed with a dozen statements about the integrity, competency, benevolence and openness of scientists, on a scale of 1 to 5. A higher score indicated higher trust.

best research papers 2020

US trust in scientists is now on par with the military

Across all participants, the average trust score was moderately high, at 3.62. On a global scale, participants perceived scientists as having high competence, moderate integrity and benevolent intentions. The overall rating of openness to feedback was lower: 23% of participants think that scientists pay only somewhat or very little attention to other views. Three-quarters of people agreed that scientific methods are the best way to find out whether something is true.

Participants from Egypt had the most trust in scientists, followed by India and Nigeria; in Albania, Kazakhstan and Bolivia, people had the least trust. Participants in countries including the United States, United Kingdom, Australia and China had above-average levels of trust in scientists, whereas those in Germany, Hong Kong and Japan had below-average trust levels.

Trust and politics

The study also explored the links between participants’ trust in scientists and their political leanings. At the global level, a ‘left-leaning’ political orientation was linked to higher trust. The team saw this association at the country level in Canada, the United States, the United Kingdom, Norway and China. But of the 67 countries surveyed, in 41 — including New Zealand, Argentina and Mexico — the team found no significant association between political orientation and trust. And in some countries, including Georgia, Egypt, the Philippines, Nigeria and Greece, left-leaning views were linked to lower trust.

“These contrasting findings may be explained by the fact that in some countries right-leaning parties may have cultivated reservations against scientists among their supporters, while in other countries left-leaning parties may have done so,” the researchers say in the preprint. For example, New Democracy, Greece’s right-wing ruling party, has since 2020 consistently cooperated with researchers in implementing a public-health agenda, which could explain why in that country a right-leaning political orientation is linked to higher trust in scientists.

“It’s about the leadership of political parties and how they treat scientists,” says Liu. The concept of a right- or left-wing political orientation can also differ among people in different countries, making it hard to interpret the findings.

Engaging with policy: Bar charts showing results of a survey asking about how researchers should communicate their work.

Source: Ref. 1

More than half of the respondents think that researchers should be more involved in policymaking and should work closely with politicians to integrate scientific results into policymaking (see ‘Engaging with policy’). “These results are intuitive — if people trust scientists, they will want them to be involved,” says Liu.

“But entering the public policy arena as a scientist can end up being a kind of blood sport,” he says. “We see that with, say, climate scientists being disregarded and doubted by some politicians.”

Liu thinks that there needs to be more training for scientists who want to enter policymaking, and that many researchers need to improve their communication skills, “so we’re ready for that rough and tumble arena of public policy”. The study found that 80% of people think researchers should communicate about science with the general public.

Although the study provides a general snapshot of trust in researchers, people’s trust levels will also vary depending on scientists’ fields, says Li.

The team plans to make the global data set openly accessible online, to help other researchers study the topic.

Nature 626 , 704 (2024)

doi: https://doi.org/10.1038/d41586-024-00420-1

Cologna, V. et al. Preprint at OSF Preprints https://doi.org/10.31219/osf.io/6ay7s (2024).

Download references

Reprints and permissions

Related Articles

best research papers 2020

Give the public the tools to trust scientists

Openness in science is key to keeping public trust

  • Scientific community

How to boost your research: take a sabbatical in policy

How to boost your research: take a sabbatical in policy

World View 21 FEB 24

Scientists under arrest: the researchers taking action over climate change

Scientists under arrest: the researchers taking action over climate change

News Feature 21 FEB 24

Science can drive development and unity in Africa — as it does in the US and Europe

Science can drive development and unity in Africa — as it does in the US and Europe

Editorial 21 FEB 24

What the EU’s tough AI law means for research and ChatGPT

What the EU’s tough AI law means for research and ChatGPT

News Explainer 16 FEB 24

From the archive: river pollution, and a minister for science

From the archive: river pollution, and a minister for science

News & Views 13 FEB 24

Indonesian election promises boost to research funding — no matter who wins

Indonesian election promises boost to research funding — no matter who wins

News 13 FEB 24

Tenure Track Assistant Professor towards Associate Professor in the field of biomedical sciences

UNIL is a leading international teaching and research institution, with over 5,000 employees and 17,000 students split between its Dorigny campus, ...

Lausanne, Canton of Vaud (CH)

University of Lausanne (UNIL)

best research papers 2020

Faculty Positions at City University of Hong Kong (Dongguan)

CityU (Dongguan) warmly invites individuals from diverse backgrounds to apply for various faculty positions available at the levels of Professor...

Dongguan, Guangdong, China

City University of Hong Kong (Dongguan)

best research papers 2020

Principal Clinical Investigator in Immuno-Oncology

A new wave of Immunotherapeutics drugs is coming and its development requires specific expertise in the field of clinical research, clinical immuno...

Villejuif (Ville), L'Haÿ-les-Roses (FR)

GUSTAVE ROUSSY

best research papers 2020

Recruitment of Global Talent at the Institute of Zoology, Chinese Academy of Sciences (IOZ, CAS)

The Institute of Zoology (IOZ), Chinese Academy of Sciences (CAS), is seeking global talents around the world.

Beijing, China

Institute of Zoology, Chinese Academy of Sciences (IOZ, CAS)

best research papers 2020

Position Opening for Principal Investigator GIBH

Guangzhou, Guangdong, China

Guangzhou Institutes of Biomedicine and Health(GIBH), Chinese Academy of Sciences

best research papers 2020

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

IMAGES

  1. Top 15 Papers From Google AI Research Accepted By NeurIPS 2020

    best research papers 2020

  2. MRC’s Jeremy Leipzig and Jane Greenberg win MTSR Best Research Paper

    best research papers 2020

  3. Best Research Papers From ACL 2020

    best research papers 2020

  4. Top Machine Learning Research Papers Released In 2020

    best research papers 2020

  5. Top 3 Artificial Intelligence Research Papers

    best research papers 2020

  6. Top 3 Artificial Intelligence Research Papers

    best research papers 2020

VIDEO

  1. questions paper of research methodology for BBA students

  2. Finding HIGH-Impact Research Topics

  3. Digital FUTURES Talk: Best SIGraDi Research 2022

  4. Research paper discussion

  5. Secrets To Finding High-Impact Research Topics (I NEVER Revealed These Before)

  6. How to Write a Research Paper Publication

COMMENTS

  1. The top 10 journal articles of 2020

    The top 10 journal articles of 2020 In 2020, APA's 89 journals published more than 5,000 articles—the most ever and 25% more than in 2019. Here's a quick look at the 10 most downloaded to date. By Chris Palmer Date created: January 1, 2021 8 min read Vol. 52 No. 1 Print version: page 24 11 Cite This Article Palmer, C. (2021, January 1).

  2. Journal Top 100

    Journal Top 100 This collection highlights our most downloaded* research papers published in 2020. Featuring authors from around the world, these papers showcase valuable research from an...

  3. Research articles

    scientific reports research articles Research articles Article Type Year The social anatomy of climate change denial in the United States Dimitrios Gounaridis Joshua P. Newell Article Open...

  4. Google Scholar reveals its most influential papers for 2020

    Google Scholar reveals its most influential papers for 2020 Artificial intelligence papers amass citations more than any other research topic. Bec Crew

  5. Publications

    Publishing our work allows us to share ideas and work collaboratively to advance the field of computer science. 2024 56 2023 675 2022 854 2021 1016 2020 1069 2019 980 2018 822 2017 632 2016 571 2015 503 2014 510 2013 587 2012 637 2011 635 2010 605 2009 682 2008 612 2007 587 2006 508 2005 396 2004 363 2003 320 2002 258 2001 206 2000 155 1999 181

  6. ES &T 's Best Papers of 2020

    On reflection of how challenging the year 2020 was globally, it is testament to the commitment and dedication of the scientific research community around the world that Environmental Science& Technology (ES&T) continued to receive exceptional and outstanding quality research work during this period.In 2020, ES&T published approximately 1700 manuscripts, sharing novel and impactful research ...

  7. 2020 Best Paper Awards

    The NSR 2020 Best Paper Awards recognizes the best papers published in the recent years in National Science Review. The Editorial Board have selected the below papers as winners of the 2020 Best Paper Awards highlighting their significant contributions to research measured against achievement and potential impact.

  8. Our most popular papers of 2020

    Posted on 18 December 2020 Which Reading research publications got the most attention across the globe in 2020? We've scoured Altmetric data to bring you the top ten most talked about Reading-authored papers of the past year. The 2020 Lancet Countdown report on health and climate change: responding to converging crises

  9. The 10 Most Significant Education Studies of 2020

    1. To Teach Vocabulary, Let Kids Be Thespians When students are learning a new language, ask them to act out vocabulary words. It's fun to unleash a child's inner thespian, of course, but a 2020 study concluded that it also nearly doubles their ability to remember the words months later.

  10. A comprehensive review study of cyber-attacks and cyber security

    The existence of a comprehensive definition of a cyber-attack will undoubtedly have a direct impact on the legal environment to continue and identify the consequences of this attack type ( Furnell et al., 2020 ).

  11. 'It depends': what 86 systematic reviews tell us about what strategies

    The gap between research findings and clinical practice is well documented and a range of interventions has been developed to increase the implementation of research into clinical practice [1, 2].In recent years researchers have worked to improve the consistency in the ways in which these interventions (often called strategies) are described to support their evaluation.

  12. The 5 most popular scientific papers of August 2020 in the ...

    Listings 27 October 2020 The 5 most popular scientific papers of August 2020 in the Nature Index journals A gruesome escape artist and ingenious dolphin behaviour feature in these stand-out...

  13. The 100 most-cited scientific papers

    The 100 most-cited scientific papers. 30 Oct 2014. By David Shultz. David Shultz. Here at Science we love ranking things, so we were thrilled with this list of the top 100 most-cited scientific papers, courtesy of Nature. Surprisingly absent are many of the landmark discoveries you might expect, such as the discovery of DNA's double helix ...

  14. 2020's Top AI & Machine Learning Research Papers

    2020's Top AI & Machine Learning Research Papers November 24, 2020 by Mariya Yao Despite the challenges of 2020, the AI research community produced a number of meaningful technical breakthroughs. GPT-3 by OpenAI may be the most famous, but there are definitely many other research papers worth your attention.

  15. 2020: A Year Full of Amazing AI Papers

    Here are the most interesting research papers of the year, in case you missed any of them. ... ECCV 2020 Best Paper Award Goes to Princeton Team. They developed a new end-to-end trainable model for optical flow. Their method beats state-of-the-art architectures' accuracy across multiple datasets and is way more efficient. They even made the ...

  16. TJ Honors TWO Best Research Papers for 2020

    For the first time since the award's inception in 2006, two papers have been selected as TAPPI Journal Best Research Paper for 2020. "The six papers that were nominated this year covered a wide range of the entire pulp, paper, and supplier industries, making the selection extremely challenging," said Peter W. Hart, Ph.D., TAPPI Journal editor-in-chief.

  17. Search

    Find the research you need | With 160+ million publications, 1+ million questions, and 25+ million researchers, this is where everyone can access science

  18. What are the best scientific papers ever written?

    I have mixed feelings about both these celebrated papers. The ideas in the papers are both pretty and relevant. These are simple theoretical explanations for important social phenomena. My problem with the papers is that people can take them too seriously, that people can see a possible explanation as the explanation.

  19. Journal Top 100

    Jin-Wook Kim Article Open Access 25 Aug 2022 Scientific Reports The impact of digital media on children's intelligence while controlling for genetic differences in cognition and socioeconomic...

  20. GPT-3 & Beyond: 10 NLP Research Papers You Should Read

    November 17, 2020 by Mariya Yao NLP research advances in 2020 are still dominated by large pre-trained language models, and specifically transformers. There were many interesting updates introduced this year that have made transformer architecture more efficient and applicable to long documents.

  21. 'Best Research Papers' highlight innovative biomedical research

    The "Best Research Paper Awards" were given to Carver College of Medicine faculty who served as authors on articles published in peer-reviewed journals over the past year. Awards were presented in three categories: basic, clinical, and educational research. ... Cell Rep. 2020 Oct 27;33(4):108270. doi: 10.1016/j.celrep.2020.108270. PMID ...

  22. Best Research Paper Award

    The 2023 award is for the best research paper published in the journal in 2022. Winner. Claudio Barbiellini Amidei, Caterina Trevisan, Matilde Dotto, Eliana Ferroni, Marianna Noale, ... COVID-19 disease with ACE inhibitors and angiotensin receptor blockers: cohort study including 8.3 million people Paper >> 2020 winner

  23. The 5 most popular scientific papers of January 2020

    Listings 18 February 2020 The 5 most popular scientific papers of January 2020 Good news for ex-smokers. Credit: Anastasia Krasheninnikova The most talked-about papers of last month covered...

  24. The 10 Best Ecommerce Research Papers From 2020

    Learn more. 5. A Transformer-based Embedding Model for Personalized Product Search. Presented at SIGIR2020, this paper stands as a nice example of a piece of work by academicians (rather than industry practitioners) - which nevertheless addresses a topic of great relevance to the industry as well.

  25. Gartner Emerging Technologies and Trends Impact Radar for 2024

    3 things to tell your peers. 1. The trends and technologies featured in the Gartner Emerging Tech Impact Radar fall into four key themes and help product leaders gain a competitive edge. 2. Use the impact radar to guide your investment and strategic planning around disruptive technologies. 3.

  26. Largest post-pandemic survey finds trust in scientists is high

    A higher score indicated higher trust. Across all participants, the average trust score was moderately high, at 3.62. On a global scale, participants perceived scientists as having high competence ...