• Utility Menu

University Logo

GA4 Tracking Code

Gen ed writes, writing across the disciplines at harvard college.

  • Comparative Analysis

What It Is and Why It's Useful

Comparative analysis asks writers to make an argument about the relationship between two or more texts. Beyond that, there's a lot of variation, but three overarching kinds of comparative analysis stand out:

  • Coordinate (A ↔ B): In this kind of analysis, two (or more) texts are being read against each other in terms of a shared element, e.g., a memoir and a novel, both by Jesmyn Ward; two sets of data for the same experiment; a few op-ed responses to the same event; two YA books written in Chicago in the 2000s; a film adaption of a play; etc. 
  • Subordinate (A  → B) or (B → A ): Using a theoretical text (as a "lens") to explain a case study or work of art (e.g., how Anthony Jack's The Privileged Poor can help explain divergent experiences among students at elite four-year private colleges who are coming from similar socio-economic backgrounds) or using a work of art or case study (i.e., as a "test" of) a theory's usefulness or limitations (e.g., using coverage of recent incidents of gun violence or legislation un the U.S. to confirm or question the currency of Carol Anderson's The Second ).
  • Hybrid [A  → (B ↔ C)] or [(B ↔ C) → A] , i.e., using coordinate and subordinate analysis together. For example, using Jack to compare or contrast the experiences of students at elite four-year institutions with students at state universities and/or community colleges; or looking at gun culture in other countries and/or other timeframes to contextualize or generalize Anderson's main points about the role of the Second Amendment in U.S. history.

"In the wild," these three kinds of comparative analysis represent increasingly complex—and scholarly—modes of comparison. Students can of course compare two poems in terms of imagery or two data sets in terms of methods, but in each case the analysis will eventually be richer if the students have had a chance to encounter other people's ideas about how imagery or methods work. At that point, we're getting into a hybrid kind of reading (or even into research essays), especially if we start introducing different approaches to imagery or methods that are themselves being compared along with a couple (or few) poems or data sets.

Why It's Useful

In the context of a particular course, each kind of comparative analysis has its place and can be a useful step up from single-source analysis. Intellectually, comparative analysis helps overcome the "n of 1" problem that can face single-source analysis. That is, a writer drawing broad conclusions about the influence of the Iranian New Wave based on one film is relying entirely—and almost certainly too much—on that film to support those findings. In the context of even just one more film, though, the analysis is suddenly more likely to arrive at one of the best features of any comparative approach: both films will be more richly experienced than they would have been in isolation, and the themes or questions in terms of which they're being explored (here the general question of the influence of the Iranian New Wave) will arrive at conclusions that are less at-risk of oversimplification.

For scholars working in comparative fields or through comparative approaches, these features of comparative analysis animate their work. To borrow from a stock example in Western epistemology, our concept of "green" isn't based on a single encounter with something we intuit or are told is "green." Not at all. Our concept of "green" is derived from a complex set of experiences of what others say is green or what's labeled green or what seems to be something that's neither blue nor yellow but kind of both, etc. Comparative analysis essays offer us the chance to engage with that process—even if only enough to help us see where a more in-depth exploration with a higher and/or more diverse "n" might lead—and in that sense, from the standpoint of the subject matter students are exploring through writing as well the complexity of the genre of writing they're using to explore it—comparative analysis forms a bridge of sorts between single-source analysis and research essays.

Typical learning objectives for single-sources essays: formulate analytical questions and an arguable thesis, establish stakes of an argument, summarize sources accurately, choose evidence effectively, analyze evidence effectively, define key terms, organize argument logically, acknowledge and respond to counterargument, cite sources properly, and present ideas in clear prose.

Common types of comparative analysis essays and related types: two works in the same genre, two works from the same period (but in different places or in different cultures), a work adapted into a different genre or medium, two theories treating the same topic; a theory and a case study or other object, etc.

How to Teach It: Framing + Practice

Framing multi-source writing assignments (comparative analysis, research essays, multi-modal projects) is likely to overlap a great deal with "Why It's Useful" (see above), because the range of reasons why we might use these kinds of writing in academic or non-academic settings is itself the reason why they so often appear later in courses. In many courses, they're the best vehicles for exploring the complex questions that arise once we've been introduced to the course's main themes, core content, leading protagonists, and central debates.

For comparative analysis in particular, it's helpful to frame assignment's process and how it will help students successfully navigate the challenges and pitfalls presented by the genre. Ideally, this will mean students have time to identify what each text seems to be doing, take note of apparent points of connection between different texts, and start to imagine how those points of connection (or the absence thereof)

  • complicates or upends their own expectations or assumptions about the texts
  • complicates or refutes the expectations or assumptions about the texts presented by a scholar
  • confirms and/or nuances expectations and assumptions they themselves hold or scholars have presented
  • presents entirely unforeseen ways of understanding the texts

—and all with implications for the texts themselves or for the axes along which the comparative analysis took place. If students know that this is where their ideas will be heading, they'll be ready to develop those ideas and engage with the challenges that comparative analysis presents in terms of structure (See "Tips" and "Common Pitfalls" below for more on these elements of framing).

Like single-source analyses, comparative essays have several moving parts, and giving students practice here means adapting the sample sequence laid out at the " Formative Writing Assignments " page. Three areas that have already been mentioned above are worth noting:

  • Gathering evidence : Depending on what your assignment is asking students to compare (or in terms of what), students will benefit greatly from structured opportunities to create inventories or data sets of the motifs, examples, trajectories, etc., shared (or not shared) by the texts they'll be comparing. See the sample exercises below for a basic example of what this might look like.
  • Why it Matters: Moving beyond "x is like y but also different" or even "x is more like y than we might think at first" is what moves an essay from being "compare/contrast" to being a comparative analysis . It's also a move that can be hard to make and that will often evolve over the course of an assignment. A great way to get feedback from students about where they're at on this front? Ask them to start considering early on why their argument "matters" to different kinds of imagined audiences (while they're just gathering evidence) and again as they develop their thesis and again as they're drafting their essays. ( Cover letters , for example, are a great place to ask writers to imagine how a reader might be affected by reading an their argument.)
  • Structure: Having two texts on stage at the same time can suddenly feel a lot more complicated for any writer who's used to having just one at a time. Giving students a sense of what the most common patterns (AAA / BBB, ABABAB, etc.) are likely to be can help them imagine, even if provisionally, how their argument might unfold over a series of pages. See "Tips" and "Common Pitfalls" below for more information on this front.

Sample Exercises and Links to Other Resources

  • Common Pitfalls
  • Advice on Timing
  • Try to keep students from thinking of a proposed thesis as a commitment. Instead, help them see it as more of a hypothesis that has emerged out of readings and discussion and analytical questions and that they'll now test through an experiment, namely, writing their essay. When students see writing as part of the process of inquiry—rather than just the result—and when that process is committed to acknowledging and adapting itself to evidence, it makes writing assignments more scientific, more ethical, and more authentic. 
  • Have students create an inventory of touch points between the two texts early in the process.
  • Ask students to make the case—early on and at points throughout the process—for the significance of the claim they're making about the relationship between the texts they're comparing.
  • For coordinate kinds of comparative analysis, a common pitfall is tied to thesis and evidence. Basically, it's a thesis that tells the reader that there are "similarities and differences" between two texts, without telling the reader why it matters that these two texts have or don't have these particular features in common. This kind of thesis is stuck at the level of description or positivism, and it's not uncommon when a writer is grappling with the complexity that can in fact accompany the "taking inventory" stage of comparative analysis. The solution is to make the "taking inventory" stage part of the process of the assignment. When this stage comes before students have formulated a thesis, that formulation is then able to emerge out of a comparative data set, rather than the data set emerging in terms of their thesis (which can lead to confirmation bias, or frequency illusion, or—just for the sake of streamlining the process of gathering evidence—cherry picking). 
  • For subordinate kinds of comparative analysis , a common pitfall is tied to how much weight is given to each source. Having students apply a theory (in a "lens" essay) or weigh the pros and cons of a theory against case studies (in a "test a theory") essay can be a great way to help them explore the assumptions, implications, and real-world usefulness of theoretical approaches. The pitfall of these approaches is that they can quickly lead to the same biases we saw here above. Making sure that students know they should engage with counterevidence and counterargument, and that "lens" / "test a theory" approaches often balance each other out in any real-world application of theory is a good way to get out in front of this pitfall.
  • For any kind of comparative analysis, a common pitfall is structure. Every comparative analysis asks writers to move back and forth between texts, and that can pose a number of challenges, including: what pattern the back and forth should follow and how to use transitions and other signposting to make sure readers can follow the overarching argument as the back and forth is taking place. Here's some advice from an experienced writing instructor to students about how to think about these considerations:

a quick note on STRUCTURE

     Most of us have encountered the question of whether to adopt what we might term the “A→A→A→B→B→B” structure or the “A→B→A→B→A→B” structure.  Do we make all of our points about text A before moving on to text B?  Or do we go back and forth between A and B as the essay proceeds?  As always, the answers to our questions about structure depend on our goals in the essay as a whole.  In a “similarities in spite of differences” essay, for instance, readers will need to encounter the differences between A and B before we offer them the similarities (A d →B d →A s →B s ).  If, rather than subordinating differences to similarities you are subordinating text A to text B (using A as a point of comparison that reveals B’s originality, say), you may be well served by the “A→A→A→B→B→B” structure.  

     Ultimately, you need to ask yourself how many “A→B” moves you have in you.  Is each one identical?  If so, you may wish to make the transition from A to B only once (“A→A→A→B→B→B”), because if each “A→B” move is identical, the “A→B→A→B→A→B” structure will appear to involve nothing more than directionless oscillation and repetition.  If each is increasingly complex, however—if each AB pair yields a new and progressively more complex idea about your subject—you may be well served by the “A→B→A→B→A→B” structure, because in this case it will be visible to readers as a progressively developing argument.

As we discussed in "Advice on Timing" at the page on single-source analysis, that timeline itself roughly follows the "Sample Sequence of Formative Assignments for a 'Typical' Essay" outlined under " Formative Writing Assignments, " and it spans about 5–6 steps or 2–4 weeks. 

Comparative analysis assignments have a lot of the same DNA as single-source essays, but they potentially bring more reading into play and ask students to engage in more complicated acts of analysis and synthesis during the drafting stages. With that in mind, closer to 4 weeks is probably a good baseline for many single-source analysis assignments. For sections that meet once per week, the timeline will either probably need to expand—ideally—a little past the 4-week side of things, or some of the steps will need to be combined or done asynchronously.

What It Can Build Up To

Comparative analyses can build up to other kinds of writing in a number of ways. For example:

  • They can build toward other kinds of comparative analysis, e.g., student can be asked to choose an additional source to complicate their conclusions from a previous analysis, or they can be asked to revisit an analysis using a different axis of comparison, such as race instead of class. (These approaches are akin to moving from a coordinate or subordinate analysis to more of a hybrid approach.)
  • They can scaffold up to research essays, which in many instances are an extension of a "hybrid comparative analysis."
  • Like single-source analysis, in a course where students will take a "deep dive" into a source or topic for their capstone, they can allow students to "try on" a theoretical approach or genre or time period to see if it's indeed something they want to research more fully.
  • DIY Guides for Analytical Writing Assignments

For Teaching Fellows & Teaching Assistants

  • Types of Assignments
  • Unpacking the Elements of Writing Prompts
  • Formative Writing Assignments
  • Single-Source Analysis
  • Research Essays
  • Multi-Modal or Creative Projects
  • Giving Feedback to Students

Assignment Decoder

Home » Blog » How to Write a Comparative Literature Review

How to Write a Comparative Literature Review

Table of Contents

The review of Comparative Literature is a rich, interdisciplinary area, providing students with the opportunity to study literature across national borders, languages, cultures, time periods, genres and social fields and to engage with disciplines as various as history, translation studies, philosophy, psychology, critical theory, religious studies, sociology, politics, and the creative arts. With such an intoxicating mixture of cultural forms and academic fields to study, it is all too easy to lose focus and neglect the fundamentals of good essay writing .

What makes a review critical?

Wallace and Wray (2009) consider that critical literature reviews are personal, and they define a critical literature review as:

  • “A reviewer’s constructively critical account, developing an argument designed to convince a particular audience about what the published – and possibly unpublished- literature (theory, research, practice or policy) indicates is and is not known about one or more questions that the reviewer has framed”.
  • We need to consider carefully some very important points made in the above quotation.

Developing an argument

  • You decide on the focus of your review of the selected texts, engage critically with them, and interpret and synthesise your findings.
  • Most importantly of all, you must communicate your findings clearly and logically. This is best done if you have an argument that a reader can follow.
  • How novel is the work that you are reviewing?
  • “What is known and unknown” – you need to know the range of research and issues in your chosen topic area.

REQUIREMENTS

You will need:.

  • A criterion for selecting some texts for inclusion, rejecting others and homing in on a few of the most relevant texts for in-depth analysis .
  • A plan for reading selectively so you save time by not reading the whole text in detail.
  • Answers to the Q 5 questions for the selected texts.
  • Synthesise your findings into a logically structured account framed around a convincing, objective argument (with support for your arguments where possible).
  • To NOT write unsupported opinions such as “I agree with the authors that…..” – you are not an authority (published author) in the area of the topic.

If you like this article, see others like it:

Balancing Work and Life: Achieving Success in Nigeria’s Competitive Job Market

2024 complete nigeria current affairs pdf free download, 2024 nigeria current affairs quiz questions & answers, learn how to trade forex: a beginner’s guide, 5 corporate team building activities ideas for introverted employees, related topics, how to search for journals for a research project, a step-by-step guide to writing a comparative analysis, signs it’s time to re-evaluate your career goals, 5 common career change fears and what to do, how to plan an affordable vacation as a student.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • How to Write a Literature Review | Guide, Examples, & Templates

How to Write a Literature Review | Guide, Examples, & Templates

Published on January 2, 2023 by Shona McCombes . Revised on September 11, 2023.

What is a literature review? A literature review is a survey of scholarly sources on a specific topic. It provides an overview of current knowledge, allowing you to identify relevant theories, methods, and gaps in the existing research that you can later apply to your paper, thesis, or dissertation topic .

There are five key steps to writing a literature review:

  • Search for relevant literature
  • Evaluate sources
  • Identify themes, debates, and gaps
  • Outline the structure
  • Write your literature review

A good literature review doesn’t just summarize sources—it analyzes, synthesizes , and critically evaluates to give a clear picture of the state of knowledge on the subject.

Instantly correct all language mistakes in your text

Upload your document to correct all your mistakes in minutes

upload-your-document-ai-proofreader

Table of contents

What is the purpose of a literature review, examples of literature reviews, step 1 – search for relevant literature, step 2 – evaluate and select sources, step 3 – identify themes, debates, and gaps, step 4 – outline your literature review’s structure, step 5 – write your literature review, free lecture slides, other interesting articles, frequently asked questions, introduction.

  • Quick Run-through
  • Step 1 & 2

When you write a thesis , dissertation , or research paper , you will likely have to conduct a literature review to situate your research within existing knowledge. The literature review gives you a chance to:

  • Demonstrate your familiarity with the topic and its scholarly context
  • Develop a theoretical framework and methodology for your research
  • Position your work in relation to other researchers and theorists
  • Show how your research addresses a gap or contributes to a debate
  • Evaluate the current state of research and demonstrate your knowledge of the scholarly debates around your topic.

Writing literature reviews is a particularly important skill if you want to apply for graduate school or pursue a career in research. We’ve written a step-by-step guide that you can follow below.

Literature review guide

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

literature review comparative analysis

Writing literature reviews can be quite challenging! A good starting point could be to look at some examples, depending on what kind of literature review you’d like to write.

  • Example literature review #1: “Why Do People Migrate? A Review of the Theoretical Literature” ( Theoretical literature review about the development of economic migration theory from the 1950s to today.)
  • Example literature review #2: “Literature review as a research methodology: An overview and guidelines” ( Methodological literature review about interdisciplinary knowledge acquisition and production.)
  • Example literature review #3: “The Use of Technology in English Language Learning: A Literature Review” ( Thematic literature review about the effects of technology on language acquisition.)
  • Example literature review #4: “Learners’ Listening Comprehension Difficulties in English Language Learning: A Literature Review” ( Chronological literature review about how the concept of listening skills has changed over time.)

You can also check out our templates with literature review examples and sample outlines at the links below.

Download Word doc Download Google doc

Before you begin searching for literature, you need a clearly defined topic .

If you are writing the literature review section of a dissertation or research paper, you will search for literature related to your research problem and questions .

Make a list of keywords

Start by creating a list of keywords related to your research question. Include each of the key concepts or variables you’re interested in, and list any synonyms and related terms. You can add to this list as you discover new keywords in the process of your literature search.

  • Social media, Facebook, Instagram, Twitter, Snapchat, TikTok
  • Body image, self-perception, self-esteem, mental health
  • Generation Z, teenagers, adolescents, youth

Search for relevant sources

Use your keywords to begin searching for sources. Some useful databases to search for journals and articles include:

  • Your university’s library catalogue
  • Google Scholar
  • Project Muse (humanities and social sciences)
  • Medline (life sciences and biomedicine)
  • EconLit (economics)
  • Inspec (physics, engineering and computer science)

You can also use boolean operators to help narrow down your search.

Make sure to read the abstract to find out whether an article is relevant to your question. When you find a useful book or article, you can check the bibliography to find other relevant sources.

You likely won’t be able to read absolutely everything that has been written on your topic, so it will be necessary to evaluate which sources are most relevant to your research question.

For each publication, ask yourself:

  • What question or problem is the author addressing?
  • What are the key concepts and how are they defined?
  • What are the key theories, models, and methods?
  • Does the research use established frameworks or take an innovative approach?
  • What are the results and conclusions of the study?
  • How does the publication relate to other literature in the field? Does it confirm, add to, or challenge established knowledge?
  • What are the strengths and weaknesses of the research?

Make sure the sources you use are credible , and make sure you read any landmark studies and major theories in your field of research.

You can use our template to summarize and evaluate sources you’re thinking about using. Click on either button below to download.

Take notes and cite your sources

As you read, you should also begin the writing process. Take notes that you can later incorporate into the text of your literature review.

It is important to keep track of your sources with citations to avoid plagiarism . It can be helpful to make an annotated bibliography , where you compile full citation information and write a paragraph of summary and analysis for each source. This helps you remember what you read and saves time later in the process.

Prevent plagiarism. Run a free check.

To begin organizing your literature review’s argument and structure, be sure you understand the connections and relationships between the sources you’ve read. Based on your reading and notes, you can look for:

  • Trends and patterns (in theory, method or results): do certain approaches become more or less popular over time?
  • Themes: what questions or concepts recur across the literature?
  • Debates, conflicts and contradictions: where do sources disagree?
  • Pivotal publications: are there any influential theories or studies that changed the direction of the field?
  • Gaps: what is missing from the literature? Are there weaknesses that need to be addressed?

This step will help you work out the structure of your literature review and (if applicable) show how your own research will contribute to existing knowledge.

  • Most research has focused on young women.
  • There is an increasing interest in the visual aspects of social media.
  • But there is still a lack of robust research on highly visual platforms like Instagram and Snapchat—this is a gap that you could address in your own research.

There are various approaches to organizing the body of a literature review. Depending on the length of your literature review, you can combine several of these strategies (for example, your overall structure might be thematic, but each theme is discussed chronologically).

Chronological

The simplest approach is to trace the development of the topic over time. However, if you choose this strategy, be careful to avoid simply listing and summarizing sources in order.

Try to analyze patterns, turning points and key debates that have shaped the direction of the field. Give your interpretation of how and why certain developments occurred.

If you have found some recurring central themes, you can organize your literature review into subsections that address different aspects of the topic.

For example, if you are reviewing literature about inequalities in migrant health outcomes, key themes might include healthcare policy, language barriers, cultural attitudes, legal status, and economic access.

Methodological

If you draw your sources from different disciplines or fields that use a variety of research methods , you might want to compare the results and conclusions that emerge from different approaches. For example:

  • Look at what results have emerged in qualitative versus quantitative research
  • Discuss how the topic has been approached by empirical versus theoretical scholarship
  • Divide the literature into sociological, historical, and cultural sources

Theoretical

A literature review is often the foundation for a theoretical framework . You can use it to discuss various theories, models, and definitions of key concepts.

You might argue for the relevance of a specific theoretical approach, or combine various theoretical concepts to create a framework for your research.

Like any other academic text , your literature review should have an introduction , a main body, and a conclusion . What you include in each depends on the objective of your literature review.

The introduction should clearly establish the focus and purpose of the literature review.

Depending on the length of your literature review, you might want to divide the body into subsections. You can use a subheading for each theme, time period, or methodological approach.

As you write, you can follow these tips:

  • Summarize and synthesize: give an overview of the main points of each source and combine them into a coherent whole
  • Analyze and interpret: don’t just paraphrase other researchers — add your own interpretations where possible, discussing the significance of findings in relation to the literature as a whole
  • Critically evaluate: mention the strengths and weaknesses of your sources
  • Write in well-structured paragraphs: use transition words and topic sentences to draw connections, comparisons and contrasts

In the conclusion, you should summarize the key findings you have taken from the literature and emphasize their significance.

When you’ve finished writing and revising your literature review, don’t forget to proofread thoroughly before submitting. Not a language expert? Check out Scribbr’s professional proofreading services !

This article has been adapted into lecture slides that you can use to teach your students about writing a literature review.

Scribbr slides are free to use, customize, and distribute for educational purposes.

Open Google Slides Download PowerPoint

If you want to know more about the research process , methodology , research bias , or statistics , make sure to check out some of our other articles with explanations and examples.

  • Sampling methods
  • Simple random sampling
  • Stratified sampling
  • Cluster sampling
  • Likert scales
  • Reproducibility

 Statistics

  • Null hypothesis
  • Statistical power
  • Probability distribution
  • Effect size
  • Poisson distribution

Research bias

  • Optimism bias
  • Cognitive bias
  • Implicit bias
  • Hawthorne effect
  • Anchoring bias
  • Explicit bias

A literature review is a survey of scholarly sources (such as books, journal articles, and theses) related to a specific topic or research question .

It is often written as part of a thesis, dissertation , or research paper , in order to situate your work in relation to existing knowledge.

There are several reasons to conduct a literature review at the beginning of a research project:

  • To familiarize yourself with the current state of knowledge on your topic
  • To ensure that you’re not just repeating what others have already done
  • To identify gaps in knowledge and unresolved problems that your research can address
  • To develop your theoretical framework and methodology
  • To provide an overview of the key findings and debates on the topic

Writing the literature review shows your reader how your work relates to existing research and what new insights it will contribute.

The literature review usually comes near the beginning of your thesis or dissertation . After the introduction , it grounds your research in a scholarly field and leads directly to your theoretical framework or methodology .

A literature review is a survey of credible sources on a topic, often used in dissertations , theses, and research papers . Literature reviews give an overview of knowledge on a subject, helping you identify relevant theories and methods, as well as gaps in existing research. Literature reviews are set up similarly to other  academic texts , with an introduction , a main body, and a conclusion .

An  annotated bibliography is a list of  source references that has a short description (called an annotation ) for each of the sources. It is often assigned as part of the research process for a  paper .  

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

McCombes, S. (2023, September 11). How to Write a Literature Review | Guide, Examples, & Templates. Scribbr. Retrieved August 26, 2024, from https://www.scribbr.com/dissertation/literature-review/

Is this article helpful?

Shona McCombes

Shona McCombes

Other students also liked, what is a theoretical framework | guide to organizing, what is a research methodology | steps & tips, how to write a research proposal | examples & templates, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

Screen with a video on it (icon)

Comparative Analysis

Flipped learning module.

Each Flipped Learning Module (FLM) is a set of short videos and online activities that can be used (in whole or in part) to free up class time from content delivery for greater student interaction. At the end of the module, students are asked to fill out a brief survey, in which we adopt the minute paper strategy . In this approach, students are asked to submit their response to two brief questions regarding their knowledge of the module.

In this FLM, students are asked to complete a fill-in-the-blank outline which accompanies all three videos, covering the topics of comparative analysis, grounds for comparison, and the structure of comparative analysis. The completed outline will enhance the students’ note-taking skills and will serve as a summary of the FLM that they may refer to in the future.

comparative analysis goals, analytical discussion, interpretation, evaluation, grounds for comparison, organization, point-by-point & block-by-block structure

Module Overview Introduction to Comparative Analysis What is comparative analysis? Reading for and entering the conversation The goal of comparative analysis Grounds for Comparison The 5 grounds for comparison Identifying the significance Organization Point-by-point Block-by-block Download Video Transcripts

Worksheet: Comparative Analysis Outline

  • Comparative analysis is different than a traditional compare/contrast essay in the following way: __________________________________________________________________________
  • The goal of comparative analysis is to: __________________________________________________________________________
  • When you put two articles in conversation with one another in order to shed light on a topic, continue a discussion, or potentially resolve a problem, you are: __________________________________________________________________________
  • (Question 1):__________________________________________________
  • (Question 2):__________________________________________________
  • (Question 3):__________________________________________________
  • (Question 4):__________________________________________________
  • Once you have identified the conversation your authors are participating in, the next step is to: __________________________________________________________________________
  • (Method 1):__________________________________________________
  • (Method 2):__________________________________________________
  • (Method 3):__________________________________________________
  • (Method 4):__________________________________________________
  • (Method 5):__________________________________________________
  • Establishing the grounds for comparison is so important because: __________________________________________________________________________

Download Outline

Video 1: Introduction to Comparative Analysis

Comparative analysis online activity 1.

  • What is each author’s claim?
  • What claims are these authors responding to?
  • What larger conversation are these authors participating in?
  • How might you enter that conversation?

Video 2: Grounds for Comparison

Comparative analysis online activity 2.

  • Describe the relationship between the two texts (the grounds for comparison).
  • What aspects of this relationship could you analyze?
  • What new ideas could your analysis present by exploring the relationship you identified within the context of the conversation you are entering?
  • Why do you think it is important to engage in this conversation? Who needs to understand this issue? Why does it matter?

Video 3: Organization

Comparative analysis online activity 3.

  • What are the main points you want to make?
  • Looking at your main claims, can you more easily divide them into different ideas (point-by-point) or into different voices (block-by-block)?
  • Write out a topic sentence for each of your main points following the method of organization you chose.

Comparative Analysis Survey

  • What was the one most important thing you learned from this module?
  • Do you have any unanswered questions for me?

Comparative Relationships Between Texts Chart

literature review comparative analysis

Comparative Analysis In-Class Activity

Goal: To practice writing and evaluating comparative analysis claims.

Materials: One laptop per group; your notes/annotations on reading material; Google doc link sent by your instructor.

Download Worksheet

Download Digital Implementation of the Activity

Graff, Gerald, and Cathy Birkenstein. They Say/I Say: The Moves That Matter in Academic Writing . 3rd ed., W.W. Norton & Co., 2014.

Tompkins, Case. “ Writing in Literature: Writing the Prompt Paper .” The Purdue OWL , Purdue U Writing Lab, 06 December 2013.

Walk, Kerry. “ How to Write a Comparative Analysis .” Harvard College Writing Center , 1998.

Endnote : This module follows the philosophy of the Writing Program and refers to the work of Kevin Barents, Holly Schaff, and Lesley Yoder for specific guidelines, categories, and best practices.

A comparative review may, e.g., require you to examine two schools of thought, two issues, or the positions taken by two persons. You may create a hierarchy of issues and sub-issues to compare and contrast, as suggested by the following general plan.

This model lists 3 options for structuring the body of the review. In all cases, you are expected to deal with the similarities ( compare ) and then with the differences ( contrast ): Introduction, Body, & Conclusion


A & B

Literature Review Example 3 offers an excellent example of  a comparative review [ Language and gender ]. This was written by Alastair Pennycook for his undergraduate students as a model of (among other things) of how to structure a  review of the literature - for an example of the above structure.

  • Contact sales (+234) 08132546417
  • Have a questions? [email protected]
  • Latest Projects

Premium Researchers

Project Materials

A step-by-step guide to writing a comparative analysis.

' src=

Click Here to Download Now.

Do You Have New or Fresh Topic? Send Us Your Topic

How to write a comparative analysis.

Writing a comparative review in a research paper is not as difficult as many people might tend to think. With some tips, it is possible to write an outstanding comparative review. There are steps that must be utilized to attain this result. They are as detailed in this article.

Within the literary, academic, and journalistic world, analysis allows exposing ideas and arguments in front of a context, making it an important material for discussion within the professional work.

Within this genre, we can find a comparative analysis. For some authors, the comparative essay is defined as the text where two opposing positions are proposed or where two theses are verified. Through this comparison, the author intends to make the reader reflect on a specific topic. It consists of giving a written opinion about two positions, which are compared between them to conclude. Do you know how to write a comparative essay? In this article, we will explain step by step how to do it.

So, let’s see the guidelines that you must follow to achieve a good comparative analysis .

How to write a good comparative analysis

The structure.

In general, the approach is developed in the first paragraph or at the beginning of the work. Its objective is to propose the author’s position regarding a specific subject. Generally, this approach specifies the objective to be achieved. You must be clear about what topic you are going to deal with, what you want to explain, and what the perspectives will be to use in your comparative analysis, and you must also define who you write for.

As it is a comparative text, it begins with a general observation that can serve as a context for both approaches, then begins by establishing the arguments in each of the two cases. Do not forget to compare both objects of study according to each argument or idea to develop.

Let it be the reader himself who finds or defines his position in this essay and choose one of the two alternatives.

In this entry, there are two possibilities of approach: one deductive and the other inductive. The deductive method raises the issue, and you use your analysis of the variables leading, guiding the reader to draw their conclusions or fix a position on the issue. While the inductive method starts with argument, developing each of the variables until reaching the topic’s approach or problem. The two ways of approaching the subject are viable. Choose the one that is easiest for you to work with.

At the end of this section, your audience should:

  • First of all, have a clear understanding of what topics you will cover in your essay, what you want to explain, and under what positions or perspectives you will do it. It begins with a general observation that establishes the similarity between the two subjects and then moves the essay’s focus to the concrete.
  • The reader should understand which points will be examined and which points will not be examined in the comparison. At the end of the introduction, state your preference, or describe the two subjects’ meaning.
  • Your readers should be able to describe the ideas you are going to treat. Make a detailed exposition of its characteristics, history, consequences, and development that you consider appropriate. Your comparative analysis should expose the characteristics of the second position on which you want to speak as much as in the first one.

Development of body

Generally, in the body of the essay, the author presents all the arguments that support his thesis, which gives him a reflective and justifying body of the author’s initial statement. Depending on the length of the work, which can range from two to 15 pages, each paragraph or before a title corresponds to an argument’s development.

After speaking on the subject, the author must close the essay, must conclude, must show the findings of his work, and/or show the conclusions he reached. You must write a final closing paragraph, as a conclusion, in which you expose a confrontation between the two positions. Try to create a fight between them so that the reader gets involved. The conclusion should give a brief and general summary of the most important similarities and differences. It should end with a personal statement, an opinion, and the “what then?” – what is important about the two things being compared.

Readers should be left feeling that all the different threads of this essay have been put together coherently, that they have learned something – and they must be sure that this is the end – that they do not look around for pages missing. And finally, your assessment must explain what position you stand in solidarity and why you prefer it to the other.

Examples of how to write a comparative analysis

Paragraph 1: Messi’s preferred position / Ronaldo’s preferred position.

Paragraph 2: Messi’s play style / Ronaldo’s play style.

Paragraph 3: Messi aerial game / Ronaldo aerial game.

Paragraph 1: Messi teamwork .

Paragraph 2: Ronaldo’s teamwork.

Paragraph 3: Messi stopped the ball.

Paragraph 4: Ronaldo’s stopped the ball.

Paragraph 5: Messi’s achievements.

Paragraph 6: Ronaldo’s achievements.

Few Important Rules for Comparative analysis

Even if the exercise sounds simple, there are a few rules that should be followed to help your audience as best as possible make the best decision.

1. Clearly state your position

The first question is, “Why are you doing a comparison analysis”? To highlight your view or ideas over another, or simply to compare two (or more) solutions that do not belong to you? It is imperative that you clearly state your position to your reader, so does your credibility.

Be honest and state, for example:

  • The idea you are trying to espouse
  • The framework you are using
  • The reason why you are doing this comparison, the objective

In addition to the above, you must be consistent with the exposition of your ideas.

2. Stay objective

Even if you include your personal ideology in your comparison, stay as objective as possible. Your readers will not appreciate it when you point out all the disadvantages of one idea while you display the advantages of the other. Your comparison will turn into advertising. You have to raise weak points and strong points on both sides.

These analyses are always subjective, so you have to clarify which position convinces you the most.

3. Think about audience’ expectations

The research paper is intended for your readers, which means that you must take their expectations into account when writing your review. Put aside your desire to sell your desired idea, and take your readers’ perspective:

  • What information are they interested in?
  • What are their criteria?
  • What do they want to know?
  • What do they want from the product or service?

Again, it is about being objective in all your statements.

Do You Have New or Fresh Topic?  Send Us Your Topic 

4. Organize information

For your readers to want to read your comparative analysis, it is important to structure your comments. The idea is to make it easy for your readers to navigate your paper and get them to find the information that interests them quickly.

5. End with a conclusion

You’ve tried to be as objective as possible throughout your comparison, and now is the time to let go like we have mentioned many times in this post. In your conclusion, you can go directly to your readers and give your opinion. With a few tips, you can also encourage them to go towards one or the other idea.

Note: If time is not an issue, the best way to review the essay is to leave it for one day. Go for a walk, eat something, have fun, and forget. Then it’s time to go back to the text, find problems, and fix them. This must be done separately, that is, first find all the problems you can without correcting them. Although the idea of ​​doing it at the same time is tempting, it is smarter to do it separately. It is effective and fast.

Tips on Comparative analysis

Be concise or accurate in your analysis and dissertation of the topic.

Sometimes the authors believe that the more elaborate the language and the more extensive the writing, the better the writers or essayists. On the contrary, a good essay refers to the exact analysis of a topic, where the reader can dynamically advance the work and understand the author’s position.

Use only the arguments necessary for the explanation of the topic, do not talk too much. You run the risk of redundant or repetitive, which makes the text-heavy both when reading it and understanding it.

Write in Short Sentences

Just as we recommend that you do not redound in your texts, we also encourage you to write with short sentences. They give dynamism to the text. Communication is direct. The reader advances in the text and understands much more.

Include Reflections in Your Text

Supporting your approach with reflections or quotes from authors makes your essay more important. Above all, use those arguments that justify or give strength to your position regarding one thesis or the other.

Text Revision

Since comparative analysis can tend to be a subjective work, you must let it “sit” for a day or a few hours and read it again. This exercise will allow you to make corrections. Modify those aspects that are not clear enough for you. And you can improve it, in a few words. Once you do this exercise, just like this, you can submit it.

Not What You Were Looking For? Send Us Your Topic

INSTRUCTIONS AFTER PAYMENT

  • 1.Your Full name
  • 2. Your Active Email Address
  • 3. Your Phone Number
  • 4. Amount Paid
  • 5. Project Topic
  • 6. Location you made payment from

» Send the above details to our email; [email protected] or to our support phone number; (+234) 0813 2546 417 . As soon as details are sent and payment is confirmed, your project will be delivered to you within minutes.

Latest Updates

Influence of guidance and counselling on students’ discipline in koforidua technical university, impact of parental educational background on students career development in lagos state, problems and prospects of dissolution of marriage, leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

This site uses Akismet to reduce spam. Learn how your comment data is processed .

Advertisements

  • Hire A Writer
  • Plagiarism Research Clinic
  • International Students
  • Project Categories
  • WHY HIRE A PREMIUM RESEARCHER?
  • UPGRADE PLAN
  • PROFESSIONAL PLAN
  • STANDARD PLAN
  • MBA MSC STANDARD PLAN
  • MBA MSC PROFESSIONAL PLAN

Repository logo

DCC Curation Lifecycle Model 2.0: Literature Review and Comparative Analysis

Thumbnail Image

Journal Title

Journal issn, volume title, description, collections.

  • Open access
  • Published: 21 August 2024

Evaluating the effectiveness of large language models in abstract screening: a comparative analysis

  • Michael Li 1 ,
  • Jianping Sun 2 &
  • Xianming Tan   ORCID: orcid.org/0000-0002-5478-2269 1 , 3  

Systematic Reviews volume  13 , Article number:  219 ( 2024 ) Cite this article

176 Accesses

Metrics details

This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows.

We developed automation scripts in Python to interact with the APIs of several LLM tools, including ChatGPT v4.0, ChatGPT v3.5, Google PaLM 2, and Meta Llama 2, and latest tools including ChatGPT v4.0 turbo, ChatGPT v3.5 turbo, Google Gemini 1.0 pro, Meta Llama 3, and Claude 3. This study focused on three databases of abstracts and used them as benchmarks to evaluate the performance of these LLM tools in terms of sensitivity, specificity, and overall accuracy. The results of the LLM tools were compared to human-curated inclusion decisions, gold standard for systematic review and meta-analysis studies.

Different LLM tools had varying abilities in abstract screening. Chat GPT v4.0 demonstrated remarkable performance, with balanced sensitivity and specificity, and overall accuracy consistently reaching or exceeding 90%, indicating a high potential for LLMs in abstract screening tasks. The study found that LLMs could provide reliable results with minimal human effort and thus serve as a cost-effective and efficient alternative to traditional abstract screening methods.

While LLM tools are not yet ready to completely replace human experts in abstract screening, they show great promise in revolutionizing the process. They can serve as autonomous AI reviewers, contribute to collaborative workflows with human experts, and integrate with hybrid approaches to develop custom tools for increased efficiency. As technology continues to advance, LLMs are poised to play an increasingly important role in abstract screening, reshaping the workflow of systematic review and meta-analysis studies.

Peer Review reports

Introduction

A systematic review is a rigorous form of research that collates and synthesizes all existing evidence on a specific research question [ 1 ]. It stands as a cornerstone not just in medical research but across diverse academic disciplines. Unlike traditional literature reviews, systematic reviews follow a comprehensive and standardized process, such as the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guideline [ 2 ], designed to minimize bias and ensure reproducibility. Therefore, these reviews are recognized as one of the zenith levels of evidence in evidence-based research [ 3 ] and play a pivotal role in shaping clinical guidelines, healthcare policies, and informing medical decisions [ 4 ].

Commencing with a well-articulated research question, a typical systematic review launches an exhaustive search strategy that subsequently sweeps through databases, such as PubMed and Embase, supplemented by additional sources such as clinical trial registries and pertinent article reference lists, with aims to capture all relevant studies and mitigate bias. Predetermined inclusion and exclusion criteria guide the subsequent screening and selection of studies, encompassing facts like study design, patient demographics, and intervention types. Reviewers, working independently, appraise each study’s eligibility, reconciling disagreements through discussions or third-party reviews. Data extraction and synthesis follow, either through meta-analysis or narrative synthesis, depending on the heterogeneity of the selected studies.

The practice of conducting systematic reviews has gained substantial popularity with considerable demand within the academic community. A notable reference [ 5 ] to this trend found that approximately 75 trials and potentially 11 systematic reviews are disseminated daily, based on data from around 2010. Moreover, a query for “systematic review” in the Google Scholar database yields approximately 17,000 entries that have been published within the year 2023 alone, which translates into about 51 systematic reviews per day, as observed on November 26, 2023. This expanding volume of literature underscores the critical role that systematic reviews play in the consolidation of research findings across various fields of study.

Despite their pivotal role, executing systematic reviews remains a formidable task due to the abstract screening, a key phase that can be overwhelmingly time-consuming due to its volume. For example, in Polanin et al. [ 6 ], the authors reported that the research staff screened 29,846 abstracts independently (14,923 unique citations were double screened) over the course of 189 days. In addition, the variability in reporting standards, use of jargon, and varied study designs can further complicate the abstract screening process [ 7 ]. Nevertheless, the repetitive nature of the task, combined with cognitive fatigue, can lead to human errors [ 8 , 9 ]. Recent advancements in machine learning (ML) and deep learning propose possible solutions to these challenges. However, traditional ML models, while promising, require domain-specific training, a time-consuming process that often demands manual labeling of datasets [ 10 ].

Advancements in natural language processing (NLP) and artificial intelligence (AI) are opening doors to address challenges in systematic reviews. Large language models (LLMs) like ChatGPT [ 11 ], PaLM [ 12 ], Llama [ 13 ], and Claude [ 14 ] are capturing the research community’s attention. Their collective potential, especially their capability to operate without exhaustive domain-specific training, makes them prime candidates for revolutionizing the systematic review process.

While each of the aforementioned AI tools brings its unique capabilities to the table, the fundamental question remains: How do they stack up, both individually and collectively, in the realm of abstract screening, against the human expert-based process? In pursuit of answers, this research seeks to thoroughly investigate the potential of ChatGPT, Google PaLM, Llama, and Claude in automating the crucial abstract screening phase integral to systematic reviews. Our goal is to rigorously compare the performance of these advanced AI-driven methods with existing machine learning (ML)-based approaches. In doing so, we aim to develop AI strategies that masterfully balance efficiency and accuracy with minimal human intervention, ultimately transforming systematic review practice across disciplines.

The use of NLP for abstract screening is established [ 10 , 15 ]. However, the application of LLMs specifically for this task is a nascent field [ 16 , 17 ]. This emerging area offers significant potential to improve efficiency and accuracy. Our study aims to fill this gap by providing a comprehensive analysis of LLM capabilities in abstract screening, laying the groundwork for future research and application. This is particularly relevant considering the rapid evolution of this technology, highlighting its potential to streamline systematic reviews now and in the future.

The remainder of this paper is structured to provide a comprehensive exploration of our topic. We begin with an in-depth examination of existing methods for abstract screening, including both manual and NLP-based approaches, laying the groundwork for understanding the current state of the field. We then introduce the use of large language model (LLM) tools for abstract screening, detailing our experimental design to meticulously evaluate their performance in this context. Subsequent sections present our empirical findings and results, shedding light on the capabilities and limitations of the AI tools in question. Finally, we engage in a thoughtful discussion, reflecting on the implications of our findings and considering the future trajectory of abstract screening in systematic reviews.

Existing approaches to abstract screening in systematic reviews

In the vast realm of systematic reviews, the critical task of abstract screening serves as a foundational step in curating the highest quality of evidence [ 2 ]. However, this process often presents significant challenges due to the involvement of sifting through large volumes of literature to identify those that align with predefined criteria. Over time, various methodologies, ranging from manual evaluations to sophisticated AI-driven techniques, have been proposed to address the complexities of this task. In this section, we first describe the existing approaches on their operational mechanisms and associated advantages and disadvantages.

Manual approach

Historically, the process of abstract screening was firmly rooted in manual evaluations. In this conventional approach, individual reviewers would scrutinize each abstract against predefined criteria [ 1 ]. The meticulous nature of this method required that multiple experts independently evaluate the literature to ensure both reliability and reduced biases [ 8 ]. While the depth of human expertise brought about nuanced understanding, the manual nature of this method made it both time-consuming and, at times, prone to human error [ 6 , 9 ].

NLP-based approach

As technology evolved, the field witnessed the incorporation of natural language processing (NLP) to automate abstract screening [ 10 ]. In this framework, abstract text undergoes preprocessing and vectorization. Supervised machine learning models, notably the support vector machine (SVM) and the random forest (RF), are then trained on this vectorized data to classify literature based on specific criteria [ 15 ]. The strength of this approach lies in its potential for efficiency. However, its efficacy and accuracy hinges heavily on the availability of a well-curated, labeled training set.

Zero-shot classification

A more recent and innovative approach is zero-shot classification, which was notably highlighted by Xian et al. [ 18 ]. Eschewing the need for an extensive labeled training dataset, zero-shot classification offers the allure of categorizing abstracts without prior domain-specific training. By calculating a probability score for each abstract, researchers obtain a dynamic measure of its alignment with predefined categories. While it does not require training for the machine learning model, the tradeoff is a loss of sensitivity and potential omissions of pertinent studies [ 19 ]. In this study, for a given set of abstracts, we first obtain embeddings (using OpenAI’s text embedding API, the babbage model) for abstracts and for a pre-specified description of an ideal study to include. We indeed use the inclusion/exclusion criteria (see “Examples of abstract screening by using LLMs,” “Automated workflow for streamlining abstract screening via ChatGPT and other tools,” and “ Study design ” sections) as the description of an ideal study to include. We then compute the cosine similarity score between each abstract (its embedding) and the pre-specified description (its embedding). We claim the top 10% (in terms of similarity) of abstracts as positive (i.e., a study should be included).

Hybrid approach

To reconcile the strengths and limitations of these models, a hybrid methodology has emerged [ 18 ]. It cleverly marries the rapid categorization capabilities of zero-shot classification with the precision of traditional machine learning models such as SVM and RF. Here, zero-shot classification provides an initial curation of all abstracts in the training set. Then, based on zero-shot classification’s curation, the abstracts with classification score higher than a pre-specified threshold are selected to undergo manual review to rectify zero-shot classification. The rectified classification then serves as a foundation for training traditional models, which are then employed for screening the broader dataset (e.g., in the testing set). The hybrid approach balances the speed of zero-shot and the precision of traditional ML, and potentially offers enhanced accuracy at reduced human efforts. However, this approach involves the use of multiple methodologies and still relies on well-curated, labeled training data (in this case, a subset of the whole training set). In this study, for a given set of abstracts, we rectify the “positive” abstracts (i.e., the top 10% of abstracts) identified by zero-shot, and then fit a classification model for rectified labels to the text embeddings of the corresponding abstracts using SVM. This learned classification model is then used to predict the class of other abstracts. We added the R code for the zero-shot and hybrid approaches to the GitHub repository https://github.com/mikeli380/LLMAbstractScreening .

Active learning approach

Active learning [ 20 , 21 , 22 ] is an innovative approach to machine learning that optimizes the training process by allowing the model to selectively query a human annotator for labels on the most informative data points. This method is particularly advantageous in scenarios where labeled data is scarce or expensive to obtain. Active learning models can start with minimal datasets, often requiring only one relevant and one irrelevant abstract, making them particularly suitable for tasks such as abstract screening, where the acquisition of large labeled datasets can be prohibitive.

The main advantage of active learning is its efficiency. By focusing on the most informative samples, it reduces the amount of data that needs to be labeled while still effectively training the model. This can significantly reduce the time and resources required for the annotation process. However, the effectiveness of active learning depends heavily on the initial selection of samples and the criteria used to determine the informativeness of subsequent data points. If not well calibrated, the model may request labels for data that do not significantly improve its performance, resulting in an inefficient use of resources. In addition, the iterative nature of querying and updating the model based on new labels can introduce complexity into the training process, potentially requiring more sophisticated infrastructure and oversight than traditional supervised learning methods.

While this study did not test active learning approaches for abstract screening, readers interested in exploring this methodology further are referred to [ 23 ] for detailed information on the application of active learning in abstract screening contexts.

  • Large language models

Amidst the evolving methodologies described in the “ Existing approaches to abstract screening in systematic reviews ” section, modern AI tools based on large language models (LLMs), such as ChatGPT, PaLM, Llama, and Claude, are emerging as potential game-changers. Grounded in advanced language processing capabilities, these tools can be tailored to evaluate abstracts against nuanced criteria and offer detailed assessments and classifications. Their prowess signals transformative potential for abstract screening. In this section, we first present two examples to illustrate the potential of LLMs in the context of abstract screening, and then precede with our in-depth discussion on the study’s workflow and design that aims to critically investigate the performance of LLMs in this domain.

Examples of abstract screening by using LLMs

In this subsection, we demonstrate the process of LLM-based abstract screening using ChatGPT with two specified abstracts: (1) Millard et al. [ 24 ] and (2) Zawia et al. [ 25 ]. Using other LLM tools such as Llama, Google PaLM, or Claude for abstract screening is similar and will be skipped for brevity.

In constructing the prompts, we adopted a standardized approach to mimic a typical interaction between a senior researcher and a research assistant. Each prompt was designed to include three critical elements: a brief statement of the topic under review, the exact eligibility criteria as specified in the study guidelines, and the text of the abstract to be reviewed. This approach was designed to simulate a realistic and practical scenario that reflects common academic practices. In addition, to ensure consistency and objectivity in the responses, we required that the results of the LLM be in a binary “yes” or “no” format, directly indicating whether each abstract met the inclusion criteria. This format was chosen to eliminate any ambiguity in the model’s decisions and to facilitate straightforward analysis of its accuracy.

Abstract (1)

ChatGPT prompt: “I would like you to help me with conducting a systematic review on animal depression. I will provide the title and abstract for one journal article and would like you to screen the paper for inclusion. Here are the inclusion criteria: Does the study provide primary data of an animal model of depression or depressive-like phenotype with an appropriate control group? Here is the abstract:

Wistar-Kyoto (WKY) rats show endogenous depressive behavior that can be reversed by antidepressants. Given that WKYs exhibit decreased sensitivity to some antidepressants and treatment-resistant depressed patients often show hypothalamic-pituitary-thyroid (HPT) dysregulation, we examined the behavioral and HPT hormonal responses of WKYs to altered thyroid status. “Euthyroid” WKYs had elevated basal plasma TSH and T(3) levels as compared to Wistars. Hypothyroidism increased TSH levels more in WKYs than in Wistars and increased response latency in the open field test (OFT) of WKYs only. Administration of T(4) and T(3) suppressed plasma TSH equally in both strains. Wistars responded to increased T(3) levels with decreased response latency and increased activity in the OFT, but increased immobility in the forced swim test. In contrast, WKYs responded only to the high T(3) levels with decreased response latency in the OFT. These results suggest the existence of a decreased central nervous system sensitivity to thyroid hormones in WKYs that could be related to their depressive behavior.”

ChatGPT v4 response (accessed on 10/15/2023):

Given the abstract and the inclusion criteria, this study appears to meet the inclusion criteria for your systematic review on animal models of depression.

Consideration for inclusion:

 1. Animal model and depressive-like phenotype:

  • The study utilizes Wistar-Kyoto (WKY) rats which are mentioned to show endogenous depressive behavior.

  • This study investigates behaviors and hormonal responses associated with depression in this particular animal model, with a focus on the role of thyroid hormones.

 2. Primary data and appropriate control group:

  • The study provides primary data on behavioral and hormonal responses of WKY rats, comparing them to Wistars (likely serving as the control group), under altered thyroid status, allowing for comparisons between the depressive-like phenotype and controls.

  • The utilization of both “Euthyroid” WKYs and Wistars as controls helps in effectively investigating the behavioral changes and responses in WKY rats due to alterations in thyroid status.

 3. Behavioral test indicating depression:

  • Depressive behavior and responses are assessed using standardized tests like the open field test (OFT) and the forced swim test, which are conventional methods to investigate depressive-like behaviors in animal models.

  • The mention of antidepressant’s impact on WKY rats indicates a focus on depressive-like states.

 4. Relevance to human condition:

  • Insights gained from this study on decreased central nervous system sensitivity to thyroid hormones and its possible relation to depressive behavior in WKYs can have implications in understanding treatment-resistant depression in humans, making it valuable for the review.

Conclusion:

Given the provided abstract, this study does seem to fit the inclusion criteria for a systematic review on animal depression, as it provides primary data on an animal model demonstrating depressive-like behavior, includes an appropriate control group, and employs standardized tests to study the depressive-like states and their relation to thyroid hormones. The detailed exploration of depressive behaviors and hormonal imbalances in relation to altered thyroid status can offer valuable insights for understanding and treating depression.

Abstract (2)

The induction of the activity of cerebral ornithine decarboxylase (EC 4.1.1.17) and mRNA by electrical stimulation exhibits regional differences. The effects of the enzyme inhibitor difluoromethylornithine on these regional variations was examined. Administration of this inhibitor resulted in pronounced depression of both basal and induced activity of ornithine decarboxylase in the hippocampus. Basal activity of the enzyme in the neocortex and the cerebellum appeared to be resistant to difluoromethylornithine but the induced enzyme activity was sensitive to the effects of this inhibitor. Susceptibility to difluoromethylornithine may be directly correlated with a slower turnover rate for ornithine decarboxylase. These results suggest that ornithine decarboxylase in the hippocampus may possess a longer half-life than its counterparts in other regions of the brain. Pretreatment with difluoromethylornithine had no effect on the induced ornithine decarboxylase mRNA in the neocortex. Thus, elevated activity of ornithine decarboxylase enzyme, due to electrical stimulation, appears to not have any effect on either the transcription or the decay rate of the induced ornithine decarboxylase mRNA. These findings support the concept of region-specific regulation of cerebral ornithine decarboxylase. ”

Given the provided abstract and the inclusion criteria, this study does not appear to meet the criteria for inclusion in a systematic review on animal models of depression or depressive-like phenotypes with an appropriate control group.

Consideration for inclusion:

 1. Depression or depressive-like phenotype:

  • The abstract does not provide any information or data related to depression or depressive-like behavior or phenotypes in animals. It is mainly focused on the activity of cerebral ornithine decarboxylase and its regional differences in response to electrical stimulation and inhibitor administration.

 2. Animal model:

  • There is no mention of the use of any specific animal model or any behavioral assays to assess depressive-like symptoms or states in the provided abstract.

 3. Behavioral assessment and primary data:

  • The abstract does not contain any details about behavioral assessments related to depression or depressive-like states. It primarily provides data on the effects of enzyme inhibitor difluoromethylornithine on cerebral ornithine decarboxylase activity and mRNA, but does not relate these findings to depression or depressive-like behaviors.

 4. Relevance to depression:

  • While the study explores biochemical aspects in brain regions like the hippocampus, which is implicated in mood disorders like depression, the connection between the findings and depressive symptoms or states is not established in the provided abstract.

Conclusion:

This article does not seem to fit the inclusion criteria for a systematic review on animal models of depression or depressive-like phenotypes as it lacks primary data on depressive-like behaviors or phenotypes in animals, and there is no mention of the use of animal models representative of depression or an appropriate control group for such a comparison.

From the above two examples presented, it is evident that ChatGPT adeptly comprehends the posed queries and efficiently extracted the inclusion and exclusion criteria. It not only makes clear decisions regarding study inclusion but also articulates its rationale by cross-referencing each criterion.

It is worth noting that the same ChatGPT prompt might result in slightly different responses between different users due to the variability in its response generation, but the conclusion for the eligibility of an abstract is typically consistent. In addition, enhancing the specificity of the criteria within the prompts can further refine the screening process. To illustrate, consider the following enhanced prompts:

Enhanced ChatGPT prompt for abstract (1): “I seek assistance with a systematic review focused on animal depression. I’ll provide the abstract of a particular journal article and would appreciate an assessment for its inclusion based on the following criteria:

The study employs an animal model reflecting depression or a depressive-like phenotype.

It incorporates an appropriate control group.

Primary data is presented within the study.

and here are the exclusion criteria:

The study bypasses the use of an animal model.

The utilized animal model doesn’t characterize depression or a depressive-like phenotype.

Absence of a control group within the study.

The study is primarily a review, devoid of primary data.

Here is the abstract: …”.

Let’s assess the provided abstract based on the inclusion and exclusion criteria:

Inclusion criteria:

 1. Animal model reflecting depression: The study uses Wistar-Kyoto (WKY) rats, which are described as showing “endogenous depressive behavior.” Thus, this criterion is met.

 2. Appropriate control group: The study compares WKY rats to Wistars (another strain), which serves as a comparison or control group. This criterion is met.

 3. Primary data: The abstract mentions specific results such as the elevated basal plasma TSH and T(3) levels in “Euthyroid” WKYs compared to Wistars and the different responses in the open field test (OFT). This suggests the presence of primary data. Thus, this criterion is met.

Exclusion criteria:

 1. Bypasses the use of an animal model: The study uses WKY rats as an animal model. Thus, this criterion is not met, which is good for inclusion.

 2. Animal model doesn’t characterize depression: The WKY rats show “endogenous depressive behavior.” Thus, this criterion is not met, which again supports inclusion.

 3. Absence of a control group: The study does use a control group (Wistars). Thus, this criterion is not met, supporting inclusion.

 4. Study is primarily a review: The abstract presents experimental data and does not indicate that the study is a review. Thus, this criterion is not met, which is favorable for inclusion.

Conclusion:

Given that the abstract meets all the inclusion criteria and none of the exclusion criteria, it should be considered in your systematic review on animal depression.

With these revised prompts applied to the previously discussed abstracts, the conclusions remain consistent. Nonetheless, we contend that this enhanced structure not only provides a more rigorous framework but also aligns more closely with widely accepted practices in systematic review criteria definition.

Automated workflow for streamlining abstract screening via ChatGPT and other tools

For systematic reviews, particularly when dealing with voluminous data, efficient workflows are paramount. The ChatGPT API (application programming interface) offers a dynamic solution, enabling the automation of abstract screening on a large scale, which circumvents the labor-intensive process of manually inputting abstracts into a chatbot interface. In this subsection, we present an automated workflow for streamlining abstract screening via ChatGPT. Note, though this automated workflow uses ChatGPT as the platform, analogous workflows work for other AI platforms like PaLM, Llama, and Claude.

Automated workflow:

Data collection: The preliminary step entails accumulating a list of titles and abstracts. By utilizing carefully crafted keywords, we retrieve these from PubMed and other pertinent databases. This comprehensive approach ensures the potential inclusion of all relevant studies for a subsequent detailed screening. It is worth noting that while this list is expansive, most of these studies may not find their way into the final meta-analysis post-screening.

Automation through Python: We have devised a python script aimed at harnessing the capabilities of ChatGPT for evaluating the amassed abstracts.

This script interacts with the ChatGPT API (specifically, the GPT-4 version) and, when furnished with tailored prompts, extracts structured responses from ChatGPT.

Typically, the AI’s response commences with a succinct summary, delves into explanations aligned with each criterion, and concludes with a decisive judgment, as exemplified by the examples in the “ Examples of abstract screening by using LLMs ” section.

This automated process efficiently saves ChatGPT’s verdicts on each abstract for ensuing analyses. For instance, it extracts the final decisions regarding the inclusion or exclusion of studies and determines the stance on each pre-specified criterion for every abstract, as exemplified by the last example in the “ Examples of abstract screening by using LLMs ” section.

Additionally, to ascertain the efficiency and cost-effectiveness of this methodology, the script also monitors the time, token usage, and the financial implications of querying the OpenAI API.

In essence, we envision this procedure as delegating the meticulous task of poring over scientific summaries to an AI assistant. This virtual entity meticulously sifts through each summary, determining its alignment with stipulated criteria.

Tuning parameters in the ChatGPT API: The effectiveness of the ChatGPT API is not only dependent on the input data; it is also significantly influenced by adjustable parameters that can refine the responses. Parameters such as temperature, top k, and top p critically affect model performance by modulating the randomness and focus of the output. While fine-tuning these parameters can improve results, it requires significant technical expertise and resources. Defaults, which are rigorously tested by developers, strike a balance between output quality and ease of use, making LLMs more accessible to a wider range of users without the need for complex parameter optimization. While customization holds promise for custom applications, the default settings provide an efficient and practical solution that facilitates wider adoption of LLM technologies.

Given the complexity of fine-tuning these parameters to optimize performance specifically for abstract screening, our study primarily used the recommended default settings for these parameters as provided by the respective platforms (detailed in Table 1 ). This approach was chosen to maintain the feasibility of our experiments and to ensure that our findings are applicable to typical deployment scenarios.

The automated workflow described applies to other LLM tools or different versions of the same tools. Throughout our study, we have tested several popular LLM tools available to us. The specific LLM models and their versions used at the time of our first submission are detailed in Table  1 , along with the tuning parameters.

The field of large language models (LLMs) has evolved rapidly since we started this study in 2023. New models are frequently released, and existing versions are constantly updated. To account for these advances, we have expanded our analysis to include results from the latest versions of previously studied models, as well as a few entirely new ones. As a best practice, we set the temperature parameter to 0 for all latest models. Table 1 now includes these latest models and their versions.

A brief explanation of these parameters is as follows:

Temperature: The temperature controls the randomness of the outputs, with a range from 0 to 2. A temperature value greater than 1 is random and 0 is deterministic. The maximum temperature of 2 gives the most creative and variable outputs.

Max length: The max length is the maximum number of tokens for the model to generate as a response. A single word is generally 2–3 tokens.

Stop sequences: This parameter controls which tokens or phrases will stop the LLM from generating more text.

Top p: When generating text, the model samples from the top p percentage of most likely tokens to generate. The top p is the cumulative probability cutoff for the model’s selection of tokens to generate. Lower top p values mean sampling from a smaller, more top-weighted nucleus.

Top k: When generating text, the model samples from the top k most likely tokens. When deciding the next word to output, the model will select the most likely word when top k is lower.

Frequency penalty: This frequency penalty parameter controls how the LLM penalizes tokens that have already appeared in the input and output text. A frequency penalty of 0 implies that the frequency of the tokens does not impact the LLM and will generate them based on their probability.

Presence penalty: This parameter controls how the LLM penalizes tokens that have not been used recently. A presence penalty of 0 means that the LLM does not care about the recency of the tokens and will generate them based on their probability.

Study design

In our pursuit to assess ChatGPT’s proficiency in abstract screening, we selected certain benchmark databases that have existing performance data from other methodologies. This selection aids in a comparative analysis of performance.

In selecting the systematic reviews for our study, we used a systematic approach guided by specific criteria to ensure relevance and reliability. These studies were selected from the publicly available SYNERGY [ 23 ] dataset, which contains 26 systematic reviews from different disciplines. Key selection criteria included:

Clarity and conciseness of eligibility criteria: The selected studies had well-defined and explicit eligibility criteria. This clarity is essential for accurate replication of the study selection process, which is critical for assessing the performance of LLM tools in an analogous real-world application.

Completeness and cleanliness of data: We ensured that the selected reviews had complete datasets, with all necessary information on included and excluded studies clearly documented, minimizing the risk of ambiguities affecting our analysis.

In addition, to comply with the AMSTAR-2 [ 26 ] guidelines, in particular point 5, we reviewed the methodologies of these reviews to confirm the selection of studies was performed in duplicate and disagreements were resolved by consensus. While our analysis assumes that these systematic reviews adhere to high standards, we recognize the inherent limitations of using pre-existing datasets as a proxy for gold standards in the discussion section.

We picked the following 3 databases from the publicly available SYNERGY dataset [ 23 ]:

Bannach-Brown 2016 [ 27 ]—topic: use of animal models to study depressive behavior

Human-curated (gold standard) results: 1258 excluded abstracts and 230 included abstracts.

We randomly selected 100 excluded abstracts and 100 included abstracts for screening by LLM tools.

Meijboom 2021 [ 28 ]—topic: retransitioning of etanercept in patients with a rheumatic disease

Human-curated (gold standard) results: 599 excluded abstracts and all 32 included abstracts.

We randomly selected 100 excluded abstracts and 32 included abstracts for screening by LLM tools.

Menon 2022 [ 29 ]—topic: methodological rigor of systematic reviews in environmental health

Human-curated (gold standard) results: 896 excluded abstracts and 73 included abstracts.

We randomly selected 100 excluded abstracts and all 73 included abstracts for screening by LLM tools.

For each chosen database, abstracts were categorized as either “cases” (those included based on a gold standard) or “controls” (those excluded per the gold standard). From each category, we randomly selected 100 abstracts (we use all abstracts if there are less than 100 abstracts in that category in a database). These abstracts underwent evaluation by ChatGPT (v4.0) as per our established workflow. Subsequently, ChatGPT’s decisions were juxtaposed against the gold standard to determine sensitivity, specificity, and overall accuracy. The same abstracts were also processed using other LLM tools as listed in Table  1 to record their respective verdicts.

Statistical analysis

To quantify the efficacy of ChatGPT and other AI tools for each database, we calculated the following metrics: (1) sensitivity, (2) specificity, and (3) overall accuracy, where sensitivity is defined as the number of true positives divided by the sum of true positives and false negatives, specificity as the number of true negatives divided by the sum of true negatives and false positives, and accuracy as sum of true positives and true negatives divided by the total number of abstracts. For each metric, associated 95% confidence intervals were also determined. Although it is very common in the field to report F1 score, recall rate, and precision, we believe it is more appropriate to report sensitivity and specificity given this study design. In addition, F1 score, recall rate, and precision can be derived from sensitivity and specificity.

Furthermore, to explore the potential of a unified decision-making process, we combined the decisions from all AI tools using a voting mechanism. The majority decision across the tools was taken as the final verdict for each abstract. For this consolidated approach, we again computed sensitivity, specificity, overall accuracy, and the associated 95% CIs for each database. We also explore the use of latent class analysis (LCA), a model-based clustering approach, to derive consolidated decisions. More details on this LCA approach are provided in the “ Beyond majority voting ” section.

For a given database, 100 cases and 100 controls yield a two-sided 95% confidence interval with a half-width equal to 0.048 when the underline sensitivity (specificity) is approximately 95%.

All statistical analyses were conducted using the R statistical software (version 4.3.1). All tests were two-sided with an alpha level set at 0.05 unless otherwise mentioned.

To improve the transparency and reproducibility of studies using AI tools, we have included the TRIPOD + AI checklist [ 30 ] in our report. This checklist has been adapted to reflect the specifics of our research, which focuses on the evaluation of large language models for abstract screening rather than diagnostic or prognostic modeling. The completed checklist is presented in Table S 1 , to provide readers with a comprehensive overview of our study’s adherence to established reporting standards.

We present the results for each of the 3 databases. For each, we first present the prompts we used when we called LLM tools to screen an abstract, then present the performance data (accuracy, sensitivity, and specificity for each method or LLM tool), followed by a summary of the performance, and a comparison of the performance of the different methods to that based on ChatGPT v4.0.

Results on the Bannach-Brown 2016 database (see Table  2 )

The prompts we used for screening abstracts in this database are as follows:

Conduct a systematic review on animal depression. I provide the title and abstract for one journal article. Provide an overall assessment based on eligibility criteria with only one word answer yes or no with no explanation. Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion, and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion. After answering all the criteria with yes or no, then provide an overall explanation.

Here is the eligibility criteria: Inclusion Criteria: 1. Any article providing primary data of an animal model of depression or depressive-like phenotype with an appropriate control group (specified above). 2. Animals of all ages, sexes and species, where depression-like phenotype intended to mimic the human condition have been induced. Including animal models where depressive-like phenotypes are induced in the presence of a comorbidity (e.g. obesity or cancer). 3. All studies that claim to model depression or depressive-like phenotypes in animals. Studies that induce depressive behavior or model depression and that also test a treatment or intervention (prior or subsequent to model induction), with no exclusion criteria based on dosage, timing or frequency. 4. Studies measuring behavioral, anatomical and structural, electrophysiological, histological and/or neurochemical outcomes and where genomic, proteomic or metabolomic outcomes are measured in addition to behavioral, anatomical, electrophysiological, histological or neurochemical outcomes. Exclusion Criteria: 1. Review article, editorials, case reports, letters or comments, conference or seminar abstracts, studies providing primary data but not appropriate control group. 2. Human studies and ex vivo, in vitro or in silico studies. Studies will be excluded if authors state an intention to induce or investigate only anxiety or anxious behavior. Studies will be excluded if there is no experimental intervention on the animals (e.g. purely observational studies). 3. Studies that investigate treatments or interventions, but no depressive behavior or model of depression is induced (e.g. toxicity and side-effect studies). 4. Where metabolic outcome measures are the primary outcome measure of a study. Where genomic, proteomic, metabolic or metabolomic outcomes are the sole outcome measures in a study, they will be excluded.

Here is the abstract:

Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.945) and specificity (0.960), and satisfactory sensitivity (0.930). Combined decision using major voting seemed to improve sensitivity (0.970) considerably but did not improve specificity (0.870) much. Comparatively, the zero-shot method was less effective across these metrics, while the hybrid method attained superior accuracy (0.961) and specificity (0.982), albeit with reduced sensitivity (0.843). We acknowledge that our findings regarding the zero-shot and hybrid methods differ from those reported in [ 18 ] for the same set of abstracts. This discrepancy could be attributed to the use of distinct sets of embeddings in our analysis compared to the previous study.

For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0). Gemini-1.0-pro (vs. PaLM 2) and Llama 3 (vs. Llama 2) improved over their older versions, but did not surpass the performance of ChatGPT (v4.0). Claude 3 performed well, but still did not surpass the performance of ChatGPT (v4.0).

Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better ( p value = 0.002) than Google PaLM 2 in terms of sensitivity; ChatGPTv4.0 performed significantly better than ChatGPTv3.5 ( p value = 0.008) and better than Llama-2 ( p value < 0.001) in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was no statistically significant difference ( p value = 0.134) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was significantly worse ( p value = 0.008) than ChatGPTv4.0 in terms of specificity.

Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We assessed the performance (sensitivity and specificity) of ChatGPT v4.0 against both the zero-shot and hybrid approaches using the McNemar test. Specifically, we aligned the screening results from 100 cases and 100 controls as per the ChatGPT v4.0 method and similarly for the zero-shot and hybrid methods, testing for inconsistencies between these approaches as previously done. Our analysis revealed that ChatGPT v4.0 significantly outperformed the zero-shot method in sensitivity ( p value < 0.001) but showed comparable effectiveness in specificity ( p value = 0.37). Additionally, ChatGPT v4.0 demonstrated marginally superior sensitivity compared to the hybrid method ( p value = 0.07), while its performance in specificity was similar ( p value = 1.00).

Results on the Meijboom 2021 database (see Table  3 )

Conduct a systematic review on transitioning patients from an originator to a corresponding biosimilar.

I provide the title and abstract for one journal article. Provide an overall assessment based on eligibility criteria with only one word answer yes or no, with no explanation. Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion, and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion.

After answering all the criteria with yes or no, then provide an overall explanation.

Here is the eligibility criteria:

Articles were included if they met the following criteria:

Study involved transitioning from a TNFα inhibitor (including etanercept, infliximab, and adalimumab) originator to a biosimilar

The number of patients who retransitioned was reported or could be calculated

The article was an original research article published in a peer-reviewed journal

The article included baseline characteristics of the patients who transitioned

The article was written in English

The full-text version of the article could be obtained.

Transitioning was defined as patients in whom the biosimilar was introduced after the originator, without treatment with other drugs in between. Retransitioning was defined as restarting the originator directly after discontinuing a biosimilar, without treatment with other drugs in between. In summary, transitioning was defined as switching from the originator to a biosimilar; retransitioning was defined as switching from the originator to a biosimilar and back to the originator. Both transitioning and retransitioning involved changes with the same active biological substance.

Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.840), but not with specificity (0.860) or satisfactory sensitivity (0.812). Compared to ChatGPTv4.0, combined decision using major voting did not improve overall accuracy (0.720), but improved sensitivity (1.000) at the sacrifice of specificity (0.630).

Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better ( p value < 0.001) than Google PaLM 2, but significantly worse than ChatGPT3.5 ( p value = 0.001) and Llama 2 in terms of sensitivity; ChatGPTv4.0 performed significantly better than ChatGPTv3.5 ( p value < 0.001) and better than Llama 2 ( p value < 0.001), but worse than Google PaLM 2 ( p value = 0.002), in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was statistically significant difference ( p value = 0.008) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was not significantly worse ( p value > 0.50) than ChatGPTv4.0 in terms of specificity.

For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0), and Gemini-1.0-pro (vs. PaLM 2) did not improve performance either. However, Llama 3 (vs. Llama 2) improved over its older version and surpassed the performance of ChatGPT (v4.0). Claude 3 also slightly surpassed the performance of ChatGPT (v4.0).

Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We evaluated the performance of ChatGPT v4.0, focusing on sensitivity and specificity, in comparison with the zero-shot and hybrid approaches, employing the McNemar test as described above. In this analysis, we aligned the screening results from 32 cases and 100 controls for the tests. Our findings indicated that ChatGPT v4.0 significantly surpassed the zero-shot method in sensitivity ( p value = 0.0002) and exhibited marginally improved specificity ( p value = 0.099). Furthermore, ChatGPT v4.0 showed notably higher sensitivity than the hybrid method ( p value < 0.001), although its specificity was comparatively lower.

Results on the Menon 2022 database (see Table  4 )

“Conduct a systematic review on the methodological rigour of systematic reviews in environmental health.

I provide the title and abstract for one journal article.

Provide an overall assessment based on eligibility criteria with only one word answer yes or no, with no explanation.

Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion.

Here are the eligibility criteria:

To be eligible for inclusion in the SR sample, documents had to fulfill the following criteria:

Identify explicitly as a “systematic review” in their title

Assess the effect of a non-acute, non-communicable, environmental exposure on a health outcome. Environmental exposures can include air and water pollutants, radiation, noise, occupational hazards, lifestyle factors (like diet or physical activity) and lifestyle choices influenced by family and peers (like substance use), social and economic factors (like stress from work or living conditions).

Include studies in people or mammalian models

Be available in HTML format

Abstract X”

Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.913) and specificity (0.932), but not with specificity (0.900). Compared to ChatGPTv4.0, combined decision using major voting did not improve overall accuracy (0.884) or sensitivity (0.808), but improved specificity (0.940).

Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better than ChatGPT3.5 ( p value < 0.001), Google PaLM 2, and Llama 2 ( p value = 0.02) in terms of sensitivity; ChatGPTv4.0 performed worse than ChatGPTv3.5 and Google PaLM 2, in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was statistically significant difference ( p value = 0.008) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was not significantly different ( p value = 0.134) than ChatGPTv4.0 in terms of specificity.

For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0). However, both Gemini-1.0-pro (vs. PaLM 2) and Llama 3 (vs. Llama 2) improved over their older versions and surpassed the performance of ChatGPT (v4.0). Claude 3 also performed well but did not surpass the performance of ChatGPT (v4.0).

Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We aligned the screening results from 73 cases and 100 controls based on the ChatGPT v4.0 method, and similarly for the zero-shot and hybrid methods, to test for inconsistencies between these approaches, using the McNemar test as done in previous assessments. Our analysis showed that ChatGPT v4.0 significantly outperformed the zero-shot method in both sensitivity ( p value < 0.001) and specificity ( p value = 0.016). In comparison with the hybrid method, ChatGPT v4.0 also demonstrated superior sensitivity ( p value < 0.001) and better specificity ( p value = 0.04).

Monetary cost and time cost

To use the ChatGPT API or other LLM tools, the owners of these platforms charge a predetermined rate for access to the corresponding APIs. These fees are calculated in USD per thousand tokens, where tokens are the basic units used by these LLM platforms to quantify text length. In this context, a token can represent a word, a punctuation mark, or a character. The financial cost of screening 200 abstracts was approximately $6 for ChatGPT v4.0, $0.2 for ChatGPT v3.5, $10 for Llama 2 (using Replicate), while Google PaLM 2 offered its services for free to invited developers. Thus, the cumulative cost of evaluating 200 abstracts across all platforms was approximately $16.2. The cumulative cost of evaluating 200 abstracts across all latest models ($3 for GPT-4-Turbo, $0.05 for GPT-3.5-Turbo, free for Gemini-1.0-pro, $0.05 for Llama-3, $4 for Claude) was less, approximately $7.1. In terms of time efficiency, processing 200 abstracts with each of these LLM tools took approximately 10–20 min using a single thread. However, it is imperative to recognize that abstract screening lends itself well to parallelization. Consequently, one could significantly speed up the process by setting up multiple threads to simultaneously screen different subsets of abstracts, thereby reducing the overall time required for completion. This parallel approach not only increases efficiency, but also ensures that large amounts of data can be processed in a timely manner, making LLM tools even more attractive for large-scale abstract screening tasks. In summary, the monetary and time costs of using LLM tools for abstract screening are negligible compared to manual labeling.

Beyond majority voting

We have expanded our analysis to include a variety of approaches for synthesizing decisions across different LLM tools. Our methodology is inspired by the concept of combining multiple diagnostic tests in the absence of a gold standard, akin to situations where human expert consensus is not available. There are several publications discussing such scenarios [ 31 , 32 ], among which we proposed to use the latent class analysis (LCA) models.

Latent class analysis (LCA) is a statistical method used to identify subgroups within a population, which are not directly observed (hence “latent”) [ 33 , 34 , 35 ]. It is particularly useful when the research interest lies in categorizing individuals into mutually exclusive groups based on their responses to multiple observed variables. In the context of abstract screening, LCA can offer a sophisticated means of integrating decisions from different LLM tools without relying on a gold standard, typically provided by human expert consensus. This approach assumes that the unobserved subgroups (or “latent classes”) explain the dependence between the observed decisions made by each of the LLM tools.

Utilizing the LCA model, we treat the decisions from all LLM tools as dichotomous variables, corresponding to the adherence to each inclusion or exclusion criterion, as well as the overall decision. For instance, within the Bannach-Brown 2016 database (BB2016), there are eight criteria in total with four criteria each for inclusion and exclusion and one overall decision for inclusion, resulting in a total of nine binary items per LLM tool. In our analysis, we incorporated decisions from GPT v4.0, v3.5, and Llama 2. Decisions from Google PaLM 2 were excluded due to a high frequency (10% or more) of incomplete responses. Consequently, for the Bannach-Brown 2016 database, we worked with 27 binary items. For other databases such as Meijboom 2021 (Mj2021) and Menon 2022 (Me2022), the binary items totaled 21 and 15, respectively. It is important to note that LCA models were fitted to the binary data of each database independently.

The LCA model fitting process enables us to calculate the posterior probabilities of each abstract belonging to specific latent classes or subgroups. Abstracts are then categorized based on these probabilities, with assignment to the class for which an abstract has the highest posterior membership probability. The determination of the number of latent classes is a critical step in the LCA model fitting, which requires a priori specification. In our evaluation, we explored models with class numbers ranging from 2 to 6 and utilized the Bayesian information criterion (BIC) to identify the most “optimal” LCA model for our datasets.

Table 5 shows that after applying the Bayesian information criterion (BIC) to determine the most appropriate model for our data, we identified a 3-class model as the best fit for the binary decisions derived from the BB2016 database. Similarly, a 4-class model was optimal for the Mj2021 database, while a 3-class model was again best for the Me2022 database. The confusion matrices generated by the selected LCA models for each database provided a clear juxtaposition between the LLM-assigned classes and the actual labels of the abstracts (see Table  6 ).

The performance metrics derived from these models are noteworthy. For the BB2016 database, should we categorize abstracts within class with label 2 as “included” and assign the others in classes with label 1 or 3 to the “excluded” category, the LCA model achieved a sensitivity rate of 93% and a specificity rate of 96%, indicating a high degree of accuracy in classifying relevant and irrelevant abstracts. In the case of the Mj2021 database, if we view class 1 as “included” group and classes 2 and 3 as “excluded” group, the model achieved a perfect sensitivity rate of 100%, meaning that it correctly identified all relevant abstracts, although the specificity was lower at 79%, indicating a higher rate of false positives. Similarly, for the Me2022 database, considering class 1 to be the “excluded” category and classes 2 and 3 to be the “included” group, the model showed a sensitivity of 94.5% and a specificity of 83%, a good balance between identifying relevant abstracts and avoiding false inclusions.

These results highlight the robustness of the latent class analysis approach to the abstract screening task, providing an effective method for classifying abstracts when a gold standard is not available. The varying levels of sensitivity and specificity across databases underscore the need to tailor the LCA model to the specific characteristics of each dataset as well as further future research.

We have incorporated results from the latest LLM models to increase the robustness and relevance of our findings. However, to maintain consistency and comparability with our original analyses, we have not incorporated these new results into the previously established majority voting or latent class analysis (LCA) approaches. Instead, we have chosen to make all raw data, including results from these newer models, freely available in our GitHub repository. This approach allows interested researchers and practitioners to conduct further investigations or apply alternative methods of analysis. By providing access to this additional data, we aim to promote transparency and enable the broader community to engage with and potentially build upon our work.

This study began with a rigorous exploration of the capabilities of large language models (LLMs) in abstract screening. We used automation scripts developed in Python to interact with the APIs of several LLM tools, including Chat GPT v4.0, Google PaLM 2, and Meta Llama 2, as well as latest versions of these tools. Our central goal was to evaluate the efficiency and accuracy of these tools across three different databases of abstracts, leading us to a complex understanding of their potential in this context.

Large language models (LLMs), particularly ChatGPT, have garnered global attention since their inception. Employing LLMs for abstract screening in systematic reviews is an innovative concept [ 32 , 33 ] and remains underexplored. This study presents the first comprehensive evaluation of LLMs applied to systematic review processes. The findings are encouraging, suggesting that LLMs could revolutionize abstract screening. Specifically, ChatGPT v4.0 exhibited stellar performance across three test scenarios, achieving an accuracy of at least 85%. Furthermore, it attained sensitivity and specificity rates ranging from 80% to an impressive 95%. These exceptional outcomes highlight the substantial promise of LLMs in abstract screening, offering an efficient and capable alternative to the conventional, laborious approaches that typically necessitate extensive human annotation.

However, it is important to acknowledge that we are still in the early stages of integrating LLM tools into the abstract screening field, and they are not without their imperfections—for example, occasionally excluding many (20%) relevant studies, even with the best-performed LLM tools, ChatGPT v4.0. These tools are not a universal solution to all the challenges associated with abstract screening, and they are not ready to completely replace human expertise in this area. Instead, they should be embraced as invaluable assistants in the abstract screening process.

In discussing the limitations of our study, it is important to recognize the constraints associated with our dataset selection and model configuration. We used only three databases from the SYNERGY set, limiting the generalizability of our findings across disciplines and datasets. In addition, the reliance on human-curated labels as a gold standard, while necessary, introduces potential biases due to the retrospective nature of our analysis. These labels may contain errors, and the methodology used in the original reviews to resolve discrepancies may affect the validity of our conclusions.

The parameters chosen for our LLMs—temperature, top k, top p, and prompts—were set to defaults or based on natural conversations to balance output quality with ease of use. While this approach minimizes the need for technical expertise, it may not be optimal for all screening scenarios. In addition, the reliance of LLMs on abstracts alone, as opposed to full-text evaluations, presents a fundamental challenge; critical data influencing inclusion or exclusion may not be present in the abstracts, potentially compromising screening accuracy.

In addition, the rapid development of LLMs and their “black box” nature pose challenges to the transparency that is essential in scientific settings. The environmental impact of using these computationally intensive models is also significant [ 36 ], requiring sustainability considerations. Future research should focus on refining these tools to increase transparency and efficiency, and ensure their responsible development and use in systematic reviews.

Our research suggests that LLM tools are ready to take on a role in abstract screening and are poised to have an immediate and positive impact on the process. Their integration into abstract screening can manifest itself in a few ways. They can serve as autonomous AI reviewers, adding an extra layer of scrutiny and ensuring thoroughness. Our findings suggest that a collective decision, such as one derived from majority voting, can sometimes improve sensitivity, underscoring the potential of LLM tools as a reliable safeguard against oversight, ensuring both comprehensiveness and accuracy.

In addition, LLM tools can facilitate a synergistic partnership with human experts. They are adept at identifying “high-risk” abstracts where different LLM tools have reached different judgments, flagging them for further human evaluation, and promoting a careful and error-free screening process while minimizing human effort.

Another exciting prospect is the integration of LLM tools with hybrid approaches and active learning approach. In this scenario, LLM tools could autonomously annotate abstracts in the training set, minimizing the need for human labeling. Subsequently, these labeled abstracts could be used to train custom NLP models, paving the way for a streamlined and efficient abstract screening process, resulting in significant time and resource savings. Further research is needed to understand how the “uncertainty” in LLM-based decisions, when human-curated labels are not used, affects the performance of the hybrid approach. We also note from the test examples that the performance of hybrid approaches varies widely from example to example and depending on the text embedding tools used. Extensive research is needed to perfect the hybrid approaches.

Another future research direction is to explore how to fine-tune different versions of LLM tools and how to derive collective decisions. One idea is that by using different tuning parameters, one could propose different versions of LLM tools, and if they perform similarly in terms of accuracy but give different decisions (e.g., not highly correlated), we expect that better collective decisions would be obtained, as observed in many ensemble learning experiences [ 37 ]. However, this could be costly and require a lot of exploration.

More importantly, it would be particularly valuable to prospectively test these LLM tools, or integration of these tools with other approaches like zero-shot, active learning, in real-world meta-analysis and systematic review projects. This would provide deeper insights into their practical utility and effectiveness, and a more comprehensive understanding of their impact in live research environments.

In summary, while LLM tools may not be able to fully replace human experts in abstract screening, their ability to transform the screening process is both undeniable and imminent. With continued advances in technology and ongoing refinement, these tools will play a critical role in the future of abstract screening, ushering in a new era of efficiency and effectiveness.

Availability of data and materials

The datasets analyzed during the current study are available in the SYNERGY repository, https://github.com/asreview/synergy-dataset . We have created a GitHub repository https://github.com/mikeli380/LLMAbstractScreening to host our code (R/Python), abstracts, and intermediate results so that interested readers can replicate our findings.

Higgins JPT, et al. Cochrane handbook for systematic reviews of interventions. 2019.

Book   Google Scholar  

Moher D, et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7). https://doi.org/10.1371/journal.pmed.1000097 .

Sackett DL, Rosenberg WMC, Gray JAM, Haynes RB,  Richardson WS. Evidence based medicine: what it is and what it isn’t. 1996. Clin Orthop Relat Res. 2007;455.  https://doi.org/10.1136/bmj.312.7023.71 .

Lavis JN, Posada FB, Haines PA, Osei E. Use of research to inform public policymaking. Lancet. 2004;364(9445). https://doi.org/10.1016/S0140-6736(04)17317-0 .

Bastian H, Glasziou P,  Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up?. PLoS Med. 2010;7(9).   https://doi.org/10.1371/journal.pmed.1000326 .

Polanin JR, Pigott TD, Espelage DL, Grotpeter JK. Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Res Synth Methods. 2019;10(3). https://doi.org/10.1002/jrsm.1354 .

Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of antidepressant trials and its influence on apparent efficacy. N Engl J Med. 2008;358(3). https://doi.org/10.1056/nejmsa065779 .

Edwards P, Clarke M, DiGuiseppi C, Pratap S, Roberts I, Wentz R. Identification of randomized controlled trials in systematic reviews: accuracy and reliability of screening records. Stat Med. 2002;21(11). https://doi.org/10.1002/sim.1190 .

Wang Z, Nayfeh T, Tetzlaff J, O’Blenis P, Murad MH. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE. 2020;15(1). https://doi.org/10.1371/journal.pone.0227742 .

O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4(1). https://doi.org/10.1186/2046-4053-4-5 .

OpenAI. Introducing ChatGPT Plus. OpenAI Blog; 2023. https://openai.com/blog/chatgpt-plus . Accessed 8 Nov 2023.

Google DeepMind. Introducing PaLM 2. Google Blog; 2023.  https://blog.google/technology/ai/google-palm-2-ai-large-language-model/ . Accessed 8 Nov 2023.

Meta AI Research. Meta and Microsoft introduce the next generation of Llama. Facebook AI Blog; 2023. https://ai.meta.com/blog/llama-2/ . Accessed 8 Nov 2023.

Claude.ai. Claude AI API reference. 2024. https://docs.anthropic.com/en/api/getting-started . Accessed 4 Jun 2024.

Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. 2012. https://doi.org/10.1145/2110363.2110464 .

Alshami A, Elsayed M, Ali E, Eltoukhy AEE, Zayed T. Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Systems. 2023;11(7). https://doi.org/10.3390/systems11070351 .

Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models ‘the answer’ to bringing us closer to systematic review automation?  Syst Rev. 2023;12(1).   https://doi.org/10.1186/s13643-023-02243-z .

Xian Y, Lampert CH, Schiele B, Akata Z. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell. 2019;41(9). https://doi.org/10.1109/TPAMI.2018.2857768 .

Moreno-Garcia CF, Jayne C, Elyan E, Aceves-Martins M. A novel application of machine learning and zero-shot classification methods for automated abstract screening in systematic reviews. Decis Anal J. 2023;6.  https://doi.org/10.1016/j.dajour.2023.100162 .

Ferdinands G, et al. Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records. Syst Rev. 2023;12(1). https://doi.org/10.1186/s13643-023-02257-7 .

Boetje J, van de Schoot R. The SAFE procedure: a practical stopping heuristic for active learning-based screening in systematic reviews and meta-analyses. Syst Rev. 2024;13(1). https://doi.org/10.1186/s13643-024-02502-7 .

Yang Y, Loog M. A benchmark and comparison of active learning for logistic regression. Pattern Recognit. 2018;83. https://doi.org/10.1016/j.patcog.2018.06.004 .

De Bruin RJ, Ma Y, Ferdinands G, Teijema J, Van de Schoot. SYNERGY - open machine learning dataset on study selection in systematic reviews. DataverseNL; 2023. https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/HE6NAQ .

Millard SJ, Weston-Green K, Newell KA. The Wistar-Kyoto rat model of endogenous depression: a tool for exploring treatment resistance with an urgent need to focus on sex differences. Prog Neuropsychopharmacol Biol Psychiatry. 2020;101. https://doi.org/10.1016/j.pnpbp.2020.109908 .

Zawia NH, Mattia CJ, Bondy SC. Differential effects of difluoromethylornithine on basal and induced activity of cerebral ornithine decarboxylase and mRNA. Neuropharmacology. 1991;30(4). https://doi.org/10.1016/0028-3908(91)90058-J .

Shea BJ, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;358. https://doi.org/10.1136/bmj.j4008 .

Bannach-Brown A, Liao J, Wegener G, Macleod M. Understanding in vivo modelling of depression in non-human animals: a systematic review protocol. Evid Based Preclin Med. 2016;3(2). https://doi.org/10.1002/ebm2.24 .

Meijboom RW, Gardarsdottir H, Becker ML, ten Wolde S, Egberts TCG, Giezen TJ. Incidence of and reasons and determinants associated with retransitioning from biosimilar etanercept to originator etanercept. BioDrugs. 2021;35(6). https://doi.org/10.1007/s40259-021-00501-x .

Menon JML, Struijs F, Whaley P. The methodological rigour of systematic reviews in environmental health. Crit Rev Toxicol. 2022;52(3). https://doi.org/10.1080/10408444.2022.2082917 .

Collins GS, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7). https://doi.org/10.1136/bmjopen-2020-048008 .

Joseph L, Gyorkos TW, Coupal L. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. Am J Epidemiol. 1995;141(3). https://doi.org/10.1093/oxfordjournals.aje.a117428 .

Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics. 1980;36(1). https://doi.org/10.2307/2530508 .

Dendukuri N, Joseph L. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics. 2001;57(1). https://doi.org/10.1111/j.0006-341X.2001.00158.x .

Collins LM, Lanza ST. Latent class and latent transition analysis: with applications in the social, behavioral, and health sciences. 2010.

Google Scholar  

McCutcheon AL. Latent class analysis. Newbury Park: Sage; 1987.

Rillig MC, Ågerstrand M, Bi M, Gould KA, Sauerland U. Risks and benefits of large language models for the environment. Environm Sci Technol. 2023;57(9). https://doi.org/10.1021/acs.est.3c01106 .

Sagi O, Rokach L. Ensemble learning: a survey.  Wiley Interdiscip Rev Data Min Knowl Discov. 2018;8(4). https://doi.org/10.1002/widm.1249 .

Download references

Acknowledgements

The authors thank the reviewers and the Editor for their insightful comments and suggestions, which have greatly improved the quality of this work.

Not applicable.

Author information

Authors and affiliations.

Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA

Michael Li & Xianming Tan

Department of Mathematics and Statistics, University of North Carolina at Greensboro, Greensboro, NC, 27402, USA

Jianping Sun

Department of Biostatistics and Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA

Xianming Tan

You can also search for this author in PubMed   Google Scholar

Contributions

XT conceived the paper. ML and XT devised the study. ML, JS, and XT analyzed the datasets. ML, JS, and XT wrote the manuscript. All authors edited and approved the final manuscript.

Corresponding author

Correspondence to Xianming Tan .

Ethics declarations

Ethics approval and consent to participate, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1: table s1. tripod + ai checklist., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Li, M., Sun, J. & Tan, X. Evaluating the effectiveness of large language models in abstract screening: a comparative analysis. Syst Rev 13 , 219 (2024). https://doi.org/10.1186/s13643-024-02609-x

Download citation

Received : 20 December 2023

Accepted : 11 July 2024

Published : 21 August 2024

DOI : https://doi.org/10.1186/s13643-024-02609-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Abstract screening
  • Automated text analysis
  • Google PaLM

Systematic Reviews

ISSN: 2046-4053

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

literature review comparative analysis

Reference guides for the safe use of drugs during pregnancy: a systematic review and comparative analysis

  • Published: 26 August 2024

Cite this article

literature review comparative analysis

  • Xue-Feng Jiao 1 , 2 , 3 , 4   na1 ,
  • Panda Qiu 1 , 2 , 3 , 4 , 5   na1 ,
  • Zheyun Song 6 ,
  • Xue Peng 4 , 7 ,
  • Hailong Li 1 , 2 , 3 , 4 ,
  • Linan Zeng 1 , 2 , 3 , 4 &
  • Lingli Zhang 1 , 2 , 3 , 4 , 8  

11 Accesses

Explore all metrics

To comprehensively evaluate and compare all the available reference guides for the safe use of drugs during pregnancy, with the goal of determining the scientificity and reliability of these reference guides.

We searched PubMed, EMbase, CNKI, Wanfang Database, and VIP database to comprehensively identify the available reference guides. Moreover, we selected 103 drugs based on relevant literatures, and compared the recommendations of each drug from different reference guides.

A total of 14 available reference guides were identified. However, none of these reference guides assessed the risk of bias of original studies or the quality of current evidence. Seven reference guides adopted expert consensus method to formulate pregnancy recommendations, while the rest reference guides did not report the formation method. Moreover, 77.7% of the selected drugs had inconsistent recommendations among different reference guides. In addition, the referenced human and animal studies for the same drug differed among different reference guides.

Our results indicate that current reference guides for the safe use of drugs during pregnancy are less scientific and reliable, and there are considerable discrepancies in recommendations from different reference guides concerning drug use during pregnancy. The reasons for the discrepancies in recommendations include ① the literature search in most reference guides was not comprehensive, ② none of the available reference guides assessed the risk of bias of original studies or the quality of current evidence, and ③ the method adopted by current reference guides to formulate recommendations had obvious subjectivity and lacked of scientificity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

literature review comparative analysis

Similar content being viewed by others

literature review comparative analysis

Risk of drug use during pregnancy: master protocol for living systematic reviews and meta-analyses performed in the metaPreg project

literature review comparative analysis

Assessment of the scope, completeness, and consistency of various drug information resources related to COVID-19 medications in pregnancy and lactation

literature review comparative analysis

The Methodological Quality of Observational Studies Examining the Risk of Pregnancy Drug Use on Congenital Malformations Needs Substantial Improvement: A Cross-Sectional Survey

Data availability.

Data is provided within the manuscript or supplementary information files.

Zhang C, Zhang LL, Chen L (2012) Medications used in pregnancy: a systematic review. J Chin Pharm Sci 47(11):858–862

Google Scholar  

Broussard CS, Frey MT, Hernandez-Diaz S, Greene MF, Chambers CD, Sahin L et al (2014) Developing a systematic approach to safer medication use during pregnancy: summary of a Centers for Disease Control and Prevention–convened meeting. Am J Obstet Gynecol 211(3):208-214.e1

Article   PubMed   PubMed Central   Google Scholar  

Zhang C, Zhang LL, Wang XD, Liu D, Li JL, Xiao D (2016) A comparative study of risk classification systems for drug use during pregnancy. Chin Pharm J 51(3):234–238

Feibus KB (2008) FDA’s proposed rule for pregnancy and lactation labeling: improving maternal child health through well-informed medicine use. J Med Toxicol 4(4):284–288

Zhang C, Zhang LL, Zeng LN, Wang XD, Jiang HY (2020) A comparative study of different references to the fetal risk of drugs in pregnancy. Chin J Evid Based Med 20(7):6

NHMRC (2016) Guidelines for Guidelines Handbook. National Health and Medical Research Council. https://www.nhmrc.gov.au/guidelinesforguidelines . Accessed 10 July 2024

Peters SL, Lind JN, Humphrey JR, Friedman JM, Honein MA, Tassinari MS et al (2013) Safe lists for medications in pregnancy: inadequate evidence base and inconsistent guidance from Web-based information, 2011. Pharmacoepidemiol Drug Saf 22(3):324–328

Article   PubMed   Google Scholar  

Ailes EC, Zimmerman J, Lind JN, Fan F, Shi K, Reefhuis J et al (2020) Using supervised learning methods to develop a list of prescription medications of greatest concern during pregnancy. Matern Child Health J 24(7):901–910

Gheysen W, Kennedy D (2020) An update on maternal medication-related embryopathies. Prenat Diagn 40(9):1168–1177

Nörby U, Noël-Cuppers B, Hristoskova S, Desai M, Härmark L, Steel M et al (2021) Online information discrepancies regarding safety of medicine use during pregnancy and lactation: an IMI ConcePTION study. Expert Opin Drug Saf 20(9):1117–1124

Frost Widnes SK, Schjøtt J (2008) Advice on drug safety in pregnancy: are there differences between commonly used sources of information? Drug Saf 31(9):799–806

Brown E, Hotham E, Hotham N (2016) Pregnancy and lactation advice How does Australian Product Information compare with established information resources? Obstet Med 9(3):130–134

Higgins JPT, Altman DG, Sterne JAC (2011) Chapter 8: Assessing risk of bias in included studies. In: Higgins JPT, Green S (eds) Cochrane Handbook for Systematic Reviews of Interventions. Version 5.1.0 [updated March 2011]. The Cochrane Collaboration. https://www.cochrane-handbook.org . Accessed 26 Apr 2024

Wang J, Li MX, Liu CR, Xiong YQ, Qi YN, Tan J (2022) A survey of studies investigating the association between medication exposure during pregnancy and birth defects. Chin J Evid Based Med 22(06):692–705

Sterne JA, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M et al (2016) ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ 355:i4919

Drukker M, Weltens I, van Hooijdonk CFM, Vandenberk E, Bak M (2021) Development of a methodological quality criteria list for observational studies: the observational study quality evaluation. Front Res Metr Anal 6:675071

Wells GA, Shea B, O’Connell D, Peterson J, Welch V, Losos M, et al. The Newcastle-Ottawa Scale (NOS) for assessing the quality of nonrandomised studies in meta-analyses. 2008 [cited 2024 Apr 26]. Available from: http://www.ohri.ca/programs/clinical_epidemiology/oxford.asp .

Sharma H (2021) Statistical significance or clinical significance? A researcher’s dilemma for appropriate interpretation of research results. Saudi J Anaesth 15(4):431–434

Melanie Lauckner. WHO handbook for guideline development. 2nd ed. 2014 [cited 2024 Apr 26]. Available from: https://www.who.int/publications/i/item/9789241548960 .

Schünemann H, Brożek J, Guyatt G, and Oxman A. GRADE Handbook. 2013 [cited 2024 Apr 26]. Available from: https://gdt.gradepro.org/app/handbook/handbook.html#h.svwngs6pm0f2 .

Guideline on risk assessment of medicinal products on human reproduction and lactation: From data to labelling [cited 2024 Jul 10]. Available from: https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-risk-assessment-medicinal-products-human-reproduction-and-lactation-data-labelling_en.pdf .

Download references

This study was supported by Natural Science Foundation of Sichuan Province (grant number 2022NSFSC0644). The funders had no role in the development of this article (i.e., in the study design; collection, analysis, and interpretation of data; report writing; or decision to submit the paper for publication).

Author information

Xue-Feng Jiao and Panda Qiu contributed equally to this work and share first authorship.

Authors and Affiliations

Department of Pharmacy, West China Second University Hospital, Sichuan University, Renmin Nan Road 20, Chengdu, 610041, China

Xue-Feng Jiao, Panda Qiu, Hailong Li, Linan Zeng & Lingli Zhang

Evidence-Based Pharmacy Center, West China Second University Hospital, Sichuan University, Chengdu, China

National Medical Products Administration (NMPA) Key Laboratory for Technical Research On Drug Products In Vitro and In Vivo Correlation, Chengdu, China

Key Laboratory of Birth Defects and Related Diseases of Women and Children, Ministry of Education, Sichuan University, Chengdu, China

Xue-Feng Jiao, Panda Qiu, Xue Peng, Hailong Li, Linan Zeng & Lingli Zhang

West China School of Pharmacy, Sichuan University, Chengdu, China

Xinxiang Medical University, Xinxiang, China

Zheyun Song

Department of Obstetrics and Gynecology, West China Second University Hospital, Sichuan University, Chengdu, China

Chinese Evidence-Based Medicine Center, West China Hospital, Sichuan University, Chengdu, China

Lingli Zhang

You can also search for this author in PubMed   Google Scholar

Contributions

LL.Z. and XF.J. conceptualized the research question. XFJ and PQ participated in drafting and writing the review. XF.J., P.Q., Z.S. and H.L. participated in the formulation of retrieval strategies, data acquisition, data analysis and quality assessment. XF.J., P.Q., Z.S., H.L. and L.Z. participated in the drawing of tables and figures. LL.Z. and X.P. participated in critical revision of the manuscript. All authors contributed to the research and approved the final manuscript.

Corresponding author

Correspondence to Lingli Zhang .

Ethics declarations

Conflict of interest.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 60 KB)

Rights and permissions.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Jiao, XF., Qiu, P., Song, Z. et al. Reference guides for the safe use of drugs during pregnancy: a systematic review and comparative analysis. Eur J Clin Pharmacol (2024). https://doi.org/10.1007/s00228-024-03736-z

Download citation

Received : 26 May 2024

Accepted : 01 August 2024

Published : 26 August 2024

DOI : https://doi.org/10.1007/s00228-024-03736-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Reference guides
  • The safe use of drugs
  • Recommendations
  • Comparative analysis
  • Find a journal
  • Publish with us
  • Track your research

COMMENTS

  1. Comparative Analysis

    Comparative analysis asks writers to make an argument about the relationship between two or more texts. Beyond that, there's a lot of variation, but three overarching kinds of comparative analysis stand out: Subordinate (A → B) or (B → A): Using a theoretical text (as a "lens") to explain a case study or work of art (e.g., how Anthony Jack ...

  2. (PDF) A Short Introduction to Comparative Research

    comparative historical analysis in history, and psychological analysis (Smelser, 1973). Comparative research or analysis is a broad term that includes both quantitative and qualitative comparison.

  3. How to Write a Comparative Literature Review

    The review of Comparative Literature is a rich, interdisciplinary area, providing students with the opportunity to study literature across national borders, languages, cultures, time periods, genres and social fields and to engage with disciplines as various as history, translation studies, philosophy, psychology, critical theory, religious ...

  4. PDF How to Write a Comparative Analysis

    There are two basic ways to organize the body of your paper. In text-by-text, you discuss all of A, then all of B. In point-by-point, you alternate points about A with comparable points about B. If you think that B extends A, you'll probably use a text-by-text scheme; if you see A and B engaged in debate, a point-by-point scheme will draw ...

  5. PDF How to Write a Comparative Analysis

    Determine the focus of your piece. Determine if you will focus on the similarities, the differences, or both. Be sure you treat each individual the same; each person deserves the same amount of focus-meaning, do not place most of the emphasis on you or the other person. Find a balance.

  6. Methodological Approaches to Literature Review

    A literature review is defined as "a critical analysis of a segment of a published body of knowledge through summary, classification, and comparison of prior research studies, reviews of literature, and theoretical articles." (The Writing Center University of Winconsin-Madison 2022) A literature review is an integrated analysis, not just a summary of scholarly work on a specific topic.

  7. How to Write a Literature Review

    Examples of literature reviews. Step 1 - Search for relevant literature. Step 2 - Evaluate and select sources. Step 3 - Identify themes, debates, and gaps. Step 4 - Outline your literature review's structure. Step 5 - Write your literature review.

  8. Comparative Analysis

    Work in groups of 3-4, practice writing claims for comparative analysis based on possible conversations your selected authors could have. Your team will write the claims on a Google doc shared by your instructor. The whole class will then review and evaluate each claim for effectiveness and significance. To practice writing and evaluating ...

  9. Comparative Analysis

    Definition. The goal of comparative analysis is to search for similarity and variance among units of analysis. Comparative research commonly involves the description and explanation of similarities and differences of conditions or outcomes among large-scale social units, usually regions, nations, societies, and cultures.

  10. Comparative Literature Review Essays

    A comparative review may, e.g., require you to examine two schools of thought, two issues, or the positions taken by two persons. You may create a hierarchy of issues and sub-issues to compare and contrast, as suggested by the following general plan. This model lists 3 options for structuring the body of the review. In all cases, you are expected to deal with the similarities (compare) and ...

  11. Comparative Analysis

    Comparative analysis is a multidisciplinary method, which spans a wide cross-section of disciplines (Azarian, 2011).It is the process of comparing multiple units of study for the purpose of scientific discovery and for informing policy decisions (Rogers, 2014).Even though there has been a renewed interest in comparative analysis as a research method over the last decade in fields such as ...

  12. A Framework for Using Qualitative Comparative Analysis for the Review

    of the Literature, Qualitative Comparative Analysis, Qualitative Data Analysis, Qualitative Comparative Analysis-Based Research Synthesis . The literature review has become the most common way of acquiring knowledge and oftentimes sets the direction for a study. Traditionally, two branches of a literature review have

  13. Comparative Research Methods

    Large-N Comparative Analysis. Comparative analysis is about control (Sartori, 1994). The influence of potentially significant variables is either controlled for by employing a most similar or a most different system design or, if we are dealing with larger number of cases, by way of statistical control.

  14. Approaches to Qualitative Comparative Analysis and good practices: A

    We limited the search to articles written in English that are listed in the "Social Sciences Citation Index" since 1956 using the search term "Qualitative Comparative Analysis" OR "Boolean" OR "QCA" in the title, abstract or WoS keywords Plus® AND the category "Public Administration", excluding literature reviews, research ...

  15. Comparative Literature Studies

    Comparative Literature Studies publishes comparative critical essays that range across the rich traditions of Africa, Asia, Europe, and North and South America, and that examine the literary relations between East and West, North and South. Articles may also explore movements, themes, forms, the history of ideas, relations between authors, the foundations of criticism and theory, and issues of ...

  16. A Step-by-Step Guide to Writing a Comparative Analysis

    Send Us Your Topic. 4. Organize information. For your readers to want to read your comparative analysis, it is important to structure your comments. The idea is to make it easy for your readers to navigate your paper and get them to find the information that interests them quickly. 5.

  17. (PDF) Four Varieties of Comparative Analysis

    The article starts by examining the definition of comparative analysis and what distinguishes it from analysis in general. It then identifies four varieties of comparative analysis according to (a ...

  18. Using qualitative comparative analysis (QCA) in systematic reviews of

    Systematic reviews that address policy and practice questions frequently need to not only assess the efficacy of a given intervention but to identify which intervention, out of a range of possibilities, might be relevant in the particular situation [].Network meta-analysis has been receiving widespread interest as it enables a reviewer indirectly to compare the relative effectiveness of ...

  19. (PDF) Comparative Literature: An Overview

    departments or Comparative Literature programs. Comparative literature is an interdisciplinary field whose practitioners study literature. across national borders, across time periods, across ...

  20. DCC Curation Lifecycle Model 2.0: Literature Review and Comparative

    The literature covered in this report was gathered and reviewed as background for the DCC Curation Lifecycle Model 2.0 project. Led by Sayeed Choudhury at Johns Hopkins University and funded by the Alfred P. Sloan Foundation, the project provided a foundation for updating the original DCC model, originally published in 2008. A revised model is needed to accommodate the dramatic changes in the ...

  21. Using qualitative comparative analysis in a systematic review of a

    The BCTs abstracted during the completed review were determined and defined a priori by the review team and derived from a previous meta-analysis of medication adherence interventions and a published taxonomy of BCTs [24, 25].One study reviewer captured a study's use of each BCT as "yes" or "no" or "unclear" based on information available in the published intervention description ...

  22. Fuzzy‐set qualitative comparative analysis in consumer research: A

    As a result, the use of innovative and asymmetrical methods such as fuzzy-set qualitative comparative analysis (fsQCA), which draws from the complexity theory could help better predict and explain consumer decisions in the age of disruptions. ... Through a comprehensive method-based systematic literature review of 129 research articles ...

  23. Evaluating the effectiveness of large language models in abstract

    Objective This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows. Methods We developed automation scripts in Python to interact with the APIs of several LLM tools, including ...

  24. (PDF) Literature Review

    LITERATURE REVIEW - COMPARATIVE ANALYSIS OF PARLIAMENTARY . IMPACT IN THE LEGISLATIVE PROCESS. This is a standalone section of the literature review for my PhD, the focus of which is on the.

  25. Comparative Effectiveness and Safety of Seven Qi‐Tonifying Chinese

    We conducted this network meta-analysis to compare the efficacy and safety of 7 commonly used qi-tonifying injections in patients with AECOPD. Methods. Literature search was conducted through electronic databases, including PubMed, the Cochrane Library, EMBASE, CINAHL, AMED, CBM, CNKI, Wanfang database, and VIP database.

  26. Reference guides for the safe use of drugs during pregnancy ...

    Drug use during pregnancy is exceedingly common. Our previous systematic review indicated that 60-90% of pregnant women took drugs during pregnancy, with 2-4 drugs on average and up to 8 drugs maximum [].However, a number of these drugs have not been adequately studied in human pregnancy and may have undetermined risks for birth defects or adverse outcomes for the fetus [].

  27. Comparative economic analysis of urban sanitation interventions in low

    Related literature studies are collated on the comparative economic analysis of urban sanitation interventions.Gaps in existing practices were ... This review assesses evaluative criteria for least-cost economic analyses of citywide inclusive sanitation (CWIS) interventions, focusing on their alignment with the practical needs for decision ...