A Dataset Exploration Case Study with Know Your Data

Data underlies much of machine learning (ML) research and development, helping to structure what a machine learning algorithm learns and how models are evaluated and benchmarked. However, data collection and labeling can be complicated by unconscious biases, data access limitations and privacy concerns, among other challenges. As a result, machine learning datasets can reflect unfair social biases along dimensions of race, gender, age, and more.

Methods of examining datasets that can surface information about how different social groups are represented within are a key component of ensuring development of ML models and datasets is aligned with our AI Principles . Such methods can inform the responsible use of ML datasets and point toward potential mitigations of unfair outcomes. For example, prior research has demonstrated that some object recognition datasets are biased toward images sourced from North America and Western Europe, prompting Google’s Crowdsource effort to balance out image representations in other parts of the world.

Today, we demonstrate some of the functionality of a dataset exploration tool, Know Your Data (KYD), recently introduced at Google I/O, using the COCO Captions dataset as a case study. Using this tool, we find a range of gender and age biases in COCO Captions — biases that can be traced to both dataset collection and annotation practices. KYD is a dataset analysis tool that complements the growing suite of responsible AI tools being developed across Google and the broader research community. Currently, KYD only supports analysis of a small set of image datasets, but we’re working hard to make the tool accessible beyond this set.

Introducing Know Your Data

Know Your Data helps ML research, product and compliance teams understand datasets, with the goal of improving data quality, and thus helping to mitigate fairness and bias issues. KYD offers a range of features that allow users to explore and examine machine learning datasets — users can filter, group, and study correlations based on annotations already present in a given dataset. KYD also presents automatically computed labels from Google’s Cloud Vision API , providing users with a simple way to explore their data based on signals that weren’t originally present in the dataset.

A KYD Case Study

As a case study, we explore some of these features using the COCO Captions dataset , an image dataset that contains five human-generated captions for each of over 300k images. Given the rich annotations provided by free-form text, we focus our analysis on signals already present within the dataset.

Exploring Gender Bias

Previous research has demonstrated undesirable gender biases within computer vision datasets, including pornographic imagery of women and image label correlations that align with harmful gender stereotypes. We use KYD to explore gender biases within COCO Captions by examining gendered correlations within the image captions. We find a gender bias in the depiction of different activities across the images in the dataset, as well as biases relating to how people of different genders are described by annotators.

The first part of our analysis aimed to surface gender biases with respect to different activities depicted in the dataset. We examined images captioned with words describing different activities and analyzed their relation to gendered caption words, such as “man” or “woman”. Building upon recent work that leverages the PMI metric to measure associations learned by a model , the KYD relations tab makes it easy to examine associations between different signals in a dataset. This tab visualizes the extent to which two signals in the dataset co-occur more (or less) than would be expected by chance. Each cell indicates either a positive (blue color) or negative (orange color) correlation between two specific signal values along with the strength of that correlation.

KYD also allows users to filter rows of a relations table based on substring matching. Using this functionality, we initially probed for caption words containing “-ing”, as a simple way to filter by verbs. We immediately saw strong gendered correlations :

Digging further into these correlations, we found that several activities stereotypically associated with women, such as “shopping” and “cooking”, co-occur with images captioned with “women” or “woman” at a higher rate than with images captioned with “men” or “man”. In contrast captions describing many physically intensive activities, such as “skateboarding”, “surfing”, and “snowboarding”, co-occur with images captioned with “man” or “men” at higher rates.

While individual image captions may not use stereotypical or derogatory language, such as with the example below, if certain gender groups are over (or under) represented within a particular activity across the whole dataset, models developed from the dataset risk learning stereotypical associations. KYD makes it easy to surface, quantify, and make plans to mitigate this risk.

In addition to examining biases with respect to the social groups depicted with different activities, we also explored biases in how annotators described the appearance of people they perceived as male or female. Inspired by media scholars who have examined the “male gaze” embedded in other forms of visual media, we examined the frequency with which individuals perceived as women in COCO are described using adjectives that position them as an object of desire. KYD allowed us to easily examine co-occurrences between words associated with binary gender (e.g. "female/girl/woman" vs. "male/man/boy") and words associated with evaluating physical attractiveness. Importantly, these are captions written by human annotators, who are making subjective assessments about the gender of people in the image and choosing a descriptor for attractiveness. We see that the words "attractive", "beautiful", "pretty", and "sexy" are overrepresented in describing people perceived as women as compared to those perceived as men, confirming what prior work has said about how gender is viewed in visual media.

KYD also allows us to manually inspect images for each relation by clicking on the relation in question. For example, we can see images whose captions include female terms (e.g. “woman”) and the word “beautiful”.

Exploring Age Bias

Adults older than 65 have been shown to be underrepresented in datasets relative to their presence in the general population — a first step toward improving age representation is to allow developers to assess it in their datasets. By looking at caption words describing different activities and analyzing their relation to caption words describing age, KYD helped us to assess the range of example captions depicting older adults. Having example captions of adults in a range of environments and activities is important for a variety of tasks, such as image captioning or pedestrian detection.

The first trend that KYD made clear is how rarely annotators described people as older adults in captions detailing different activities. The relations tab also shows a trend wherein “elderly”, “old”, and “older” tend not to occur with verbs that describe a variety of physical activities that might be important for a system to be able to detect. Important to note is that, relative to “young”, “old” is more often used to describe things other than people, such as belongings or clothing, so these relations are also capturing some uses that don’t describe people.

The underrepresentation of captions containing the references to older adults that we examined here could be rooted in a relative lack of images depicting older adults as well as in a tendency for annotators to omit older age-related terms when describing people in images. While manual inspection of the intersection of “old” and “running” shows a negative relation, we notice that it shows no older people and a number of locomotives . KYD makes it easy to quantitatively and qualitatively inspect relations to identify dataset strengths and areas for improvement.

Understanding the contents of ML datasets is a critical first step to developing suitable strategies to mitigate the downstream impact of unfair dataset bias. The above analysis points towards several potential mitigations. For example, correlations between certain activities and social groups, which can lead trained models to reproduce social stereotypes, can be potentially mitigated by “dataset balancing” — increasing the representation of under-represented group/activity combinations. However, mitigations focused exclusively on dataset balancing are not sufficient, as our analysis of how different genders are described by annotators demonstrated. We found annotators’ subjective judgements of people portrayed in images were reflected within the final dataset, suggesting a deeper look at methods of image annotations are needed. One solution for data practitioners who are developing image captioning datasets is to consider integrating guidelines that have been developed for writing image descriptions that are sensitive to race, gender, and other identity categories.

The above case studies highlight only some of the KYD features. For example, Cloud Vision API signals are also integrated into KYD and can be used to infer signals that annotators haven't labeled directly. We encourage the broader ML community to perform their own KYD case studies and share their findings.

KYD complements other dataset analysis tools being developed across the ML community, including Google's growing Responsible AI toolkit . We look forward to ML practitioners using KYD to better understand their datasets and mitigate potential bias and fairness concerns. If you have feedback on KYD, please write to [email protected] .


The analysis and write-up in this post were conducted with equal contribution by Emily Denton, Mark Díaz, and Alex Hanna. We thank Marie Pellat, Ludovic Peran, Daniel Smilkov, Nikhil Thorat and Tsung-Yi for their contributions to and reviews of this post. We also thank the researchers and teams that have developed the signals and metrics used in KYD and particularly the team that has helped us implement nPMI.

10 Real World Data Science Case Studies Projects with Example

Top 10 Data Science Case Studies Projects with Examples and Solutions in Python to inspire your data science learning in 2023.

10 Real World Data Science Case Studies Projects with Example

BelData science has been a trending buzzword in recent times. With wide applications in various sectors like healthcare , education, retail, transportation, media, and banking -data science applications are at the core of pretty much every industry out there. The possibilities are endless: analysis of frauds in the finance sector or the personalization of recommendations on eCommerce businesses.  We have developed ten exciting data science case studies to explain how data science is leveraged across various industries to make smarter decisions and develop innovative personalized products tailored to specific customers.


Walmart Sales Forecasting Data Science Project

Downloadable solution code | Explanatory videos | Tech Support

Table of Contents

Data science case studies in retail , data science case study examples in entertainment industry , data analytics case study examples in travel industry , case studies for data analytics in social media , real world data science projects in healthcare, data analytics case studies in oil and gas, what is a case study in data science, how do you prepare a data science case study, 10 most interesting data science case studies with examples.

data science case studies

So, without much ado, let's get started with data science business case studies !

With humble beginnings as a simple discount retailer, today, Walmart operates in 10,500 stores and clubs in 24 countries and eCommerce websites, employing around 2.2 million people around the globe. For the fiscal year ended January 31, 2021, Walmart's total revenue was $559 billion showing a growth of $35 billion with the expansion of the eCommerce sector. Walmart is a data-driven company that works on the principle of 'Everyday low cost' for its consumers. To achieve this goal, they heavily depend on the advances of their data science and analytics department for research and development, also known as Walmart Labs. Walmart is home to the world's largest private cloud, which can manage 2.5 petabytes of data every hour! To analyze this humongous amount of data, Walmart has created 'Data Café,' a state-of-the-art analytics hub located within its Bentonville, Arkansas headquarters. The Walmart Labs team heavily invests in building and managing technologies like cloud, data, DevOps , infrastructure, and security.

ProjectPro Free Projects on Big Data and Data Science

Walmart is experiencing massive digital growth as the world's largest retailer . Walmart has been leveraging Big data and advances in data science to build solutions to enhance, optimize and customize the shopping experience and serve their customers in a better way. At Walmart Labs, data scientists are focused on creating data-driven solutions that power the efficiency and effectiveness of complex supply chain management processes. Here are some of the applications of data science  at Walmart:

i) Personalized Customer Shopping Experience

Walmart analyses customer preferences and shopping patterns to optimize the stocking and displaying of merchandise in their stores. Analysis of Big data also helps them understand new item sales, make decisions on discontinuing products, and the performance of brands.

ii) Order Sourcing and On-Time Delivery Promise

Millions of customers view items on Walmart.com, and Walmart provides each customer a real-time estimated delivery date for the items purchased. Walmart runs a backend algorithm that estimates this based on the distance between the customer and the fulfillment center, inventory levels, and shipping methods available. The supply chain management system determines the optimum fulfillment center based on distance and inventory levels for every order. It also has to decide on the shipping method to minimize transportation costs while meeting the promised delivery date.

Here's what valued users are saying about ProjectPro

user profile

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

user profile

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

Not sure what you are looking for?

iii) Packing Optimization 

Also known as Box recommendation is a daily occurrence in the shipping of items in retail and eCommerce business. When items of an order or multiple orders for the same customer are ready for packing, Walmart has developed a recommender system that picks the best-sized box which holds all the ordered items with the least in-box space wastage within a fixed amount of time. This Bin Packing problem is a classic NP-Hard problem familiar to data scientists .

Whenever items of an order or multiple orders placed by the same customer are picked from the shelf and are ready for packing, the box recommendation system determines the best-sized box to hold all the ordered items with a minimum of in-box space wasted. This problem is known as the Bin Packing Problem, another classic NP-Hard problem familiar to data scientists.

Here is a link to a sales prediction data science case study to help you understand the applications of Data Science in the real world. Walmart Sales Forecasting Project uses historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and you must build a model to project the sales for each department in each store. This data science case study aims to create a predictive model to predict the sales of each product. You can also try your hands-on Inventory Demand Forecasting Data Science Project to develop a machine learning model to forecast inventory demand accurately based on historical sales data.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

Amazon is an American multinational technology-based company based in Seattle, USA. It started as an online bookseller, but today it focuses on eCommerce, cloud computing , digital streaming, and artificial intelligence . It hosts an estimate of 1,000,000,000 gigabytes of data across more than 1,400,000 servers. Through its constant innovation in data science and big data Amazon is always ahead in understanding its customers. Here are a few data analytics case study examples at Amazon:

i) Recommendation Systems

Data science models help amazon understand the customers' needs and recommend them to them before the customer searches for a product; this model uses collaborative filtering. Amazon uses 152 million customer purchases data to help users to decide on products to be purchased. The company generates 35% of its annual sales using the Recommendation based systems (RBS) method.

Here is a Recommender System Project to help you build a recommendation system using collaborative filtering. 

ii) Retail Price Optimization

Amazon product prices are optimized based on a predictive model that determines the best price so that the users do not refuse to buy it based on price. The model carefully determines the optimal prices considering the customers' likelihood of purchasing the product and thinks the price will affect the customers' future buying patterns. Price for a product is determined according to your activity on the website, competitors' pricing, product availability, item preferences, order history, expected profit margin, and other factors.

Check Out this Retail Price Optimization Project to build a Dynamic Pricing Model.

iii) Fraud Detection

Being a significant eCommerce business, Amazon remains at high risk of retail fraud. As a preemptive measure, the company collects historical and real-time data for every order. It uses Machine learning algorithms to find transactions with a higher probability of being fraudulent. This proactive measure has helped the company restrict clients with an excessive number of returns of products.

You can look at this Credit Card Fraud Detection Project to implement a fraud detection model to classify fraudulent credit card transactions.

New Projects

Let us explore data analytics case study examples in the entertainment indusry.

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

Data Science Interview Preparation

Netflix started as a DVD rental service in 1997 and then has expanded into the streaming business. Headquartered in Los Gatos, California, Netflix is the largest content streaming company in the world. Currently, Netflix has over 208 million paid subscribers worldwide, and with thousands of smart devices which are presently streaming supported, Netflix has around 3 billion hours watched every month. The secret to this massive growth and popularity of Netflix is its advanced use of data analytics and recommendation systems to provide personalized and relevant content recommendations to its users. The data is collected over 100 billion events every day. Here are a few examples of data analysis case studies applied at Netflix :

i) Personalized Recommendation System

Netflix uses over 1300 recommendation clusters based on consumer viewing preferences to provide a personalized experience. Some of the data that Netflix collects from its users include Viewing time, platform searches for keywords, Metadata related to content abandonment, such as content pause time, rewind, rewatched. Using this data, Netflix can predict what a viewer is likely to watch and give a personalized watchlist to a user. Some of the algorithms used by the Netflix recommendation system are Personalized video Ranking, Trending now ranker, and the Continue watching now ranker.

ii) Content Development using Data Analytics

Netflix uses data science to analyze the behavior and patterns of its user to recognize themes and categories that the masses prefer to watch. This data is used to produce shows like The umbrella academy, and Orange Is the New Black, and the Queen's Gambit. These shows seem like a huge risk but are significantly based on data analytics using parameters, which assured Netflix that they would succeed with its audience. Data analytics is helping Netflix come up with content that their viewers want to watch even before they know they want to watch it.

iii) Marketing Analytics for Campaigns

Netflix uses data analytics to find the right time to launch shows and ad campaigns to have maximum impact on the target audience. Marketing analytics helps come up with different trailers and thumbnails for other groups of viewers. For example, the House of Cards Season 5 trailer with a giant American flag was launched during the American presidential elections, as it would resonate well with the audience.

Here is a Customer Segmentation Project using association rule mining to understand the primary grouping of customers based on various parameters.

Get FREE Access to Machine Learning Example Codes for Data Cleaning , Data Munging, and Data Visualization

In a world where Purchasing music is a thing of the past and streaming music is a current trend, Spotify has emerged as one of the most popular streaming platforms. With 320 million monthly users, around 4 billion playlists, and approximately 2 million podcasts, Spotify leads the pack among well-known streaming platforms like Apple Music, Wynk, Songza, amazon music, etc. The success of Spotify has mainly depended on data analytics. By analyzing massive volumes of listener data, Spotify provides real-time and personalized services to its listeners. Most of Spotify's revenue comes from paid premium subscriptions. Here are some of the examples of case study on data analytics used by Spotify to provide enhanced services to its listeners:

i) Personalization of Content using Recommendation Systems

Spotify uses Bart or Bayesian Additive Regression Trees to generate music recommendations to its listeners in real-time. Bart ignores any song a user listens to for less than 30 seconds. The model is retrained every day to provide updated recommendations. A new Patent granted to Spotify for an AI application is used to identify a user's musical tastes based on audio signals, gender, age, accent to make better music recommendations.

Spotify creates daily playlists for its listeners, based on the taste profiles called 'Daily Mixes,' which have songs the user has added to their playlists or created by the artists that the user has included in their playlists. It also includes new artists and songs that the user might be unfamiliar with but might improve the playlist. Similar to it is the weekly 'Release Radar' playlists that have newly released artists' songs that the listener follows or has liked before.

ii) Targetted marketing through Customer Segmentation

With user data for enhancing personalized song recommendations, Spotify uses this massive dataset for targeted ad campaigns and personalized service recommendations for its users. Spotify uses ML models to analyze the listener's behavior and group them based on music preferences, age, gender, ethnicity, etc. These insights help them create ad campaigns for a specific target audience. One of their well-known ad campaigns was the meme-inspired ads for potential target customers, which was a huge success globally.

iii) CNN's for Classification of Songs and Audio Tracks

Spotify builds audio models to evaluate the songs and tracks, which helps develop better playlists and recommendations for its users. These allow Spotify to filter new tracks based on their lyrics and rhythms and recommend them to users like similar tracks ( collaborative filtering). Spotify also uses NLP ( Natural language processing) to scan articles and blogs to analyze the words used to describe songs and artists. These analytical insights can help group and identify similar artists and songs and leverage them to build playlists.

Here is a Music Recommender System Project for you to start learning. We have listed another music recommendations dataset for you to use for your projects: Dataset1 . You can use this dataset of Spotify metadata to classify songs based on artists, mood, liveliness. Plot histograms, heatmaps to get a better understanding of the dataset. Use classification algorithms like logistic regression, SVM, and Principal component analysis to generate valuable insights from the dataset.

Explore Categories

Below you will find case studies for data analytics in the travel and tourism industry.

Airbnb was born in 2007 in San Francisco and has since grown to 4 million Hosts and 5.6 million listings worldwide who have welcomed more than 1 billion guest arrivals in almost every country across the globe. Airbnb is active in every country on the planet except for Iran, Sudan, Syria, and North Korea. That is around 97.95% of the world. Using data as a voice of their customers, Airbnb uses the large volume of customer reviews, host inputs to understand trends across communities, rate user experiences, and uses these analytics to make informed decisions to build a better business model. The data scientists at Airbnb are developing exciting new solutions to boost the business and find the best mapping for its customers and hosts. Airbnb data servers serve approximately 10 million requests a day and process around one million search queries. Data is the voice of customers at AirBnB and offers personalized services by creating a perfect match between the guests and hosts for a supreme customer experience. 

i) Recommendation Systems and Search Ranking Algorithms

Airbnb helps people find 'local experiences' in a place with the help of search algorithms that make searches and listings precise. Airbnb uses a 'listing quality score' to find homes based on the proximity to the searched location and uses previous guest reviews. Airbnb uses deep neural networks to build models that take the guest's earlier stays into account and area information to find a perfect match. The search algorithms are optimized based on guest and host preferences, rankings, pricing, and availability to understand users’ needs and provide the best match possible.

ii) Natural Language Processing for Review Analysis

Airbnb characterizes data as the voice of its customers. The customer and host reviews give a direct insight into the experience. The star ratings alone cannot be an excellent way to understand it quantitatively. Hence Airbnb uses natural language processing to understand reviews and the sentiments behind them. The NLP models are developed using Convolutional neural networks .

Practice this Sentiment Analysis Project for analyzing product reviews to understand the basic concepts of natural language processing.

iii) Smart Pricing using Predictive Analytics

The Airbnb hosts community uses the service as a supplementary income. The vacation homes and guest houses rented to customers provide for rising local community earnings as Airbnb guests stay 2.4 times longer and spend approximately 2.3 times the money compared to a hotel guest. The profits are a significant positive impact on the local neighborhood community. Airbnb uses predictive analytics to predict the prices of the listings and help the hosts set a competitive and optimal price. The overall profitability of the Airbnb host depends on factors like the time invested by the host and responsiveness to changing demands for different seasons. The factors that impact the real-time smart pricing are the location of the listing, proximity to transport options, season, and amenities available in the neighborhood of the listing.

Here is a Price Prediction Project to help you understand the concept of predictive analysis which is widely common in case studies for data analytics. 

Uber is the biggest global taxi service provider. As of December 2018, Uber has 91 million monthly active consumers and 3.8 million drivers. Uber completes 14 million trips each day. Uber uses data analytics and big data-driven technologies to optimize their business processes and provide enhanced customer service. The Data Science team at uber has been exploring futuristic technologies to provide better service constantly. Machine learning and data analytics help Uber make data-driven decisions that enable benefits like ride-sharing, dynamic price surges, better customer support, and demand forecasting. Here are some of the real world data science projects used by uber:

i) Dynamic Pricing for Price Surges and Demand Forecasting

Uber prices change at peak hours based on demand. Uber uses surge pricing to encourage more cab drivers to sign up with the company, to meet the demand from the passengers. When the prices increase, the driver and the passenger are both informed about the surge in price. Uber uses a predictive model for price surging called the 'Geosurge' ( patented). It is based on the demand for the ride and the location.

ii) One-Click Chat

Uber has developed a Machine learning and natural language processing solution called one-click chat or OCC for coordination between drivers and users. This feature anticipates responses for commonly asked questions, making it easy for the drivers to respond to customer messages. Drivers can reply with the clock of just one button. One-Click chat is developed on Uber's machine learning platform Michelangelo to perform NLP on rider chat messages and generate appropriate responses to them.

iii) Customer Retention

Failure to meet the customer demand for cabs could lead to users opting for other services. Uber uses machine learning models to bridge this demand-supply gap. By using prediction models to predict the demand in any location, uber retains its customers. Uber also uses a tier-based reward system, which segments customers into different levels based on usage. The higher level the user achieves, the better are the perks. Uber also provides personalized destination suggestions based on the history of the user and their frequently traveled destinations.

You can take a look at this Python Chatbot Project and build a simple chatbot application to understand better the techniques used for natural language processing. You can also practice the working of a demand forecasting model with this project using time series analysis. You can look at this project which uses time series forecasting and clustering on a dataset containing geospatial data for forecasting customer demand for ola rides.

Explore More  Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

7) LinkedIn 

LinkedIn is the largest professional social networking site with nearly 800 million members in more than 200 countries worldwide. Almost 40% of the users access LinkedIn daily, clocking around 1 billion interactions per month. The data science team at LinkedIn works with this massive pool of data to generate insights to build strategies, apply algorithms and statistical inferences to optimize engineering solutions, and help the company achieve its goals. Here are some of the real world data science projects at LinkedIn:

i) LinkedIn Recruiter Implement Search Algorithms and Recommendation Systems

LinkedIn Recruiter helps recruiters build and manage a talent pool to optimize the chances of hiring candidates successfully. This sophisticated product works on search and recommendation engines. The LinkedIn recruiter handles complex queries and filters on a constantly growing large dataset. The results delivered have to be relevant and specific. The initial search model was based on linear regression but was eventually upgraded to Gradient Boosted decision trees to include non-linear correlations in the dataset. In addition to these models, the LinkedIn recruiter also uses the Generalized Linear Mix model to improve the results of prediction problems to give personalized results.

ii) Recommendation Systems Personalized for News Feed

The LinkedIn news feed is the heart and soul of the professional community. A member's newsfeed is a place to discover conversations among connections, career news, posts, suggestions, photos, and videos. Every time a member visits LinkedIn, machine learning algorithms identify the best exchanges to be displayed on the feed by sorting through posts and ranking the most relevant results on top. The algorithms help LinkedIn understand member preferences and help provide personalized news feeds. The algorithms used include logistic regression, gradient boosted decision trees and neural networks for recommendation systems.

iii) CNN's to Detect Inappropriate Content

To provide a professional space where people can trust and express themselves professionally in a safe community has been a critical goal at LinkedIn. LinkedIn has heavily invested in building solutions to detect fake accounts and abusive behavior on their platform. Any form of spam, harassment, inappropriate content is immediately flagged and taken down. These can range from profanity to advertisements for illegal services. LinkedIn uses a Convolutional neural networks based machine learning model. This classifier trains on a training dataset containing accounts labeled as either "inappropriate" or "appropriate." The inappropriate list consists of accounts having content from "blocklisted" phrases or words and a small portion of manually reviewed accounts reported by the user community.

Here is a Text Classification Project to help you understand NLP basics for text classification. You can find a news recommendation system dataset to help you build a personalized news recommender system. You can also use this dataset to build a classifier using logistic regression, Naive Bayes, or Neural networks to classify toxic comments.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Pfizer is a multinational pharmaceutical company headquartered in New York, USA. One of the largest pharmaceutical companies globally known for developing a wide range of medicines and vaccines in disciplines like immunology, oncology, cardiology, and neurology. Pfizer became a household name in 2010 when it was the first to have a COVID-19 vaccine with FDA. In early November 2021, The CDC has approved the Pfizer vaccine for kids aged 5 to 11. Pfizer has been using machine learning and artificial intelligence to develop drugs and streamline trials, which played a massive role in developing and deploying the COVID-19 vaccine. Here are a few data analytics case studies by Pfizer :

i) Identifying Patients for Clinical Trials

Artificial intelligence and machine learning are used to streamline and optimize clinical trials to increase their efficiency. Natural language processing and exploratory data analysis of patient records can help identify suitable patients for clinical trials. These can help identify patients with distinct symptoms. These can help examine interactions of potential trial members' specific biomarkers, predict drug interactions and side effects which can help avoid complications. Pfizer's AI implementation helped rapidly identify signals within the noise of millions of data points across their 44,000-candidate COVID-19 clinical trial.

ii) Supply Chain and Manufacturing

Data science and machine learning techniques help pharmaceutical companies better forecast demand for vaccines and drugs and distribute them efficiently. Machine learning models can help identify efficient supply systems by automating and optimizing the production steps. These will help supply drugs customized to small pools of patients in specific gene pools. Pfizer uses Machine learning to predict the maintenance cost of equipment used. Predictive maintenance using AI is the next big step for Pharmaceutical companies to reduce costs.

iii) Drug Development

Computer simulations of proteins, and tests of their interactions, and yield analysis help researchers develop and test drugs more efficiently. In 2016 Watson Health and Pfizer announced a collaboration to utilize IBM Watson for Drug Discovery to help accelerate Pfizer's research in immuno-oncology, an approach to cancer treatment that uses the body's immune system to help fight cancer. Deep learning models have been used recently for bioactivity and synthesis prediction for drugs and vaccines in addition to molecular design. Deep learning has been a revolutionary technique for drug discovery as it factors everything from new applications of medications to possible toxic reactions which can save millions in drug trials.

You can create a Machine learning model to predict molecular activity to help design medicine using this dataset . You may build a CNN or a Deep neural network for this data analyst case study project.

Access Data Science and Machine Learning Project Code Examples

9) Shell Data Analyst Case Study Project

Shell is a global group of energy and petrochemical companies with over 80,000 employees in around 70 countries. Shell uses advanced technologies and innovations to help build a sustainable energy future. Shell is going through a significant transition as the world needs more and cleaner energy solutions to be a clean energy company by 2050. It requires substantial changes in the way in which energy is used. Digital technologies, including AI and Machine Learning, play an essential role in this transformation. These include efficient exploration and energy production, more reliable manufacturing, more nimble trading, and a personalized customer experience. Using AI in various phases of the organization will help achieve this goal and stay competitive in the market. Here are a few data analytics case studies in the petrochemical industry:

i) Precision Drilling

Shell is involved in the processing mining oil and gas supply, ranging from mining hydrocarbons to refining the fuel to retailing them to customers. Recently Shell has included reinforcement learning to control the drilling equipment used in mining. Reinforcement learning works on a reward-based system based on the outcome of the AI model. The algorithm is designed to guide the drills as they move through the surface, based on the historical data from drilling records. It includes information such as the size of drill bits, temperatures, pressures, and knowledge of the seismic activity. This model helps the human operator understand the environment better, leading to better and faster results will minor damage to machinery used. 

ii) Efficient Charging Terminals

Due to climate changes, governments have encouraged people to switch to electric vehicles to reduce carbon dioxide emissions. However, the lack of public charging terminals has deterred people from switching to electric cars. Shell uses AI to monitor and predict the demand for terminals to provide efficient supply. Multiple vehicles charging from a single terminal may create a considerable grid load, and predictions on demand can help make this process more efficient.

iii) Monitoring Service and Charging Stations

Another Shell initiative trialed in Thailand and Singapore is the use of computer vision cameras, which can think and understand to watch out for potentially hazardous activities like lighting cigarettes in the vicinity of the pumps while refueling. The model is built to process the content of the captured images and label and classify it. The algorithm can then alert the staff and hence reduce the risk of fires. You can further train the model to detect rash driving or thefts in the future.

Here is a project to help you understand multiclass image classification. You can use the Hourly Energy Consumption Dataset to build an energy consumption prediction model. You can use time series with XGBoost to develop your model.

10) Zomato Case Study on Data Analytics

Zomato was founded in 2010 and is currently one of the most well-known food tech companies. Zomato offers services like restaurant discovery, home delivery, online table reservation, online payments for dining, etc. Zomato partners with restaurants to provide tools to acquire more customers while also providing delivery services and easy procurement of ingredients and kitchen supplies. Currently, Zomato has over 2 lakh restaurant partners and around 1 lakh delivery partners. Zomato has closed over ten crore delivery orders as of date. Zomato uses ML and AI to boost their business growth, with the massive amount of data collected over the years from food orders and user consumption patterns. Here are a few examples of data analyst case study project developed by the data scientists at Zomato:

i) Personalized Recommendation System for Homepage

Zomato uses data analytics to create personalized homepages for its users. Zomato uses data science to provide order personalization, like giving recommendations to the customers for specific cuisines, locations, prices, brands, etc. Restaurant recommendations are made based on a customer's past purchases, browsing history, and what other similar customers in the vicinity are ordering. This personalized recommendation system has led to a 15% improvement in order conversions and click-through rates for Zomato. 

You can use the Restaurant Recommendation Dataset to build a restaurant recommendation system to predict what restaurants customers are most likely to order from, given the customer location, restaurant information, and customer order history.

ii) Analyzing Customer Sentiment

Zomato uses Natural language processing and Machine learning to understand customer sentiments using social media posts and customer reviews. These help the company gauge the inclination of its customer base towards the brand. Deep learning models analyze the sentiments of various brand mentions on social networking sites like Twitter, Instagram, Linked In, and Facebook. These analytics give insights to the company, which helps build the brand and understand the target audience.

iii) Predicting Food Preparation Time (FPT)

Food delivery time is an essential variable in the estimated delivery time of the order placed by the customer using Zomato. The food preparation time depends on numerous factors like the number of dishes ordered, time of the day, footfall in the restaurant, day of the week, etc. Accurate prediction of the food preparation time can help make a better prediction of the Estimated delivery time, which will help delivery partners less likely to breach it. Zomato uses a Bidirectional LSTM-based deep learning model that considers all these features and provides food preparation time for each order in real-time. 

Data scientists are companies' secret weapons when analyzing customer sentiments and behavior and leveraging it to drive conversion, loyalty, and profits. These 10 data science case studies projects with examples and solutions show you how various organizations use data science technologies to succeed and be at the top of their field! To summarize, Data Science has not only accelerated the performance of companies but has also made it possible to manage & sustain their performance with ease.

FAQs on Data Analysis Case Studies

A case study in data science is an in-depth analysis of a real-world problem using data-driven approaches. It involves collecting, cleaning, and analyzing data to extract insights and solve challenges, offering practical insights into how data science techniques can address complex issues across various industries.

To create a data science case study, identify a relevant problem, define objectives, and gather suitable data. Clean and preprocess data, perform exploratory data analysis, and apply appropriate algorithms for analysis. Summarize findings, visualize results, and provide actionable recommendations, showcasing the problem-solving potential of data science techniques.

Access Solved Big Data and Data Science Projects

About the Author

author profile

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

arrow link

© 2024

© 2024 Iconiq Inc.

Privacy policy

User policy

Write for ProjectPro

For enquiries call:



  • Data Science

Top 12 Data Science Case Studies: Across Various Industries

Home Blog Data Science Top 12 Data Science Case Studies: Across Various Industries

Play icon

Data science has become popular in the last few years due to its successful application in making business decisions. Data scientists have been using data science techniques to solve challenging real-world issues in healthcare, agriculture, manufacturing, automotive, and many more. For this purpose, a data enthusiast needs to stay updated with the latest technological advancements in AI . An excellent way to achieve this is through reading industry data science case studies. I recommend checking out Data Science With Python course syllabus to start your data science journey. In this discussion, I will present some case studies to you that contain detailed and systematic data analysis of people, objects, or entities focusing on multiple factors present in the dataset. Aspiring and practising data scientists can motivate themselves to learn more about the sector, an alternative way of thinking, or methods to improve their organization based on comparable experiences. Almost every industry uses data science in some way. You can learn more about data science fundamentals in this data science course content . From my standpoint, data scientists may use it to spot fraudulent conduct in insurance claims. Automotive data scientists may use it to improve self-driving cars. In contrast, e-commerce data scientists can use it to add more personalization for their consumers—the possibilities are unlimited and unexplored. Let’s look at the top eight data science case studies in this article so you can understand how businesses from many sectors have benefitted from data science to boost productivity, revenues, and more. Read on to explore more or use the following links to go straight to the case study of your choice.

case study on data set

Examples of Data Science Case Studies

  • Hospitality:  Airbnb focuses on growth by  analyzing  customer voice using data science.  Qantas uses predictive analytics to mitigate losses  
  • Healthcare:  Novo Nordisk  is  Driving innovation with NLP.  AstraZeneca harnesses data for innovation in medicine  
  • Covid 19:  Johnson and Johnson use s  d ata science  to fight the Pandemic  
  • E-commerce:  Amazon uses data science to personalize shop p ing experiences and improve customer satisfaction  
  • Supply chain management :  UPS optimizes supp l y chain with big data analytics
  • Meteorology:  IMD leveraged data science to achieve a rec o rd 1.2m evacuation before cyclone ''Fani''  
  • Entertainment Industry:  Netflix  u ses data science to personalize the content and improve recommendations.  Spotify uses big   data to deliver a rich user experience for online music streaming  
  • Banking and Finance:  HDFC utilizes Big  D ata Analytics to increase income and enhance  the  banking experience  

Top 8 Data Science Case Studies  [For Various Industries]

1. data science in hospitality industry.

In the hospitality sector, data analytics assists hotels in better pricing strategies, customer analysis, brand marketing , tracking market trends, and many more.

Airbnb focuses on growth by analyzing customer voice using data science.  A famous example in this sector is the unicorn '' Airbnb '', a startup that focussed on data science early to grow and adapt to the market faster. This company witnessed a 43000 percent hypergrowth in as little as five years using data science. They included data science techniques to process the data, translate this data for better understanding the voice of the customer, and use the insights for decision making. They also scaled the approach to cover all aspects of the organization. Airbnb uses statistics to analyze and aggregate individual experiences to establish trends throughout the community. These analyzed trends using data science techniques impact their business choices while helping them grow further.  

Travel industry and data science

Predictive analytics benefits many parameters in the travel industry. These companies can use recommendation engines with data science to achieve higher personalization and improved user interactions. They can study and cross-sell products by recommending relevant products to drive sales and increase revenue. Data science is also employed in analyzing social media posts for sentiment analysis, bringing invaluable travel-related insights. Whether these views are positive, negative, or neutral can help these agencies understand the user demographics, the expected experiences by their target audiences, and so on. These insights are essential for developing aggressive pricing strategies to draw customers and provide better customization to customers in the travel packages and allied services. Travel agencies like Expedia and Booking.com use predictive analytics to create personalized recommendations, product development, and effective marketing of their products. Not just travel agencies but airlines also benefit from the same approach. Airlines frequently face losses due to flight cancellations, disruptions, and delays. Data science helps them identify patterns and predict possible bottlenecks, thereby effectively mitigating the losses and improving the overall customer traveling experience.  

How Qantas uses predictive analytics to mitigate losses  

Qantas , one of Australia's largest airlines, leverages data science to reduce losses caused due to flight delays, disruptions, and cancellations. They also use it to provide a better traveling experience for their customers by reducing the number and length of delays caused due to huge air traffic, weather conditions, or difficulties arising in operations. Back in 2016, when heavy storms badly struck Australia's east coast, only 15 out of 436 Qantas flights were cancelled due to their predictive analytics-based system against their competitor Virgin Australia, which witnessed 70 cancelled flights out of 320.  

2. Data Science in Healthcare

The  Healthcare sector  is immensely benefiting from the advancements in AI. Data science, especially in medical imaging, has been helping healthcare professionals come up with better diagnoses and effective treatments for patients. Similarly, several advanced healthcare analytics tools have been developed to generate clinical insights for improving patient care. These tools also assist in defining personalized medications for patients reducing operating costs for clinics and hospitals. Apart from medical imaging or computer vision,  Natural Language Processing (NLP)  is frequently used in the healthcare domain to study the published textual research data.     

A. Pharmaceutical

Driving innovation with NLP: Novo Nordisk.  Novo Nordisk  uses the Linguamatics NLP platform from internal and external data sources for text mining purposes that include scientific abstracts, patents, grants, news, tech transfer offices from universities worldwide, and more. These NLP queries run across sources for the key therapeutic areas of interest to the Novo Nordisk R&D community. Several NLP algorithms have been developed for the topics of safety, efficacy, randomized controlled trials, patient populations, dosing, and devices. Novo Nordisk employs a data pipeline to capitalize the tools' success on real-world data and uses interactive dashboards and cloud services to visualize this standardized structured information from the queries for exploring commercial effectiveness, market situations, potential, and gaps in the product documentation. Through data science, they are able to automate the process of generating insights, save time and provide better insights for evidence-based decision making.  

How AstraZeneca harnesses data for innovation in medicine.  AstraZeneca  is a globally known biotech company that leverages data using AI technology to discover and deliver newer effective medicines faster. Within their R&D teams, they are using AI to decode the big data to understand better diseases like cancer, respiratory disease, and heart, kidney, and metabolic diseases to be effectively treated. Using data science, they can identify new targets for innovative medications. In 2021, they selected the first two AI-generated drug targets collaborating with BenevolentAI in Chronic Kidney Disease and Idiopathic Pulmonary Fibrosis.   

Data science is also helping AstraZeneca redesign better clinical trials, achieve personalized medication strategies, and innovate the process of developing new medicines. Their Center for Genomics Research uses  data science and AI  to analyze around two million genomes by 2026. Apart from this, they are training their AI systems to check these images for disease and biomarkers for effective medicines for imaging purposes. This approach helps them analyze samples accurately and more effortlessly. Moreover, it can cut the analysis time by around 30%.   

AstraZeneca also utilizes AI and machine learning to optimize the process at different stages and minimize the overall time for the clinical trials by analyzing the clinical trial data. Summing up, they use data science to design smarter clinical trials, develop innovative medicines, improve drug development and patient care strategies, and many more.

C. Wearable Technology  

Wearable technology is a multi-billion-dollar industry. With an increasing awareness about fitness and nutrition, more individuals now prefer using fitness wearables to track their routines and lifestyle choices.  

Fitness wearables are convenient to use, assist users in tracking their health, and encourage them to lead a healthier lifestyle. The medical devices in this domain are beneficial since they help monitor the patient's condition and communicate in an emergency situation. The regularly used fitness trackers and smartwatches from renowned companies like Garmin, Apple, FitBit, etc., continuously collect physiological data of the individuals wearing them. These wearable providers offer user-friendly dashboards to their customers for analyzing and tracking progress in their fitness journey.

3. Covid 19 and Data Science

In the past two years of the Pandemic, the power of data science has been more evident than ever. Different  pharmaceutical companies  across the globe could synthesize Covid 19 vaccines by analyzing the data to understand the trends and patterns of the outbreak. Data science made it possible to track the virus in real-time, predict patterns, devise effective strategies to fight the Pandemic, and many more.  

How Johnson and Johnson uses data science to fight the Pandemic   

The  data science team  at  Johnson and Johnson  leverages real-time data to track the spread of the virus. They built a global surveillance dashboard (granulated to county level) that helps them track the Pandemic's progress, predict potential hotspots of the virus, and narrow down the likely place where they should test its investigational COVID-19 vaccine candidate. The team works with in-country experts to determine whether official numbers are accurate and find the most valid information about case numbers, hospitalizations, mortality and testing rates, social compliance, and local policies to populate this dashboard. The team also studies the data to build models that help the company identify groups of individuals at risk of getting affected by the virus and explore effective treatments to improve patient outcomes.

4. Data Science in E-commerce  

In the  e-commerce sector , big data analytics can assist in customer analysis, reduce operational costs, forecast trends for better sales, provide personalized shopping experiences to customers, and many more.  

Amazon uses data science to personalize shopping experiences and improve customer satisfaction.  Amazon  is a globally leading eCommerce platform that offers a wide range of online shopping services. Due to this, Amazon generates a massive amount of data that can be leveraged to understand consumer behavior and generate insights on competitors' strategies. Amazon uses its data to provide recommendations to its users on different products and services. With this approach, Amazon is able to persuade its consumers into buying and making additional sales. This approach works well for Amazon as it earns 35% of the revenue yearly with this technique. Additionally, Amazon collects consumer data for faster order tracking and better deliveries.     

Similarly, Amazon's virtual assistant, Alexa, can converse in different languages; uses speakers and a   camera to interact with the users. Amazon utilizes the audio commands from users to improve Alexa and deliver a better user experience. 

5. Data Science in Supply Chain Management

Predictive analytics and big data are driving innovation in the Supply chain domain. They offer greater visibility into the company operations, reduce costs and overheads, forecasting demands, predictive maintenance, product pricing, minimize supply chain interruptions, route optimization, fleet management , drive better performance, and more.     

Optimizing supply chain with big data analytics: UPS

UPS  is a renowned package delivery and supply chain management company. With thousands of packages being delivered every day, on average, a UPS driver makes about 100 deliveries each business day. On-time and safe package delivery are crucial to UPS's success. Hence, UPS offers an optimized navigation tool ''ORION'' (On-Road Integrated Optimization and Navigation), which uses highly advanced big data processing algorithms. This tool for UPS drivers provides route optimization concerning fuel, distance, and time. UPS utilizes supply chain data analysis in all aspects of its shipping process. Data about packages and deliveries are captured through radars and sensors. The deliveries and routes are optimized using big data systems. Overall, this approach has helped UPS save 1.6 million gallons of gasoline in transportation every year, significantly reducing delivery costs.    

6. Data Science in Meteorology

Weather prediction is an interesting  application of data science . Businesses like aviation, agriculture and farming, construction, consumer goods, sporting events, and many more are dependent on climatic conditions. The success of these businesses is closely tied to the weather, as decisions are made after considering the weather predictions from the meteorological department.   

Besides, weather forecasts are extremely helpful for individuals to manage their allergic conditions. One crucial application of weather forecasting is natural disaster prediction and risk management.  

Weather forecasts begin with a large amount of data collection related to the current environmental conditions (wind speed, temperature, humidity, clouds captured at a specific location and time) using sensors on IoT (Internet of Things) devices and satellite imagery. This gathered data is then analyzed using the understanding of atmospheric processes, and machine learning models are built to make predictions on upcoming weather conditions like rainfall or snow prediction. Although data science cannot help avoid natural calamities like floods, hurricanes, or forest fires. Tracking these natural phenomena well ahead of their arrival is beneficial. Such predictions allow governments sufficient time to take necessary steps and measures to ensure the safety of the population.  

IMD leveraged data science to achieve a record 1.2m evacuation before cyclone ''Fani''   

Most  d ata scientist’s responsibilities  rely on satellite images to make short-term forecasts, decide whether a forecast is correct, and validate models. Machine Learning is also used for pattern matching in this case. It can forecast future weather conditions if it recognizes a past pattern. When employing dependable equipment, sensor data is helpful to produce local forecasts about actual weather models. IMD used satellite pictures to study the low-pressure zones forming off the Odisha coast (India). In April 2019, thirteen days before cyclone ''Fani'' reached the area,  IMD  (India Meteorological Department) warned that a massive storm was underway, and the authorities began preparing for safety measures.  

It was one of the most powerful cyclones to strike India in the recent 20 years, and a record 1.2 million people were evacuated in less than 48 hours, thanks to the power of data science.   

7. Data Science in the Entertainment Industry

Due to the Pandemic, demand for OTT (Over-the-top) media platforms has grown significantly. People prefer watching movies and web series or listening to the music of their choice at leisure in the convenience of their homes. This sudden growth in demand has given rise to stiff competition. Every platform now uses data analytics in different capacities to provide better-personalized recommendations to its subscribers and improve user experience.   

How Netflix uses data science to personalize the content and improve recommendations  

Netflix  is an extremely popular internet television platform with streamable content offered in several languages and caters to various audiences. In 2006, when Netflix entered this media streaming market, they were interested in increasing the efficiency of their existing ''Cinematch'' platform by 10% and hence, offered a prize of $1 million to the winning team. This approach was successful as they found a solution developed by the BellKor team at the end of the competition that increased prediction accuracy by 10.06%. Over 200 work hours and an ensemble of 107 algorithms provided this result. These winning algorithms are now a part of the Netflix recommendation system.  

Netflix also employs Ranking Algorithms to generate personalized recommendations of movies and TV Shows appealing to its users.   

Spotify uses big data to deliver a rich user experience for online music streaming  

Personalized online music streaming is another area where data science is being used.  Spotify  is a well-known on-demand music service provider launched in 2008, which effectively leveraged big data to create personalized experiences for each user. It is a huge platform with more than 24 million subscribers and hosts a database of nearly 20million songs; they use the big data to offer a rich experience to its users. Spotify uses this big data and various algorithms to train machine learning models to provide personalized content. Spotify offers a "Discover Weekly" feature that generates a personalized playlist of fresh unheard songs matching the user's taste every week. Using the Spotify "Wrapped" feature, users get an overview of their most favorite or frequently listened songs during the entire year in December. Spotify also leverages the data to run targeted ads to grow its business. Thus, Spotify utilizes the user data, which is big data and some external data, to deliver a high-quality user experience.  

8. Data Science in Banking and Finance

Data science is extremely valuable in the Banking and  Finance industry . Several high priority aspects of Banking and Finance like credit risk modeling (possibility of repayment of a loan), fraud detection (detection of malicious or irregularities in transactional patterns using machine learning), identifying customer lifetime value (prediction of bank performance based on existing and potential customers), customer segmentation (customer profiling based on behavior and characteristics for personalization of offers and services). Finally, data science is also used in real-time predictive analytics (computational techniques to predict future events).    

How HDFC utilizes Big Data Analytics to increase revenues and enhance the banking experience    

One of the major private banks in India,  HDFC Bank , was an early adopter of AI. It started with Big Data analytics in 2004, intending to grow its revenue and understand its customers and markets better than its competitors. Back then, they were trendsetters by setting up an enterprise data warehouse in the bank to be able to track the differentiation to be given to customers based on their relationship value with HDFC Bank. Data science and analytics have been crucial in helping HDFC bank segregate its customers and offer customized personal or commercial banking services. The analytics engine and SaaS use have been assisting the HDFC bank in cross-selling relevant offers to its customers. Apart from the regular fraud prevention, it assists in keeping track of customer credit histories and has also been the reason for the speedy loan approvals offered by the bank.  

9. Data Science in Urban Planning and Smart Cities  

Data Science can help the dream of smart cities come true! Everything, from traffic flow to energy usage, can get optimized using data science techniques. You can use the data fetched from multiple sources to understand trends and plan urban living in a sorted manner.  

The significant data science case study is traffic management in Pune city. The city controls and modifies its traffic signals dynamically, tracking the traffic flow. Real-time data gets fetched from the signals through cameras or sensors installed. Based on this information, they do the traffic management. With this proactive approach, the traffic and congestion situation in the city gets managed, and the traffic flow becomes sorted. A similar case study is from Bhubaneswar, where the municipality has platforms for the people to give suggestions and actively participate in decision-making. The government goes through all the inputs provided before making any decisions, making rules or arranging things that their residents actually need.  

10. Data Science in Agricultural Yield Prediction   

Have you ever wondered how helpful it can be if you can predict your agricultural yield? That is exactly what data science is helping farmers with. They can get information about the number of crops they can produce in a given area based on different environmental factors and soil types. Using this information, the farmers can make informed decisions about their yield and benefit the buyers and themselves in multiple ways.  

Data Science in Agricultural Yield Prediction

Farmers across the globe and overseas use various data science techniques to understand multiple aspects of their farms and crops. A famous example of data science in the agricultural industry is the work done by Farmers Edge. It is a company in Canada that takes real-time images of farms across the globe and combines them with related data. The farmers use this data to make decisions relevant to their yield and improve their produce. Similarly, farmers in countries like Ireland use satellite-based information to ditch traditional methods and multiply their yield strategically.  

11. Data Science in the Transportation Industry   

Transportation keeps the world moving around. People and goods commute from one place to another for various purposes, and it is fair to say that the world will come to a standstill without efficient transportation. That is why it is crucial to keep the transportation industry in the most smoothly working pattern, and data science helps a lot in this. In the realm of technological progress, various devices such as traffic sensors, monitoring display systems, mobility management devices, and numerous others have emerged.  

Many cities have already adapted to the multi-modal transportation system. They use GPS trackers, geo-locations and CCTV cameras to monitor and manage their transportation system. Uber is the perfect case study to understand the use of data science in the transportation industry. They optimize their ride-sharing feature and track the delivery routes through data analysis. Their data science approach enabled them to serve more than 100 million users, making transportation easy and convenient. Moreover, they also use the data they fetch from users daily to offer cost-effective and quickly available rides.  

12. Data Science in the Environmental Industry    

Increasing pollution, global warming, climate changes and other poor environmental impacts have forced the world to pay attention to environmental industry. Multiple initiatives are being taken across the globe to preserve the environment and make the world a better place. Though the industry recognition and the efforts are in the initial stages, the impact is significant, and the growth is fast.  

The popular use of data science in the environmental industry is by NASA and other research organizations worldwide. NASA gets data related to the current climate conditions, and this data gets used to create remedial policies that can make a difference. Another way in which data science is actually helping researchers is they can predict natural disasters well before time and save or at least reduce the potential damage considerably. A similar case study is with the World Wildlife Fund. They use data science to track data related to deforestation and help reduce the illegal cutting of trees. Hence, it helps preserve the environment.  

Where to Find Full Data Science Case Studies?  

Data science is a highly evolving domain with many practical applications and a huge open community. Hence, the best way to keep updated with the latest trends in this domain is by reading case studies and technical articles. Usually, companies share their success stories of how data science helped them achieve their goals to showcase their potential and benefit the greater good. Such case studies are available online on the respective company websites and dedicated technology forums like Towards Data Science or Medium.  

Additionally, we can get some practical examples in recently published research papers and textbooks in data science.  

What Are the Skills Required for Data Scientists?  

Data scientists play an important role in the data science process as they are the ones who work on the data end to end. To be able to work on a data science case study, there are several skills required for data scientists like a good grasp of the fundamentals of data science, deep knowledge of statistics, excellent programming skills in Python or R, exposure to data manipulation and data analysis, ability to generate creative and compelling data visualizations, good knowledge of big data, machine learning and deep learning concepts for model building & deployment. Apart from these technical skills, data scientists also need to be good storytellers and should have an analytical mind with strong communication skills.    

Opt for the best business analyst training  elevating your expertise. Take the leap towards becoming a distinguished business analysis professional


These were some interesting  data science case studies  across different industries. There are many more domains where data science has exciting applications, like in the Education domain, where data can be utilized to monitor student and instructor performance, develop an innovative curriculum that is in sync with the industry expectations, etc.   

Almost all the companies looking to leverage the power of big data begin with a swot analysis to narrow down the problems they intend to solve with data science. Further, they need to assess their competitors to develop relevant data science tools and strategies to address the challenging issue. This approach allows them to differentiate themselves from their competitors and offer something unique to their customers.  

With data science, the companies have become smarter and more data-driven to bring about tremendous growth. Moreover, data science has made these organizations more sustainable. Thus, the utility of data science in several sectors is clearly visible, a lot is left to be explored, and more is yet to come. Nonetheless, data science will continue to boost the performance of organizations in this age of big data.  

Frequently Asked Questions (FAQs)

A case study in data science requires a systematic and organized approach for solving the problem. Generally, four main steps are needed to tackle every data science case study: 

  • Defining the problem statement and strategy to solve it  
  • Gather and pre-process the data by making relevant assumptions  
  • Select tool and appropriate algorithms to build machine learning /deep learning models 
  • Make predictions, accept the solutions based on evaluation metrics, and improve the model if necessary. 

Getting data for a case study starts with a reasonable understanding of the problem. This gives us clarity about what we expect the dataset to include. Finding relevant data for a case study requires some effort. Although it is possible to collect relevant data using traditional techniques like surveys and questionnaires, we can also find good quality data sets online on different platforms like Kaggle, UCI Machine Learning repository, Azure open data sets, Government open datasets, Google Public Datasets, Data World and so on.  

Data science projects involve multiple steps to process the data and bring valuable insights. A data science project includes different steps - defining the problem statement, gathering relevant data required to solve the problem, data pre-processing, data exploration & data analysis, algorithm selection, model building, model prediction, model optimization, and communicating the results through dashboards and reports.  


Devashree Madhugiri

Devashree holds an M.Eng degree in Information Technology from Germany and a background in Data Science. She likes working with statistics and discovering hidden insights in varied datasets to create stunning dashboards. She enjoys sharing her knowledge in AI by writing technical articles on various technological platforms. She loves traveling, reading fiction, solving Sudoku puzzles, and participating in coding competitions in her leisure time.

Avail your free 1:1 mentorship session.

Something went wrong

Upcoming Data Science Batches & Dates

Course advisor icon

U.S. flag

An official website of the United States government

Here’s how you know

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Case studies & examples

Articles, use cases, and proof points describing projects undertaken by data managers and data practitioners across the federal government

Agencies Mobilize to Improve Emergency Response in Puerto Rico through Better Data

Federal agencies' response efforts to Hurricanes Irma and Maria in Puerto Rico was hampered by imperfect address data for the island. In the aftermath, emergency responders gathered together to enhance the utility of Puerto Rico address data and share best practices for using what information is currently available.

Federal Data Strategy

BUILDER: A Science-Based Approach to Infrastructure Management

The Department of Energy’s National Nuclear Security Administration (NNSA) adopted a data-driven, risk-informed strategy to better assess risks, prioritize investments, and cost effectively modernize its aging nuclear infrastructure. NNSA’s new strategy, and lessons learned during its implementation, will help inform other federal data practitioners’ efforts to maintain facility-level information while enabling accurate and timely enterprise-wide infrastructure analysis.

Department of Energy

data management , data analysis , process redesign , Federal Data Strategy

Business case for open data

Six reasons why making your agency's data open and accessible is a good business decision.

CDO Council Federal HR Dashboarding Report - 2021

The CDO Council worked with the US Department of Agriculture, the Department of the Treasury, the United States Agency for International Development, and the Department of Transportation to develop a Diversity Profile Dashboard and to explore the value of shared HR decision support across agencies. The pilot was a success, and identified potential impact of a standardized suite of HR dashboards, in addition to demonstrating the value of collaborative analytics between agencies.

Federal Chief Data Officer's Council

data practices , data sharing , data access

CDOC Data Inventory Report

The Chief Data Officers Council Data Inventory Working Group developed this paper to highlight the value proposition for data inventories and describe challenges agencies may face when implementing and managing comprehensive data inventories. It identifies opportunities agencies can take to overcome some of these challenges and includes a set of recommendations directed at Agencies, OMB, and the CDO Council (CDOC).

data practices , metadata , data inventory

DSWG Recommendations and Findings

The Chief Data Officer Council (CDOC) established a Data Sharing Working Group (DSWG) to help the council understand the varied data-sharing needs and challenges of all agencies across the Federal Government. The DSWG reviewed data-sharing across federal agencies and developed a set of recommendations for improving the methods to access and share data within and between agencies. This report presents the findings of the DSWG’s review and provides recommendations to the CDOC Executive Committee.

data practices , data agreements , data sharing , data access

Data Skills Training Program Implementation Toolkit

The Data Skills Training Program Implementation Toolkit is designed to provide both small and large agencies with information to develop their own data skills training programs. The information provided will serve as a roadmap to the design, implementation, and administration of federal data skills training programs as agencies address their Federal Data Strategy’s Agency Action 4 gap-closing strategy training component.

data sharing , Federal Data Strategy

Data Standdown: Interrupting process to fix information

Although not a true pause in operations, ONR’s data standdown made data quality and data consolidation the top priority for the entire organization. It aimed to establish an automated and repeatable solution to enable a more holistic view of ONR investments and activities, and to increase transparency and effectiveness throughout its mission support functions. In addition, it demonstrated that getting top-level buy-in from management to prioritize data can truly advance a more data-driven culture.

Office of Naval Research

data governance , data cleaning , process redesign , Federal Data Strategy

Data.gov Metadata Management Services Product-Preliminary Plan

Status summary and preliminary business plan for a potential metadata management product under development by the Data.gov Program Management Office

data management , Federal Data Strategy , metadata , open data

PDF (7 pages)

Department of Transportation Case Study: Enterprise Data Inventory

In response to the Open Government Directive, DOT developed a strategic action plan to inventory and release high-value information through the Data.gov portal. The Department sustained efforts in building its data inventory, responding to the President’s memorandum on regulatory compliance with a comprehensive plan that was recognized as a model for other agencies to follow.

Department of Transportation

data inventory , open data

Department of Transportation Model Data Inventory Approach

This document from the Department of Transportation provides a model plan for conducting data inventory efforts required under OMB Memorandum M-13-13.

data inventory

PDF (5 pages)

FEMA Case Study: Disaster Assistance Program Coordination

In 2008, the Disaster Assistance Improvement Program (DAIP), an E-Government initiative led by FEMA with support from 16 U.S. Government partners, launched DisasterAssistance.gov to simplify the process for disaster survivors to identify and apply for disaster assistance. DAIP utilized existing partner technologies and implemented a services oriented architecture (SOA) that integrated the content management system and rules engine supporting Department of Labor’s Benefits.gov applications with FEMA’s Individual Assistance Center application. The FEMA SOA serves as the backbone for data sharing interfaces with three of DAIP’s federal partners and transfers application data to reduce duplicate data entry by disaster survivors.

Federal Emergency Management Agency

data sharing

Federal CDO Data Skills Training Program Case Studies

This series was developed by the Chief Data Officer Council’s Data Skills & Workforce Development Working Group to provide support to agencies in implementing the Federal Data Strategy’s Agency Action 4 gap-closing strategy training component in FY21.

FederalRegister.gov API Case Study

This case study describes the tenets behind an API that provides access to all data found on FederalRegister.gov, including all Federal Register documents from 1994 to the present.

National Archives and Records Administration

PDF (3 pages)

Fuels Knowledge Graph Project

The Fuels Knowledge Graph Project (FKGP), funded through the Federal Chief Data Officers (CDO) Council, explored the use of knowledge graphs to achieve more consistent and reliable fuel management performance measures. The team hypothesized that better performance measures and an interoperable semantic framework could enhance the ability to understand wildfires and, ultimately, improve outcomes. To develop a more systematic and robust characterization of program outcomes, the FKGP team compiled, reviewed, and analyzed multiple agency glossaries and data sources. The team examined the relationships between them, while documenting the data management necessary for a successful fuels management program.

metadata , data sharing , data access

Government Data Hubs

A list of Federal agency open data hubs, including USDA, HHS, NASA, and many others.

Helping Baltimore Volunteers Find Where to Help

Bloomberg Government analysts put together a prototype through the Census Bureau’s Opportunity Project to better assess where volunteers should direct litter-clearing efforts. Using Census Bureau and Forest Service information, the team brought a data-driven approach to their work. Their experience reveals how individuals with data expertise can identify a real-world problem that data can help solve, navigate across agencies to find and obtain the most useful data, and work within resource constraints to provide a tool to help address the problem.

Census Bureau

geospatial , data sharing , Federal Data Strategy

How USDA Linked Federal and Commercial Data to Shed Light on the Nutritional Value of Retail Food Sales

Purchase-to-Plate Crosswalk (PPC) links the more than 359,000 food products in a comercial company database to several thousand foods in a series of USDA nutrition databases. By linking existing data resources, USDA was able to enrich and expand the analysis capabilities of both datasets. Since there were no common identifiers between the two data structures, the team used probabilistic and semantic methods to reduce the manual effort required to link the data.

Department of Agriculture

data sharing , process redesign , Federal Data Strategy

How to Blend Your Data: BEA and BLS Harness Big Data to Gain New Insights about Foreign Direct Investment in the U.S.

A recent collaboration between the Bureau of Economic Analysis (BEA) and the Bureau of Labor Statistics (BLS) helps shed light on the segment of the American workforce employed by foreign multinational companies. This case study shows the opportunities of cross-agency data collaboration, as well as some of the challenges of using big data and administrative data in the federal government.

Bureau of Economic Analysis / Bureau of Labor Statistics

data sharing , workforce development , process redesign , Federal Data Strategy

Implementing Federal-Wide Comment Analysis Tools

The CDO Council Comment Analysis pilot has shown that recent advances in Natural Language Processing (NLP) can effectively aid the regulatory comment analysis process. The proof-ofconcept is a standardized toolset intended to support agencies and staff in reviewing and responding to the millions of public comments received each year across government.

Improving Data Access and Data Management: Artificial Intelligence-Generated Metadata Tags at NASA

NASA’s data scientists and research content managers recently built an automated tagging system using machine learning and natural language processing. This system serves as an example of how other agencies can use their own unstructured data to improve information accessibility and promote data reuse.

National Aeronautics and Space Administration

metadata , data management , data sharing , process redesign , Federal Data Strategy

Investing in Learning with the Data Stewardship Tactical Working Group at DHS

The Department of Homeland Security (DHS) experience forming the Data Stewardship Tactical Working Group (DSTWG) provides meaningful insights for those who want to address data-related challenges collaboratively and successfully in their own agencies.

Department of Homeland Security

data governance , data management , Federal Data Strategy

Leveraging AI for Business Process Automation at NIH

The National Institute of General Medical Sciences (NIGMS), one of the twenty-seven institutes and centers at the NIH, recently deployed Natural Language Processing (NLP) and Machine Learning (ML) to automate the process by which it receives and internally refers grant applications. This new approach ensures efficient and consistent grant application referral, and liberates Program Managers from the labor-intensive and monotonous referral process.

National Institutes of Health

standards , data cleaning , process redesign , AI

FDS Proof Point

National Broadband Map: A Case Study on Open Innovation for National Policy

The National Broadband Map is a tool that provide consumers nationwide reliable information on broadband internet connections. This case study describes how crowd-sourcing, open source software, and public engagement informs the development of a tool that promotes government transparency.

Federal Communications Commission

National Renewable Energy Laboratory API Case Study

This case study describes the launch of the National Renewable Energy Laboratory (NREL) Developer Network in October 2011. The main goal was to build an overarching platform to make it easier for the public to use NREL APIs and for NREL to produce APIs.

National Renewable Energy Laboratory

Open Energy Data at DOE

This case study details the development of the renewable energy applications built on the Open Energy Information (OpenEI) platform, sponsored by the Department of Energy (DOE) and implemented by the National Renewable Energy Laboratory (NREL).

open data , data sharing , Federal Data Strategy

Pairing Government Data with Private-Sector Ingenuity to Take on Unwanted Calls

The Federal Trade Commission (FTC) releases data from millions of consumer complaints about unwanted calls to help fuel a myriad of private-sector solutions to tackle the problem. The FTC’s work serves as an example of how agencies can work with the private sector to encourage the innovative use of government data toward solutions that benefit the public.

Federal Trade Commission

data cleaning , Federal Data Strategy , open data , data sharing

Profile in Data Sharing - National Electronic Interstate Compact Enterprise

The Federal CDO Council’s Data Sharing Working Group highlights successful data sharing activities to recognize mature data sharing practices as well as to incentivize and inspire others to take part in similar collaborations. This Profile in Data Sharing focuses on how the federal government and states support children who are being placed for adoption or foster care across state lines. It greatly reduces the work and time required for states to exchange paperwork and information needed to process the placements. Additionally, NEICE allows child welfare workers to communicate and provide timely updates to courts, relevant private service providers, and families.

Profile in Data Sharing - National Health Service Corps Loan Repayment Programs

The Federal CDO Council’s Data Sharing Working Group highlights successful data sharing activities to recognize mature data sharing practices as well as to incentivize and inspire others to take part in similar collaborations. This Profile in Data Sharing focuses on how the Health Resources and Services Administration collaborates with the Department of Education to make it easier to apply to serve medically underserved communities - reducing applicant burden and improving processing efficiency.

Profile in Data Sharing - Roadside Inspection Data

The Federal CDO Council’s Data Sharing Working Group highlights successful data sharing activities to recognize mature data sharing practices as well as to incentivize and inspire others to take part in similar collaborations. This Profile in Data Sharing focuses on how the Department of Transportation collaborates with the Customs and Border Patrol and state partners to prescreen commercial motor vehicles entering the US and to focus inspections on unsafe carriers and drivers.

Profiles in Data Sharing - U.S. Citizenship and Immigration Service

The Federal CDO Council’s Data Sharing Working Group highlights successful data sharing activities to recognize mature data sharing practices as well as to incentivize and inspire others to take part in similar collaborations. This Profile in Data Sharing focuses on how the U.S. Citizenship and Immigration Service (USCIS) collaborated with the Centers for Disease Control to notify state, local, tribal, and territorial public health authorities so they can connect with individuals in their communities about their potential exposure.

SBA’s Approach to Identifying Data, Using a Learning Agenda, and Leveraging Partnerships to Build its Evidence Base

Through its Enterprise Learning Agenda, Small Business Administration’s (SBA) staff identify essential research questions, a plan to answer them, and how data held outside the agency can help provide further insights. Other agencies can learn from the innovative ways SBA identifies data to answer agency strategic questions and adopt those aspects that work for their own needs.

Small Business Administration

process redesign , Federal Data Strategy

Supercharging Data through Validation as a Service

USDA's Food and Nutrition Service restructured its approach to data validation at the state level using an open-source, API-based validation service managed at the federal level.

data cleaning , data validation , API , data sharing , process redesign , Federal Data Strategy

The Census Bureau Uses Its Own Data to Increase Response Rates, Helps Communities and Other Stakeholders Do the Same

The Census Bureau team produced a new interactive mapping tool in early 2018 called the Response Outreach Area Mapper (ROAM), an application that resulted in wider use of authoritative Census Bureau data, not only to improve the Census Bureau’s own operational efficiency, but also for use by tribal, state, and local governments, national and local partners, and other community groups. Other agency data practitioners can learn from the Census Bureau team’s experience communicating technical needs to non-technical executives, building analysis tools with widely-used software, and integrating efforts with stakeholders and users.

open data , data sharing , data management , data analysis , Federal Data Strategy

The Mapping Medicare Disparities Tool

The Centers for Medicare & Medicaid Services’ Office of Minority Health (CMS OMH) Mapping Medicare Disparities Tool harnessed the power of millions of data records while protecting the privacy of individuals, creating an easy-to-use tool to better understand health disparities.

Centers for Medicare & Medicaid Services

geospatial , Federal Data Strategy , open data

The Veterans Legacy Memorial

The Veterans Legacy Memorial (VLM) is a digital platform to help families, survivors, and fellow veterans to take a leading role in honoring their beloved veteran. Built on millions of existing National Cemetery Administration (NCA) records in a 25-year-old database, VLM is a powerful example of an agency harnessing the potential of a legacy system to provide a modernized service that better serves the public.

Veterans Administration

data sharing , data visualization , Federal Data Strategy

Transitioning to a Data Driven Culture at CMS

This case study describes how CMS announced the creation of the Office of Information Products and Data Analytics (OIPDA) to take the lead in making data use and dissemination a core function of the agency.

data management , data sharing , data analysis , data analytics

PDF (10 pages)

U.S. Department of Labor Case Study: Software Development Kits

The U.S. Department of Labor sought to go beyond merely making data available to developers and take ease of use of the data to the next level by giving developers tools that would make using DOL’s data easier. DOL created software development kits (SDKs), which are downloadable code packages that developers can drop into their apps, making access to DOL’s data easy for even the most novice developer. These SDKs have even been published as open source projects with the aim of speeding up their conversion to SDKs that will eventually support all federal APIs.

Department of Labor

open data , API

U.S. Geological Survey and U.S. Census Bureau collaborate on national roads and boundaries data

It is a well-kept secret that the U.S. Geological Survey and the U.S. Census Bureau were the original two federal agencies to build the first national digital database of roads and boundaries in the United States. The agencies joined forces to develop homegrown computer software and state of the art technologies to convert existing USGS topographic maps of the nation to the points, lines, and polygons that fueled early GIS. Today, the USGS and Census Bureau have a longstanding goal to leverage and use roads and authoritative boundary datasets.

U.S. Geological Survey and U.S. Census Bureau

data management , data sharing , data standards , data validation , data visualization , Federal Data Strategy , geospatial , open data , quality

USA.gov Uses Human-Centered Design to Roll Out AI Chatbot

To improve customer service and give better answers to users of the USA.gov website, the Technology Transformation and Services team at General Services Administration (GSA) created a chatbot using artificial intelligence (AI) and automation.

General Services Administration

AI , Federal Data Strategy


An official website of the Office of Management and Budget, the General Services Administration, and the Office of Government Information Services.

This section contains explanations of common terms referenced on resources.data.gov.

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Case Study – Methods, Examples and Guide

Case Study – Methods, Examples and Guide

Table of Contents

Case Study Research

A case study is a research method that involves an in-depth examination and analysis of a particular phenomenon or case, such as an individual, organization, community, event, or situation.

It is a qualitative research approach that aims to provide a detailed and comprehensive understanding of the case being studied. Case studies typically involve multiple sources of data, including interviews, observations, documents, and artifacts, which are analyzed using various techniques, such as content analysis, thematic analysis, and grounded theory. The findings of a case study are often used to develop theories, inform policy or practice, or generate new research questions.

Types of Case Study

Types and Methods of Case Study are as follows:

Single-Case Study

A single-case study is an in-depth analysis of a single case. This type of case study is useful when the researcher wants to understand a specific phenomenon in detail.

For Example , A researcher might conduct a single-case study on a particular individual to understand their experiences with a particular health condition or a specific organization to explore their management practices. The researcher collects data from multiple sources, such as interviews, observations, and documents, and uses various techniques to analyze the data, such as content analysis or thematic analysis. The findings of a single-case study are often used to generate new research questions, develop theories, or inform policy or practice.

Multiple-Case Study

A multiple-case study involves the analysis of several cases that are similar in nature. This type of case study is useful when the researcher wants to identify similarities and differences between the cases.

For Example, a researcher might conduct a multiple-case study on several companies to explore the factors that contribute to their success or failure. The researcher collects data from each case, compares and contrasts the findings, and uses various techniques to analyze the data, such as comparative analysis or pattern-matching. The findings of a multiple-case study can be used to develop theories, inform policy or practice, or generate new research questions.

Exploratory Case Study

An exploratory case study is used to explore a new or understudied phenomenon. This type of case study is useful when the researcher wants to generate hypotheses or theories about the phenomenon.

For Example, a researcher might conduct an exploratory case study on a new technology to understand its potential impact on society. The researcher collects data from multiple sources, such as interviews, observations, and documents, and uses various techniques to analyze the data, such as grounded theory or content analysis. The findings of an exploratory case study can be used to generate new research questions, develop theories, or inform policy or practice.

Descriptive Case Study

A descriptive case study is used to describe a particular phenomenon in detail. This type of case study is useful when the researcher wants to provide a comprehensive account of the phenomenon.

For Example, a researcher might conduct a descriptive case study on a particular community to understand its social and economic characteristics. The researcher collects data from multiple sources, such as interviews, observations, and documents, and uses various techniques to analyze the data, such as content analysis or thematic analysis. The findings of a descriptive case study can be used to inform policy or practice or generate new research questions.

Instrumental Case Study

An instrumental case study is used to understand a particular phenomenon that is instrumental in achieving a particular goal. This type of case study is useful when the researcher wants to understand the role of the phenomenon in achieving the goal.

For Example, a researcher might conduct an instrumental case study on a particular policy to understand its impact on achieving a particular goal, such as reducing poverty. The researcher collects data from multiple sources, such as interviews, observations, and documents, and uses various techniques to analyze the data, such as content analysis or thematic analysis. The findings of an instrumental case study can be used to inform policy or practice or generate new research questions.

Case Study Data Collection Methods

Here are some common data collection methods for case studies:

Interviews involve asking questions to individuals who have knowledge or experience relevant to the case study. Interviews can be structured (where the same questions are asked to all participants) or unstructured (where the interviewer follows up on the responses with further questions). Interviews can be conducted in person, over the phone, or through video conferencing.


Observations involve watching and recording the behavior and activities of individuals or groups relevant to the case study. Observations can be participant (where the researcher actively participates in the activities) or non-participant (where the researcher observes from a distance). Observations can be recorded using notes, audio or video recordings, or photographs.

Documents can be used as a source of information for case studies. Documents can include reports, memos, emails, letters, and other written materials related to the case study. Documents can be collected from the case study participants or from public sources.

Surveys involve asking a set of questions to a sample of individuals relevant to the case study. Surveys can be administered in person, over the phone, through mail or email, or online. Surveys can be used to gather information on attitudes, opinions, or behaviors related to the case study.

Artifacts are physical objects relevant to the case study. Artifacts can include tools, equipment, products, or other objects that provide insights into the case study phenomenon.

How to conduct Case Study Research

Conducting a case study research involves several steps that need to be followed to ensure the quality and rigor of the study. Here are the steps to conduct case study research:

  • Define the research questions: The first step in conducting a case study research is to define the research questions. The research questions should be specific, measurable, and relevant to the case study phenomenon under investigation.
  • Select the case: The next step is to select the case or cases to be studied. The case should be relevant to the research questions and should provide rich and diverse data that can be used to answer the research questions.
  • Collect data: Data can be collected using various methods, such as interviews, observations, documents, surveys, and artifacts. The data collection method should be selected based on the research questions and the nature of the case study phenomenon.
  • Analyze the data: The data collected from the case study should be analyzed using various techniques, such as content analysis, thematic analysis, or grounded theory. The analysis should be guided by the research questions and should aim to provide insights and conclusions relevant to the research questions.
  • Draw conclusions: The conclusions drawn from the case study should be based on the data analysis and should be relevant to the research questions. The conclusions should be supported by evidence and should be clearly stated.
  • Validate the findings: The findings of the case study should be validated by reviewing the data and the analysis with participants or other experts in the field. This helps to ensure the validity and reliability of the findings.
  • Write the report: The final step is to write the report of the case study research. The report should provide a clear description of the case study phenomenon, the research questions, the data collection methods, the data analysis, the findings, and the conclusions. The report should be written in a clear and concise manner and should follow the guidelines for academic writing.

Examples of Case Study

Here are some examples of case study research:

  • The Hawthorne Studies : Conducted between 1924 and 1932, the Hawthorne Studies were a series of case studies conducted by Elton Mayo and his colleagues to examine the impact of work environment on employee productivity. The studies were conducted at the Hawthorne Works plant of the Western Electric Company in Chicago and included interviews, observations, and experiments.
  • The Stanford Prison Experiment: Conducted in 1971, the Stanford Prison Experiment was a case study conducted by Philip Zimbardo to examine the psychological effects of power and authority. The study involved simulating a prison environment and assigning participants to the role of guards or prisoners. The study was controversial due to the ethical issues it raised.
  • The Challenger Disaster: The Challenger Disaster was a case study conducted to examine the causes of the Space Shuttle Challenger explosion in 1986. The study included interviews, observations, and analysis of data to identify the technical, organizational, and cultural factors that contributed to the disaster.
  • The Enron Scandal: The Enron Scandal was a case study conducted to examine the causes of the Enron Corporation’s bankruptcy in 2001. The study included interviews, analysis of financial data, and review of documents to identify the accounting practices, corporate culture, and ethical issues that led to the company’s downfall.
  • The Fukushima Nuclear Disaster : The Fukushima Nuclear Disaster was a case study conducted to examine the causes of the nuclear accident that occurred at the Fukushima Daiichi Nuclear Power Plant in Japan in 2011. The study included interviews, analysis of data, and review of documents to identify the technical, organizational, and cultural factors that contributed to the disaster.

Application of Case Study

Case studies have a wide range of applications across various fields and industries. Here are some examples:

Business and Management

Case studies are widely used in business and management to examine real-life situations and develop problem-solving skills. Case studies can help students and professionals to develop a deep understanding of business concepts, theories, and best practices.

Case studies are used in healthcare to examine patient care, treatment options, and outcomes. Case studies can help healthcare professionals to develop critical thinking skills, diagnose complex medical conditions, and develop effective treatment plans.

Case studies are used in education to examine teaching and learning practices. Case studies can help educators to develop effective teaching strategies, evaluate student progress, and identify areas for improvement.

Social Sciences

Case studies are widely used in social sciences to examine human behavior, social phenomena, and cultural practices. Case studies can help researchers to develop theories, test hypotheses, and gain insights into complex social issues.

Law and Ethics

Case studies are used in law and ethics to examine legal and ethical dilemmas. Case studies can help lawyers, policymakers, and ethical professionals to develop critical thinking skills, analyze complex cases, and make informed decisions.

Purpose of Case Study

The purpose of a case study is to provide a detailed analysis of a specific phenomenon, issue, or problem in its real-life context. A case study is a qualitative research method that involves the in-depth exploration and analysis of a particular case, which can be an individual, group, organization, event, or community.

The primary purpose of a case study is to generate a comprehensive and nuanced understanding of the case, including its history, context, and dynamics. Case studies can help researchers to identify and examine the underlying factors, processes, and mechanisms that contribute to the case and its outcomes. This can help to develop a more accurate and detailed understanding of the case, which can inform future research, practice, or policy.

Case studies can also serve other purposes, including:

  • Illustrating a theory or concept: Case studies can be used to illustrate and explain theoretical concepts and frameworks, providing concrete examples of how they can be applied in real-life situations.
  • Developing hypotheses: Case studies can help to generate hypotheses about the causal relationships between different factors and outcomes, which can be tested through further research.
  • Providing insight into complex issues: Case studies can provide insights into complex and multifaceted issues, which may be difficult to understand through other research methods.
  • Informing practice or policy: Case studies can be used to inform practice or policy by identifying best practices, lessons learned, or areas for improvement.

Advantages of Case Study Research

There are several advantages of case study research, including:

  • In-depth exploration: Case study research allows for a detailed exploration and analysis of a specific phenomenon, issue, or problem in its real-life context. This can provide a comprehensive understanding of the case and its dynamics, which may not be possible through other research methods.
  • Rich data: Case study research can generate rich and detailed data, including qualitative data such as interviews, observations, and documents. This can provide a nuanced understanding of the case and its complexity.
  • Holistic perspective: Case study research allows for a holistic perspective of the case, taking into account the various factors, processes, and mechanisms that contribute to the case and its outcomes. This can help to develop a more accurate and comprehensive understanding of the case.
  • Theory development: Case study research can help to develop and refine theories and concepts by providing empirical evidence and concrete examples of how they can be applied in real-life situations.
  • Practical application: Case study research can inform practice or policy by identifying best practices, lessons learned, or areas for improvement.
  • Contextualization: Case study research takes into account the specific context in which the case is situated, which can help to understand how the case is influenced by the social, cultural, and historical factors of its environment.

Limitations of Case Study Research

There are several limitations of case study research, including:

  • Limited generalizability : Case studies are typically focused on a single case or a small number of cases, which limits the generalizability of the findings. The unique characteristics of the case may not be applicable to other contexts or populations, which may limit the external validity of the research.
  • Biased sampling: Case studies may rely on purposive or convenience sampling, which can introduce bias into the sample selection process. This may limit the representativeness of the sample and the generalizability of the findings.
  • Subjectivity: Case studies rely on the interpretation of the researcher, which can introduce subjectivity into the analysis. The researcher’s own biases, assumptions, and perspectives may influence the findings, which may limit the objectivity of the research.
  • Limited control: Case studies are typically conducted in naturalistic settings, which limits the control that the researcher has over the environment and the variables being studied. This may limit the ability to establish causal relationships between variables.
  • Time-consuming: Case studies can be time-consuming to conduct, as they typically involve a detailed exploration and analysis of a specific case. This may limit the feasibility of conducting multiple case studies or conducting case studies in a timely manner.
  • Resource-intensive: Case studies may require significant resources, including time, funding, and expertise. This may limit the ability of researchers to conduct case studies in resource-constrained settings.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Qualitative Research

Qualitative Research – Methods, Analysis Types...

Descriptive Research Design

Descriptive Research Design – Types, Methods and...

Qualitative Research Methods

Qualitative Research Methods

Basic Research

Basic Research – Types, Methods and Examples

Exploratory Research

Exploratory Research – Types, Methods and...

One-to-One Interview in Research

One-to-One Interview – Methods and Guide

case study on data set

  • Register or Log In
  • 0) { document.location='/search/'+document.getElementById('quicksearch').value.trim().toLowerCase(); }">

case study on data set

Statistics Case Study and Dataset Resources

The philosophies of transparency and open access are becoming more widespread, more popular, and—with the ever-increasing expansion of the Internet—more attainable. Governments and institutions around the world are working to make more and more of their accumulated data available online for free. The datasets below are just a small sample of what is available. If you have a particular interest, do not hesitate to search for datasets on that topic. The table below provides a quick visual representation of what each resource offers, while the annotated links below the table provide further information on each website. Links to additional data sets are also provided.

Annotated Links and Further Data Sources

The links below follow a general-to-specific trajectory and have been marked with a maple leaf where content is Canada-specific. At the top of the list are datasets that have been created with post-secondary statistics students in mind.

Approximately two case studies per year have been featured at the Statistical Society of Canada Annual Meetings. This website includes all case studies since 1996. Case Studies vary widely in subject matter, from the cod fishery in Newfoundland, to the gender gap in earnings among young people, to the effect of genetic variation on the relationship between diet and cardiovascular disease risk. The data is contextualized, provided for download in multiple formats, and includes questions to consider as well as references for each data set. The case studies for the current year can be found by clicking on the “Meetings” tab in the navigation sidebar, or by searching for “case study” in the search bar.

Journal of Statistics Education

This international journal, published and accessible online for free, includes at least two data sets with each volume. All volumes to 1993 are archived and available online. Each data set includes the context, methodology, questions asked, analysis, and relevant references. The data are included in the journal’s data archive , linked to both on the webpage sidebar and at the end of each data set.

UK Data Service, Economic and Social Data Service Teaching Datasets

The Economic and Social Data Service (run by the government of the United Kingdom) has an online catalogue of over 5,000 datasets with over 35 sampler survey datasets tailor-made to be easier for students to use. Study methods and data can be downloaded free of charge. These datasets use UK studies from the University of Essex and the University of Manchester. The datasets are for NESSTOR, not SPSS, but can also be downloaded in plain-text format.

The Rice Virtual Lab in Statistics, Case Studies

The Rice Virtual Lab in Statistics is an initiative by the National Science Foundation in the United States created to provide free online statistics help and practice. The online case studies are fantastic not only because they provide context, datasets, and downloadable raw data where appropriate, but they also allow the user to search by type of statistical analysis required for the case study, allowing you to focus on t-tests, histograms, regression, ANOVA, or whatever you need the most practice with. There are a limited number of case studies on this site.

The United Nations (UN) Statistics Division of the Department of Economic and Social Affairs has pooled major UN databases from the various divisions as accumulated over the past sixty or more years in order to allow users to access information from multiple UN sources simultaneously. This database of datasets includes over 60 million data points. The datasets can be searched, filtered, have columns changed, and downloaded for ease of use.

Open Data is an initiative by the Government of Canada to provide free, easily navigable access to data collected by the Canadian Government in areas such as health, environment, agriculture, and natural resources. You can browse the datasets by subject, file format, or department, or use an advanced search to filter using all of the above as well as keywords. The site also includes links to Provincial and Municipal-level open data sites available across Canada (accessible in the “Links” section of the left-hand sidebar).

The University of Toronto Library has prepared this excellent and exhaustive list of sources for Canadian Statistics on a wide variety of topics, organized by topic. Some have restricted access; you may or may not be able to access these through your university library, depending on which online databases your institution is subscribed to. The restricted links are all clearly labelled in red. This resource also has an international section, accessible through the horizontal vertical toolbar at the top left of the page.

CANSIM is Statistics Canada’s key socioeconomic database, providing fast and easy access to a large range of the latest statistics available in Canada. The data is sorted both by category and survey in which the data was collected. The site not only allows you to access tables of data, but lets you customize your own table of data based on what information you would like CANSIM to display. You can add or remove content, change the way in which the information is summarized, and download your personalized data table.

The National Climate Data and Information Archive provides historical climate data for major cities across Canada, both online and available for download, as collected by the Government of Canada Weather Office. The data can be displayed hourly for each day, or daily for each month. Other weather statistics including engineering climate datasets can be found at http://climate.weather.gc.ca/prods_servs/engineering_e.html .

GeoGratis is a portal provided by Natural Resources Canada which provides a single point of access to a broad collection of geospatial data, topographic and geoscience maps, images, and scientific publications that cover all of Canada at no cost and with no restrictions. Most of this data is in GIS format. You can use the Government of Canada’s GeoConnections website’s advanced search function to filter out only information that includes datasets available for download. Not all of the data that comes up on GeoConnections is available online for free, which is why we have linked to GeoGratis in this guide.

This website allows users to download datasets collected by the Canadian Association of Research Libraries (CARL) on collection size, emerging services, and salaries, by year, in excel format.

Online Sources of International Statistics Guide, University of Maryland

This online resource, provided by the University of Maryland’s Libraries website, has an impressive list of links to datasets organized by Country and Region, as well as by category (Economic, Environmental, Political, Social, and Population). Some of the datasets are only available through subscriptions to sites such as Proquest. Check with your institution’s library to see if you can access these resources.

Organization for Economic Co-Operation and Development (OECD) Better Life Index

The OECD ’s mission is to promote policies that will improve the economic and social well-being of people around the world. Governments work together, using the OECD as a forum to share experiences and seek solutions to common problems. In service to this mission, the OECD created the Better Life Index, which uses United Nations statistics as well as national statistics, to represent all 34 member countries of the OECD in a relational survey of life satisfaction. The index is interactive, allowing you to set your own levels of importance and the website organizes the data to represent how each country does according to your rankings. The raw index data is also available for download on the website (see the link on the left-hand sidebar).

Human Development Index

The HDI, run by the United Nations Development Programme , combines indicators of life expectancy, educational attainment, and income into a composite index, providing a single statistic to serve as a frame of reference for both social and economic development. Under the “ Getting and Using Data ” tab in the left-hand sidebar, the HDI website provides downloads of the raw data sorted in various ways (including an option to build your own data table), as well as the statistical tables underlying the HDI report. In the “ Tools and Rankings ” section ( also in the left-hand side bar) you can also see various visualizations of the data and tools for readjusting the HDI.

The World Bank DataBank

The World Bank is an international financial institution that provides loans to developing countries towards the goal of worldwide reduction of poverty. DataBank is an analysis and visualization tool that allows you to generate charts, tables, and maps based on the data available in several databases. You can also access the raw data by country, topic, or by source on their Data page.

Commission for Environmental Cooperation (CEC): North American Environmental Atlas

The CEC is a collaborative effort between Canada, the United States, and Mexico to address environmental issues of continental concern. The North American Environmental Atlas (first link above) is an interactive mapping tool to research, analyze, and manage environmental issues across the continent. You can also download the individual map files and data sets that comprise the interactive atlas on the CEC website. Most of the map layers are available in several mapping files, but also provide links to the source datasets that they use, which are largely available for download.

Population Reference Bureau DataFinder

The Population Reference Bureau informs people about population, health, and the environment, and empowers them to use that information to advance the well-being of current and future generations. It is based in the United States but has international data. The DataFinder website combines US Census Bureau data with international data from national surveys. It allows users to search and create custom tables comparing countries and variables of your choice.

Mathematics-in-Industry Case Studies Journal

This international online journal (run by the FIELDS Institute for Research in Mathematical Sciences, Toronto) is dedicated to stimulating innovative mathematics by the modelling and analysis of problems across the physical, biological, and social sciences. While the information in this journal is more about the process of modelling various industry-related issues, and so it does not explicitly provide case study data sets for students to explore on their own, this journal does provide examples of problems worked on by mathematicians in industry, and can give you an understanding of the myriad ways in which statistics and modelling can be applied in a variety of industries.

UCLA Department of Statistics Case Studies

The University of California Los Angeles offers HTML-based case studies for student perusal. Many of these include small datasets, a problem, and a worked solution. They are short and easy to use, but not formatted to allow students to try their hand before seeing the answer. This website has not been updated since 2001.

National Center for Case Study Teaching in Science

This website, maintained by the National Center for Case Study Teaching in Science out of the University of Buffalo, is a collection of over 450 peer-reviewed cases at the high school, undergraduate, and graduate school levels. The cases can be filtered by subject, and several are listed under “statistics.” In order to access the answer keys, you must be an instructor affiliated with an educational institution. If you would like to access the answer to a particular case study, you can ask your professor to register in order to access the answer key, if he or she will not be marking your case study his/herself.

The DHS Program

The Demographic and Health Surveys Program collects and has ready to use data for over 90 countries from over 300 surveys. The website is very comprehensive and contains detailed information pertaining to the different survey data available for each of the participating countries, a guide to the DHS statistics and recode manual, as well as tips on working with the different data sets. Although registration is required for access to the data, registration is free.

Select your Country

  • JMP Academic Program

Case Study Library

Bring practical statistical problem solving to your course.

A wide selection of real-world scenarios with practical multistep solution paths.  Complete with objectives, data, illustrations, insights and exercises. Exercise solutions available to qualified instructors only.

case study on data set

What is JMP’s case study library?

  • Request Solutions ​

*: The cases with * need JMP Pro

About the Authors

Marlene Smith

Dr. Marlene Smith

University of Colorado Denver

Jim Lamar

Saint-Gobain NorPro

Stephens, Mia

Mia Stephens

Dewayne Derryberry

Dr. DeWayne Derryberry

Idaho State University

Eric Stephens

Eric Stephens

Nashville General Hospital

Shirley Shmerling

Dr. Shirley Shmerling

University of Massachusetts

Volker Kraft, JMP

Dr. Volker Kraft

Markus Schafheutle

Dr. Markus Schafheutle

Ajoy Kumar

Dr. M Ajoy Kumar

Siddaganga Institute of Technology

Sam Gardner

Sam Gardner

Jennifer Verdolin

Dr. Jennifer Verdolin

University of Arizona

Kevin Potcner

Kevin Potcner

Jane Oppenlander

Dr. Jane Oppenlander

Clarkson University

Mary Ann Shifflet

Dr. Mary Ann Shifflet

University of South Indiana

Muralidhara Anandamurthy

Muralidhara A

James Grayson

Dr. Jim Grayson

Augusta University

Robert Carver

Dr. Robert Carver

Brandeis University

Dr. Frank Deruyck

Dr. Frank Deruyck

University College Ghent

Dr. Simon Stelzig

Dr. Simon Stelzig

Lohmann GmbH & Co. KG

Andreas Trautmann

Andreas Trautmann

Lonza Group AG

Claire Baril

Claire Baril

Chandramouli Ramnarayanan

Chandramouli R

Ross Metusalem

Ross Metusalem

case study on data set

Benjamin Ingham

The University of Manchester

Case Study Solutions request

To request solutions to the exercises within the Case Studies, please complete this form and indicate which case(s) and their number you would like to request in the space provided below.  Solutions are provided to qualified instructors only and all requests including academic standing will be verified before solutions are sent. 

Medical Malpractice

Explore claim payment amounts for medical malpractice lawsuits and identify factors that appear to influence the amount of the payment using descriptive statistics and data visualizations.

Key words: Summary statistics, frequency distribution, histogram, box plot, bar chart, Pareto plot, and pie chart

  • Download the case study (PDF)
  • Download the data set

Baggage Complaints

Analyze and compare baggage complaints for three different airlines using descriptive statistics and time series plots. Explore differences between the airlines, whether complaints are getting better or worse over time, and if there are other factors, such as destinations, seasonal effects or the volume of travelers that might affect baggage performance.

Key words: Time series plots, summary statistics

Defect Sampling

Explore the effectiveness of different sampling plans in detecting changes in the occurrence of manufacturing defects.

Key words: Tabulation, histogram, summary statistics, and time series plots

Film on the Rocks

Use survey results from a summer movie series to answer questions regarding customer satisfaction, demographic profiles of patrons, and the use of media outlets in advertising.

Key words: Bar charts, frequency distribution, summary statistics, mosaic plot, contingency table, (cross-tabulations), and chi-squared test

Improving Patient Satisfaction

Analyze patient complaint data at a medical clinic to identify the issues resulting in customer dissatisfaction and determine potential causes of decreased patient volume. 

Key words: Frequency distribution, summary statistics, Pareto plot, tabulation, scatterplot, run chart, correlation

  • Download the data set 1
  • Download the data set 2

Price Quotes

Evaluate the price quoting process of two different sales associate to determine if there is inconsistency between them to decide if a new more consistent pricing process should be developed.

Key words: Histograms, summary statistics, confidence interval for the mean, one sample t-Test

Treatment Facility

Determine what effect a reengineering effort had on the incidence of behavioral problems and turnover at a treatment facility for teenagers.

Key words: Summary statistics, time series plots, normal quantile plots, two sample t-Test, unequal variance test, Welch's test

Use data from a survey of students to perform exploratory data analysis and to evaluate the performance of different approaches to a statistical analysis.

Key words: Histograms, normal quantile plots, log transformations, confidence intervals, inverse transformation

Fish Story: Not Too Many Fishes in the Sea

Use the DASL Fish Prices data to investigate whether there is evidence that overfishing occurred from 1970 to 1980.

Key words: Histograms, normal quantile plots, log transformations, inverse transformation, paired t-test, Wilcoxon signed rank test

Subliminal Messages

Determine whether subliminal messages were effective in increasing math test scores, and if so, by how much.

Key words: Histograms, summary statistics, box plots, t-Test and pooled t-Test, normal quantile plot, Wilcoxon Rank Sums test, Cohen's d

Priority Assessment

Determine whether a software development project prioritization system was effective in speeding the time to completion for high priority jobs.

Key words: Summary statistics, histograms, normal quantile plot, ANOVA, pairwise comparison, unequal variance test, and Welch's test

Determine if a backgammon program has been upgraded by comparing the performance of a player against the computer across different time periods.

Key words: Histograms, confidence intervals, stacking data, one-way ANOVA, unequal variances test, one-sample t-Test, ANOVA table and calculations, F Distribution, F ratios

Per Capita Income

Use data from the World Factbook to explore wealth disparities between different regions of the world and identify those with the highest and lowest wealth.

Key words: Geographic mapping, histograms, log transformation, ANOVA, Welch's ANOVA, Kruskal-Wallis

  • Download the data set 3

Kerrich: Is a Coin Fair?

Using outcomes for 10,000 flips of a coin, use descriptive statistics, confidence intervals and hypothesis tests to determine whether the coin is fair. 

Key words: Bar charts, confidence intervals for proportions, hypothesis testing for proportions, likelihood ratio, simulating random data, scatterplot, fitting a regression line

Lister and Germ Theory

Use results from a 1860’s sterilization study to determine if there is evidence that the sterilization process reduces deaths when amputations are performed.

Key words: Mosaic plots, contingency tables, Pearson and likelihood ratio tests, Fisher's exact test, two-sample proportions test, one- and two-sided tests, confidence interval for the difference, relative risk

Salk Vaccine

Using data from a 1950’s study, determine whether the polio vaccine was effective in a cohort study, and, if it was, quantify the degree of effectiveness.

Key words: Bar charts, two-sample proportions test, relative risk, two-sided Pearson and likelihood ratio tests, Fisher's exact test, and the Gamma measure of association

Smoking and Lung Cancer

Use the results of a retrospective study to determine if there is a positive association between smoking and lung cancer, and estimate the risk of lung cancer for smokers relative to non-smokers.

Key words: Mosaic plots, two-by-two contingency tables, odds ratios and confidence intervals, conditional probability, hypothesis tests for proportions (likelihood ratio, Pearson's, Fisher's Exact, two sample tests for proportions)

Mendel's Laws of Inheritance

Use the data sets provided to explore Mendel’s Laws of Inheritance for dominant and recessive traits.

Key words: Bar charts, frequency distributions, goodness-of-fit tests, mosaic plot, hypothesis tests for proportions


Predict year-end contributions in an employee fund-raising drive.

Key words: Summary statistics, time series plots, simple linear regression, predicted values, prediction intervals

Direct Mail

Evaluate different regression models to determine if sales at small retail shop are influence by direct mail campaign and using the resulting models to predict sales based upon the amount of marketing.

Key words: Time series plots, simple linear regression, lagged variables, predicted values, prediction intervals

Cost Leadership

Assess the effectiveness of a cost leadership strategy in increasing market share, and assess the potential for additional gains in market share under the current strategy.

Key words: Simple linear regression, spline fitting, transformations, predicted values, prediction intervals

Archosaur:  The Relationship Between Body Size and Brain Size

Analyze data on the brain and body weight of different dinosaur species to determine if a proposed statistical model performs well at describing the relationship and use the model to predict brain weight based on body weight.

Key words: Histogram and summary statistics, fitting a regression line, log transformations, residual plots, interpreting regression output and parameter estimates, inverse transformations

Cell Phone Service

Determine whether wind speed and barometric pressure are related to phone call performance (percentage of dropped or failed calls) and use the resulting model to predict the percentage of bad calls based upon the weather conditions.

Key words: Histograms, summary statistics, simple linear regression, multiple regression, scatterplot, 3D-scatterplot

Housing Prices

After determining which factors relate to the selling prices of homes located in and around a ski resort, develop a model to predict housing prices.

Key words: Scatterplot matrix, correlations, multiple regression, stepwise regression, multicollinearity, model building, model diagnostics

Bank Revenues

A bank wants to understand how customer banking habits contribute to revenues and profitability. Build a model that allows the bank to predict profitability for a given customer. The resulting model will be used to forecast bank revenues and guide the bank in future marketing campaigns.

Key words: Log transformation, stepwise regression, regression assumptions, residuals, Cook’s D, model coefficients, singularity, prediction profiler, inverse transformations

Determine whether certain conditions make it more likely that a customer order will be won or lost.

Key words: Bar charts, frequency distribution, mosaic plots, contingency table, chi-squared test, logistic regression, predicted values, confusion matrix

Titanic Passengers

Use the passenger data related to the sinking of the RMS Titanic ship to explore some questions of interest about survival rates for the Titanic. For example, were there some key characteristics of the survivors? Were some passenger groups more likely to survive than others? Can we accurately predict survival?

Key words: Logistic regression, log odds and logit, odds, odds ratios, prediction profiler

Credit Card Marketing

A bank would like to understand the demographics and other characteristics associated with whether a customer accepts a credit card offer. Build a Classification model that will provide insight into why some bank customers accept credit card offers.

Key words: Classification trees, training & validation, confusion matrix, misclassification, leaf report, ROC curves, lift curves

Call Center Improvement: Visual Six Sigma

The scenario relates to the handling of customer queries via an IT call center. The call center performance is well below best in class. Identify potential process changes to allow the call center to achieve best in class performance.

Key words: Interactive data visualization, graphs, distribution, tabulate, recursive partitioning, process capability, control chart, multiple regression, prediction profiler

Customer Churn

Analyze the factors related to customer churn of a mobile phone service provider. The company would like to build a model to predict which customers are most likely to move their service to a competitor. This knowledge will be used to identify customers for targeted interventions, with the ultimate goal of reducing churn.

Key words: Neural networks, activation functions, model validation, confusion matrix, lift, prediction profiler, variable importance

Boston Housing

Build a variety of prediction models (multiple regression, partition tree, and a neural network) to determine the one that performs the best at predicting house prices based upon various characteristics of the house and its location.

Key words: Stepwise regression, regression trees, neural networks, model validation, model comparison

Durability of Mobile Phone Screen - Part 1

Evaluate the durability of mobile phone screens in a drop test. Determine if a desired level of durability is achieved for each of two types of screens and compare performance.

Key words: Confidence Intervals, Hypothesis Tests for One and Two Population Proportions, Chi-square, Relative Risk

Durability of Mobile Phone Screen - Part 2

Evaluate the durability of mobile phone screens in a drop test at various drop heights. Determine if a desired level of durability is achieved for each of three types of screens and compare performance.

Key words: Contingency analysis, comparing proportions via difference, relative risk and odds ratio

Durability of Mobile Phone Screen - Part 3

Evaluate the durability of mobile phone screens in a drop test across various heights by building individual simple logistic regression models. Use the models to estimate the probability of a screen being damaged across any drop height.

Key words: Single variable logistic regression, inverse prediction

Durability of Mobile Phone Screen - Part 4

Evaluate the durability of mobile phone screens in a drop test across various heights by building a single multiple logistic regression model. Use the model to estimate the probability of a screen being damaged across any drop height.

Key words: Multivariate logistic regression, inverse prediction, odds ratio

Online Mortgage Application

Evaluate the potential improvement to the UI design of an online mortgage application process by examining the usability rating from a sample of 50 customers and comparing their performance using the new design vs. a large collection of historic data on customer’s performance with the current design.

Key words: Distribution, normality, normal quantile plot, Shapiro Wilk and Anderson Darling tests, t-Test

Performance of Food Manufacturing Process - Part 1

Evaluate the performance to specifications of a food manufacturing process using graphical analyses and numerical summarizations of the data.

Key words: Distribution, summary statistics, time series plots

Performance of Food Manufacturing Process - Part 2

Evaluate the performance to specifications of a food manufacturing process using confidence intervals and hypothesis testing.

Key words: Distribution, normality, normal quantile plot, Shapiro Wilk and Anderson Darling tests, test of mean and test of standard deviation

Detergent Cleaning Effectiveness

Analyze the results of an experiment to determine if there is statistical evidence demonstrating an improvement in a new laundry detergent formulation. Explore and describe the affect that multiple factors have on a response, as well as identify conditions with the most and least impact.

Key words: Analysis of variance (ANOVA), t-Test, pairwise comparison, model diagnostics, model performance

Manufacturing Systems Variation

Study the use of Nested Variability chart to understand and analyze the different components of variances. Also explore the ways to minimize the variability by applying various rules of operation related to variance.

Key words: Variability gauge, nested design, component analysis of variance

Text Exploration of Patents

This study requires the use of unstructured data analysis to understand and analyze the text related to patents filed by different companies.

Key words: Word cloud, data visualization, term selection

US Stock Indices

Understand the basic concepts related to time series data analysis and explore the ways to practically understand the risks and rate of return related to the financial indices data.

Key words: Differencing, log transformation, stationarity, Augmented Dickey Fuller (ADF) test

Pricing Musical Instrument

Study the application of regression and concepts related to choice modeling (also called conjoint analysis) to understand and analyze the importance of the product attributes and their levels influencing the preferences.

Key words: Part Worth, regression, prediction profiler

Pricing Spectacles

Design and analyze discrete choice experiments (also called conjoint analysis) to discover which product or service attributes are preferred by potential customers.

Key words: Discrete choice design, regression, utility and probability profiler, willingness to pay

Modeling Gold Prices

Learn univariate time series modeling using US Gold Prices. Build AR, MA, ARMA and ARMA models to analyze the characteristics of the time series data and forecast.

Key words: Stationarity, AR, MA, ARMA, ARIMA, model comparison and diagnostics

Explore statistical evidence demonstrating an association between Saguro size and the amount of flowers it produces.

Key words: Kendall's Tau, correlation, normality, regression

Manufacturing Excellence at Pharma Company - Part 1

Use control charts to understand process stability and analyze the patterns of process variation.

Key words: Statistical Process Control, Control Chart, Process Capability

Manufacturing Excellence at Pharma Company - Part 2

Use Measurement Systems Analysis (MSA) to assess the precision, consistency and bias of a measurement system.

Key words: Measurement Systems Analysis (MSA), Analysis of Variance (ANOVA)

Manufacturing Excellence at Pharma Company - Part 3

Use Design of Experiments (DOE) to advance knowledge about the process.

Key words: Definitive Screening Design, Custom Design, Design Comparison. Prediction, Simulation and Optimization

Polymerization at Lohmann - Part 1

Application of statistical methods to understand the process and enhance its performance through Design of Experiments and regression techniques.

Key words: Custom Design, Stepwise Regression, Prediction Profiler

Polymerization at Lohmann - Part 2

Use Functional Data Analysis to understand the intrinsic structure of the data.

Key words: Functional Data Analysis (FDA), B Splines, Functional PCA, Generalized Regression

Optimization of Microbial Cultivation Process

Use Design of Experiments (DOE) to optimize the microbial cultivation process.

Key words: Custom Design, Design Evaluation, Predictive Modeling

Cluster Analysis in the Public Sector

Use PCA and Clustering techniques to segment the demographic data.

Key words: Clustering, Principal Component Analysis, Exploratory Data Analysis

Forecasting Copper Prices

Learn various exponential smoothing techniques to build various forecasting models and compare them.

Key words: Time series forecasting, Exponential Smoothing

Increasing Bioavailability of a Drug using SMEDDS

Use Mixture/formulation design to optimize multiple responses related to bioavailability of a drug.

Key words: Custom Design, Mixture/Formulation Design, Optimization

Where Have All the Butterflies Gone?

Apply time series forecasting and Generalized linear mixed model (GLMM) to evaluate butterfly populations being impacted by climate and land-use changes.

Key words: Time series forecasting, Generalized linear mixed model

Exploratory Factor Analysis of Trust in Online Sellers

Apply exploratory factor analysis to uncover latent factor structure in an online shopping questionnaire.

Key words: Exploratory Factor Analysis (EFA), Bartlett’s Test, KMO Test

Modeling Online Shopping Perceptions

Apply measurement and structural models to survey responses from online shoppers to build and evaluate competing models.

Key words : Confirmatory Factor Analysis (CFA), Structural Equation Modeling (SEM), Measurement and Structural Regression Models, Model Comparison

Functional Data Analysis for HPLC Optimization

Apply functional data analysis and functional design of experiments (FDOE) for the optimization of an analytical method to allow for the accurate quantification of two biological components.

Key words: Functional Data Analysis, Functional PCA, Functional DOE

Nonlinear Regression Modeling for Cell Growth Optimization

Apply nonlinear models to understand the impact of factors on a cell growth.

Key words: Nonlinear Modeling, Logistic 3P, Curve DOE

Study Site Homepage

  • Request new password
  • Create a new account

Using Software in Qualitative Research

A step-by-step guide, student resources, case study sample data.

Case Studies

Throughout Using Software in Qualitative Research   three case-study examples illustrate analytic tasks, their execution in CAQDAS packages and the potentials of different products. Chapter 2 summarizes the data sets, lists the research questions and outlines suggested processes for analysis.

The case-study examples are drawn from real research projects and/or reflect contemporary sociological issues. We use them to illustrate common analytic tasks encountered in a range of methodologies and to enable discussion of efficient and robust analytic strategies. Our intention is not to promote any particular method of analysis, or to suggest that there is necessarily an ‘ideal’ way of using a particular software program. Rather, we offer ideas for analysis in relation to different data types and methodological and practical contexts.

From this website you can download sample data that we refer to in the book, and follow through the chapter exercises using your chosen software package. Alternatively, you can choose to experiment with alternative ways of working. Your purpose in working with our sample data may be to familiarize with and compare particular software packages. Or you may use them as a means of experimenting with different tools in order to inform the development of your own software-supported analytic strategy. Whatever your purpose, be creative and experimental. The processes we present are not the only ways of proceeding with these types of data. You will get more out of software and data analysis if you try out different tools and processes and reflect on how they will suit the needs of your data and your preferred ways of working.

Please click on the links below to begin:

  • Young People’s Perspectives Case A
  • Financial Downturn Case B
  • Coca Cola  Case C

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base


  • What Is a Case Study? | Definition, Examples & Methods

What Is a Case Study? | Definition, Examples & Methods

Published on May 8, 2019 by Shona McCombes . Revised on November 20, 2023.

A case study is a detailed study of a specific subject, such as a person, group, place, event, organization, or phenomenon. Case studies are commonly used in social, educational, clinical, and business research.

A case study research design usually involves qualitative methods , but quantitative methods are sometimes also used. Case studies are good for describing , comparing, evaluating and understanding different aspects of a research problem .

Table of contents

When to do a case study, step 1: select a case, step 2: build a theoretical framework, step 3: collect your data, step 4: describe and analyze the case, other interesting articles.

A case study is an appropriate research design when you want to gain concrete, contextual, in-depth knowledge about a specific real-world subject. It allows you to explore the key characteristics, meanings, and implications of the case.

Case studies are often a good choice in a thesis or dissertation . They keep your project focused and manageable when you don’t have the time or resources to do large-scale research.

You might use just one complex case study where you explore a single subject in depth, or conduct multiple case studies to compare and illuminate different aspects of your research problem.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Once you have developed your problem statement and research questions , you should be ready to choose the specific case that you want to focus on. A good case study should have the potential to:

  • Provide new or unexpected insights into the subject
  • Challenge or complicate existing assumptions and theories
  • Propose practical courses of action to resolve a problem
  • Open up new directions for future research

TipIf your research is more practical in nature and aims to simultaneously investigate an issue as you solve it, consider conducting action research instead.

Unlike quantitative or experimental research , a strong case study does not require a random or representative sample. In fact, case studies often deliberately focus on unusual, neglected, or outlying cases which may shed new light on the research problem.

Example of an outlying case studyIn the 1960s the town of Roseto, Pennsylvania was discovered to have extremely low rates of heart disease compared to the US average. It became an important case study for understanding previously neglected causes of heart disease.

However, you can also choose a more common or representative case to exemplify a particular category, experience or phenomenon.

Example of a representative case studyIn the 1920s, two sociologists used Muncie, Indiana as a case study of a typical American city that supposedly exemplified the changing culture of the US at the time.

While case studies focus more on concrete details than general theories, they should usually have some connection with theory in the field. This way the case study is not just an isolated description, but is integrated into existing knowledge about the topic. It might aim to:

  • Exemplify a theory by showing how it explains the case under investigation
  • Expand on a theory by uncovering new concepts and ideas that need to be incorporated
  • Challenge a theory by exploring an outlier case that doesn’t fit with established assumptions

To ensure that your analysis of the case has a solid academic grounding, you should conduct a literature review of sources related to the topic and develop a theoretical framework . This means identifying key concepts and theories to guide your analysis and interpretation.

There are many different research methods you can use to collect data on your subject. Case studies tend to focus on qualitative data using methods such as interviews , observations , and analysis of primary and secondary sources (e.g., newspaper articles, photographs, official records). Sometimes a case study will also collect quantitative data.

Example of a mixed methods case studyFor a case study of a wind farm development in a rural area, you could collect quantitative data on employment rates and business revenue, collect qualitative data on local people’s perceptions and experiences, and analyze local and national media coverage of the development.

The aim is to gain as thorough an understanding as possible of the case and its context.

Prevent plagiarism. Run a free check.

In writing up the case study, you need to bring together all the relevant aspects to give as complete a picture as possible of the subject.

How you report your findings depends on the type of research you are doing. Some case studies are structured like a standard scientific paper or thesis , with separate sections or chapters for the methods , results and discussion .

Others are written in a more narrative style, aiming to explore the case from various angles and analyze its meanings and implications (for example, by using textual analysis or discourse analysis ).

In all cases, though, make sure to give contextual details about the case, connect it back to the literature and theory, and discuss how it fits into wider patterns or debates.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

McCombes, S. (2023, November 20). What Is a Case Study? | Definition, Examples & Methods. Scribbr. Retrieved March 18, 2024, from https://www.scribbr.com/methodology/case-study/

Is this article helpful?

Shona McCombes

Shona McCombes

Other students also liked, primary vs. secondary sources | difference & examples, what is a theoretical framework | guide to organizing, what is action research | definition & examples, what is your plagiarism score.


Statistics Made Easy

What are Cases in Statistics? (Definition & Examples)

In statistics, cases simply refer to the individuals in a dataset.

In most datasets, we have cases (the individuals) and variables (the attributes for the individuals).

For example, the following dataset contains 10 cases and 3 variables that we measure for each case:

case study on data set

Notice that each case has multiple variables or “attributes.”

For example, each player has a value for points, assists, and rebounds.

Note that cases are also sometimes called experimental units . These terms are used interchangeably.

Check out the following examples to gain an even better understanding of cases.

Example 1: Education

The following dataset contains 10 cases and 2 variables:

case study on data set

The cases are the individual students and the variables are time studied and exam score.

Example 2: Business

The following dataset contains 6 cases and 3 variables:

case study on data set

The cases are the individual stores and the variables are total sales, total customers, and total refunds.

Example 3: Biology

The following dataset contains 15 cases and 3 variables:

case study on data set

The cases are the individual plants and the variables are height, width, and age.

Additional Resources

The following tutorials provide additional information on other terms commonly used in statistics:

What is an Observation in Statistics? What is an Influential Observation in Statistics? What is a Covariate in Statistics? What is a Parameter of Interest in Statistics?

' src=

Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

How to write a case study — examples, templates, and tools

case study on data set

It’s a marketer’s job to communicate the effectiveness of a product or service to potential and current customers to convince them to buy and keep business moving. One of the best methods for doing this is to share success stories that are relatable to prospects and customers based on their pain points, experiences, and overall needs.

That’s where case studies come in. Case studies are an essential part of a content marketing plan. These in-depth stories of customer experiences are some of the most effective at demonstrating the value of a product or service. Yet many marketers don’t use them, whether because of their regimented formats or the process of customer involvement and approval.

A case study is a powerful tool for showcasing your hard work and the success your customer achieved. But writing a great case study can be difficult if you’ve never done it before or if it’s been a while. This guide will show you how to write an effective case study and provide real-world examples and templates that will keep readers engaged and support your business.

In this article, you’ll learn:

What is a case study?

How to write a case study, case study templates, case study examples, case study tools.

A case study is the detailed story of a customer’s experience with a product or service that demonstrates their success and often includes measurable outcomes. Case studies are used in a range of fields and for various reasons, from business to academic research. They’re especially impactful in marketing as brands work to convince and convert consumers with relatable, real-world stories of actual customer experiences.

The best case studies tell the story of a customer’s success, including the steps they took, the results they achieved, and the support they received from a brand along the way. To write a great case study, you need to:

  • Celebrate the customer and make them — not a product or service — the star of the story.
  • Craft the story with specific audiences or target segments in mind so that the story of one customer will be viewed as relatable and actionable for another customer.
  • Write copy that is easy to read and engaging so that readers will gain the insights and messages intended.
  • Follow a standardized format that includes all of the essentials a potential customer would find interesting and useful.
  • Support all of the claims for success made in the story with data in the forms of hard numbers and customer statements.

Case studies are a type of review but more in depth, aiming to show — rather than just tell — the positive experiences that customers have with a brand. Notably, 89% of consumers read reviews before deciding to buy, and 79% view case study content as part of their purchasing process. When it comes to B2B sales, 52% of buyers rank case studies as an important part of their evaluation process.

Telling a brand story through the experience of a tried-and-true customer matters. The story is relatable to potential new customers as they imagine themselves in the shoes of the company or individual featured in the case study. Showcasing previous customers can help new ones see themselves engaging with your brand in the ways that are most meaningful to them.

Besides sharing the perspective of another customer, case studies stand out from other content marketing forms because they are based on evidence. Whether pulling from client testimonials or data-driven results, case studies tend to have more impact on new business because the story contains information that is both objective (data) and subjective (customer experience) — and the brand doesn’t sound too self-promotional.

89% of consumers read reviews before buying, 79% view case studies, and 52% of B2B buyers prioritize case studies in the evaluation process.

Case studies are unique in that there’s a fairly standardized format for telling a customer’s story. But that doesn’t mean there isn’t room for creativity. It’s all about making sure that teams are clear on the goals for the case study — along with strategies for supporting content and channels — and understanding how the story fits within the framework of the company’s overall marketing goals.

Here are the basic steps to writing a good case study.

1. Identify your goal

Start by defining exactly who your case study will be designed to help. Case studies are about specific instances where a company works with a customer to achieve a goal. Identify which customers are likely to have these goals, as well as other needs the story should cover to appeal to them.

The answer is often found in one of the buyer personas that have been constructed as part of your larger marketing strategy. This can include anything from new leads generated by the marketing team to long-term customers that are being pressed for cross-sell opportunities. In all of these cases, demonstrating value through a relatable customer success story can be part of the solution to conversion.

2. Choose your client or subject

Who you highlight matters. Case studies tie brands together that might otherwise not cross paths. A writer will want to ensure that the highlighted customer aligns with their own company’s brand identity and offerings. Look for a customer with positive name recognition who has had great success with a product or service and is willing to be an advocate.

The client should also match up with the identified target audience. Whichever company or individual is selected should be a reflection of other potential customers who can see themselves in similar circumstances, having the same problems and possible solutions.

Some of the most compelling case studies feature customers who:

  • Switch from one product or service to another while naming competitors that missed the mark.
  • Experience measurable results that are relatable to others in a specific industry.
  • Represent well-known brands and recognizable names that are likely to compel action.
  • Advocate for a product or service as a champion and are well-versed in its advantages.

Whoever or whatever customer is selected, marketers must ensure they have the permission of the company involved before getting started. Some brands have strict review and approval procedures for any official marketing or promotional materials that include their name. Acquiring those approvals in advance will prevent any miscommunication or wasted effort if there is an issue with their legal or compliance teams.

3. Conduct research and compile data

Substantiating the claims made in a case study — either by the marketing team or customers themselves — adds validity to the story. To do this, include data and feedback from the client that defines what success looks like. This can be anything from demonstrating return on investment (ROI) to a specific metric the customer was striving to improve. Case studies should prove how an outcome was achieved and show tangible results that indicate to the customer that your solution is the right one.

This step could also include customer interviews. Make sure that the people being interviewed are key stakeholders in the purchase decision or deployment and use of the product or service that is being highlighted. Content writers should work off a set list of questions prepared in advance. It can be helpful to share these with the interviewees beforehand so they have time to consider and craft their responses. One of the best interview tactics to keep in mind is to ask questions where yes and no are not natural answers. This way, your subject will provide more open-ended responses that produce more meaningful content.

4. Choose the right format

There are a number of different ways to format a case study. Depending on what you hope to achieve, one style will be better than another. However, there are some common elements to include, such as:

  • An engaging headline
  • A subject and customer introduction
  • The unique challenge or challenges the customer faced
  • The solution the customer used to solve the problem
  • The results achieved
  • Data and statistics to back up claims of success
  • A strong call to action (CTA) to engage with the vendor

It’s also important to note that while case studies are traditionally written as stories, they don’t have to be in a written format. Some companies choose to get more creative with their case studies and produce multimedia content, depending on their audience and objectives. Case study formats can include traditional print stories, interactive web or social content, data-heavy infographics, professionally shot videos, podcasts, and more.

5. Write your case study

We’ll go into more detail later about how exactly to write a case study, including templates and examples. Generally speaking, though, there are a few things to keep in mind when writing your case study.

  • Be clear and concise. Readers want to get to the point of the story quickly and easily, and they’ll be looking to see themselves reflected in the story right from the start.
  • Provide a big picture. Always make sure to explain who the client is, their goals, and how they achieved success in a short introduction to engage the reader.
  • Construct a clear narrative. Stick to the story from the perspective of the customer and what they needed to solve instead of just listing product features or benefits.
  • Leverage graphics. Incorporating infographics, charts, and sidebars can be a more engaging and eye-catching way to share key statistics and data in readable ways.
  • Offer the right amount of detail. Most case studies are one or two pages with clear sections that a reader can skim to find the information most important to them.
  • Include data to support claims. Show real results — both facts and figures and customer quotes — to demonstrate credibility and prove the solution works.

6. Promote your story

Marketers have a number of options for distribution of a freshly minted case study. Many brands choose to publish case studies on their website and post them on social media. This can help support SEO and organic content strategies while also boosting company credibility and trust as visitors see that other businesses have used the product or service.

Marketers are always looking for quality content they can use for lead generation. Consider offering a case study as gated content behind a form on a landing page or as an offer in an email message. One great way to do this is to summarize the content and tease the full story available for download after the user takes an action.

Sales teams can also leverage case studies, so be sure they are aware that the assets exist once they’re published. Especially when it comes to larger B2B sales, companies often ask for examples of similar customer challenges that have been solved.

Now that you’ve learned a bit about case studies and what they should include, you may be wondering how to start creating great customer story content. Here are a couple of templates you can use to structure your case study.

Template 1 — Challenge-solution-result format

  • Start with an engaging title. This should be fewer than 70 characters long for SEO best practices. One of the best ways to approach the title is to include the customer’s name and a hint at the challenge they overcame in the end.
  • Create an introduction. Lead with an explanation as to who the customer is, the need they had, and the opportunity they found with a specific product or solution. Writers can also suggest the success the customer experienced with the solution they chose.
  • Present the challenge. This should be several paragraphs long and explain the problem the customer faced and the issues they were trying to solve. Details should tie into the company’s products and services naturally. This section needs to be the most relatable to the reader so they can picture themselves in a similar situation.
  • Share the solution. Explain which product or service offered was the ideal fit for the customer and why. Feel free to delve into their experience setting up, purchasing, and onboarding the solution.
  • Explain the results. Demonstrate the impact of the solution they chose by backing up their positive experience with data. Fill in with customer quotes and tangible, measurable results that show the effect of their choice.
  • Ask for action. Include a CTA at the end of the case study that invites readers to reach out for more information, try a demo, or learn more — to nurture them further in the marketing pipeline. What you ask of the reader should tie directly into the goals that were established for the case study in the first place.

Template 2 — Data-driven format

  • Start with an engaging title. Be sure to include a statistic or data point in the first 70 characters. Again, it’s best to include the customer’s name as part of the title.
  • Create an overview. Share the customer’s background and a short version of the challenge they faced. Present the reason a particular product or service was chosen, and feel free to include quotes from the customer about their selection process.
  • Present data point 1. Isolate the first metric that the customer used to define success and explain how the product or solution helped to achieve this goal. Provide data points and quotes to substantiate the claim that success was achieved.
  • Present data point 2. Isolate the second metric that the customer used to define success and explain what the product or solution did to achieve this goal. Provide data points and quotes to substantiate the claim that success was achieved.
  • Present data point 3. Isolate the final metric that the customer used to define success and explain what the product or solution did to achieve this goal. Provide data points and quotes to substantiate the claim that success was achieved.
  • Summarize the results. Reiterate the fact that the customer was able to achieve success thanks to a specific product or service. Include quotes and statements that reflect customer satisfaction and suggest they plan to continue using the solution.
  • Ask for action. Include a CTA at the end of the case study that asks readers to reach out for more information, try a demo, or learn more — to further nurture them in the marketing pipeline. Again, remember that this is where marketers can look to convert their content into action with the customer.

While templates are helpful, seeing a case study in action can also be a great way to learn. Here are some examples of how Adobe customers have experienced success.

Juniper Networks

One example is the Adobe and Juniper Networks case study , which puts the reader in the customer’s shoes. The beginning of the story quickly orients the reader so that they know exactly who the article is about and what they were trying to achieve. Solutions are outlined in a way that shows Adobe Experience Manager is the best choice and a natural fit for the customer. Along the way, quotes from the client are incorporated to help add validity to the statements. The results in the case study are conveyed with clear evidence of scale and volume using tangible data.

A Lenovo case study showing statistics, a pull quote and featured headshot, the headline "The customer is king.," and Adobe product links.

The story of Lenovo’s journey with Adobe is one that spans years of planning, implementation, and rollout. The Lenovo case study does a great job of consolidating all of this into a relatable journey that other enterprise organizations can see themselves taking, despite the project size. This case study also features descriptive headers and compelling visual elements that engage the reader and strengthen the content.

Tata Consulting

When it comes to using data to show customer results, this case study does an excellent job of conveying details and numbers in an easy-to-digest manner. Bullet points at the start break up the content while also helping the reader understand exactly what the case study will be about. Tata Consulting used Adobe to deliver elevated, engaging content experiences for a large telecommunications client of its own — an objective that’s relatable for a lot of companies.

Case studies are a vital tool for any marketing team as they enable you to demonstrate the value of your company’s products and services to others. They help marketers do their job and add credibility to a brand trying to promote its solutions by using the experiences and stories of real customers.

When you’re ready to get started with a case study:

  • Think about a few goals you’d like to accomplish with your content.
  • Make a list of successful clients that would be strong candidates for a case study.
  • Reach out to the client to get their approval and conduct an interview.
  • Gather the data to present an engaging and effective customer story.

Adobe can help

There are several Adobe products that can help you craft compelling case studies. Adobe Experience Platform helps you collect data and deliver great customer experiences across every channel. Once you’ve created your case studies, Experience Platform will help you deliver the right information to the right customer at the right time for maximum impact.

To learn more, watch the Adobe Experience Platform story .

Keep in mind that the best case studies are backed by data. That’s where Adobe Real-Time Customer Data Platform and Adobe Analytics come into play. With Real-Time CDP, you can gather the data you need to build a great case study and target specific customers to deliver the content to the right audience at the perfect moment.

Watch the Real-Time CDP overview video to learn more.

Finally, Adobe Analytics turns real-time data into real-time insights. It helps your business collect and synthesize data from multiple platforms to make more informed decisions and create the best case study possible.

Request a demo to learn more about Adobe Analytics.




Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 11 March 2024

Anomaly detection in IoT-based healthcare: machine learning for enhanced security

  • Maryam Mahsal Khan 1 &
  • Mohammed Alkhathami 2  

Scientific Reports volume  14 , Article number:  5872 ( 2024 ) Cite this article

338 Accesses

1 Altmetric

Metrics details

  • Computer science
  • Information technology

Internet of Things (IoT) integration in healthcare improves patient care while also making healthcare delivery systems more effective and economical. To fully realize the advantages of IoT in healthcare, it is imperative to overcome issues with data security, interoperability, and ethical considerations. IoT sensors periodically measure the health-related data of the patients and share it with a server for further evaluation. At the server, different machine learning algorithms are applied which help in early diagnosis of diseases and issue alerts in case vital signs are out of the normal range. Different cyber attacks can be launched on IoT devices which can result in compromised security and privacy of applications such as health care. In this paper, we utilize the publicly available Canadian Institute for Cybersecurity (CIC) IoT dataset to model machine learning techniques for efficient detection of anomalous network traffic. The dataset consists of 33 types of IoT attacks which are divided into 7 main categories. In the current study, the dataset is pre-processed, and a balanced representation of classes is used in generating a non-biased supervised (Random Forest, Adaptive Boosting, Logistic Regression, Perceptron, Deep Neural Network) machine learning models. These models are analyzed further by eliminating highly correlated features, reducing dimensionality, minimizing overfitting, and speeding up training times. Random Forest was found to perform optimally across binary and multiclass classification of IoT Attacks with an approximate accuracy of 99.55% under both reduced and all feature space. This improvement was complimented by a reduction in computational response time which is essential for real-time attack detection and response.

Similar content being viewed by others

case study on data set

An overview of clinical decision support systems: benefits, risks, and strategies for success

Reed T. Sutton, David Pincock, … Karen I. Kroeker

case study on data set

Unsupervised ensemble-based phenotyping enhances discoverability of genes related to left-ventricular morphology

Rodrigo Bonazzola, Enzo Ferrante, … Alejandro F. Frangi

case study on data set

Towards a universal mechanism for successful deep learning

Yuval Meir, Yarden Tzach, … Ido Kanter


The Internet of Things (IoT) is a major technology that is the basis of several upcoming applications in the areas of health care, smart manufacturing, and transportation systems. IoT relies on the use of various sensors to gather information about humans, devices, and the surrounding environment. This information is passed to the cloud server regularly and as a result, application administrators can make various decisions to improve the efficiency of applications. Similarly, AI techniques can be utilized to automatically control the applications based on the collected data 1 .

Healthcare is one major application of IoT where patients are provided with wearable devices to collect data related to body vitals. Examples of such data could be body measurements such as oxygen level, blood pressure, sugar level, heart rate, etc. Without using IoT, these vital measurements can not be recorded continuously and sent to the cloud for processing. Thus, IoT-enabled health care is an important use case with a huge impact on human lives.

Since IoT-enabled health care involves the recording and sharing of critical data that is linked to human safety, it is vital to design efficient techniques to make sure that the data recording and sharing are reliable and secure. Healthcare systems can be subject to several security attacks that can lead to a loss of confidence in received data. In several cases, wrong decisions can be made on the malicious data, thus leading to the collapse of IoT-enabled healthcare applications.

There are several types of security attacks in healthcare systems such as Denial of Service (DoS) attack in which malicious users aims to deny the wearable or to share data with the cloud. This can be achieved by sharing incorrect data with high frequency towards the wearable or, thus blocking its access to the wireless medium. Similarly, spoofing is another common cyber attack in which malicious users hide their identity to get access to the critical health-related data of patients. Another example of a cyber attack is a brute force attack that tries to crack the password of users’ wearable devices and gain access to the sensor’s data. In addition, there are many other attacks such as data integrity and eavesdropping that can reduce the reliability of IoT health care applications.

This paper focuses on developing anomaly detection techniques for IoT attacks using the publicly available dataset. Following are the major contributions of the paper.

The authors in 2 have applied Machine Learning (ML) algorithms in an imbalanced dataset, producing models with high accuracy and low precision scores. The research motivation is to balance the dataset and train ML algorithms accordingly.

To evaluate supervised machine learning algorithms across inary (2-Class) and multiclass (8 and 34-Class) representations on the balanced dataset.

To evaluate the computational response time of machine learning models via feature reduction.

To determine which features are essential for the generalization of machine learning models.

The paper is organized as follows. " Literature review " Section describes the literature review and recent work done in the area of IoT security and anomaly detection and briefly describes the ML algorithms used in the study and how they are evaluated. The problem of an imbalanced dataset and the strategy to resolve it through oversampling techniques is also included in this section. " Methodology " section describes the system model and utilized IoT attack dataset including the methodology and anomaly detection framework of the current study. The result and discussion are presented in " Results and discussion " section. Finally, conclusions are described in " Conclusion " section.

Literature review

In this section, we present an overview of different intrusion and cyber-attack detection techniques in an IoT network and provide a brief description of different datasets that are used to analyze these attacks. The section also provides information on the Machine learning (ML) algorithm used in the study along with the standard performance metrics used for the evaluation of the ML models. Finally, the section describes the problem with ML models trained on imbalanced datasets and strategies to overcome them.

Review of different intrusion detection techniques

Table  1 lists different intrusion detection techniques focused on IoT networks. In 3 , authors utilize Deep Neural Network (DNN) and Bi-directional Long Short-Term Memory (Bi-LSTM) techniques to identify the abnormalities in the data. A key feature of the proposed technique is the use of the Incremental Principal Component Analysis (IPCA) technique for reducing the features in the dataset. The proposed technique also uses dynamic quantization for efficient data analysis. The work achieves improved accuracy of intrusion detection and reduced complexity of the model.

The work in 4 is focused on efficient cyber attack detection. The main idea of the proposal is to use federated learning for improved privacy and distributed model development. The proposed technique uses a Deep Neural Network (DDN) for attack detection. The work also contributed towards reducing the features and balancing of the data. Results show that the proposed technique improves the accuracy of attack detection as well as the privacy of the system.

In 5 , another intrusion detection for IoT networks is proposed. The focus of the work is on two key factors, one is removing the redundancy in dataset features, and the second is mitigating the imbalance in the dataset. By using these two factors, the proposed technique improves the F1 score of intrusion detection.

The work in 6 proposes a cyber-attack detection mechanism. The class imbalance problem is handled by the proposed technique. Authors apply DNN on the balanced dataset to perform training and testing. A bagging classifier mechanism is used to improve the performance of the system. The proposed technique achieves improved accuracy and precision.

In 7 develops an adaptive recommendation system to improve the efficiency of intrusion detection. The main feature of the proposed technique is the development of a self-improving mechanism that autonomously learns the intrusion knowledge. A pseudo-label-based voting system is also used in the proposed technique, thus resulting in improved intrusion detection performance.

The work in 8 develops an explainable AI-based intrusion detection system. Authors utilize the DNN technique in conjunction with explainable AI mechanisms such as RuleFit and Shapley Additive Explanation. Results show that the developed model is simple and easier to understand while providing improved efficiency.

Cyber attack and intrusion detection data sets in IoT

There are various publicly available data sets related to cyber attacks and intrusion detection in IoT as shown in Table  2 . In 9 , the CIC IDS 2017 attack data set is provided by the Canadian Institute of Cyber Security. A 5-day network traffic data was collected using CIC Flow meter software. The data included normal traffic as well as different types of attacks such as Denial of Service (DoS), Distributed Denial of Service (DDoS), Brute Force, Cross-Site Scripting (XSS), Structured Query Language (SQL) injection, Infiltration, Port Scan, and Botnet.

The N-BaIoT data set in 10 was collected by the University of California, Irvine. Nine Linux-based IoT machines were used to generate traffic. Two IoT Botnets were used, one was BASHLITE and the other was Mirai. The generated security attacks included Acknowledgement (ACK), Scan, Synchronize (SYN), and User Datagram Protocol (UDP) flooding.

In 2 , the CICIoT data set was provided by the Canadian Institute of Cyber Security. 105 IoT machines were used to generate diverse security attacks. The generated attacks were divided into 33 attacks and 7 major categories.

The NSL-KDD data set 11 was provided by Tavallaee et al. The data set is an improved version of the KDD data set and removes duplicate entries. The attacks included in the data set are DoS, User to Root, Root to Local, and Probing.

In 12 , the UNSW_NB-15 data set was provided by the University of New South Wales. A synthetic attack environment was created including normal traffic and synthetic abnormal traffic. Several attacks were generated including Fuzzers, Analysis, Backdoors, etc.

Another data set named BoT-IoT was generated by the University of New South Wales 13 . This data set was based on a realistic environment of traffic containing both normal as well as Botnet traffic. The attack traffic included DoS, DDoS, Operating System (OS), Service scan, keylogging, and data exfiltration.

Motivation to use CICIoT 2023 dataset

The author 2 introduced the CICIoT2023 dataset, which is composed of thirty-three different attacks (categorized into seven classes) executed against 105 IoT devices with well-documented processes defined. So far, the study provides a comprehensive and wide variety of attack types as compared to other reported in literature. Moreover, the main motivation of using the CICIoT2023 dataset is that it has been released recently and there exist only one publication using the dataset. In 3 only two attacks (Mirai, DDoS) were focused on the study. There exists no article on the use of various intelligent machine learning models in identification of all types of malicious anomalous IoT attacks namely DDoS, DoS, Recon, Web-based, brute force, spoofing, and Mirai. The present study hence contributes to this direction.

Machine learning algorithms

There exist numerous supervised, unsupervised, and reinforcement-based machine learning algorithms. The research study only investigates the application of supervised ML algorithms in IoT attack detection. The performance of five ML algorithms is tested in the present research work and a brief description of these algorithms is provided herewith.

Random forest (RF): Multiple decision trees are combined in the ensemble learning technique known as RF. For the classification task, the RF’s output is the statistical mode while for the regression task, average of the predictions made by each tree. Applications for RFs are numerous and include image analysis, finance, and healthcare. Their usefulness, usability, and capacity to manage high-dimensional data are well-known attributes.

Logistic regression (LR): It is the type of regression that determines the likelihood that an event will occur and is used for classification. Statistics is used to predict a data value given the previous observations of a data set. The output is discrete. LR operates on a logistic sigmoid function, which accepts any real input and outputs an integer between zero and one.

Perceptron (PER): As a linear classifier, the PER performs best in situations when there is a linear separation of the classes. It uses the perceptron learning rule to update its weights and makes adjustments in response to the misclassifications. Simple and effective, the scikit-learn Perceptron class may not converge on datasets that are not linearly separable. Under such circumstances, more sophisticated algorithms, like support vector machines or neural networks should be used.

Deep neural network (DNN): An artificial neural network with several layers between the input and output layers is called a Deep Neural Network (DNN). Deep learning models are a subclass of neural networks distinguished by their capacity to acquire intricate hierarchical data representations. A deep neural network’s layers are made up of linked nodes or neurons, and these layers are generally divided into three categories: input layer, hidden layer, and output layer. Key characteristics of a DNN include the use of non-linear activation function, deep architectures, and backpropagation algorithm for training weights of the network for locating an optimal solution.

Adaptive boosting (AB): AB creates a powerful classifier by combining several weak classifiers. Training instances are given weights by the algorithm, which then iteratively updates them. A weighted sum of the individual weak classifiers yields the final prediction.

Machine learning performance metrics

In machine learning classification problems, several performance metrics are commonly used to evaluate the performance of a model. These metrics include accuracy, precision, recall, and F1-score, each of which measures different aspects of classification performance.

Accuracy : Accuracy measures how accurately a classification model is applied overall. It determines the proportion of accurately predicted occurrences to all of the dataset’s instances and is mathematically computed using Eq. ( 1 ), where

TP (True Positives) is the number of correctly predicted positive instances.

TN (True Negatives) is the number of correctly predicted negative instances.

FP (False Positives) is the number of instances that were actually negative but were incorrectly predicted as positive.

FN (False Negatives) is the number of instances that were positive but were incorrectly predicted as negative.

Precision : Precision measures the accuracy of positive predictions made by the model. It calculates the ratio of true positives to the total number of positive predictions expressed in Eq. ( 2 ).

Recall : Recall measures the ability of the model to correctly identify positive instances. It calculates the ratio of true positives to the total number of actual positive instances, expressed in Eq. ( 3 ).

F1-score : The F1-Score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is particularly useful when dealing with imbalanced datasets, expressed in Eq. ( 4 )

Imbalanced datasets

An imbalanced dataset has a distribution of classes (categories or labels) that is severely skewed, indicating that one class has significantly more samples or instances than the other(s). The occurrence of the dataset is frequently seen in machine learning. In binary classification problems, one class is the majority class and the other is the minority class while in multiclass classification, class imbalance can arise when one or more classes have disproportionately fewer samples than the others. In applications where the minority class is of great importance, such as fraud detection, medical diagnosis, and rare event prediction, addressing class imbalance is essential for reliable predictions. Two major concerns in using ML on an imbalance dataset includes 14 , 15 :

Biased model training : Machine learning algorithms are often biased in favor of the dominant class when one class outweighs the others significantly. The model may prioritize correctly predicting the majority class while ignoring the minority class because its goal is frequently to minimize the overall error. The model may have trouble making precise predictions for the minority class based on unobserved data because it hasn’t seen enough examples from that group resulting in poor generalization of the problem.

Misleading evaluation metrics : In unbalanced datasets, standard accuracy becomes a misleading statistic. Even if a model that predicts the majority class in every instance can still be highly accurate. The sensitivity (true positive rate) of the model for the minority class is fairly low in unbalanced datasets. This indicates that a large number of false negatives could result from the model missing a significant number of cases from the minority class.

Several tactics and strategies can be used to reduce the problems caused by class imbalance. These include resampling techniques such as oversampling of minority class and under-sampling majority class 16 ; synthetic data generation techniques like SMOTE 17 , Adaptive Synthetic Sampling(ADASYN) 1 , cluster-based techniques 18 to name a few. The authors in 2 have applied Machine Learning (ML) algorithms in an imbalanced dataset, producing models with high accuracy and low precision scores. The motivation of this research is to balance the dataset and then apply the ML algorithms to generate generalized models with marked improvements in the evaluation metrics.

Synthetic minority over-sampling technique: balanced dataset generation

Synthetic Minority Over-sampling Technique (SMOTE), is a well-known pre-processing approach in the area of machine learning and data preparation that deals with the issue of class imbalance in classification problems. Class imbalance happens when one class in a binary or multi-class classification problem has significantly fewer samples than the other(s), resulting in an inaccurate model that tends to bias the dominant class. To address this problem, Chawla et al. developed the SMOTE algorithm in 2002 17 . It balances the class distribution by creating artificial examples of the minority class, which improves the learning algorithm’s performance and lowers the likelihood of a biased model. Mathematically expressed as in Eq. ( 5 ).

where, x is the original minority class instance. neighbor is one of the k nearest neighbors of x within the minority class. \(\lambda \) is a random value between 0 and 1, controlling the amount of interpolation.

The SMOTE method has multiple versions, each with unique adjustments to handle various facets of the class imbalance issue. A few variations of the SMOTE algorithm include e.g. Borderline-Smote which applies SMOTE to instances near the decision boundary 19 ; ADASYN that generates samples based on the local density of the minority class 1 ; SMOTE-with Edited Nearest Neighbour(ENN) which removes noisy samples using ENN 20 , 21 ; SMOTE-Tomek Links combines SMOTE with Tomek Links undersampling technique to remove noisy samples 22 ; SMOTE-Boost that combines SMOTE with AdaBoost ensemble method to oversample minority class in each iteration of AdaBoost 23 for improving performance. Different versions of the SMOTE algorithm provide different strategies for increasing minority class samples and reducing noisy data. In the current research study, the conventional SMOTE algorithm is used as a starting point to observe the change in performance metrics after applying the SMOTE algorithm to the CICIoT dataset.


Ciciot2023 dataset.

In the current research study, we use the publicly available IoT attack dataset namely CICIoT2023 2 . The dataset was created to encourage the creation of security analytics applications for use in actual IoT operations. The authors executed 33 different attacks in an IoT topology of 105 devices. These attacks are classified into seven categories, namely DDoS, DoS, Recon, Web-based, brute force, spoofing, and Mirai. The dataset consists of 169 files in two different file formats PCAP and CSV. The CSV files are PCAP-processed files generating 46 attributes that indicate the different types of attacks. The number of recorded samples per category is not uniform, whereas Web-Based and Brute-Force have far-low representation—a classic sign of an imbalanced dataset. Figure  1 displays the research study’s workflow. The dataset is pre-processed and balanced to ensure credibility in the evaluation of the machine learning models. The data features are further reduced, to improve predictive performance and training times of the ML models across both binary and multiclass representation of the dataset. Further explanation is ahead. The algorithm of the methodology is shown in 1 .

figure 1

Methodology of the research work applied on the CCIoT2023 Dataset.

figure a

Performance of ML algorithms on balanced representation of CCIoT2023 dataset.

Dataset preprocessing

Data cleaning is a crucial step in the ML pipeline. Data cleaning includes handling missing or noisy data or dealing with outliers or duplicates. The dataset consists of 33 different classes of IoT attacks with forty-six numerical features. Features with no variation across the thirty-four classes are removed from the dataset. Hence out of 46 features, 40 features are processed ahead. These features are normalized using a standard scalar method which is a common requirement for many machine learning algorithms.

Feature scaling is particularly important for algorithms that use distance-based metrics, as differences in scale can disproportionately impact the influence of certain features on the model. This pre-processing step helps in improving the performance and convergence of ML algorithms. There are two methods of scaling the features in a dataset (1) Normalization (2) Standardization. Normalization is the process of scaling the features within a certain range e.g. [0–1] and standardization is the process of scaling features to a mean of zero and standard deviation of 1. Many of the ML algorithms including linear regression and Neural networks converge faster in the standardized feature space. In the current study, the forty features obtained after cleaning are normalized using a standard scalar method.

Data balancing

This is the important block of the methodology and requires balancing the dataset using either random undersampling or oversampling via the conventional SMOTE algorithm, described in " Synthetic minority over‑sampling technique: balanced dataset generation " section. The process of dataset generation for binary and multiclass classification is explained below.

2-Class representation : In this scenario, the thirty-three malicious classes are labeled as one category ‘Attack’. Approximately 50% of the data, which captures the different types of malicious representations, from each of the 169 CSV files is randomly extracted and a balanced data set is created. No SMOTE algorithm is used in this particular scenario. The total number of samples per class in the integrated dataset was 8450.

8-Class representation : The data samples from all the different type of attacks i.e. 34 subcategories has been used in the construction of the 8 Class dataset. The process of random undersampling in the majority class and SMOTE-based upsampling of the minority class is executed to produce a uniform representation of the dataset samples. The total number of samples per class in the integrated dataset was 33,800.

34-Class representation : For the 34 classes in the CICIoT dataset, it has been found that two classes namely BruetForce and Web-based have less representative samples in the dataset. The process of random undersampling in the majority class and SMOTE-based upsampling of the minority class is executed to produce a uniform representation of the dataset samples. The total number of samples per class in the integrated dataset was 84,500.

The IoT topology deployed to produce the CICIoT2023 dataset comprises 105 IoT devices. 33 different types of IoT attacks were modeled. In the dataset, the number of rows captured per attack is not uniform, e.g. the attack type DDoS-ICMP Flood contains 7,200,504 data rows representing a majority class whereas WebBased-Uploading Attack is a minority class with 1252 data rows. Applying ML algorithm directly on an imbalanced dataset with non-uniform data-rows across the different attack classes would impact the generalization and performance of a ML model e.g. the authors in 2 have produced models with high accuracy and low precision scores. Hence, the main motivation and contribution of this research is to balance the dataset and generate ML models that are unbiased with non-misleading evaluation metrics.

Feature reduction

For feature engineering, model selection, and general data analysis in machine learning, the Pearson correlation coefficient (PCC) is significant since it offers a clear indicator of the relationship between variables. PCC facilitates the creation of more accurate predictive and descriptive models by assisting in the decision-making process over which variables to include in models and how they interact. Many applications have been devised where eliminating highly correlated features has reduced model complexity without compromising the predictive performance. The formula for calculating the Pearson correlation coefficient r between two variables, X and Y, with n data points, is given shown in Eq. ( 6 ).

where \(X_i\) and \(Y_i\) are the individual data points for variables X and Y respectively and \(\bar{X}\) and \(\bar{Y}\) are the means of variables X and Y respectively.

As mentioned, the pre-processed dataset consists of forty features. The PCC of the forty characteristics is calculated, and Fig.  2 a shows the absolute correlation coefficient heat map. Darker shades in the figure display highly correlated features. A PCC value of 0.9 or higher, in the current study, is regarded as a highly correlated feature, and it is eliminated from the feature collection. Hence a total of thirty-one features are analyzed in the reduced feature space. Referred to Fig.  2 b, a heat map of the reduced feature set and related PCC values is displayed.

figure 2

Heat map plot of ( a ) Pearson Correlation Coefficient of forty features in the CCIoT2023 dataset, ( b ) thirty-one Pearson Correlation Coefficients after removing high correlated features, absolute Correlation threshold set to 0.9. Darker shades represent a high absolute correlation coefficient.

Model generation and evaluation

Any binary or multiclass classification problem is modeled through the application of supervised machine learning algorithms. Five popular and powerful supervised ML algorithms (Random forest RF , Adaptive Boosting AB , Logistic Regression LR , Perceptron PER and Deep Neural Network DNN ); are studied on the balanced dataset with both full features and reduced feature set respectively. The datasets are split into 80% training and 20% testing as followed in the research study 2 for a fair comparison. Standard performance metrics for evaluating supervised algorithms, discussed in " Machine learning performance metrics " section, are computed and reported in Table  3 for 2-Class, 8-Class, and 34-Class respectively.

Results and discussion

Table  3 , shows the performance of ML algorithms on the balanced dataset across three defined classification scenarios i.e. 2-Class, 8-Class, and 34-Class. The ML models generated are evaluated based on Accuracy, Precision, Recall, and F1-Score details which have been explained in " Machine learning performance metrics " section. Overall, RF has been found to perform better than other ML models across the different scenarios. In the 2-Class task, all of the ML models perform with an accuracy of \(\ge 98\%\) , while it decreases with increasing complexity of the problem i.e. 8-Class and 34-Class label identification. There is a slight improvement in accuracy for the ML models trained in the reduced feature e.g. 0.06% in RF and DNN models. With balanced dataset representation across the three classification tasks, improvement in precision, recall, and f1-score from the ones reported in literature 2 is obtained.

To visualize the performance of the RF models across the different class categories, confusion matrices are observed. In Figure  3 , for the binary classification problem, out of the 1690 test samples per category i.e. benign or attack, benign prediction is found to be more accurate than the attack ones in both scenarios. This might be attributed to the fact that the 33 variations of attack are labeled as one category. The f1-score of the RF-model is found to slightly improve in the reduced feature space i.e. from 99.49 to 99.55% respectively.

figure 3

Confusion matrices of RF models on a binary classification task i.e. Attack versus benign, using the CICIoT2023 dataset across ( a ) all features and ( b ) reduced features, respectively.

Figure  4 shows the confusion matrices of the multi-classification eight-class problem where 33,800 samples per category were tested by the RF model under both scenarios. Two attack categories in particular Recon and Spoofing were found to be poorly recognizable (with an f1-score of 90%) by the RF models despite being trained on real samples. SMOTE-based synthetic samples generated for BruteForce and Web were found to be in good agreement with the original training samples. Further analysis is required to understand Spoofing and Recon attack characteristics.

figure 4

Confusion matrices of trained RF models on a multiclass classification task with 8-class labels, using the CICIoT2023 dataset across ( a ) all features and ( b ) reduced features, respectively.

In the multi-classification 34-class problem, 16,900 samples per category were tested. Confusion matrices for the RF models under both scenarios (all features and reduced features) are shown in Fig.  5 . In the test set, 16,900 samples per category were tested on the trained model. 31 of the classes produced an f1-score greater than 85% while three classes, DNS-Spoofing , Recon-PortScan and Recon-OSScan had an f1-score of 83%, 82% and 79%. These subclasses belong to Recon and Spoofing IoT attack category, which was also found harder to classify than other class labels in the 8-Class task.

figure 5

Confusion Matrices of trained RF models on a multiclass classification task with 34-class labels, using the CICIoT2023 dataset across ( a ) all features and ( b ) reduced features, respectively.

An additional tool for comprehending important characteristics in the dataset is a feature importance graph, which is produced through RF models. The feature significance graph from the RF models for the three classification tasks is displayed in Fig.  6 , where (a) shows the RF models when all features are used and (b) shows the RF models when a reduced feature set is used. The top features identified in the binary classification tasks under both scenarios were \(urg_{count}\) and AVG . \(urg_{count}\) is the number of packets with urg flag set and AVG represents the average packet length. For both of the multi-classification tasks, IAT was found to be the top feature. IAT measures the time difference between the current and the previous packet. The statistical measurements e.g. Header Length, Min, Max, Average, covering the right side of the feature graph in Fig.  6 were more frequently chosen than the other features.

figure 6

Feature significance graphs, extracted from the RF models across ( a ) all features and ( b ) reduced features in the CCIoT2023 Dataset for 2-Class, 8-Class and 34-Class classification tasks.

Figure  7 a, c and e displays the training time in seconds and Fig.  7 b, d and f shows the testing time in seconds of the ML algorithms on all and reduced feature sets for 2-Class Fig.  7 a and b, 8-Class Fig.  7 b and c and 34-Class classification Fig.  7 e and f tasks respectively. As the feature set is reduced, we can see a reduction in the training time of all the models. For the DNN model performance in 2-Class classification, Fig.  7 a and b, training time across all features was approximately 8.6s while in the reduced space it was 6.6s respectively. Similarly, as the feature set is reduced in almost all cases there is a reduction in response time of the models. For the RF model in 8-Class classification, Fig.  7 d, testing time across all features was approximately 13.08 s while in the reduced feature space was 6.64 s secs respectively. All these steps are carried out in the development environment with Intel Core i7 7820HQ-processor, 32 GB DDR4 RAM, and Windows 10 operating system.

figure 7

Time is taken, in seconds, to train and test supervised ML algorithms, with and without feature reduction. The figure shows training and testing time for ( a ,  b ) 2-Class, ( c ,  d ) 8-Class, and ( e ,  f ) 34-Class multiclassification tasks, respectively.

The CICIoT2023 dataset has been recently released and there exists not much literature using the dataset. The reported best models in the study are compared with the best models produced by the authors in 2 and are shown in Table  4 . The optimum performing model metrics are highlighted in bold. The results of the existing study have performed better than the ones reported. The dataset originally was imbalanced hence models generated have low recall values. Recall values can be seen improved due to balancing the data samples across the different classification tasks.

The use of Medical Internet of Things (IoT) devices in healthcare settings has made automation and monitoring possible e.g. in enhanced patient care and remote patient monitoring. However, it has also introduced a host of security vulnerabilities and risks including identity theft, unauthorized alteration of medical records, and even life-threatening situations. Furthermore, it is becoming more challenging to secure each device entry point in real-time due to the growing usage of networked devices.

Machine learning has the potential to detect and respond to attacks in real-time by identifying anomalies in the data captured by IoT devices. The current study explored the potential of supervised machine learning algorithms in identifying anomalous behavior on a recently published dataset, CCIoT2023. The dataset consists of 33 different categories of IoT attacks represented by 46 features, with a varying number of data samples. The dataset is imbalanced, i.e., it has a non-uniform sample distribution. The study explored improving machine learning models by employing a balanced approach to data distribution using the SMOTE algorithm. Classification models for three strategies of ‘IoT Attack’, two-class, eight-class, and thirty-four class, were investigated. Random Forest was found to excel in all three defined classification problems and performed better than what has been reported so far in the literature. Eliminating strongly correlated features slightly improved the performance of the model but reduced computational response time and enabled real-time detection.

The feature importance graph depicted \(urg_{count}\) -number of urg flags in the packet and AVG -average packet length in 2-Class and IAT – time difference between packet arrival time, as an important feature in discriminating various attack categories in multiclassification problem. Moreover, certain IoT attacks e.g. Spoofing and Recon require further analysis and feature expansion to be able to discriminate these classes and their corresponding sub-classes further.

Data availability

Details of data is available in the paper.

He, H., Bai, Y., Garcia, E. A. & Li, S. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) 1322–1328, https://doi.org/10.1109/IJCNN.2008.4633969 (2008).

Neto, E. C. P. et al. Ciciot2023: A real-time dataset and benchmark for large-scale attacks in IoT environment. Sensors 23 . https://doi.org/10.3390/s23135941 (2023).

Wang, Z. et al. A lightweight intrusion detection method for IoT based on deep learning and dynamic quantization. PeerJ Comput. Sci. 9 , e1569 (2023).

Article   PubMed   PubMed Central   Google Scholar  

Abbas, S. et al. A novel federated edge learning approach for detecting cyberattacks in IoT infrastructures. IEEE Access 11 , 112189–112198. https://doi.org/10.1109/ACCESS.2023.3318866 (2023).

Article   Google Scholar  

Narayan, K. et al. Iids: Design of intelligent intrusion detection system for internet-of-things applications. arXiv:2308.00943 (2023).

Thakkar, A. & Lohiya, R. Attack classification of imbalanced intrusion data for IoT network using ensemble-learning-based deep neural network. IEEE Internet Things J. 10 , 11888–11895. https://doi.org/10.1109/JIOT.2023.3244810 (2023).

Wu, J., Wang, Y., Dai, H., Xu, C. & Kent, K. B. Adaptive bi-recommendation and self-improving network for heterogeneous domain adaptation-assisted IoT intrusion detection. IEEE Internet Things J. 10 , 13205–13220. https://doi.org/10.1109/JIOT.2023.3262458 (2023).

El Houda, Z. A., Brik, B. & Senouci, S.-M. A novel IoT-based explainable deep learning framework for intrusion detection systems. IEEE Internet Things Mag. 5 , 20–23. https://doi.org/10.1109/IOTM.005.2200028 (2022).

Sharafaldin, I., Lashkari, A. H. & Ghorbani, A. A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 1 , 108–116 (2018).

Google Scholar  

Meidan, Y. et al. N-baiot-network-based detection of IoT botnet attacks using deep autoencoders. IEEE Pervasive Comput. 17 , 12–22. https://doi.org/10.1109/MPRV.2018.03367731 (2018).

Tavallaee, M., Bagheri, E., Lu, W. & Ghorbani, A. A. A detailed analysis of the KDD CUP 99 data set. In 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications 1–6, https://doi.org/10.1109/CISDA.2009.5356528 (2009).

Moustafa, N. & Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In 2015 Military Communications and Information Systems Conference (MilCIS) 1–6, https://doi.org/10.1109/MilCIS.2015.7348942 (2015).

Koroniotis, N., Moustafa, N., Sitnikova, E. & Turnbull, B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-IoT dataset. Futur. Gener. Comput. Syst. 100 , 779–796 (2019).

Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 5 , 221–232. https://doi.org/10.1007/s13748-016-0094-0 (2016).

Batista, G., Prati, R. & Monard, M.-C. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6 , 20–29. https://doi.org/10.1145/1007730.1007735 (2004).

Devi, D., Biswas, S. & Purkayastha, B. A review on solution to class imbalance problem: Undersampling approaches (2021).

Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res. 16 , 321–357 (2002).

Yen, S.-J. & Lee, Y.-S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36 , 5718–5727. https://doi.org/10.1016/j.eswa.2008.06.108 (2009).

Han, H., Wang, W.-Y. & Mao, B.-H. Borderline-smote: A new over-sampling method in imbalanced data sets learning. In Advances in Intelligent Computing (eds Huang, D.-S. et al. ) 878–887 (Springer, Berlin, 2005).

Chapter   Google Scholar  

Batista, G. E. A. P. A., Prati, R. C. & Monard, M. C. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6 , 20–29. https://doi.org/10.1145/1007730.1007735 (2004).

Wilson, D. L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC–2 , 408–421. https://doi.org/10.1109/TSMC.1972.4309137 (1972).

Article   MathSciNet   Google Scholar  

Swana, E. F., Doorsamy, W. & Bokoro, P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors 22 , 3246. https://doi.org/10.3390/s22093246 (2022).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Lv, M., Ren, Y. & Chen, Y. Research on imbalanced data : based on SMOTE-AdaBoost algorithm. In 2019 3rd International Conference on Electronic Information Technology and Computer Engineering (EITCE) 1165–1170, https://doi.org/10.1109/EITCE47263.2019.9094859 (2019).

Download references

The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia for funding this research through the project number IFP-IMSIU-2023046. The authors also appreciate the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) for supporting and supervising this project.

Author information

Authors and affiliations.

Department of Computer Science, CECOS University of IT and Emerging Sciences, Peshawar, 25000, Pakistan

Maryam Mahsal Khan

Information Systems Department, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, 11432, Saudi Arabia

Mohammed Alkhathami

You can also search for this author in PubMed   Google Scholar


M.M.K. and M.A. developed the paper idea, M.M.K. and M.A. conducted simulations, M.M.K. and M.A. wrote the manuscript.

Corresponding author

Correspondence to Mohammed Alkhathami .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Khan, M.M., Alkhathami, M. Anomaly detection in IoT-based healthcare: machine learning for enhanced security. Sci Rep 14 , 5872 (2024). https://doi.org/10.1038/s41598-024-56126-x

Download citation

Received : 03 December 2023

Accepted : 29 February 2024

Published : 11 March 2024

DOI : https://doi.org/10.1038/s41598-024-56126-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Anomaly detection
  • Machine learning
  • Deep learning
  • Pearson correlation coefficient
  • Imbalanced dataset

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

case study on data set

Qualitative case study data analysis: an example from practice


  • 1 School of Nursing and Midwifery, National University of Ireland, Galway, Republic of Ireland.
  • PMID: 25976531
  • DOI: 10.7748/nr.22.5.8.e1307

Aim: To illustrate an approach to data analysis in qualitative case study methodology.

Background: There is often little detail in case study research about how data were analysed. However, it is important that comprehensive analysis procedures are used because there are often large sets of data from multiple sources of evidence. Furthermore, the ability to describe in detail how the analysis was conducted ensures rigour in reporting qualitative research.

Data sources: The research example used is a multiple case study that explored the role of the clinical skills laboratory in preparing students for the real world of practice. Data analysis was conducted using a framework guided by the four stages of analysis outlined by Morse ( 1994 ): comprehending, synthesising, theorising and recontextualising. The specific strategies for analysis in these stages centred on the work of Miles and Huberman ( 1994 ), which has been successfully used in case study research. The data were managed using NVivo software.

Review methods: Literature examining qualitative data analysis was reviewed and strategies illustrated by the case study example provided. Discussion Each stage of the analysis framework is described with illustration from the research example for the purpose of highlighting the benefits of a systematic approach to handling large data sets from multiple sources.

Conclusion: By providing an example of how each stage of the analysis was conducted, it is hoped that researchers will be able to consider the benefits of such an approach to their own case study analysis.

Implications for research/practice: This paper illustrates specific strategies that can be employed when conducting data analysis in case study research and other qualitative research designs.

Keywords: Case study data analysis; case study research methodology; clinical skills research; qualitative case study methodology; qualitative data analysis; qualitative research.

  • Case-Control Studies*
  • Data Interpretation, Statistical*
  • Nursing Research / methods*
  • Qualitative Research*
  • Research Design

A generative AI reset: Rewiring to turn potential into value in 2024

It’s time for a generative AI (gen AI) reset. The initial enthusiasm and flurry of activity in 2023 is giving way to second thoughts and recalibrations as companies realize that capturing gen AI’s enormous potential value is harder than expected .

With 2024 shaping up to be the year for gen AI to prove its value, companies should keep in mind the hard lessons learned with digital and AI transformations: competitive advantage comes from building organizational and technological capabilities to broadly innovate, deploy, and improve solutions at scale—in effect, rewiring the business  for distributed digital and AI innovation.

About QuantumBlack, AI by McKinsey

QuantumBlack, McKinsey’s AI arm, helps companies transform using the power of technology, technical expertise, and industry experts. With thousands of practitioners at QuantumBlack (data engineers, data scientists, product managers, designers, and software engineers) and McKinsey (industry and domain experts), we are working to solve the world’s most important AI challenges. QuantumBlack Labs is our center of technology development and client innovation, which has been driving cutting-edge advancements and developments in AI through locations across the globe.

Companies looking to score early wins with gen AI should move quickly. But those hoping that gen AI offers a shortcut past the tough—and necessary—organizational surgery are likely to meet with disappointing results. Launching pilots is (relatively) easy; getting pilots to scale and create meaningful value is hard because they require a broad set of changes to the way work actually gets done.

Let’s briefly look at what this has meant for one Pacific region telecommunications company. The company hired a chief data and AI officer with a mandate to “enable the organization to create value with data and AI.” The chief data and AI officer worked with the business to develop the strategic vision and implement the road map for the use cases. After a scan of domains (that is, customer journeys or functions) and use case opportunities across the enterprise, leadership prioritized the home-servicing/maintenance domain to pilot and then scale as part of a larger sequencing of initiatives. They targeted, in particular, the development of a gen AI tool to help dispatchers and service operators better predict the types of calls and parts needed when servicing homes.

Leadership put in place cross-functional product teams with shared objectives and incentives to build the gen AI tool. As part of an effort to upskill the entire enterprise to better work with data and gen AI tools, they also set up a data and AI academy, which the dispatchers and service operators enrolled in as part of their training. To provide the technology and data underpinnings for gen AI, the chief data and AI officer also selected a large language model (LLM) and cloud provider that could meet the needs of the domain as well as serve other parts of the enterprise. The chief data and AI officer also oversaw the implementation of a data architecture so that the clean and reliable data (including service histories and inventory databases) needed to build the gen AI tool could be delivered quickly and responsibly.

Never just tech

Creating value beyond the hype

Let’s deliver on the promise of technology from strategy to scale.

Our book Rewired: The McKinsey Guide to Outcompeting in the Age of Digital and AI (Wiley, June 2023) provides a detailed manual on the six capabilities needed to deliver the kind of broad change that harnesses digital and AI technology. In this article, we will explore how to extend each of those capabilities to implement a successful gen AI program at scale. While recognizing that these are still early days and that there is much more to learn, our experience has shown that breaking open the gen AI opportunity requires companies to rewire how they work in the following ways.

Figure out where gen AI copilots can give you a real competitive advantage

The broad excitement around gen AI and its relative ease of use has led to a burst of experimentation across organizations. Most of these initiatives, however, won’t generate a competitive advantage. One bank, for example, bought tens of thousands of GitHub Copilot licenses, but since it didn’t have a clear sense of how to work with the technology, progress was slow. Another unfocused effort we often see is when companies move to incorporate gen AI into their customer service capabilities. Customer service is a commodity capability, not part of the core business, for most companies. While gen AI might help with productivity in such cases, it won’t create a competitive advantage.

To create competitive advantage, companies should first understand the difference between being a “taker” (a user of available tools, often via APIs and subscription services), a “shaper” (an integrator of available models with proprietary data), and a “maker” (a builder of LLMs). For now, the maker approach is too expensive for most companies, so the sweet spot for businesses is implementing a taker model for productivity improvements while building shaper applications for competitive advantage.

Much of gen AI’s near-term value is closely tied to its ability to help people do their current jobs better. In this way, gen AI tools act as copilots that work side by side with an employee, creating an initial block of code that a developer can adapt, for example, or drafting a requisition order for a new part that a maintenance worker in the field can review and submit (see sidebar “Copilot examples across three generative AI archetypes”). This means companies should be focusing on where copilot technology can have the biggest impact on their priority programs.

Copilot examples across three generative AI archetypes

  • “Taker” copilots help real estate customers sift through property options and find the most promising one, write code for a developer, and summarize investor transcripts.
  • “Shaper” copilots provide recommendations to sales reps for upselling customers by connecting generative AI tools to customer relationship management systems, financial systems, and customer behavior histories; create virtual assistants to personalize treatments for patients; and recommend solutions for maintenance workers based on historical data.
  • “Maker” copilots are foundation models that lab scientists at pharmaceutical companies can use to find and test new and better drugs more quickly.

Some industrial companies, for example, have identified maintenance as a critical domain for their business. Reviewing maintenance reports and spending time with workers on the front lines can help determine where a gen AI copilot could make a big difference, such as in identifying issues with equipment failures quickly and early on. A gen AI copilot can also help identify root causes of truck breakdowns and recommend resolutions much more quickly than usual, as well as act as an ongoing source for best practices or standard operating procedures.

The challenge with copilots is figuring out how to generate revenue from increased productivity. In the case of customer service centers, for example, companies can stop recruiting new agents and use attrition to potentially achieve real financial gains. Defining the plans for how to generate revenue from the increased productivity up front, therefore, is crucial to capturing the value.

Upskill the talent you have but be clear about the gen-AI-specific skills you need

By now, most companies have a decent understanding of the technical gen AI skills they need, such as model fine-tuning, vector database administration, prompt engineering, and context engineering. In many cases, these are skills that you can train your existing workforce to develop. Those with existing AI and machine learning (ML) capabilities have a strong head start. Data engineers, for example, can learn multimodal processing and vector database management, MLOps (ML operations) engineers can extend their skills to LLMOps (LLM operations), and data scientists can develop prompt engineering, bias detection, and fine-tuning skills.

A sample of new generative AI skills needed

The following are examples of new skills needed for the successful deployment of generative AI tools:

  • data scientist:
  • prompt engineering
  • in-context learning
  • bias detection
  • pattern identification
  • reinforcement learning from human feedback
  • hyperparameter/large language model fine-tuning; transfer learning
  • data engineer:
  • data wrangling and data warehousing
  • data pipeline construction
  • multimodal processing
  • vector database management

The learning process can take two to three months to get to a decent level of competence because of the complexities in learning what various LLMs can and can’t do and how best to use them. The coders need to gain experience building software, testing, and validating answers, for example. It took one financial-services company three months to train its best data scientists to a high level of competence. While courses and documentation are available—many LLM providers have boot camps for developers—we have found that the most effective way to build capabilities at scale is through apprenticeship, training people to then train others, and building communities of practitioners. Rotating experts through teams to train others, scheduling regular sessions for people to share learnings, and hosting biweekly documentation review sessions are practices that have proven successful in building communities of practitioners (see sidebar “A sample of new generative AI skills needed”).

It’s important to bear in mind that successful gen AI skills are about more than coding proficiency. Our experience in developing our own gen AI platform, Lilli , showed us that the best gen AI technical talent has design skills to uncover where to focus solutions, contextual understanding to ensure the most relevant and high-quality answers are generated, collaboration skills to work well with knowledge experts (to test and validate answers and develop an appropriate curation approach), strong forensic skills to figure out causes of breakdowns (is the issue the data, the interpretation of the user’s intent, the quality of metadata on embeddings, or something else?), and anticipation skills to conceive of and plan for possible outcomes and to put the right kind of tracking into their code. A pure coder who doesn’t intrinsically have these skills may not be as useful a team member.

While current upskilling is largely based on a “learn on the job” approach, we see a rapid market emerging for people who have learned these skills over the past year. That skill growth is moving quickly. GitHub reported that developers were working on gen AI projects “in big numbers,” and that 65,000 public gen AI projects were created on its platform in 2023—a jump of almost 250 percent over the previous year. If your company is just starting its gen AI journey, you could consider hiring two or three senior engineers who have built a gen AI shaper product for their companies. This could greatly accelerate your efforts.

Form a centralized team to establish standards that enable responsible scaling

To ensure that all parts of the business can scale gen AI capabilities, centralizing competencies is a natural first move. The critical focus for this central team will be to develop and put in place protocols and standards to support scale, ensuring that teams can access models while also minimizing risk and containing costs. The team’s work could include, for example, procuring models and prescribing ways to access them, developing standards for data readiness, setting up approved prompt libraries, and allocating resources.

While developing Lilli, our team had its mind on scale when it created an open plug-in architecture and setting standards for how APIs should function and be built.  They developed standardized tooling and infrastructure where teams could securely experiment and access a GPT LLM , a gateway with preapproved APIs that teams could access, and a self-serve developer portal. Our goal is that this approach, over time, can help shift “Lilli as a product” (that a handful of teams use to build specific solutions) to “Lilli as a platform” (that teams across the enterprise can access to build other products).

For teams developing gen AI solutions, squad composition will be similar to AI teams but with data engineers and data scientists with gen AI experience and more contributors from risk management, compliance, and legal functions. The general idea of staffing squads with resources that are federated from the different expertise areas will not change, but the skill composition of a gen-AI-intensive squad will.

Set up the technology architecture to scale

Building a gen AI model is often relatively straightforward, but making it fully operational at scale is a different matter entirely. We’ve seen engineers build a basic chatbot in a week, but releasing a stable, accurate, and compliant version that scales can take four months. That’s why, our experience shows, the actual model costs may be less than 10 to 15 percent of the total costs of the solution.

Building for scale doesn’t mean building a new technology architecture. But it does mean focusing on a few core decisions that simplify and speed up processes without breaking the bank. Three such decisions stand out:

  • Focus on reusing your technology. Reusing code can increase the development speed of gen AI use cases by 30 to 50 percent. One good approach is simply creating a source for approved tools, code, and components. A financial-services company, for example, created a library of production-grade tools, which had been approved by both the security and legal teams, and made them available in a library for teams to use. More important is taking the time to identify and build those capabilities that are common across the most priority use cases. The same financial-services company, for example, identified three components that could be reused for more than 100 identified use cases. By building those first, they were able to generate a significant portion of the code base for all the identified use cases—essentially giving every application a big head start.
  • Focus the architecture on enabling efficient connections between gen AI models and internal systems. For gen AI models to work effectively in the shaper archetype, they need access to a business’s data and applications. Advances in integration and orchestration frameworks have significantly reduced the effort required to make those connections. But laying out what those integrations are and how to enable them is critical to ensure these models work efficiently and to avoid the complexity that creates technical debt  (the “tax” a company pays in terms of time and resources needed to redress existing technology issues). Chief information officers and chief technology officers can define reference architectures and integration standards for their organizations. Key elements should include a model hub, which contains trained and approved models that can be provisioned on demand; standard APIs that act as bridges connecting gen AI models to applications or data; and context management and caching, which speed up processing by providing models with relevant information from enterprise data sources.
  • Build up your testing and quality assurance capabilities. Our own experience building Lilli taught us to prioritize testing over development. Our team invested in not only developing testing protocols for each stage of development but also aligning the entire team so that, for example, it was clear who specifically needed to sign off on each stage of the process. This slowed down initial development but sped up the overall delivery pace and quality by cutting back on errors and the time needed to fix mistakes.

Ensure data quality and focus on unstructured data to fuel your models

The ability of a business to generate and scale value from gen AI models will depend on how well it takes advantage of its own data. As with technology, targeted upgrades to existing data architecture  are needed to maximize the future strategic benefits of gen AI:

  • Be targeted in ramping up your data quality and data augmentation efforts. While data quality has always been an important issue, the scale and scope of data that gen AI models can use—especially unstructured data—has made this issue much more consequential. For this reason, it’s critical to get the data foundations right, from clarifying decision rights to defining clear data processes to establishing taxonomies so models can access the data they need. The companies that do this well tie their data quality and augmentation efforts to the specific AI/gen AI application and use case—you don’t need this data foundation to extend to every corner of the enterprise. This could mean, for example, developing a new data repository for all equipment specifications and reported issues to better support maintenance copilot applications.
  • Understand what value is locked into your unstructured data. Most organizations have traditionally focused their data efforts on structured data (values that can be organized in tables, such as prices and features). But the real value from LLMs comes from their ability to work with unstructured data (for example, PowerPoint slides, videos, and text). Companies can map out which unstructured data sources are most valuable and establish metadata tagging standards so models can process the data and teams can find what they need (tagging is particularly important to help companies remove data from models as well, if necessary). Be creative in thinking about data opportunities. Some companies, for example, are interviewing senior employees as they retire and feeding that captured institutional knowledge into an LLM to help improve their copilot performance.
  • Optimize to lower costs at scale. There is often as much as a tenfold difference between what companies pay for data and what they could be paying if they optimized their data infrastructure and underlying costs. This issue often stems from companies scaling their proofs of concept without optimizing their data approach. Two costs generally stand out. One is storage costs arising from companies uploading terabytes of data into the cloud and wanting that data available 24/7. In practice, companies rarely need more than 10 percent of their data to have that level of availability, and accessing the rest over a 24- or 48-hour period is a much cheaper option. The other costs relate to computation with models that require on-call access to thousands of processors to run. This is especially the case when companies are building their own models (the maker archetype) but also when they are using pretrained models and running them with their own data and use cases (the shaper archetype). Companies could take a close look at how they can optimize computation costs on cloud platforms—for instance, putting some models in a queue to run when processors aren’t being used (such as when Americans go to bed and consumption of computing services like Netflix decreases) is a much cheaper option.

Build trust and reusability to drive adoption and scale

Because many people have concerns about gen AI, the bar on explaining how these tools work is much higher than for most solutions. People who use the tools want to know how they work, not just what they do. So it’s important to invest extra time and money to build trust by ensuring model accuracy and making it easy to check answers.

One insurance company, for example, created a gen AI tool to help manage claims. As part of the tool, it listed all the guardrails that had been put in place, and for each answer provided a link to the sentence or page of the relevant policy documents. The company also used an LLM to generate many variations of the same question to ensure answer consistency. These steps, among others, were critical to helping end users build trust in the tool.

Part of the training for maintenance teams using a gen AI tool should be to help them understand the limitations of models and how best to get the right answers. That includes teaching workers strategies to get to the best answer as fast as possible by starting with broad questions then narrowing them down. This provides the model with more context, and it also helps remove any bias of the people who might think they know the answer already. Having model interfaces that look and feel the same as existing tools also helps users feel less pressured to learn something new each time a new application is introduced.

Getting to scale means that businesses will need to stop building one-off solutions that are hard to use for other similar use cases. One global energy and materials company, for example, has established ease of reuse as a key requirement for all gen AI models, and has found in early iterations that 50 to 60 percent of its components can be reused. This means setting standards for developing gen AI assets (for example, prompts and context) that can be easily reused for other cases.

While many of the risk issues relating to gen AI are evolutions of discussions that were already brewing—for instance, data privacy, security, bias risk, job displacement, and intellectual property protection—gen AI has greatly expanded that risk landscape. Just 21 percent of companies reporting AI adoption say they have established policies governing employees’ use of gen AI technologies.

Similarly, a set of tests for AI/gen AI solutions should be established to demonstrate that data privacy, debiasing, and intellectual property protection are respected. Some organizations, in fact, are proposing to release models accompanied with documentation that details their performance characteristics. Documenting your decisions and rationales can be particularly helpful in conversations with regulators.

In some ways, this article is premature—so much is changing that we’ll likely have a profoundly different understanding of gen AI and its capabilities in a year’s time. But the core truths of finding value and driving change will still apply. How well companies have learned those lessons may largely determine how successful they’ll be in capturing that value.

Eric Lamarre

The authors wish to thank Michael Chui, Juan Couto, Ben Ellencweig, Josh Gartner, Bryce Hall, Holger Harreis, Phil Hudelson, Suzana Iacob, Sid Kamath, Neerav Kingsland, Kitti Lakner, Robert Levin, Matej Macak, Lapo Mori, Alex Peluffo, Aldo Rosales, Erik Roth, Abdul Wahab Shaikh, and Stephen Xu for their contributions to this article.

This article was edited by Barr Seitz, an editorial director in the New York office.

Explore a career with us

Related articles.

Light dots and lines evolve into a pattern of a human face and continue to stream off the the side in a moving grid pattern.

The economic potential of generative AI: The next productivity frontier

A yellow wire shaped into a butterfly

Rewired to outcompete

A digital construction of a human face consisting of blocks

Meet Lilli, our generative AI tool that’s a researcher, a time saver, and an inspiration

  • Open access
  • Published: 18 March 2024

Generalizability of machine learning in predicting antimicrobial resistance in E. coli : a multi-country case study in Africa

  • Mike Nsubuga 1 , 3 , 5 , 6 ,
  • Ronald Galiwango 1 , 3 ,
  • Daudi Jjingo 2 , 3 &
  • Gerald Mboowa 1 , 3 , 4  

BMC Genomics volume  25 , Article number:  287 ( 2024 ) Cite this article

8 Altmetric

Metrics details

Antimicrobial resistance (AMR) remains a significant global health threat particularly impacting low- and middle-income countries (LMICs). These regions often grapple with limited healthcare resources and access to advanced diagnostic tools. Consequently, there is a pressing need for innovative approaches that can enhance AMR surveillance and management. Machine learning (ML) though underutilized in these settings, presents a promising avenue. This study leverages ML models trained on whole-genome sequencing data from England, where such data is more readily available, to predict AMR in E . coli , targeting key antibiotics such as ciprofloxacin, ampicillin, and cefotaxime. A crucial part of our work involved the validation of these models using an independent dataset from Africa, specifically from Uganda, Nigeria, and Tanzania, to ascertain their applicability and effectiveness in LMICs.

Model performance varied across antibiotics. The Support Vector Machine excelled in predicting ciprofloxacin resistance (87% accuracy, F1 Score: 0.57), Light Gradient Boosting Machine for cefotaxime (92% accuracy, F1 Score: 0.42), and Gradient Boosting for ampicillin (58% accuracy, F1 Score: 0.66). In validation with data from Africa, Logistic Regression showed high accuracy for ampicillin (94%, F1 Score: 0.97), while Random Forest and Light Gradient Boosting Machine were effective for ciprofloxacin (50% accuracy, F1 Score: 0.56) and cefotaxime (45% accuracy, F1 Score:0.54), respectively. Key mutations associated with AMR were identified for these antibiotics.

As the threat of AMR continues to rise, the successful application of these models, particularly on genomic datasets from LMICs, signals a promising avenue for improving AMR prediction to support large AMR surveillance programs. This work thus not only expands our current understanding of the genetic underpinnings of AMR but also provides a robust methodological framework that can guide future research and applications in the fight against AMR.

Peer Review reports

Antimicrobial resistance (AMR) is a pressing global health challenge that threatens human and animal well-being [ 1 ]. Recognized as a priority by the World Health Organization (WHO) and the United Nations General Assembly [ 2 ], AMR’s unchecked proliferation could lead to catastrophic consequences, with Africa alone projected to account for millions of annual deaths by 2050 [ 3 ]. In 2019, reports showed that AMR all-age death rates were highest in some low- and middle-income countries (LMICs), making AMR not only a major health problem globally but a particularly serious problem for some of the poorest countries in the world [ 4 ].

The WHO launched the Global Antimicrobial Resistance and Use Surveillance System (GLASS) to enhance AMR evidence base for priority pathogens including Escherichia coli, Klebsiella pneumoniae, Acinetobacter baumannii, Staphylococcus aureus, Streptococcus pneumoniae, Salmonella spp and others. While capacities for antimicrobial susceptibility testing (AST) exist across Africa, they are unevenly distributed and often limited in scope, particularly in LMICS. The COVID-19 pandemic has however catalyzed the broader adoption of Next-Generation Sequencing (NGS) platforms in Africa, now increasingly available to support a range of disease surveillance programs, including AMR. This technological advance offers a valuable complement to traditional AST methods, although the distribution and accessibility of NGS capabilities remain variable across the continent.

In Uganda, available data indicates concerning levels of drug resistance among E . coli strains (45.62%) with substantial resistance to key antibiotics [ 5 ]. Similarly, in Tanzania and Nigeria, studies have highlighted the growing challenge of AMR, reflecting patterns of resistance that may differ from other regions, thereby necessitating localized surveillance and tailored predictive models [ 6 ]. These countries exemplify the diverse AMR landscape across Africa and underscore the need for enhanced detection methods and strengthening diagnostic programs [ 7 , 8 , 9 ].

Overall, the increasing availability of whole-genome sequence (WGS) data in dedicated databases, exemplified by tools like CARD and Resfinder, has facilitated the identification of antibiotic resistance determinants [ 10 , 11 ]. Existing approaches for detecting AMR from microbial whole-genome sequence data, such as rule-based models relying on identifying causal genes in databases, have high accuracy for some common pathogens but are limited in detecting resistance caused by unknown mechanisms in other major pathogenic strains. Machine learning techniques, including random forest, support vector machines, and neural networks have shown great promise in predicting antimicrobial resistance [ 12 ]. These methods excel in capturing complex patterns within large datasets and can directly learn valuable features from genomic sequence data without relying on assumptions about the underlying mechanisms of AMR. Previous studies using machine learning have demonstrated success in predicting AMR and pathogen invasiveness from genomic sequences [ 13 , 14 , 15 , 16 , 17 , 18 ]. Despite this potential, the application of machine learning for AMR prediction has not been widely explored in LMICs, often due to data scarcity and the underrepresentation of AMR genetic determinants within reference databases [ 19 ].

To bridge this gap, we adopted a cross-continental approach, training machine learning models on data from England and validating them on datasets from Uganda, Tanzania and Nigeria. This strategy aimed to evaluate the efficacy of machine learning in predicting AMR for E. coli and assess the models’ generalizability across diverse African settings and datasets. By leveraging microbial genomic data and advanced machine learning techniques, this study endeavored to enhance the accuracy and efficiency of AMR prediction, thus contributing significantly to the global battle against AMR. This comprehensive analysis provides crucial insights into the practical implementation and scalability of AMR prediction strategies, especially in LMICs where genomic data is limited and the burden of AMR is disproportionately high.

Study design

This was a cross-sectional study utilizing data collected in the past years to explore associations between predictors and outcomes.

Sample size

In this study, two datasets, referred to as the Africa data and the England data of E. coli strains were used.

Data description

The study focused on three antibiotics ciprofloxacin (CIP), ampicillin (AMP) & cefotaxime (CTX). Each of these represented an antibiotic from a different class of antibiotics (penicillins, cephalosporins, and fluoroquinolones). They are broad-spectrum antibiotics with activity against numerous Gram-positive and Gram-negative bacteria, including E. coli. These drugs were selected based on their increasing prevalence of resistance as reported in the GLASS report [ 5 ]. In addition, data on resistance to these drugs was available in the study datasets, making them an ideal choice for the study. The study utilized data from one of the largest complete E. coli datasets that were already available online from the National Center for Biotechnology Information, eliminating the need for additional data collection efforts. We categorised the data into two primary datasets:

England dataset

Comprising of 1,496 samples for ciprofloxacin; 1,428 for cefotaxime; and 1,396 for ampicillin. The dataset was collected from England and consisted of WGS of 1509 E. coli strains and corresponding phenotypic information [ 20 ]. This data was collected in England by the British Society for Antimicrobial Chemotherapy and from the Cambridge University Hospitals NHS Foundation Trust as part of a longitudinal survey of E. coli to contextualize the ST131 lineage within a broader E. coli population. This data was made publicly available by the Wellcome Trust Sanger Institute (Accession: PRJEB4681).

Africa dataset

Comprising of data from Uganda, Tanzania and Nigeria. The first Africa dataset consisted of samples collected from pastoralist communities of Western Uganda to study phylogenomic changes, virulence, and resistant genes. It contained WGS data for a total of 42 E. coli strains [ 21 ]. These were isolated from stool samples from both humans ( n  = 30) and cattle ( n  = 12) collected between January 2018– March 2019. WGS was carried out at facilities of Kenya Medical Research Institute -Wellcome Trust, Kilifi. The data is made publicly available by the author in a repository (DOI https://doi.org/10.17605/OSF.IO/KPHRD ). The second Africa dataset consisted of 73 samples collected from both Uganda ( n  = 40) and Tanzania ( n  = 33) in a study that was unravelling virulence determinants in extended-spectrum beta-lactamase-producing E . coli from East Africa using WGS [ 22 ]. The third dataset consisted 68 samples collected from Nigeria as part of a study looking at WGS data from E . coli isolates from South-West Nigeria hospitals [ 23 ] (Table  1 ).

The samples that had not been screened for AST were removed from the dataset.

Variant calling of whole-genome sequencing data

The raw WGS paired-end reads were first quality checked and filtered by fastp 0.23.4 using its default parameters: adapter detection and trimming, sliding window quality filtering with a threshold of Q20, end trimming for low quality bases and removing reads shorter than 15 bp post-trimming [ 24 ]. The filtered reads were aligned to the E. coli K-12 substr. MG1655 U00096.3 complete genome using Burrows-Wheeler Aligner-mem (0.7.17-r1188) algorithm with default seed length of 19, bandwidth of 100, and off-diagonal X-dropoff of 100 [ 25 ]. BCFtools 1.18 was used for calling variants with a minimum of depth coverage of 10x and allelic frequency of 0.9 [ 26 ]. SAMtools 1.18 was used to sort the aligned reads and BCFtools 1.18 was used to filter the raw variants applying default filtering thresholds, including a minimum read depth of 2, SNP quality of 20 [ 27 ]. The entire bioinformatics workflow was subsequently executed on the Open Science Grid High Throughput Computing infrastructure [ 28 , 29 ].

SNPs pre-processing and encoding

We employed a previously established methodology for constructing the SNP matrix from the VCF files. First, the reference alleles, variant alleles, and their positions from the VCF files were extracted and merged with the isolates based on the position of the reference alleles. A SNP matrix was built where the rows represented the samples, and the columns represented the variant alleles [ 15 ]. The SNPs were converted from characters to numbers through categorical encoding where the categories are converted to numbers. The SNPs were encoded for machine learning using label encoding, where the A, C, G, T in the SNP matrix were converted to 1,2,3,4 (Fig.  1 ). It is acknowledged that certain machine learning models could misconstrue these as ordinal values; however based on previous studies demonstrating minimal performance difference between label, one-hot and Frequency Chaos Game Representation encoding methodologies [ 15 ], label encoding was selected for its computational efficiency in handling large genomic datasets. The missing values encoded as N were converted to 0. The gene positions that had more than 90% as null were removed and the remaining were selected for machine learning. The antibiotic phenotypes were encoded as binary values: ‘S’ for susceptible was mapped to 0, and ‘R’ for resistant was mapped to 1.

figure 1

Illustration of the preprocessing and encoding process of the SNPs. Created with Biorender.com

  • Machine learning

We trained eight machine learning algorithms, each selected for its unique capabilities in predictive modeling. The training of these models was conducted individually for each antibiotic, focusing on one antibiotic at a time to ensure the specificity and accuracy of the predictions. Logistic Regression (LR) provided a baseline for binary classification, and Random Forest (RF) and Gradient Boosting (GB) were chosen for their effectiveness in handling high-dimensional data and intricate relationships. Support Vector Machines (SVM) were implemented with a sigmoid kernel, optimized through hyperparameter tuning to a C parameter of 9.795846277645586 and gamma set to ‘auto’. Feed-Forward Neural Networks (FFNNs), designed using Keras 2.12.0, consisted of an input layer with 64 neurons, a hidden layer with 32 neurons, and an output layer with one neuron, using binary cross-entropy loss and the Adam optimizer. The FFNN was trained for 20 epochs with a batch size of 32, with hyperparameter tuning improving its configuration. XGBoost (XGB) with xgboost 1.7.6, LightGBM (LGB) using lightgbm 4.1.0, and CatBoost using catboost 1.2.2 were implemented with default parameters, leveraging their efficiencies with large-scale data.

All models were implemented using Scikit-learn version 1.3.2, except for FFNNs which were implemented in Keras. Hyperparameter tuning was conducted for SVM and FFNNs using scikit-learn’s RandomizedSearchCV, which helped identify the most effective configurations for these models. The training was performed on both originally imbalanced and balanced datasets. For balancing, a simple random down-sampling approach was employed to reduce the majority class, enabling us to assess the impact of class distribution on model performance.

This comprehensive approach, involving diverse algorithms and hyperparameter tuning, allowed for an exhaustive evaluation of predictive models in the detection of AMR, under varied dataset conditions.

Statistical evaluation

The machine learning models were optimized using five times 5-fold stratified cross-validation. For the final evaluation of the data from Africa, the performance was analyzed on the raw public dataset and on a balanced set using a downsampling strategy. The models were evaluated using the receiver operating characteristics curve (ROC) and the area under the curve (AUC). Precision, recall, f1-score, and accuracy for all models were calculated. In order to determine the statistical significance of the differences in AUC scores between models, we employed Tukey’s Honestly Significant Difference (HSD) test [ 30 ]. This test is appropriate for comparing all possible pairs of groups in a family of models without increasing the risk of Type I errors that multiple comparisons may induce. The significance threshold was set at α = 0.05, indicating that differences with p-values less than this threshold were considered statistically significant. The pairwise comparisons were conducted using statsmodels 0.14.0.

Identification of genes

To identify the top 10 most important features for the models mentioned, the methods for calculating feature importance varied between models. For tree-based models like Random Forest, Gradient Boosting, XGBoost, LightGBM, and CatBoost, we utilized the feature_importances attribute, which quantifies the contribution of each feature to the model’s prediction. In Logistic Regression, feature importance was deduced from the absolute values of the coefficients. The SVM model employed the coefficients’ absolute values for linear kernels and SelectKBest with the chi2 method for non-linear kernels. For the Keras Neural Network model, we averaged the absolute values of the weights in the first layer, reflecting the relevance of each feature in the model. The corresponding gene annotations were extracted from the reference genome for the identified SNPs. By examining the functional roles of these genes, an investigation of their potential contribution to antibiotic resistance mechanisms in E. coli was done (Fig.  2 ).

figure 2

Flowchart showing how genes were identified

Performance of machine learning methods in predicting AMR

We assessed the performance of eight machine learning algorithms, including LR, RF, SVM, GB, XGB, LGB, CatBoost, and FFNN, in predicting antibiotic resistance in E. coli. Multiple metrics, such as accuracy, precision, recall, F1 score, and the area under the receiver operating characteristics (ROC) curve, were used for evaluation (Table  2 ). The models were optimized using 5-fold stratified cross-validation and confidence intervals recorded (Supplementary Material 1 ). Tukey’s Honestly Significant Difference (HSD) test was employed for pairwise comparisons of AUC scores.

For CIP, we evaluated the models’ effectiveness considering the class imbalance issue. We applied a random down-sampling strategy but didn’t observe significant improvements. The FFNN emerged as the top performer with the highest mean AUC score (0.83), while SVM achieved the highest accuracy (0.87). HSD tests revealed significant performance differences between several pairs of models, specifically RF ( p  < 0.001) when compared to all the models.

For AMP, the SVM achieved the highest mean AUC score (0.72). GB had the highest F1 score and precision, and CB and SVM had the highest recall scores.

On the CTX, FFNN stood out with the highest mean AUC score (0.72), while SVM recorded the highest accuracy (0.92). The Random Forest model excelled in precision, and Logistic Regression had the highest F1 score (0.42) (Fig.  3 ).

figure 3

Performance of different machine learning methods for predicting AMR on England microbial sequence data

Evaluation of the machine learning models on the Africa data

We assessed the generalizability of our machine learning models on an external dataset from Uganda, Nigeria and Tanzania, consisting of up to 170 samples with a severe class imbalance issue. Performance metrics for each model on this dataset (Table  3 ).

In the external validation with the African dataset, the class imbalance presented varied challenges across different antibiotics. For CIP, the Logistic Regression model exhibited an accuracy of 0.55 and precision of 0.59, but a recall of only 0.16. The RF model achieved an accuracy of 0.50 and an AUC-ROC score of 0.53. SVM displayed an accuracy of 0.50, while GB showed an accuracy of 0.52 and a recall of 0.32. XGB had an accuracy of 0.57, and both LGB and CatBoost had accuracies just above 0.55, with CB also attaining an AUC-ROC of 0.58. The FFNN model did not identify any true positives.

For AMP, LR achieved an accuracy of 0.94 and a near-perfect recall. RF had a precision of 0.93 but a lower accuracy of 0.38. SVM’s performance was close to that of LR, with high accuracy and recall but a slightly lower AUC-ROC score of 0.57. GB, XGB, LGB, and CatBoost demonstrated solid accuracy and precision, albeit with varying AUC-ROC scores. The FFNN model’s accuracy was at 0.05.

Regarding CTX, LR recorded an AUC-ROC of 0.39, RF exhibited high precision but low recall, and SVM had a precision of 1. GB had the highest accuracy among the models at 0.22 and the highest AUC-ROC score of 0.57. XGB and LGB showed higher accuracy and recall rates, with LGB achieving the highest recall of 0.38. The FFNN model again showed zero capacity for true positive identification (Fig.  4 ).

figure 4

Performance of different machine learning methods for predicting AMR on Africa microbial sequence data

Marker genes associated with antibiotic resistance

A crucial part of machine learning in the genomic field is to interpret the model’s results. In our case, the analysis of feature importance and interactions provided insights into which genetic mutations are most influential in predicting antibiotic resistance. For each model, we identified the top 10 features (SNP positions) with the highest importance scores, which reflect their contribution to the accuracy of the model’s predictions.

For instance, in the Logistic Regression model on CIP, the mutation at position ‘3589009’ has the highest importance score, followed by ‘4040529’, ‘1473047’, and so on. These positions potentially have a substantial impact on antibiotic resistance, as mutations in these areas of the gene could probably cause the bacteria to become resistant to specific antibiotics. The exact biological mechanism for this can be complex, involving changes in the gene’s protein product that might render an antibiotic ineffective (Table  4 ).

The models used different ways to calculate these importance scores, which is why they differ between models. Still, positions that are consistently high across different models can be a strong indicator of their significance in conferring antibiotic resistance.

Gene annotation

The identification of genetic SNPs associated with antibiotic resistance can shed light on the underlying genetic mechanisms that contribute to drug resistance in E. coli (Table  5 ). By analyzing the top SNPs from each predictive model, key marker genes that potentially play a role in antibiotic resistance were identified. For CIP, SNPs were identified in the following genes: rlmL, yehB, rrfA, vciQ , and ygjK . For AMP, the implicated genes include rcsD, yjfI, tdcE, ugpB, ugpQ , and ggt . Lastly, for CTX, SNPs were found in ydbA, mltB, lomR, mppA, recD , and glyS . The identified SNPs in these genes underscore the complex and multifactorial nature of antibiotic resistance in E. coli . A variety of biological processes, such as membrane transport, rRNA methylation, DNA repair, and cell wall synthesis, are potentially collectively implicated in the development of resistance. Further experimental validation of these marker genes is warranted to confirm their role in antibiotic resistance.

This study embarked on an explorative journey to understand the generalizability of machine learning models in predicting AMR in E . coli , utilizing datasets from England and multiple African countries. While the models showed promise on the England dataset, the application to the highly imbalanced African dataset illuminated significant challenges. The validation of machine learning models on the African dataset, which had a higher incidence of resistant strains compared to the training data from England, highlighted the challenges and potential of such tools; discrepancies in class distribution impacted performance measures like recall and precision, yet the robustness and real-world applicability of these models were affirmed when they successfully predicted resistance across varied datasets.

In the England dataset, models like SVM (Accuracy: 0.87, AUC-ROC: 0.86) and Logistic Regression (AUC-ROC: 0.77) demonstrated effectiveness. However, the transition to the African dataset, characterized by significant class imbalance, presented a stark contrast. For example, the Random Forest model experienced a decline in accuracy from 0.75 for CIP in the England dataset to 0.50 in the African dataset.

The performance of the models on the African dataset, particularly in terms of recall, highlights potential overfitting to the England dataset and the need for more generalizable models. The disparity in class distribution between the datasets—where the England dataset had a higher proportion of susceptible strains and the African dataset had a higher proportion of resistant strains—presented both challenges and opportunities.

A notable observation in this study is the impressive performance of the models for predicting ampicillin (AMP) resistance in the African dataset, despite their moderate performance on the England dataset. For AMP, models demonstrated substantial accuracy and recall in the African dataset (e.g., Logistic Regression: Accuracy 0.94, Recall 0.99, F1 0.97), highlighting their effectiveness in identifying true resistance cases. This success may be attributed to the distinct resistance mechanisms of AMP, which were perhaps better captured in the training data, leading to more accurate predictions in the validation dataset, or the data representation of the AMP training dataset which might have contained patterns that were more representative of the resistance seen in the African dataset.

Moreover, the process of down-sampling the England dataset for training, while fostering a balanced environment, did not uniformly enhance model performance. While down-sampled models showed a slight improvement for AMP, indicating that down-sampling might enhance the model’s sensitivity to specific resistance patterns associated with AMP, this effect was not as pronounced for CIP and CTX.

The identification of SNPs associated with antibiotic resistance can illuminate the genetic mechanisms driving drug resistance in E. coli . By analyzing the top SNPs from each predictive model, we identified key marker genes potentially involved in antibiotic resistance. For CIP, SNPs were identified in the following genes: ugpC, rlmL, yciQ, ygjK, yehB, rrfA, ytfB , and yjjW . These genes encode for various bacterial functions. For instance, ugpC is part of the glycerol-3-phosphate (G3P) transport system implicated in phospholipid biosynthesis, and RlmL is an enzyme involved in the methylation of ribosomal RNA (rRNA). mdtC is a component of multidrug efflux pump systems that can contribute to antibiotic resistance by actively pumping out antibiotics from bacterial cells. It’s important to note that while machine learning can highlight these genes as candidates, experimental validation is essential to confirm their roles in antibiotic resistance.

Implications and applications

While our research concentrated primarily on three specific antibiotics, the methodology we’ve developed is versatile and readily adaptable for investigating other antibiotics and can be extended to resistance-associated SNPs in a variety of pathogens beyond just bacteria. This flexibility allows for a broader scope of study, opening the door for a comprehensive understanding of AMR mechanisms. In addition, the applicability of our approach extends beyond the realm of infectious diseases, holding promise for other branches of biomedical research, such as predicting resistance to cancer treatments by enabling precise targeted therapy.


While this study has provided valuable insights into predicting genotypic resistance to ciprofloxacin, ampicillin, and cefotaxime in E. coli strains, it is important to acknowledge several limitations that should be considered when interpreting the results. First, it is important to acknowledge the inherent limitation of focusing exclusively on SNPs as the single specific genomic factor. Antimicrobial resistance is a complex phenomenon influenced by various genomic drivers including resistance genes, insertion sequences, plasmids and AMR gene cassettes which collectively contribute to the intricate landscape of resistance mechanisms. Our study, by concentrating on SNPs, represents a deliberate simplification to ensure depth and clarity in our ML analysis, driven by data quality and the need for clinically interpretable models. However, we recognise that the exclusive emphasis on SNPs may not capture the entirety of the multifaceted interplay within resistance determinants.

Furthermore, it is worth noting that the validation of the models on Africa data presented some challenges. The availability of whole-genome sequence data from Africa was limited, resulting in a relatively small dataset for model evaluation. Additionally, the African dataset exhibited high-class imbalance, where certain resistance classes were significantly underrepresented. This imbalance can introduce bias and affect the performance metrics of the models. Due to our study’s uniqueness, traditional benchmarking might not capture our nuanced challenges. Future studies should explore alternative methodologies for a comprehensive evaluation of predictive models in diverse contexts.

Moreover, it is important to highlight that the performance of the models in this study is specific to the context of the datasets used, which may not fully represent the diversity and complexity of AMR patterns observed in other regions or populations. Therefore, caution should be exercised when generalizing the findings to different settings. Despite these limitations, this study provides a valuable foundation for future research and highlights areas for improvement and expansion. Incorporating additional variables, addressing the class imbalance, and expanding the dataset to include a more diverse range of sequences would enhance the robustness and applicability of the models. Overall, while the findings of this study contribute to our understanding of genotypic resistance prediction, it is important to recognize these limitations and consider them in the broader context of AMR research.

In conclusion, our study highlights the complex interplay between data composition, model training approaches, and predictive accuracy in the context of AMR. The impressive performance of models for AMP in the African dataset despite their moderate performance in the England dataset underscores the potential of machine learning in AMR prediction, given appropriate training and validation strategies.

The findings from this study serve as a crucial reminder of the complexities involved in applying machine learning models to predict AMR across diverse settings. It emphasizes the importance of developing robust, adaptable, and generalizable machine learning tools, capable of handling varied data landscapes and resistance mechanisms. Future research should focus on integrating larger and more diverse datasets while exploring innovative methods to maintain a balance between dataset size and class distribution, thus advancing the development of machine learning tools in the global fight against antimicrobial resistance.

As the threat of antimicrobial resistance continues to rise, the successful application of these models - particularly on the African dataset, signals a promising avenue for improving AMR detection and treatment strategies. This work thus not only expands our current understanding of the genetic underpinnings of antibiotic resistance but also provides a robust methodological framework that can guide future research and applications in the fight against antimicrobial resistance.

Code Availability

The source code in data preparation and model training is provided on the GitHub page: https://github.com/KingMike100/mlamr .

Data availability

The data that supports the findings of this study is publicly available at EBI from https://www.ebi.ac.uk/ena/browser/view/PRJEB4681 and https://osf.io/kphrd/ .


Antimicrobial Resistance

Antimicrobial Susceptibility Testing


Convolutional Neural Network

Comprehensive Resistance Prediction for Tuberculosis:an International Consortium

Deoxyribonucleic Acid

World health Organization Global Antimicrobial Resistance and Use Surveillance system

Logistic Regression

Machine Learning

National Centre for Biotechnology Information

Polymerase Chain Reaction (PCR)

Real-time Polymerase Chain Reaction

Research Ethics Committee

Random Forest

Ribonucleic acid

Support Vector Machine

Uganda National Council for Science and Technology

Whole Genome Sequencing

World Health Organization

Laxminarayan R, Duse A, Wattal C, et al. Antibiotic resistance-the need for global solutions. Lancet Infect Dis. 2013;13:1057–98.

Article   PubMed   Google Scholar  

Refugees UNHC. for. Refworld| Transforming our world: the 2030 Agenda for Sustainable Development. Refworld , https://www.refworld.org/docid/57b6e3e44.html (accessed 27 September 2023).

160518_Final paper_with. cover.pdf, https://amr-review.org/sites/default/files/160518_Final%20paper_with%20cover.pdf (accessed 27 September 2023).

Murray CJL, Ikuta KS, Sharara F, et al. Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet. 2022;399:629–55.

Article   CAS   Google Scholar  

Nabadda S, Kakooza F, Kiggundu R, et al. Implementation of the World Health Organization Global Antimicrobial Resistance Surveillance System in Uganda, 2015–2020: mixed-methods study using National Surveillance Data. JMIR Public Health Surveill. 2021;7:e29954.

Article   PubMed   PubMed Central   Google Scholar  

Ingle DJ, Levine MM, Kotloff KL, et al. Dynamics of antimicrobial resistance in intestinal Escherichia coli from children in community settings in South Asia and sub-saharan Africa. Nat Microbiol. 2018;3:1063–73.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Amr NGHRU. on GS of. Whole-genome sequencing as part of national and international surveillance programmes for antimicrobial resistance: a roadmap. BMJ Global Health 2020; 5: e002244.

Katyali D, Kawau G, Blomberg B, et al. Antibiotic use at a tertiary hospital in Tanzania: findings from a point prevalence survey. Antimicrob Resist Infect Control. 2023;12:112.

Achi CR, Ayobami O, Mark G, et al. Operationalising One Health in Nigeria: reflections from a High-Level Expert Panel discussion commemorating the 2020 World Antibiotics Awareness Week. Front Public Health. 2021;9:673504.

Zankari E, Hasman H, Cosentino S, et al. Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother. 2012;67:2640–4.

McArthur AG, Waglechner N, Nizam F, et al. The comprehensive antibiotic resistance database. Antimicrob Agents Chemother. 2013;57:3348–57.

Her H-L, Wu Y-W. A pan-genome-based machine learning approach for predicting antimicrobial resistance activities of the Escherichia coli strains. Bioinformatics. 2018;34:i89–95.

Wheeler NE, Gardner PP, Barquist L. Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. PLoS Genet. 2018;14:e1007333.

Sf R, Mr O, Mj M et al. Machine Learning Leveraging Genomes from Metagenomes Identifies Influential Antibiotic Resistance Genes in the Infant Gut Microbiome. mSystems ; 3. Epub ahead of print 9 January 2018. https://doi.org/10.1128/mSystems.00123-17 .

Ren Y, Chakraborty T, Doijad S, et al. Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning. Bioinformatics. 2022;38:325–34.

Article   CAS   PubMed   Google Scholar  

Mw P. T H, M W, Evaluation of Machine Learning and Rules-Based Approaches for Predicting Antimicrobial Resistance Profiles in Gram-negative Bacilli from Whole Genome Sequence Data. Frontiers in microbiology ; 7. Epub ahead of print 28 November 2016. https://doi.org/10.3389/fmicb.2016.01887 .

M N, T B, Sw L, et al. Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumoniae. Scientific reports ; 8. Epub ahead of print 11 January 2018. https://doi.org/10.1038/s41598-017-18972-w .

Antonopoulos DA, Assaf R, Aziz RK, et al. PATRIC as a unique resource for studying antimicrobial resistance. Brief Bioinform. 2019;20:1094–102.

Onywera H, Ondoa P, Nfii F, et al. Boosting pathogen genomics and bioinformatics workforce in Africa. Lancet Infect Dis. 2024;24:e106–12.

Kallonen T, Brodrick HJ, Harris SR, et al. Systematic longitudinal survey of invasive Escherichia coli in England demonstrates a stable population structure only transiently disturbed by the emergence of ST131. Genome Res. 2017;27:1437–49.

Stanley IJ, Kajumbula H, Bazira J, et al. Multidrug resistance among Escherichia coli and Klebsiella pneumoniae carried in the gut of out-patients from pastoralist communities of Kasese district, Uganda. PLoS ONE. 2018;13:e0200093.

Sserwadda I, Kidenya BR, Kanyerezi S, et al. Unraveling virulence determinants in extended-spectrum beta-lactamase-producing Escherichia coli from East Africa using whole-genome sequencing. BMC Infect Dis. 2023;23:587.

Afolayan AO, Aboderin AO, Oaikhena AO, et al. An ST131 clade and a phylogroup a clade bearing an O101-like O-antigen cluster predominate among bloodstream Escherichia coli isolates from South-West Nigeria hospitals. Microb Genomics. 2022;8:000863.

Chen S, Zhou Y, Chen Y, et al. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90.

Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.

Danecek P, Bonfield JK, Liddle J, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10:giab008.

Li H, Handsaker B, Wysoker A, et al. The sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9.

Pordes TOSGEB on behalf of the OC, Petravick D, Kramer B, et al. The open science grid. J Phys: Conf Ser. 2007;78:012057.

Google Scholar  

Sfiligoi I, Bradley DC, Holzman B et al. The Pilot Way to Grid Resources Using glideinWMS. In: 2009 WRI World Congress on Computer Science and Information Engineering , pp. 428–432.

5-6-32-656.pdf, https://www.mathsjournal.com/pdf/2021/vol6issue1/PartA/5-6-32-656.pdf (accessed 27 September 2023).

Download references


The first author MN was funded by the East African Network for Bioinformatics Training (EANBIT) under Fogarty International Center at the U.S. National Institutes of Health (NIH) under award number U2RTW010677 as a Masters scholar. The authors would also like to acknowledge the Open Science Grid (OSG) consortium which provided computational resources to carry out this study. The OSG is supported by the National Science Foundation award number 2030508 and 1836650. G.M. acknowledges the EDCTP2 career development grant which supports the Pathogen detection in HIV-infected children and adolescents with non-malarial febrile illnesses using the metagenomic next-generation sequencing (PHICAMS) approach in Uganda, a project which is part of the EDCTP2 programme from the European Union (Grant number TMA2020CDF-3159). He is also in part supported by the US National Institutes of Health, Fogarty International Center [Grant number: 5U2RTW012116-01]. The views and opinions of the author expressed herein do not necessarily state or reflect those of funders.

Not applicable.

Author information

Authors and affiliations.

Department of Immunology and Molecular Biology, School of Biomedical Sciences, College of Health Sciences, Makerere University, P.O Box 7072, Kampala, Uganda

Mike Nsubuga, Ronald Galiwango & Gerald Mboowa

Department of Computer Science, College of Computing and Information Sciences, Makerere University, P.O Box 7062, Kampala, Uganda

Daudi Jjingo

The African Center of Excellence in Bioinformatics and Data-Intensive Sciences, Infectious Diseases Institute, College of Health Sciences, Makerere University, P.O Box 22418, Kampala, Uganda

Mike Nsubuga, Ronald Galiwango, Daudi Jjingo & Gerald Mboowa

Africa Centres for Disease Control and Prevention, African Union Commission, P.O Box 3243, Roosevelt Street, Addis Ababa, W21 K19, Ethiopia

Gerald Mboowa

Faculty of Health Sciences, University of Bristol, Bristol, BS40 5DU, UK

Mike Nsubuga

Jean Golding Institute, University of Bristol, Bristol, BS8 1UH, UK

You can also search for this author in PubMed   Google Scholar


MN: Data analysis, Data Interpretation, Writing-Original draft; GM: Conception, Writing-Final draft; DJ: Conception, Writing-Final draft; RG: Data Interpretation, Writing-Final draft. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Gerald Mboowa .

Ethics declarations

Ethics approval, consent for publication, competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Nsubuga, M., Galiwango, R., Jjingo, D. et al. Generalizability of machine learning in predicting antimicrobial resistance in E. coli : a multi-country case study in Africa. BMC Genomics 25 , 287 (2024). https://doi.org/10.1186/s12864-024-10214-4

Download citation

Received : 28 September 2023

Accepted : 11 March 2024

Published : 18 March 2024

DOI : https://doi.org/10.1186/s12864-024-10214-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Antimicrobial resistance
  • Whole-genome sequencing

BMC Genomics

ISSN: 1471-2164

case study on data set

Create an account

Create a free IEA account to download our reports or subcribe to a paid service.

Key findings

  • Understanding methane emissions
  • What did COP28 mean for methane?
  • Methane emissions in a 1.5 °C pathway
  • Tracking pledges, targets and action
  • Progress on data and lingering uncertainties

Cite report

IEA (2024), Global Methane Tracker 2024 , IEA, Paris https://www.iea.org/reports/global-methane-tracker-2024, Licence: CC BY 4.0

Share this report

  • Share on Twitter Twitter
  • Share on Facebook Facebook
  • Share on LinkedIn LinkedIn
  • Share on Email Email
  • Share on Print Print

Methane emissions from the energy sector remained near a record high in 2023

We estimate that the production and use of fossil fuels resulted in close to 120 million tonnes (Mt) of methane emissions in 2023, while a further 10 Mt came from bioenergy – largely stemming from the traditional use of biomass. Emissions have remained around this level since 2019, when they reached a record high. Since fossil fuel supply has continued to expand since then, this indicates that the average methane intensity of production globally has declined marginally during this period.

The latest IEA Global Methane Tracker is based on the most recently available data on methane emissions from the energy sector and incorporates new scientific studies, measurement campaigns, and information collected from satellites.

Analysis of this data reveals both signs of progress and some worrying trends. On one hand, more governments and fossil fuel companies have committed to take action on methane. Global efforts to report emissions estimates consistently and transparently are strengthening, and studies suggest emissions are falling in some regions. However, overall emissions remain far too high to meet the world’s climate goals. Large methane emissions events detected by satellites also rose by more than 50% in 2023 compared with 2022, with more than 5 Mt of methane emissions detected from major fossil fuel leaks around the world – including a major well blowout in Kazakhstan that went on for more than 200 days. 

Methane emissions from energy, 2000-2023

Close to 70% of methane emissions from fossil fuels come from the top 10 emitting countries.

Of the nearly 120 Mt of emissions we estimate were tied to fossil fuels in 2023, around 80 Mt came from countries that are among the top 10 emitters of methane globally. The United States is the largest emitter of methane from oil and gas operations, closely followed by the Russian Federation (hereafter “Russia”). The People’s Republic of China (hereafter “China”) is by far the highest emitter in the coal sector. The amount of methane lost in fossil fuel operations globally in 2023 was 170 billion cubic metres, more than Qatar’s natural gas production.

The methane emissions intensity of oil and gas production varies widely. The best-performing countries score more than 100 times better than the worst. Norway and the Netherlands have the lowest emissions intensities. Countries in the Middle East, such as Saudi Arabia and the United Arab Emirates, also have relatively low emissions intensities. Turkmenistan and Venezuela have the highest. High emissions intensities are not inevitable; they can be addressed cost-effectively through a combination of high operational standards, policy action and technology deployment. On all these fronts, best practices are well established.

Methane emissions from oil and gas production and methane intensity for selected producers, 2023

Cutting methane emissions from fossil fuels by 75% by 2030 is vital to limit warming to 1.5 °c.

The energy sector accounts for more than one third of total methane emissions attributable to human activity, and cutting emissions from fossil fuel operations has the most potential for major reductions in the near term. We estimate that around 80 Mt of annual methane emissions from fossil fuels can be avoided through the deployment of known and existing technologies, often at low – or even negative – cost.

In our Net Zero Emissions by 2050 (NZE) Scenario – which sees the global energy sector achieving net zero emissions by mid-century, limiting the temperature rise to 1.5 °C – methane emissions from fossil fuel operations fall by around 75% by 2030. By that year, all fossil fuel producers have an emissions intensity similar to the world’s best operators today. Targeted measures to reduce methane emissions are necessary even as fossil fuel use begins to decline; cutting fossil fuel demand alone is not enough to achieve the deep and sustained reductions needed.

Methane abatement potential to 2030

Main sources of methane emissions, full implementation of cop28 and other pledges would cut fossil fuel methane emissions by 50%.

The COP28 climate summit in Dubai produced a host of new pledges to accelerate action on methane. Importantly, the outcome of the first Global Stocktake called for countries to substantially reduce methane emissions by 2030. Additionally, more than 50 oil and gas companies launched the Oil and Gas Decarbonization Charter (OGDC) to speed up emissions reductions within the industry, new countries joined the Global Methane Pledge, and new finance was mobilised to support the reduction of methane and greenhouse gases (GHGs) other than carbon dioxide (CO 2 ).

Substantial new policies and regulations on methane were also established or announced in 2023, including by the United States , Canada , and the European Union and China published an action plan dedicated to methane emission control. A series of supportive initiatives have been launched to accompany these efforts, such as the Methane Alert and Response System and the Oil and Gas Climate Initiative’s Satellite Monitoring Campaign .

Taken together, we estimate that if all methane policies and pledges made by countries and companies to date are implemented and achieved in full and on time, methane emissions from fossil fuels would decline by around 50% by 2030. However, in most cases, these pledges are not yet backed up by detailed plans, policies and regulations. The detailed methane policies and regulations that currently exist would cut emissions from fossil fuel operations by around 20% from 2023 levels by 2030. The upcoming round of updated Nationally Determined Contributions (NDCs) under the Paris Agreement, which will see countries set climate goals through 2035, presents a major opportunity for governments to set bolder targets on energy-related methane and lay out plans to achieve them.

Reductions in methane emissions from fossil fuel operations from existing policies and pledges, 2020-2030

Around 40% of today’s methane emissions from fossil fuels could be avoided at no net cost.

Methane abatement in the fossil fuel industry is one of the most pragmatic and lowest cost options to reduce greenhouse gas emissions. The technologies and measures to prevent emissions are well known and have already been deployed successfully around the world. Around 40% of the 120 Mt of methane emissions from fossil fuels could be avoided at no net cost, based on average energy prices in 2023. This is because the required outlays for abatement measures are less than the market value of the additional methane gas captured and sold or used. The share is higher for oil and natural gas (50%) than for coal (15%).

There are many possible reasons why companies are not deploying these measures even though they pay for themselves. For example, the return on investment for methane abatement projects may be longer than for other investment opportunities. There may also be a lack of awareness regarding the scale of methane emissions and the cost-effectiveness of abatement. Sometimes infrastructure or institutional arrangements are inadequate, making it difficult for companies to receive the income from avoided emissions.

Regardless of the value of captured gas, we estimate that it would be cost-effective to deploy nearly all fossil fuel methane abatement measures if emissions are priced at about USD 20/tonne CO 2 ‑equivalent. Tapping into this potential will require new regulatory frameworks, financing mechanisms and improved emissions tracking.

Marginal abatement cost curve for methane from coal, 2023

Marginal abatement cost curve for methane from oil and natural gas operations, 2023, delivering the 75% cut in methane emissions requires usd 170 billion in spending to 2030.

We estimate that around USD 170 billion in spending is needed to deliver the methane abatement measures deployed by the fossil fuel industry in the NZE Scenario. This includes around USD 100 billion of spending in the oil and gas sector and USD 70 billion in the coal industry. Through 2030, roughly USD 135 billion goes towards capital expenditures, while USD 35 billion is for operational expenditures.

Fossil fuel companies should carry the primary responsibility for financing these abatement measures, given that the amount of spending needed represents less than 5% of the income the industry generated in 2023. Nonetheless, we estimate that about USD 45 billion of spending in low- and middle-income countries requires particular attention, as sources of finance are likely to be more limited. To date, we estimate that external sources of finance targeted at reducing methane in the fossil fuel industry total less than USD 1 billion, although this should catalyse a far greater level of spending.

Spending for methane abatement in coal operations in the Net Zero Scenario, 2024-2030

Spending for methane abatement in oil and gas operations in the net zero scenario, 2024-2030, new tools to track emissions will bring a step change in transparency.

Better and more transparent data based on measurements of methane emissions is becoming increasingly accessible and will support more effective mitigation. In 2023, Kayrros , an analytics firm, released a tool based on satellite imagery that quantifies large methane emissions on a daily basis and provides country-level oil and gas methane intensities. GHGSat , another technology company, increased its constellation of satellites in orbit to 12 and started to offer targeted monitoring of offshore methane emissions, while the United Nations Environment Programme (UNEP) Methane Alert and Response System (MARS) ramped up usage of satellites to detect major methane emission events and alert government authorities and involved operators.

Despite this progress, little or no measurement-based data is used to report emissions in most parts of the world – which is an issue since measured emissions tend to be higher than reported emissions. For example, if companies that report emissions to UNEP’s Oil & Gas Methane Partnership 2.0 were to be fully representative of the industry globally, this would imply that global oil and gas methane emissions in 2023 were around 5 Mt, 95% lower than our estimate. Total oil and gas emissions levels reported by countries to the UN Framework Convention on Climate Change are close to 40 Mt, about 50% lower than our 2023 estimate. There are many possible reasons for these major discrepancies, but they will only be resolved through more systematic and transparent use of measured data.

Regardless, all assessments make clear that methane emissions from fossil fuels operations are a major issue and that renewed action – by governments, companies, and financial actors – is essential.

Methane emissions from global oil and gas supply

Subscription successful.

Thank you for subscribing. You can unsubscribe at any time by clicking the link at the bottom of any IEA newsletter.


Data from: Beware of the impact of land use legacy on genetic connectivity: A case study of the long-lived perennial Primula veris

Cite this dataset.

Reinula, Iris et al. (2024). Data from: Beware of the impact of land use legacy on genetic connectivity: A case study of the long-lived perennial Primula veris [Dataset]. Dryad. https://doi.org/10.5061/dryad.ns1rn8q1f

This dataset contains genetic and landscape data of 32 Primula veris populations in Muhu island in Estonia. The study populations are on two 2x2 km study landscapes. Genetic samples were collected in 2014. Landscape data was extracted from maps dated 2015. Data is divided into node- and link-based data. Node-based data contains genetic diversity data of the P. veris populations. Link-based data contains genetic differentiation between population pairs and landscape data in buffers surrounding a straight line between population pairs.

This Primula_veris_landscape_genetics_Readme.txt file was generated on 2024-02-28 by Iris Reinula


Title of Dataset: Data from: Beware of the impact of land use legacy on genetic connectivity: A case study of the long-lived perennial Primula veris

Author Information A. Principal Investigator Contact Information Name: Iris Reinula Institution: University of Tartu, Institute of Ecology and Earth Sciences Address: Liivi 2, 50409 Tartu, Estonia Email: [email protected]

B. Associate or Co-investigator Contact Information Name: Tsipe Aavik Institution: University of Tartu, Institute of Ecology and Earth Sciences Address: Liivi 2, 50409 Tartu, Estonia Email: [email protected]

C. Alternate Contact Information Name: Sabrina Träger Institution: Martin-Luther-University Halle-Wittenberg, Institute of Biology/Geobotany and Botanical Garden Address: Große Steinstr. 79/80, 06108 Halle (Saale), Germany Email: [email protected]

Date of data collection (single date, range, approximate date): genetic data: 2014 map data originally created: 2015 map data modified for this dataset: 2022-2024

Geographic location of data collection: Muhu island, Estonia

Information about funding sources that supported the collection of the data: Financial support was obtained from the Estonian Research Council (PRG1751, MOBJD427, PUT589 and PRG874), the European Regional Development Fund (Centre of Excellence EcolChange), the European Commission LIFE+ Nature program (LIFE13NAT/EE/000082), and ERA-NET program Biodiversa+ through the Estonian Ministry of Environment (Biodiversa2021-943).


Licenses/restrictions placed on the data: -

Links to publications that cite or use the data: Reinula, I., Träger, S., Järvine, H-T., Kuningas, V-M., Kaldra, M., Aavik, T. (2024). Beware of the impact of land use legacy on genetic connectivity: A case study of the long-lived perennial Primula veris. Biological Conservation, xx.

Links to other publicly accessible locations of the data:

Links/relationships to ancillary data sets: Sequence data used to generate this data will be made available at the European Nucleotide Archive (ENA)

Was data derived from another source? yes A. If yes, list source(s): Map data: Estonian Basic Map (1:10000; Estonian Land Board, 2015) aerial photos of the study areas (Estonian Land Board, 2015)

Recommended citation for this dataset: Reinula, I., Träger, S., Järvine, H-T., Kuningas, V-M., Kaldra, M., Aavik, T. (2024). Data from: Beware of the impact of land use legacy on genetic connectivity: A case study of the long-lived perennial Primula veris, Estonia. Dryad, Dataset.


File List: Primula_veris_gen_div.csv - Genetic diversity data for Primula veris populations Primula_veris_gen_diff_landscape_variables.csv - Genetic differentiation data for Primula veris and landscape data between the populations Koguva_raster_landscape.tiff - Landscape data in raster format in one study landscape (Koguva) Lepiku_raster_landscape.tiff - Landscape data in raster format in one study landscape (Lepiku)

Relationship between files, if important:

Additional related data collected that was not included in the current data package: n/a

Are there multiple versions of the dataset? no


1. Description of methods used for collection/generation of data:

To generate the genetic information, the leaves of Primula veris were collected from study populations in the two study landscapes (Koguva, Lepiku) and DNA was extracted from the leaves. Extracted DNA was prepared for library using ddRAD method (Peterson, Weber, Kay, Fisher, & Hoekstra, 2012) and sequenced.

We obtained landscape data (grasslands, shrubs, forest, agricultural land, quarry) from Estonian Basic Map (1:10000) and modified and categorised it based on areal photos. Map data is from Estonian Land Board.

See Reinula et al. 2024 for more info.


Peterson, B. K., Weber, J. N., Kay, E. H., Fisher, H. S., & Hoekstra, H. E. (2012). Double digest RADseq: An inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS ONE, 7(5), e37135. https://doi.org/10.1371/journal.pone.0037135

Reinula, I., Träger, S., Järvine, H-T., Kuningas, V-M., Kaldra, M., Aavik, T. (2024).Beware of the impact of land use legacy on genetic connectivity: A case study of the long-lived perennial Primula veris. Biological Conservation, xx.

2. Methods for processing the data:

Genetic data was filtered geoinformatically (Träger et al. 2021) and population-based genetic diversity indices (unbiased expected and observed heterozygosity, uHe and Ho, respectively) were calculated using GENALEX version 6.503 (Peakall & Smouse, 2005, 2012) and mean nucleotide diversity (π) was calculated using vcftools v0.1.12b (Danecek et al., 2011) within a window of 125 bp over all loci for each population. Inbreeding coefficients (FIS) and genetic differentiation (FST) were calculated using the package `genepop´ (Rousset, 2008) in R version 3.4.2 (R Core Team, 2017). Pairwise mean assignment probability (MAP) was calculated with the package AssignPop (Chen et al., 2018). For calculating MAP, we used assignment tests. We performed assignment tests for which we filtered out loci with low variance (threshold at 0.95) and used Monte-Carlo cross-validation. All loci (100%) were used as training data. The classification method for prediction was linear discriminant analysis. The resulting pairwise probabilities (membership accuracies across all individuals) were directional (e.g. 1 to 2, 2 to 1). We added these pairs together and divided them by two, resulting in one value per population pair (MAP; following van Strien et al., 2014).

Study populations were sampled at the scale of 2 2x2 km study landscapes (Koguva, Lepiku) and a 250 m buffer around the 2x2 km landscapes was added, resulting in two 2.5x2.5 km squares. We calculated the proportional amount of landscape elements surrounding the straight line between population pairs in a buffer with the width of 100 m. We only calculated this within one landscape. We transformed the landscape data from vector data to 10x10 m raster data for resistance surface analysis.

Chen, K.-Y., Marschall, E. A., Sovic, M. G., Fries, A. C., Gibbs, H. L., & Ludsin, S. A. (2018). assignPOP: An r package for population assignment using genetic, non-genetic, or integrated data in a machine-learning framework. Methods in Ecology and Evolution, 9(2), 439–446. https://doi.org/10.1111/2041-210X.12897

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., & Durbin, R. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. https://doi.org/10.1093/bioinformatics/btr330

Peakall, R., & Smouse, P. E. (2005). genalex 6: Genetic analysis in Excel. Population genetic software for teaching and research. Molecular Ecology Notes, 6(1), 288–295. https://doi.org/10.1111/j.1471-8286.2005.01155.x

Peakall, R., & Smouse, P. E. (2012). GenAlEx 6.5: Genetic analysis in Excel. Population genetic software for teaching and research—an update. Bioinformatics, 28(19), 2537–2539. https://doi.org/10.1093/bioinformatics/bts460

Rousset, F. (2008). genepop’007: A complete re-implementation of the genepop software for Windows and Linux. Molecular Ecology Resources, 8(1), 103–106. https://doi.org/10.1111/j.1471-8286.2007.01931.x

R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/ .

Reinula, I., Träger, S., Järvine, H-T., Kuningas, V-M., Kaldra, M., Aavik, T. (2024). Beware of the impact of land use legacy on genetic connectivity: A case study of the long-lived perennial Primula veris. Biological Conservation, xx.

  • Träger, S., Rellstab, C., Reinula, I., Zemp, N., Helm, A., Holderegger, R., Aavik, T. (2021). Genetic diversity at putatively adaptive but not neutral loci in Primula veris responds to recent habitat change in semi-natural grasslands bioRxiv 2021.05.12.442254; doi: https://doi.org/10.1101/2021.05.12.442254

van Strien, M. J., Keller, D., Holderegger, R., Ghazoul, J., Kienast, F., & Bolliger, J. (2014). Landscape genetics as a tool for conservation planning: Predicting the effects of landscape change on gene flow. Ecological Applications, 24(2), 327–339. https://doi.org/10.1890/13-0442.1

1.Instrument- or software-specific information needed to interpret the data: n/a

2.Standards and calibration information, if appropriate: n/a

3.Environmental/experimental conditions: Experimental conditions don't apply. Environmental conditions: samples for genetic analysis were collected during summer with mostly dry and sunny weather.

4.Describe any quality-assurance procedures performed on the data:

5.People involved with sample collection, processing, analysis and/or submission: Iris Reinula, Sabrina Träger, Hanna-Triinu Järvine, Vete-Mari Kuningas, Marianne Kaldra, Tsipe Aavik, Marge Thetloff, Liis Kasari-Toussaint

DATA-SPECIFIC INFORMATION FOR: Primula_veris_gen_div.csv

Number of variables: 9

Number of cases/rows: 32

Delimiter: semicolon tab

Decimal operator: full stop (.)

Variable List:

  • Population_ID: identification number of the population
  • Region: the study landscape, either Koguva or Lepiku
  • Samples: number of samples in each population
  • Ho: genetic diversity index observed heterozygosity
  • He: genetic diversity index unbiased expected heterozygosity
  • FIS: genetic diversity index inbreeding coefficient
  • π: genetic diversity index nucleotide diversity

Missing data codes: n/a

Specialized formats or other abbreviations used: n/a

DATA-SPECIFIC INFORMATION FOR: Primula_veris_gen_diff_landscape_variables.csv

Number of variables: 10

Number of cases/rows: 244

Delimiter: tab

  • Population_ID_1: identification number of the first population in a pair
  • Population_ID_2: identification number of the first population in a pair
  • Geographical_distance_m: geographical distance in a straigt line between two populations (m)
  • MAP: pairwise mean assignment probability, a genetic distance index
  • FST: pairwise genetic differentiation index
  • Grassland_proportion: proportional amount of grassland within the buffer zone (d = 100 m)
  • surrounding the straight corridor between two populations.
  • Shrubs_proportion: proportional amount of shrubs within the buffer zone (d = 100 m)
  • Agricultural_alnd_proportion: proportional amount of agricultural land within the buffer zone (d = 100 m)
  • Forest_proportion: proportional amount of forest within the buffer zone (d = 100 m)
  • Quarry_proportion: proportional amount of quarry within the buffer zone (d = 100 m)

6. Missing data codes: NA

7. Specialized formats or other abbreviations used:n/a

DATA-SPECIFIC INFORMATION FOR: Koguva_raster_landscape.tiff

Dimensions: 250 x 250

Cell values: 1 - arable land 2 - quarry 3 - forest 4 - semi-natural grassland 5 - shrubs 6 - other landscape elements

DATA-SPECIFIC INFORMATION FOR: Lepiku_raster_landscape.tiff

Cell values: 1 - arable land 2 - other landscape elements 3 - forest 4 - semi-natural grassland 5 - shrubs

To generate the genetic information, the leaves of  Primula veris  were collected from study populations and DNA was extracted from the leaves. Extracted DNA was prepared for library using ddRAD (Peterson, Weber, Kay, Fisher, & Hoekstra, 2012) method and sequenced.

Genetic data was filtered geoinformatically (see Träger et al. 2021) and population-based genetic diversity indices (unbiased expected and observed heterozygosity, uHe and Ho, respectively) were calculated using GENALEX version 6.503 (Peakall & Smouse, 2005, 2012)  and mean nucleotide diversity (π) was calculated using vcftools v0.1.12b (Danecek et al., 2011) within a window of 125 bp over all loci for each population. Inbreeding coefficients (F IS ) and genetic differentiation (F ST ) were calculated using the package `genepop´ (Rousset, 2008) in R version 3.4.2 (R Core Team, 2017).

Pairwise mean assignment probability (MAP) was calculated with the package AssignPop (Chen et al., 2018). For calculating MAP, we used assignment tests.  We performed assignment tests for which we filtered out loci with low variance (threshold at 0.95) and used Monte-Carlo cross-validation. All loci (100%) were used as training data.

The classification method for prediction was linear discriminant analysis. The resulting pairwise probabilities (membership accuracies across all individuals)  were directional (e.g. 1 to 2, 2 to 1). We added these pairs together and divided them by two, resulting in one value per population pair (MAP; following van Strien et al., 2014).

Study populations were sampled at the scale of 2 2x2 km study landscapes (Koguva, Lepiku) and a 250 m buffer around the 2x2 km landscapes was added, resulting in two 2.5x2.5 km squares. We calculated the proportional amount of landscape elements surrounding the straight line between population pairs in a buffer with a width of 100 m.  We only calculated this within one landscape. We transformed the landscape data from vector data to 10x10 m raster data for resistance surface analysis.

  • Chen, K.-Y., Marschall, E. A., Sovic, M. G., Fries, A. C., Gibbs, H. L., & Ludsin, S. A. (2018). assignPOP: An r package for population assignment using genetic, non-genetic,  or integrated data in a machine-learning framework. Methods in Ecology and Evolution, 9(2), 439–446. https://doi.org/10.1111/2041-210X.12897
  • Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., & Durbin, R. (2011).  The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. https://doi.org/10.1093/bioinformatics/btr330
  • Peakall, R., & Smouse, P. E. (2005). genalex 6: Genetic analysis in Excel. Population genetic software for teaching and research.  Molecular Ecology Notes, 6(1), 288–295. https://doi.org/10.1111/j.1471-8286.2005.01155.x
  • Peakall, R., & Smouse, P. E. (2012). GenAlEx 6.5: Genetic analysis in Excel. Population genetic software for teaching and research—an update.  Bioinformatics, 28(19), 2537–2539. https://doi.org/10.1093/bioinformatics/bts460
  • Rousset, F. (2008). genepop’007: A complete re-implementation of the genepop software for Windows and Linux.  Molecular Ecology Resources, 8(1), 103–106. https://doi.org/10.1111/j.1471-8286.2007.01931.x
  • R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
  • Reinula, I., Träger, S., Järvine, H-T., Kuningas, V-M., Kaldra, M., Aavik, T. (2024).  Beware of the impact of land use legacy on genetic connectivity: A case study of the long-lived perennial Primula veris. Biological Conservation, xx. 
  • van Strien, M. J., Keller, D., Holderegger, R., Ghazoul, J., Kienast, F., & Bolliger, J. (2014). Landscape genetics as a tool for conservation planning:  Predicting the effects of landscape change on gene flow. Ecological Applications, 24(2), 327–339. https://doi.org/10.1890/13-0442.1

Estonian Research Council , Award: PRG1751

Estonian Research Council , Award: MOBJD427

Estonian Research Council , Award: PUT589

Estonian Research Council , Award: PRG874

European Regional Development Fund , Award: Centre of Excellence EcolChange

Estonian Ministry of Climate , Award: Biodiversa2021-943 , Biodiversa+

Estonian Ministry of Education and Research , Award: Center of Excellence AgroCropFuture, TK200


  1. How to Customize a Case Study Infographic With Animated Data

    case study on data set

  2. Top 10 Big Data Case Studies that You Should Know

    case study on data set

  3. The case study data collection and analysis process (an author's view

    case study on data set

  4. How to Create a Case Study + 14 Case Study Templates

    case study on data set

  5. 49 Free Case Study Templates ( + Case Study Format Examples + )

    case study on data set

  6. Write Online: Case Study Report Writing Guide

    case study on data set


  1. Numerical Analysis

  2. Stat 243 Module 1 Video 4 Data Organizaion and Variables



  5. Secure a Data Analyst in 3 Months

  6. (Mastering JMP) Visualizing and Exploring Data


  1. A Dataset Exploration Case Study with Know Your Data

    A KYD Case Study. As a case study, we explore some of these features using the COCO Captions dataset, an image dataset that contains five human-generated captions for each of over 300k images. Given the rich annotations provided by free-form text, we focus our analysis on signals already present within the dataset.

  2. 10 Real World Data Science Case Studies Projects with Example

    Data Analytics Case Study Examples in Travel Industry . Below you will find case studies for data analytics in the travel and tourism industry. 5) Airbnb. ... Airbnb uses predictive analytics to predict the prices of the listings and help the hosts set a competitive and optimal price. The overall profitability of the Airbnb host depends on ...

  3. Data in Action: 7 Data Science Case Studies Worth Reading

    7 Top Data Science Case Studies . Here are 7 top case studies that show how companies and organizations have approached common challenges with some seriously inventive data science solutions: Geosciences. Data science is a powerful tool that can help us to understand better and predict geoscience phenomena.

  4. Top 12 Data Science Case Studies: Across Various Industries

    Examples of Data Science Case Studies. Hospitality: Airbnb focuses on growth by analyzing customer voice using data science. Qantas uses predictive analytics to mitigate losses. Healthcare: Novo Nordisk is Driving innovation with NLP. AstraZeneca harnesses data for innovation in medicine. Covid 19: Johnson and Johnson uses data science to fight ...

  5. 15 Free Data Sets for Your Next Project or Portfolio

    Data.gov. Data.gov is where all of the American government's public data sets live. You can access all kinds of data that is a matter of public record in the country. The main categories of data available are agriculture, climate, energy, local government, maritime, ocean, and older adult health.

  6. Open Case Studies: Statistics and Data Science Education through Real

    To address this, we developed the Open Case Studies (https://www.opencasestudies.org) project, which offers a new statistical and data science education case study model. This educational resource ...

  7. Data Exploration

    Each environment included only the hardware each firm required, alongside premiere software and data. FactSet Data Exploration provided a turnkey solution, and granted users across Firm A and Firm B access to industry-standard tools such as Microsoft SQL Server, MATLAB, Python, R Studio, and Tableau. In addition, all of FactSet's Standard ...

  8. PDF Open Case Studies: Statistics and Data Science Education through Real

    fully constructing a diverse set of case studies across a wide range of contextual topics may require collabora-tion with individuals in other disciplines; this can be hard without protected time and effort from their aca-demic institutions [18]. Third, while there are rich repos-itories of data sets [7], there are few collections of as-

  9. PDF Resources on Cases and Data Sets Informs Resources

    the opportunity to learn how the case study was addressed in real-world practice. Since the competition inception in 2012, INFORMS and SAS have made available all the case-study data online at no charge. Current case-studies include topics such as forecasting, optimization, inventory management, scheduling, and revenue management.

  10. Google Data Analytics Capstone: Complete a Case Study

    There are 4 modules in this course. This course is the eighth and final course in the Google Data Analytics Certificate. You'll have the opportunity to complete a case study, which will help prepare you for your data analytics job hunt. Case studies are commonly used by employers to assess analytical skills. For your case study, you'll ...

  11. Case studies & examples

    The DSWG reviewed data-sharing across federal agencies and developed a set of recommendations for improving the methods to access and share data within and between agencies. This report presents the findings of the DSWG's review and provides recommendations to the CDOC Executive Committee. ... Open Energy Data at DOE. This case study details ...

  12. Case Study

    Surveys involve asking a set of questions to a sample of individuals relevant to the case study. Surveys can be administered in person, over the phone, through mail or email, or online. ... Rich data: Case study research can generate rich and detailed data, including qualitative data such as interviews, observations, and documents. This can ...

  13. Statistics Case Study and Dataset Resources

    The data is contextualized, provided for download in multiple formats, and includes questions to consider as well as references for each data set. The case studies for the current year can be found by clicking on the "Meetings" tab in the navigation sidebar, or by searching for "case study" in the search bar. Journal of Statistics Education

  14. Case Study Library

    Mendel's Laws of Inheritance. Use the data sets provided to explore Mendel's Laws of Inheritance for dominant and recessive traits. Key words: Bar charts, frequency distributions, goodness-of-fit tests, mosaic plot, hypothesis tests for proportions. Download the case study (PDF) Download the data set 1. Download the data set 2.

  15. Case Study Method: A Step-by-Step Guide for Business Researchers

    Qualitative case study methodology enables researchers to conduct an in-depth exploration of intricate phenomena within some specific context. ... Villiers and Fouché (2015) depicted a paradigm as a set framework making various assumptions about the social world, about how ... The authors interpreted the raw data for case studies with the help ...

  16. PDF Simpson's Paradox: A Data Set and Discrimination Case Study Exercise

    In the following sections, we describe the data set and its variables. Next, a set of instructions on how to incorporate this exercise into course curriculum is provided. We conclude with a brief discussion on the value of this case study exercise. 3. Data Set The data set presented is designed to represent a sample of 1,000 DDS consumers (which

  17. Case Study Methodology of Qualitative Research: Key Attributes and

    A case study is one of the most commonly used methodologies of social research. This article attempts to look into the various dimensions of a case study research strategy, the different epistemological strands which determine the particular case study type and approach adopted in the field, discusses the factors which can enhance the effectiveness of a case study research, and the debate ...

  18. (PDF) Understanding data analysis aspects of TMS-EEG in clinical study

    Understanding data analysis aspects of TMS-EEG in clinical study: a mini review and a case study with open dataset

  19. Case Study Data

    A well-organised and categorised set of case data will facilitate the task of analysing the case study evidence and maintaining a chain of evidence to support the derivation of case study conclusions from the data collected. The case study database will include the case data, such as documents, video or audio tapes of interviews, survey or ...

  20. Case Study Sample Data

    Case Studies. Throughout Using Software in Qualitative Research three case-study examples illustrate analytic tasks, their execution in CAQDAS packages and the potentials of different products.Chapter 2 summarizes the data sets, lists the research questions and outlines suggested processes for analysis. The case-study examples are drawn from real research projects and/or reflect contemporary ...

  21. What Is a Case Study?

    Revised on November 20, 2023. A case study is a detailed study of a specific subject, such as a person, group, place, event, organization, or phenomenon. Case studies are commonly used in social, educational, clinical, and business research. A case study research design usually involves qualitative methods, but quantitative methods are ...

  22. What are Cases in Statistics? (Definition & Examples)

    For example, the following dataset contains 10 cases and 3 variables that we measure for each case: Notice that each case has multiple variables or "attributes." For example, each player has a value for points, assists, and rebounds. Note that cases are also sometimes called experimental units. These terms are used interchangeably.

  23. How to write a case study

    1. Identify your goal. Start by defining exactly who your case study will be designed to help. Case studies are about specific instances where a company works with a customer to achieve a goal. Identify which customers are likely to have these goals, as well as other needs the story should cover to appeal to them.

  24. Anomaly detection in IoT-based healthcare: machine learning for

    The N-BaIoT data set in 10 was collected by the University of California, Irvine. Nine Linux-based IoT machines were used to generate traffic. Two IoT Botnets were used, one was BASHLITE and the ...

  25. Qualitative case study data analysis: an example from practice

    Furthermore, the ability to describe in detail how the analysis was conducted ensures rigour in reporting qualitative research. Data sources: The research example used is a multiple case study that explored the role of the clinical skills laboratory in preparing students for the real world of practice. Data analysis was conducted using a ...

  26. A generative AI reset: Rewiring to turn potential into value in 2024

    It's time for a generative AI (gen AI) reset. The initial enthusiasm and flurry of activity in 2023 is giving way to second thoughts and recalibrations as companies realize that capturing gen AI's enormous potential value is harder than expected.. With 2024 shaping up to be the year for gen AI to prove its value, companies should keep in mind the hard lessons learned with digital and AI ...

  27. Design of an basis-projected layer for sparse datasets in deep learning

    Deep learning (DL) models encompass millions or even billions of parameters and learn complex patterns from big data. However, not all data are initially stored in a suitable formation to effectively train a DL model, e.g., gas chromatography-mass spectrometry (GC-MS) spectra and DNA sequence. These datasets commonly contain many zero values, and the sparse data formation causes difficulties ...

  28. Generalizability of machine learning in predicting antimicrobial

    This was a cross-sectional study utilizing data collected in the past years to explore associations between predictors and outcomes. ... optimized through hyperparameter tuning to a C parameter of 9.795846277645586 and gamma set to 'auto'. Feed-Forward Neural Networks (FFNNs), designed using Keras 2.12.0, consisted of an input layer with 64 ...

  29. Key findings

    The latest IEA Global Methane Tracker is based on the most recently available data on methane emissions from the energy sector and incorporates new scientific studies, measurement campaigns, and information collected from satellites. Analysis of this data reveals both signs of progress and some worrying trends.

  30. Dryad

    Links/relationships to ancillary data sets: Sequence data used to generate this data will be made available at the European Nucleotide Archive (ENA) ... Data from: Beware of the impact of land use legacy on genetic connectivity: A case study of the long-lived perennial Primula veris, Estonia. Dryad, Dataset.