web scraping research

An Introduction to Web Scraping for Research

Like web archiving , web scraping is a process by which you can collect data from websites and save it for further research or preserve it over time. Also like web archiving, web scraping can be done through manual selection or it can involve the automated crawling of web pages using pre-programmed scraping applications.

Unlike web archiving, which is designed to preserve the look and feel of websites, web scraping is mostly used for gathering textual data. Most web scraping tools also allow you to structure the data as you collect it. So, instead of massive unstructured text files, you can transform your scraped data into spreadsheet, csv, or database formats that allow you to analyze and use it in your research.

There are many applications for web scraping. Companies use it for market and pricing research, weather services use it to track weather information, and real estate companies harvest data on properties. But researchers also use web scraping to perform research on web forums or social media such as Twitter and Facebook, large collections of data or documents published on the web, and for monitoring changes to web pages over time. If you are interested in identifying, collecting, and preserving textual data that exists online, there is almost certainly a scraping tool that can fit your research needs.

Please be advised that if you are collecting data from web pages, forums, social media, or other web materials for research purposes and it may constitute human subjects research, you must consult with and follow the appropriate UW-Madison Institutional Review Board process as well as follow their guidelines on “ Technology & New Media Research ”.

How it Works

The web is filled with text. Some of that text is organized in tables, populated from databases, altogether unstructured, or trapped in PDFs. Most text, though, is structured according to HTML or XHTML markup tags which instruct browsers how to display it. These tags are designed to help text appear in readable ways on the web and like web browsers, web scraping tools can interpret these tags and follow instructions on how to collect the text they contain.

Web Scraping Tools

The most crucial step for initiating a web scraping project is to select a tool to fit your research needs. Web scraping tools can range from manual browser plug-ins, to desktop applications, to purpose-built libraries within popular programming languages. The features and capabilities of web scraping tools can vary widely and require different investments of time and learning. Some tools require subscription fees, but many are free and open access.

Browser Plug-in Tools: these tools allow you to install a plugin to your Chrome or Firefox browser. Plug-ins often require more manual work in that you, the user, are going through the pages and selecting what you want to collect. Popular options include:

Scraper : a Chrome plugin Web Scraper.io : Available for Chrome and Firefox

Programming Languages: For large scale, complex scraping projects sometimes the best option is using specific libraries within popular programming languages. These tools require more up front learning, but once set up and going, are largely automated processes. It’s important to remember that to set up and use these tools, you don’t always need to be a programming expert and there are often tutorials that can help you get started. Some popular tools designed for web scraping include:

Scrapy and Beautiful Soup : Python libraries [see tutorial here and here ] rvest : a package in R [see tutorial here ] Apache Nutch : a Java library [see tutorial here ]

Desktop Applications: Downloading one of these tools to your computer can often provide familiar interface features and generally easy to learn workflows. These tools are often quite powerful, but are designed for enterprise contexts and sometimes come with data storage or subscription fees. Some examples include:

Parsehub : Initially free, but includes data limits and subscription storage past those limits Mozenda : Powerful subscription based tool

Application Programming Interface (API): Technically, a web scraping tool is an Application Programming Interface (API) in that it helps the client (you the user) interact with data stored on a server (the text). It’s helpful to know that, if you’re gathering data from a large company like Google, Amazon, Facebook, or Twitter, they often have their own APIs that can help you gather the data. Using these ready-made tools can sometimes save time and effort and may be worth investigating before you initiate a project.

The Ethics of Web Scraping

Included in their introduction to web scraping (using Python), Library Carpentry has produced a detailed set of resources on the ethics of web scraping . These include explicit delineations of what is and is not legal as well as helpful guidelines and best practices for collecting data produced by others. On the page they also include a Web Scraping Code of Conduct that provides quick advice about the most responsible ways to approach projects of this kind.

Overall, it’s important to remember that because web scraping involves the collection of data produced by others, it’s necessary to consider all the potential privacy and security implications involved in a project. Prior to your project, ensure you understand what constitutes sensitive data on campus and reach out to both your IT and IRB about your project so you have a data management plan prior to collecting any websites.

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

Web Scraping
Role of Web Scraping in Modern ...

Role of Web Scraping in Modern Research – A Practical Guide for Researchers

Bhagyashree

January 23, 2024

Imagine you’re deep into research when a game-changing tool arrives – web scraping. It’s not just a regular data collector; think of it as an automated assistant that helps researchers efficiently gather online information. Picture this: data on websites, that are a bit tricky to download in structured formats – web scraping steps in to simplify the process.

Techniques range from basic scripts in languages like Python to advanced operations with dedicated web scraping software. Researchers must navigate legal and ethical considerations , adhering to copyright laws and respecting website terms of use. It’s like embarking on a digital quest armed not only with coding skills but also a sense of responsibility in the vast online realm.

Understanding Legal and Ethical Considerations

When engaging in web scraping for research, it’s important to know about certain laws, like the Computer Fraud and Abuse Act (CFAA) in the United States and the General Data Protection Regulation (GDPR) in the European Union. These rules deal with unauthorized access to data and protecting people’s privacy. Researchers must ensure they:

Obtain data from websites with public access or with explicit permission.
Respect the terms of service provided by the website.
Avoid scraping personal data without consent in compliance with international privacy laws.
Implement ethical considerations, such as not harming the website’s functionality or overloading servers.

Neglecting these aspects can lead to legal consequences and damage the researcher’s reputation.

Choosing the Right Web Scraping Tool

When selecting a web scraping tool , researchers should consider several key factors:

Complexity of Tasks
Ease of Use
Customization
Data Export Options
Support and Documentation

By carefully evaluating these aspects, researchers can identify the web scraping tool that best aligns with their project requirements.

Data Collection Methods: API vs. HTML Scraping

When researchers gather data from web sources, they primarily employ two methods: API (Application Programming Interface) pulling and HTML scraping.

APIs serve as interfaces offered by websites, enabling the systematic retrieval of structured data, commonly formatted as JSON or XML. They are designed to be accessed programmatically and can provide a stable and efficient means of data collection, while typically respecting the website’s terms of service.

Often provides structured data
Designed for programmatic access
Generally more stable and reliable
May require authentication
Sometimes limited by rate limits or data caps
Potentially restricted access to certain data

HTML scraping, in contrast, involves extracting data directly from a website’s HTML code. This method can be used when no API is available, or when the API does not provide the required data.

Can access any data displayed on a webpage
No need for API keys or authentication is necessary
More susceptible to breakage if website layout changes
Data extracted is unstructured
Legal and ethical factors need to be considered

Researchers must choose the method that aligns with their data needs, technical capabilities, and compliance with legal frameworks.

Best Practices in Web Scraping for Research

Respect Legal Boundaries : Confirm the legality of scraping a website and comply with Terms of Service.
Use APIs When Available : Prefer officially provided APIs as they are more stable and legal.
Limit Request Rate : To avoid server overload, throttle your scraping speed and automate polite waiting periods between requests.
Identify Yourself : Through your User-Agent string, be transparent about your scraping bot’s purpose and your contact information.
Cache Data : Save data locally to minimize repeat requests thus reducing the load on the target server.
Handle Data Ethically : Protect private information and ensure data usage complies with privacy regulations and ethical guidelines.
Cite Sources : Properly attribute the source of scraped data in your scholarly work, giving credit to original data owners.
Use Robust Code : Anticipate and handle potential errors or changes in website structure gracefully to maintain research integrity.

Use Cases: How Researchers Are Leveraging Web Scraping

Researchers are applying web scraping to diverse fields:

Market Research : Extracting product prices, reviews, and descriptions to analyze market trends and consumer behavior.
Social Science : Scraping social media platforms for public sentiment analysis and to study communication patterns.
Academic Research : Collecting large datasets from scientific journals for meta-analysis and literature review.
Healthcare Data Analysis : Aggregating patient data from various health forums and websites to study disease patterns.
Competitive Analysis : Monitoring competitor websites for changes in pricing, products, or content strategy.

Web Scraping in Modern Research

A recent article by Forbes explores the impact of web scraping on modern research, emphasizing the digital revolution’s transformation of traditional methodologies. Integration of tools like data analysis software and web scraping has shortened the journey from curiosity to discovery, allowing researchers to rapidly test and refine hypotheses. Web scraping plays a pivotal role in transforming the chaotic internet into a structured information repository, providing a multi-dimensional view of the information landscape.

The potential of web scraping in research is vast, catalyzing innovation and redefining disciplines, but researchers must navigate challenges related to data privacy, ethical information sharing, and maintaining methodological integrity for credible work in this new era of exploration.

Overcoming Common Challenges in Web Scraping

Researchers often encounter multiple hurdles while web scraping. To bypass website structures that complicate data extraction, consider employing advanced parsing techniques. When websites limit access, proxy servers can simulate various user locations, reducing the likelihood of getting blocked.

Overcome anti-scraping technologies by mimicking human behavior: adjust scraping speeds and patterns. Moreover, regularly update your scraping tools to adapt to web technologies’ rapid evolution. Finally, ensure legal and ethical scraping by adhering to the website’s terms of service and robots.txt protocols.

Web scraping, when conducted ethically, can be a potent tool for researchers. To harness its power:

Understand and comply with legal frameworks and website terms of service.
Implement robust data handling protocols to respect privacy and data protection.
Use scraping judiciously, avoiding overloading servers.

Responsible web scraping for research balances information gathering for digital ecosystems. The power of web scraping must be wielded thoughtfully, ensuring it remains a valuable aid to research, not a disruptive force.

Is web scraping detectable?

Yes, websites can detect web scraping using measures like CAPTCHA or IP blocking, designed to identify automated scraping activities. Being aware of these detection methods and adhering to a website’s rules is crucial for individuals engaged in web scraping to avoid detection and potential legal consequences.

What is web scraping as a research method?

Web scraping is a technique researchers use to automatically collect data from websites. By employing specialized tools, they can efficiently organize information from the internet, enabling a quicker analysis of trends and patterns. This not only streamlines the research process but also provides valuable insights, contributing to faster decision-making compared to manual methods.

Is it legal to use web scraped data for research?

The legality of using data obtained through web scraping for research depends on the rules set by the website and prevailing privacy laws. Researchers need to conduct web scraping in a manner that aligns with the website’s guidelines and respects individuals’ privacy. This ethical approach ensures that the research is not only legal but also maintains its credibility and reliability.

Do data scientists use web scraping?

Absolutely, data scientists frequently rely on web scraping as a valuable tool in their toolkit. This technique enables them to gather a substantial volume of data from various internet sources, facilitating the analysis of trends and patterns. While web scraping is advantageous, data scientists must exercise caution, ensuring that their practices align with ethical guidelines and the rules governing web scraping to maintain responsible and legal usage.

Sharing is caring!

Driving Hospitality Industry Growth through Data Analysis

April 28, 2024

Predictive Analytics and Data Extraction: Transforming Decision-Making

Building a Data-Driven Culture: Integrating Web Scraping

April 24, 2024

Revolutionizing Business Intelligence with Enterprise Data Extraction

Leveraging Web Scraping for Consumer Sentiment Analysis

April 23, 2024

The Impact of Big Data on Market

Are you looking for a custom data extraction service?

Name * First Last
Company Name *
Contact Number *
Company Email *
What data type are you looking for? What type of data do you need? Ecommerce Product Data Travel Data Data for AI Jobs Data News, Articles, and Forums Product Reviews Real Estate Listings Airline Data Hotel Listings and Pricing Data Social Media Data Market Research and Analytics Automobile Data Image Extraction Others Please select the data type your project needs
Requirements *
I consent to having this website store my submitted information so they can respond to my inquiry.
Hidden Tags
Hidden CTA Type
Email This field is for validation purposes and should be left unchanged.

Please fill up all the fields to submit

Phone This field is for validation purposes and should be left unchanged.

Scraping Web Data for Marketing Insights

Learn how to use web scraping and APIs to build valid web data sets for academic research.

Journal of Marketing (Vol. 86, Issue 5, 2022)

Learn how to scrape⚡️

Follow technical tutorials on using web scraping and APIs for data retrieval from the web.

Discover datasets and APIs

Browse our directory of public web datasets and APIs for use in academic research projects.

Seek inspiration from 400+ published papers

Explore the database of published papers in marketing using web data (2001-2022).

Stay in the loop

Subscribe to the newsletter and get occasional updates.

Supported by

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
CAREER COLUMN
08 September 2020

How we learnt to stop worrying and love web scraping

Nicholas J. DeVito 0 ,
Georgia C. Richards 1 &
Peter Inglesby 2

Nicholas J. DeVito is a doctoral candidate and researcher at the EBM DataLab at the University of Oxford, UK.

You can also search for this author in PubMed Google Scholar

Georgia C. Richards is a doctoral candidate and researcher at the EBM DataLab at the University of Oxford, UK.

Peter Inglesby is a software engineer at the EBM DataLab at the University of Oxford, UK.

In research, time and resources are precious. Automating common tasks, such as data collection, can make a project efficient and repeatable, leading in turn to increased productivity and output. You will end up with a shareble and reproducible method for data collection that can be verified, used and expanded on by others — in other words, a computationally reproducible data-collection workflow.

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

Nature 585 , 621-622 (2020)

doi: https://doi.org/10.1038/d41586-020-02558-0

This is an article from the Nature Careers Community, a place for Nature readers to share their professional experiences and advice. Guest posts are encouraged .

Research data

Want to make a difference? Try working at an environmental non-profit organization

Career Feature 26 APR 24

Scientists urged to collect royalties from the ‘magic money tree’

Career Feature 25 APR 24

NIH pay rise for postdocs and PhD students could have US ripple effect

News 25 APR 24

AI’s keen diagnostic eye

Outlook 18 APR 24

So … you’ve been hacked

Technology Feature 19 MAR 24

No installation required: how WebAssembly is changing scientific computing

Technology Feature 11 MAR 24

Researchers want a ‘nutrition label’ for academic-paper facts

Nature Index 17 APR 24

Adopt universal standards for study adaptation to boost health, education and social-science research

Correspondence 02 APR 24

How AI is being used to accelerate clinical trials

Nature Index 13 MAR 24

Berlin (DE)

Springer Nature Group

ECUST Seeking Global Talents

Join Us and Create a Bright Future Together!

Shanghai, China

East China University of Science and Technology (ECUST)

Position Recruitment of Guangzhou Medical University

Seeking talents around the world.

Guangzhou, Guangdong, China

Guangzhou Medical University

Junior Group Leader

The Imagine Institute is a leading European research centre dedicated to genetic diseases, with the primary objective to better understand and trea...

Paris, Ile-de-France (FR)

Imagine Institute

Director of the Czech Advanced Technology and Research Institute of Palacký University Olomouc

The Rector of Palacký University Olomouc announces a Call for the Position of Director of the Czech Advanced Technology and Research Institute of P...

Czech Republic (CZ)

Palacký University Olomouc

Quick links

Explore articles by subject
Guide to authors
Editorial policies

How to use web scraping for online research

Research takes time. Whether it is academic research or market research, speeding up the data extraction process can save you days of work and eliminate the possibility of human error. Manual research is alright when copying and pasting information from a few websites. But what about more extensive projects where you need to collect and sort data from hundreds of web pages?

Enter web scraping . Also referred to as data scraping, data mining, data extraction, screen scraping , or web harvesting, web scraping is an automated solution for all your research needs. You tell the scraper what information you are looking for, and it gets you the data — already neatly organized and structured, so you won’t even have to sort it.

What is online research?

Online research refers to using the internet to retrieve data. This includes any type of website through any type of device. It's hard to find a good word to describe the amount of data in the internet — huge? Astronomical? Infinite?

Not to mention that to extract data manually, you would have to read entire web pages to determine whether they are relevant to your research or not , select relevant information, think of a structure and parameters to catalog it…

How can you use web scraping for online research?

Web scraping is the automated process of extracting structured information from a website. Simply put, your scraper will export data from a website and insert it tidily into a spreadsheet or a file. If you need to extract data for your research, this might just be the shortcut you have been looking for. Here are some examples of how you can use web scraping for online research:

Medical research : Monitoring clinical trial results, patient status, and disease incidence and detection (think of the COVID-19 pandemic) are only a few of the possible uses of web scraping in the healthcare and pharmaceutical field .

Academic research : Scholars search for data and facts for analysis which leads to a greater body of knowledge. Whatever their field of research may be, web scraping certainly speeds up the data collection process and saves energy for analysis, at both the university, think tank , and even individual student level .

News research : We have mentioned a few instances in which web scraping was instrumental to investigative journalism . But you don’t need to be a journalist to use a news web scraper! You might just want to stay up to speed with all the new developments in your professional field (like web design , for example) or receive updates every time your soccer team scores a goal. A web scraper can do that for you and store the results neatly so that you can use them in reports, databases, or just keep them handy for later reference.

Market research : Market research is fundamental to the success of every business. You need to know what your competitors are doing and what consumers want in order to sell. As the saying goes, “keep your friends close and your enemies closer.” And thanks to web scraping, your enemies will be closer than ever. Review websites are great places to find customer feedback on your competitors’ performance and activities. Check out these actors to scrape Yelp or Yellow Pages and make web scraping part of your market research .

How to use web scraping for online research

Advantages of using web scraping for online research

There are several advantages to using web scraping for research . Research can be a tedious, time-consuming, and expensive business. Sometimes, human or financial resources are scarce, resulting in poor and inaccurate results. Here are the most significant advantages of web scraping in research:

Speed : once you have set the parameters for your research, the web scraper sends your request to the website(s), brings you back the answer, and logs it into a file. It doesn’t require the same time as you to read through the whole web page, select the data you want, copy and paste it into a separate file, and then tidy it all up. If the research project is highly ambitious, the scraper might take a few hours or days — still much less than it would take you manually.

Convenience : you can export all the extracted data to a spreadsheet or file on your device. You won’t have to waste extra time sorting it because the scraper has already done that for you, as any well-designed scraper will extract structured data . The file will also be available offline when you don’t have internet access.

Cost : research is a lot of work. That is why universities, governments, and organizations usually need to hire a research department or consultants. Web scraping drastically reduces the workforce needed for such projects, therefore cutting the prices.

Also, data itself is expensive. If you need to find influencers from a target audience to promote a product, you may not have the capital to buy a database or a directory from a third party. Web scraping allows you to retrieve that data on your own.

Are there risks or downsides to web scraping?

A word of caution to the web scraping beginners. The usual rules related to copyright and especially personal data apply to data extracted through web scraping – plus a few more. Web scraping is legal , but you need to respect international regulations and the target website's terms of service.

Here are the main things to keep in mind when web scraping:

Getting blocked : Whenever you send a request to a website, the website has to send an answer back to you. This happens very fast, but it still takes time and energy. If you send too many requests too fast, you risk significantly slowing down the whole website or even bringing it down. Alternatively, the website could recognize that the requests are automated and block your IP address (temporarily or permanently). To minimize this risk , consider setting a pause for the program between each request and, more importantly, use a proxy server . Apify Proxy was designed with web scraping in mind, while also respecting the websites scraped.

Sharing : You can’t always share the data you extracted from a website. Some content might be licensed or copyrighted. However, you can always share your web scraping code on platforms such as GitHub so that others might use them.

But I don’t know how to program 🤷‍♀️

You might lack the technical competence needed to build a web scraper yourself. Most people do. But don’t despair: Apify’s no-code tools for extracting data are user-friendly and accessible.

You don’t need to know how to code to use an Apify scraper. The actor scrapes the website in HTML and extracts the data directly in an easily-readable spreadsheet or text format. You can then access the file any time you want for your research, even offline.

Check whether someone in the Apify community has already created the scraper you need in Apify Store or request a custom scraper for your project.

How to extract and download news articles online

Top 5 Google Scholar APIs to extract article data

Using web scraping to preserve digital memory of war crimes in Ukraine

Get started now.

Tools and Resources
Customer Services
Econometrics, Experimental and Quantitative Methods
Economic Development
Economic History
Economic Theory and Mathematical Models
Environmental, Agricultural, and Natural Resources Economics
Financial Economics
Health, Education, and Welfare Economics
History of Economic Thought
Industrial Organization
International Economics
Labor and Demographic Economics
Law and Economics
Macroeconomics and Monetary Economics
Micro, Behavioral, and Neuro-Economics
Public Economics and Policy
Urban, Rural, and Regional Economics
Share This Facebook LinkedIn Twitter

Article contents

Applications of web scraping in economics and finance.

Piotr Śpiewanowski , Piotr Śpiewanowski Institute of Economics, Polish Academy of Sciences
Oleksandr Talavera Oleksandr Talavera Department of Economics, University of Birmingham
and Linh Vi Linh Vi Department of Economics, University of Birmingham
https://doi.org/10.1093/acrefore/9780190625979.013.652
Published online: 18 July 2022

The 21st-century economy is increasingly built around data. Firms and individuals upload and store enormous amount of data. Most of the produced data is stored on private servers, but a considerable part is made publicly available across the 1.83 billion websites available online. These data can be accessed by researchers using web-scraping techniques.

Web scraping refers to the process of collecting data from web pages either manually or using automation tools or specialized software. Web scraping is possible and relatively simple thanks to the regular structure of the code used for websites designed to be displayed in web browsers. Websites built with HTML can be scraped using standard text-mining tools, either scripts in popular (statistical) programming languages such as Python, Stata, R, or stand-alone dedicated web-scraping tools. Some of those tools do not even require any prior programming skills.

Since about 2010, with the omnipresence of social and economic activities on the Internet, web scraping has become increasingly more popular among academic researchers. In contrast to proprietary data, which might not be feasible due to substantial costs, web scraping can make interesting data sources accessible to everyone.

Thanks to web scraping, the data are now available in real time and with significantly more details than what has been traditionally offered by statistical offices or commercial data vendors. In fact, many statistical offices have started using web-scraped data, for example, for calculating price indices. Data collected through web scraping has been used in numerous economic and finance projects and can easily complement traditional data sources.

web scraping
online prices
online vacancies
web-crawler

Web Scraping and the Digital Economy

Today’s economy is increasingly built around data. Firms and individuals upload and store enormous amount of data. This has been made possible thanks to the ongoing digital revolution. The price of data storage and data transfer has decreased to the point where the marginal (incremental) cost of storing and transmitting increasing volumes of data has fallen to virtually zero. The total volume of data created and stored rose from a mere 0.8 Zettabyte (ZB) or trillion gigabytes in 2009 to 33 ZB in 2018 and is expected to reach 175 ZB in 2025 (Reinsel et al., 2017 ). The stored data are shared at an unprecedented speed; in 2017 more than 46,000 GB of data (or four times the size of the entire U.S. Library of Congress) was transferred every second (United Nations Conference on Trade and Development (UNCTAD), 2019 ).

The digital revolution has produced a wealth of novel data that allows not only testing well-established economic hypotheses but also addressing questions on human interaction that have not been tested outside of the lab. Those new sources of data include social media, various crowd-sourcing projects (e.g., Wikipedia), GPS tracking apps, static location data, or satellite imagery. With the Internet of Things emerging, the scope of data available for scraping is set to increase in the future.

The growth of available data coincides with enormous progress in technology and software used to analyze it. Artificial intelligence (AI) techniques enable researchers to find meaningful patterns in large quantities of data of any type, allowing them to find useful information not only in data tables but also in unstructured text or even in pictures, voice recordings, and videos. The use of this new data, previously not available for quantitative analysis, enables researchers to ask new questions and helps avoid omitted variable bias by including information that has been known to others but not included in quantitative research data sets.

Natural language processing (NLP) techniques have been used for decades to convert unstructured text into structured data that machine-learning tools can analyze to uncover hidden connections. But the digital revolution expands the set of data sources useful to researchers to other media. For example, Gorodnichenko et al. ( 2021 ) have recently studied emotions embedded in the voice of Federal Reserve Board governors during press conferences. Using deep learning algorithms, they examined quantitative measures of vocal features of voice recordings such as voice pitch (indicating the level of highness/lowness of a tone) or frequency (indicating the variation in the pitch) to determine the mood/emotion of a speaker. The study shows that the tone of the voice reveals information that has been used by market participants. It is only a matter of time until conceptually similar tools will be used to read emotions from facial expressions.

Most of the produced data is stored on private servers, but a considerable part is made publicly available across the 1.83 billion websites available online that are now up for grabs for researchers equipped with basic web-scraping skills. 1

Many online services used in daily life, including search engines and price and news aggregators, would not be possible without web scraping. In fact, even Google, the most popular search engine as of 2021 , is just a large-scale web crawler. Automated data collection has also been used in business, for example, for market research and lead generation. Thus, unsurprisingly, this data collection method has also received growing attention in social research.

While new data types and data analysis tools offer new opportunities and one can expect a significant increase in use of video, voice, or picture analysis in the future, web scraping is used predominantly to collect text from websites. Using web scraping, researchers can extract data from various sources to build customized data sets to fit individual research needs.

The information available online on, inter alia, prices (e.g., Cavallo, 2017 ; Cavallo & Rigobon, 2016 ; Gorodnichenko & Talavera, 2017 ), auctions (e.g., Gnutzmann, 2014 ; Takahashi, 2018 ), job vacancies (e.g., Hershbein & Kahn, 2018 ; Kroft & Pope, 2014 ), or real estate and housing rental information (e.g., Halket & di Custoza, 2015 ; Piazzesi et al., 2020 ) has allowed refining answers to the well-known economic questions. 2

Thanks to web scraping, the data that until recently have been available only with a delay and in an aggregated form are now available in real time and with significantly more details than what has been traditionally offered by statistical offices or commercial data vendors. For example, in studies of prices, web scraping performed over extended periods of time allows the collection of price data from all vendors in a given area, with product details (including product identifier) in the desired granularity. Studies of the labor market or real estate markets benefit from extracting information from detailed ad descriptions.

The advantages of web scraping have been also noticed by statistical offices and central banks around the world. In fact, many statistical offices have started using web-scraped data, for example, for calculating price indices. However, data quality, sampling, and representativeness are major challenges and so is legal uncertainty around data privacy and confidentiality (Doerr et al., 2021 ). Although this is true for all data types, big data exacerbates the problem: most big data produced are not the final product, but rather a by-product of other applications.

What makes web scraping possible and relatively simple is the regular structure of the code used for websites designed to be displayed in web browsers. Websites built with HTML can be scraped using standard text-mining tools, either scripts in popular (statistical) programming languages such as Python or R or stand-alone dedicated web-scraping tools. Some of those tools do not even require any prior programming skills.

Web scraping allows collecting novel data that are unavailable in traditional data sets assembled by public institutions or commercial providers. Given the wealth of data available online, only researchers’ determination and coding skills determine the final scope of the data set to be explored. With web scraping, data can be collected at a very low cost per observation, compared to other methods. Taking control of the data-collecting process allows data to be gathered at the desired frequency from multiple sources in real time, while the entire process can be performed remotely. Although the set of available data is quickly growing, not all data can be collected online. For example, while price data are easily available, information on quantities sold is typically missing. 3

The rest of this article is organized as follows: “ Web Scraping Demystified ” presents the mechanics of web scraping and introduces some common web-scraping tools including code snippets that can be reused by readers in their own web-scraping exercises. “ Applications in Economics and Finance Research ” shows how those tools have been applied in economic research. Finally, “ Concluding Remarks ” wraps up the subject of web scraping, and suggestions for further reading are offered to help readers to master web-scraping skills.

Web Scraping Demystified

This section will present a brief overview of the most common web-scraping concepts and techniques. Knowing these fundamentals, researchers can decide which methods are most relevant and suitable for their projects.

Introduction to HTML

Before starting a web-scraping project, one should be familiar with Hypertext Markup Language (HTLM). HTML is a global language used to create web pages and has been developed so that all kinds of computers/devices (i.e., PCs, handheld devices) should be able to understand it. HTML allows users to publish online documents with different contents (i.e., headings, text, tables, images, and hyperlinks), incorporate video clips or sounds, and display forms for searching information, ordering products, and so forth. An HTML document is composed of a series of elements, which serve as labels of different contents and tell the web browser how to display them correctly. For example, the <title> element contains the title of the HTML page, the <body> element accommodates all the content of the page, the <img> element contains images, and the <a> element enables creating links to other content. A website with the underlying HTML code can be found here. To view HTML in any browser, one can just right click and choose the Inspect (or Inspect Element) option.

The code here is taken from a HTML document and contains information on the book’s name, author, description and price. 4

Each book’s information is enclosed in the <div> tag with attribute class and value “c-book__body” . The book title is contained in the <a> tag with attribute class and value “c-book__title” . Information on author names is embedded in the <span> tag with attribute class and value “c-book__by” . The description of the book is enclosed within the <p> tag with attribute class=“c-book__description” . Finally, the price information can be found in the <span> tag with attribute class=“c-book__price c-price” . Information on all other books listed on this website is coded in the same way. These regularities allow efficient scraping, which is described in the section “ Responsible Web Scraping .”

To access an HTML page, one should identify its Uniform Resource Locator (URL), which is the web’s address specifying its location on the Internet. A dynamic URL is usually made up of two main parts, with the first one being the base URL, which lets the web browser know how to access the information specified in the server. The next part is the query string that usually follows a question mark. An example of a URL is: Wordery where the base part is https://wordery.com/educational-art-design-YQA , and the query string part is viewBy=grid&resultsPerPage=20&page=1, which consists of specific queries to the website: displaying products in a grid view (viewBy=grid), showing 20 products in each page (resultsPerPage=20), and loading the first page (page=1) of the Educational Art Design books category. The regular structure of URLs is another feature that makes web scraping easy. Changing the page number (the last digit in the preceding URL) from 1 to 2 will display the subsequent 20 results of the query. This process can continue until the last query result is reached. Thus, it is easy to extract all query results with a simple loop.

Responsible Web Scraping

Data displayed by most websites is for public consumption. Recent court rulings, indicate that since publicly available sites cannot require a user to agree to any Terms of Service before accessing data, users are free to use web crawlers to collect data from the sites. 5 Many websites, however, provide (more detailed) data to registered users only, with the registration being conditional on accepting a ban on automated web scraping.

Constraints on web scraping arise also from the potentially excessive use of the infrastructure of the website owners, potentially impairing the service quality. The process of web scraping involves making requests to the host, which in turn will have to process a request and then send a response back. A data-scraping project usually consists of querying repeatedly and loading a large number of web pages in a short period of time, which might lead to overwhelmed traffic, system overload, and potential damages to the server and its users. With this in mind, a few rules in building web scrapers have to be followed to avoid damages such as website crashes.

Before scraping a website’s data, one should check whether there are any restrictions specified by the target website for scraping activities. Most websites provide the Robots Exclusion Protocol (also known as robots.txt file), which tells web crawlers the type/name of pages or files it can or cannot request from the website. This file can usually be found at the top-level directory of a website. For example, the robots.txt file of the Wordery website, which is available at Wordery Robots , states that:

The User-agent line specifies the name of the bot and the accompanying rules are what the bot should adhere to. User-agent is a text string in the header of a request that identifies the type of device, operating system, and browser that are used to access a web page. Normally, if a User-agent is not provided when web scraping, the rules stated for all bots under User-agent: * section should be followed. Allow gives specific URLs that are allowed to request with bots, and, conversely, Disallow specifies disallowed URLs.

In the prior example, Wordery does not allow a web scraper to access and scrape pages containing keywords such as “basket,” “checkout,” “settings,” and “newsletter.” The restrictions stated in robots.txt are not legally binding; however, they are a standard that most web users adhere to. Furthermore, if access to data is conditional on accepting some terms of service of the website, it is highly recommended that all the references to web-scraping activities be checked to make sure the data are obtained legally. Moreover, it is strongly recommended that contact details are included in the User-agent part of the requests.

Web crawlers process text at a substantially faster pace than humans, which, in principle, allows sending many more requests (downloading more web pages) than a human user would send. Thus, one should make sure that the web-scraping process does not affect the performance or bandwidth of the web server in any way. Most web servers will automatically block IP, preventing further access to its pages, if the servers receive too many requests. To limit the risk of being blocked, the program should take temporary pauses between subsequent requests. For more bandwidth-intensive crawlers, scraping should be planned at the time when the targeted websites experience the least traffic, for example, nighttime.

Furthermore, depending on the scale of each project, it might be worthwhile informing the website’s owners of the project or asking whether they have the data available in a structured format. Contacting data providers directly may save on coding efforts, though the data owner’s consent is far from guaranteed.

Typically, the researchers’ data needs are not very disruptive to the websites being scraped and the data are not being used for commercial purposes or re-sold, thus the harm made to the data owners is negligible. Nonetheless, the authors are aware of legal cases filed against researchers using web-scraped data by companies that owned the data and whose interests were affected by publication of research results based on data scraped from those sources. Retaining anonymity of the data sources and the identities of the firms involved in most cases protects the researchers against such risks.

Web-Scraping Tools

There are a wide variety of available tools/techniques that can be used for web scraping and data retrieval. In general, some web-scraping tools require programming knowledge such as Requests and BeautifulSoup libraries in Python, rvest package in R, or Scrapy, while some are more ready to use and require little to no technical expertise, such as iMacros or Visual web ripper. Many of the tools have been around for quite a while and have a large community of users (i.e., stackoverflow.com).

Python Requests and BeautifulSoup

One of the most popular web-scraping techniques is using Requests and BeautifulSoup libraries in Python, which are available for both Python 2.x and 3.x. 6 In addition to having Python installed, it is required to install necessary libraries such as bs4 and requests. The next step is to build a web scraper that sends a request to the website’s server asking the server to return the content of a specific page as an HTML file. The requests module in Python enables the performance of this task. For example, one can retrieve the HTML file from Wordery your online bookshop with the following codes:

Using the BeautifulSoup library could help to transform the complex HTML file into an object with nested data structure—the parse tree—which is easier to navigate and search. To do so, the HTML file has to be passed to the BeautifulSoup constructor to get the object soup as follows:

This example uses the HTML parser html.parser—a built-in parser in Python 3. From the object soup, one can parse all books’ name, author, price, and description from the web page as follows:

Scrapy is a popular open-source application framework written in Python for scraping websites and extracting structured data. 7 This is a programming-oriented method and requires coding skills. One of the main benefits of Scrapy is that it schedules and processes requests asynchronously, hence it can perform very quickly. This means that Scrapy does not need to wait for a request to be finished and processed; it can send another request or do other things in the meantime. Even if some request fails or an error occurs, other requests can keep going. Scrapy is extensible, as it allows users to plug in their own functionality and work on different operating systems.

There are two ways of running Scrapy, which are running from the command line and running from a Python script using Scrapy API. Due to space limitations, this article offers only a brief introduction on how to run Scrapy from the command line. Scrapy is controlled through the Scrapy command-line tool (aka Scrapy tool) and its sub-commands (aka Scrapy commands). There are two types of Scrapy commands: global commands—those that can work without a Scrapy project—and project-only commands—those that work only from inside a Scrapy project. Some examples of global commands are: startproject to create a new Scrapy project or view to show the content of the given URL in the browser as Scrapy would “see” it.

To start a Scrapy project, a researcher would need to install the latest version of Python and pip, since Scrapy supports only Python 3.6+. First, one can create a new Scrapy project using the global command startproject as follows:

This command will create a Scrapy project named wordery under the your_project_dir directory. Specifying your_project_dir is optional; otherwise, your_project_dir will be the same as the current directory.

After creating the project, one needs to go inside the project’s directory by running the following command:

The next task is to write a Python script to create the spider and save it in the file named wordery_spider.py under the wordery/spiders directory. Although adjustable, all Scrapy projects have the same directory structure by default. A Scrapy spider is a class that defines how a certain web page (or a group of pages) will be scraped, including how to perform the crawl (i.e., follow links) and how to extract structured data from the pages (i.e., scraping items). An example of a spider named WorderySpider that scrapes and parses all books’ name, price, and description from the page https://wordery.com/educational-art-design-YQA is :

After creating the spider, the next step involves changing the directory to the top-level directory of the project and using the command crawl to run such a spider:

Finally, the output that looks like the following example can be easily converted into a tabular form:

Another useful web scraping tool is iMacros, a web browser–based application for web automation popular since 2001 . 8 It is provided as a web browser extension (available for Mozilla Firefox, Google Chrome, and Internet Explorer) or a stand-alone application. iMacros can be easy to start with and requires little to no programming skills. It allows users to record repetitious tasks once and to automatically replay such tasks when needed. Furthermore, there is also an iMacros API that enables users to write script with various Windows programming languages.

After installation, the iMacros add-on can be found on the top bar of the browser. An iMacros web-scraping project normally starts by recording a macro—a set of commands for the browser to perform—in the iMacros panel. iMacros can record mouse clicks on various elements of a web page and translate them into TAG commands. Although simple tasks (e.g., navigating to a URL) can be easily recorded, more complicated tasks (e.g., looping through all items) might require some modifications of the recorded code.

As an illustration, the macro shown here asks the browser to access and extract information from the page https://wordery.com/educational-art-design-YQA :

Stata is a popular statistical software that enables researchers to carry out simple web-scraping tasks. First, one can use the readfile command to load the HTML file content from the website into a string. The next step is to extract product information that requires intensive work to handle string variables. A wide range of Stata string functions can be applied, such as splitting string, extracting substring, and searching string (i.e., regular expression). Otherwise, one might also rely on user-written packages such as readhtml , although its functions are limited to reading tables and lists from web pages. However, text manipulation operations are still limited in Stata, and it is recommended to combine Stata with Python.

The following Stata codes can be used to parse books’ information from the Wordery website:

Other Tools

Visual scraper.

Besides browser extensions, there are several ready-to-use data extraction desktop applications. For instance, iMacros has Professional Edition, which includes a stand-alone browser. This application is very easy to use because it allows scripting, recording, and data extraction. Similar applications are Visual Web Ripper, Helium Scraper, ScrapeStorm, and WebHarvy.

Cloud-Based Solutions

Cloud-based services are among the most flexible web-scraping tools since they are not operating system–dependent (they are accessible from the web browser and hence do not require installation), the extracted data is stored in the cloud, and their processing power is unrivaled by most systems. Most cloud-based services provide IP rotation to avoid getting blocked by the websites. Some cloud-based web-scraping providers are Octoparse, ParseHub, Diffbot, and Mozenda. These solutions have numerous advantages but might come with a higher cost than other web-scraping methods.

Recognizing the need for users to collect information, many websites also make their data available and directly retrievable through an open Application Programming Interface (API). While a user interface is designed for use by humans, APIs are designed for use by a computer or application. Web APIs act as intermediaries or means of communication between websites and users. Specifically, they determine the types of requests that can be made, how to make them, the data formats that should be used, and the rules to adhere to. APIs enable users to achieve data quickly and flexibly by requesting it directly from the database of the website, using their own programming language of choice.

A large number of companies and organizations offer free public APIs. Examples of open APIs are those of social media networks such as Facebook and Twitter, governments/international organizations such as the United States, France, Canada, and the World Bank, as well as companies such as Google and Skyscanner. For example, Reed.co.uk, a leading U.K. job portal, offers an open job seeker API, which allows job seekers to search all the jobs posted on this site. To get the API key, one can access the web page and sign up with an email address; then, an API key will be sent to the registered mailbox. This jobseeker API provides detailed job information, such as the ID of the employer, profile ID of the employer, location of the job, type of job, posted wage (if available), and so forth.

If APIs are available, it is usually much easier to collect data through an API than through web scraping. Furthermore, data scraping with API is completely legal. The data copyrights remain with the data provider, but this is not a limitation for use of data in research.

However, there are several challenges in collecting data using an API. First of all, the types and amount of data that are freely available through an API might be very limited. Data owners often set rate limits, based on time, the time between two consecutive queries, or number of concurrent queries and the number of records returned per query, which can significantly slow down collection of large data sets. Also, the scope of data available through free APIs may be limited. Those restrictions sometimes are lifted in return for a fee. An example is Google’s Cloud Translation API for language detection and text translation. The first 500,000 characters of text per month will be free of charge, but then fees will be charged for any over-the-limit characters. Finally, some programming skills are required to use API, though for major data providers one can easily find API wrappers written by users of major statistical software.

Applications in Economics and Finance Research

The use of web scraping in economic research started nearly as soon as when the first data started to be published on the Web (see Edelman, 2012 for a review of early web-scraping research in economics). In the early days of the Internet (and web scraping), data sets on a few hundreds of observations were considered rich enough to give insights sufficient for publication in a top journal (e.g., Roth & Ockenfels, 2002 ). As the technology matures and the amount of information available online increases, expectations on the size and quality of the scraped data are also getting higher. Low computation and storage costs allow the processing of data sets with billions of scraped observations (e.g., the Billion Prices Project, Cavallo & Rigobon, 2016 ). Web-scraping practice is present in a wide range of economic and finance research areas, including online prices, job vacancies, peer-to-peer lending, and house-sharing markets.

Online Prices

Together with the tremendous rise of e-commerce, online price data have received growing interest from economic researchers as an alternative to traditional data sources. The Billion Price Project (BPP) created at MIT by Cavallo and Rigobon in 2008 seeks to gather a massive amount of prices every day from hundreds of online retailers in the world. While the U.S. Bureau of Labor Statistics can gather only 80,000 prices on a monthly or bimonthly basis, the BPP can reach half a million price quotes in the United States each day. Collecting conventional-store prices is usually expensive and complicated, whereas retrieving online prices can come at a much lower cost (Cavallo & Rigobon, 2016 ). Moreover, detailed information for each product can be collected at a higher frequency (i.e., hourly, daily). Web scraping also allows researchers to quickly update the exit of products and introduction of new products. Furthermore, Cavallo ( 2018 ) points out that the use of Internet price data can help mitigate measurement biases (i.e., time averaging and imputation of missing prices) in traditionally collected data.

Several works in the online price literature focus on very narrow market segments. For example, in the context of online book retailing, Boivin et al. ( 2012 ) collected more than 141,000 price quotes of 213 books sold on major online book websites in the United States (Amazon.com and BN.com) and Canada (Amazon.ca and Chapters.ca) on every Monday, Wednesday, and Friday from June 2008 to June 2009 . They documented extensive international market segmentation and pervasive deviations from the law of one price. In a study on air ticket prices, Dudás et al. ( 2017 ) gathered more than 31,000 flight fares over a period of 182 days from three online travel agencies (Expedia, Orbitz, Cheaptickets) and two metasearch sites (Kayak, Skyscanner) for flights from Budapest to three popular European travel destinations (i.e., London, Paris, and Barcelona) using iMacros. They found that metasearch engines outperform online travel agencies in offering lower ticket prices and that there is no website that offers the lowest airfares constantly.

As price-comparison websites (PCWs) have gained popularity among consumers looking for the cheapest prices, these shopping platforms have become a promising data source for various studies. For instance, Baye and Morgan ( 2009 ) used a program written in Perl to download almost 300,000 price quotes from 90 best-selling products sold at Shopper.com, the top PCW for consumer electronics goods during 7 months between 2000 and 2001 . Their study aimed to document the effect of brand advertising activities on price dispersion, which refers to the case when identical products are priced differently across sellers. In the same vein, employing nearly 16,000 prices of six product categories (i.e., Hard Drives, Software, GPS, TV, Projector Accessories, and DVDs) scraped from the leading PCW BizRate, Liu et al. ( 2012 ) examined the pricing strategies of sellers with different reputation levels. They showed that, on average, low-rated sellers charge considerably higher prices than high-rated sellers and that this negative price premium effect is even larger if the market competition increases.

Lünnemann and Wintr ( 2011 ) conducted a comparative analysis with a much broader product category coverage on online price stickiness in the United States and four large European countries (France, Germany, Italy, and the United Kingdom). They collected more than 5 million price quotes from leading PCWs at daily frequencies during a 1-year period between December 2004 and December 2005 . Their data set contains common product categories, including consumer electronics, entertainment, information technology, small household appliances, consumer durables, and services from more than 1,000 sellers. Their finding is that prices adjust more often in European online markets than in the U.S. online markets. However, data sets used in these studies cover only a short duration of time (i.e., not exceeding a year). In a more recent study, Gorodnichenko and Talavera ( 2017 ) collected 20 million price quotes for more than 115,000 goods covering 55 good types in four main categories (i.e., computers, electronics, software, and cameras) from a PCW for 5 years on U.S. and Canadian online markets. They showed that online prices tend to be more flexible than conventional retail prices: price adjustments occur more often in online stores than in offline stores, but the size of price changes in online stores is less than one-half of that in brick-and-mortar stores.

Due to different competitive environments, prices collected from PCWs might not be representative, as sellers who participate in these platforms tend to raise the frequency and lower the size of price changes (Ellison & Ellison, 2009 ). Alternatively, researchers seek to expand data coverage by scraping a larger number of websites at a high frequency or to focus on specific types of websites, especially in studies on the consumer price index (CPI). The construction of the CPI requires price data that are representative of retail transactions. Hence, instead of gathering data from online-only retailers that might sell many products, which, however, account for only a small proportion of retail transactions, Cavallo ( 2018 ) collected daily prices from large multichannel retailers (i.e., Walmart) that sell goods both online and offline. Moreover, researchers may focus their data collection on product categories that are present in the official statistics, for which consumer expenditure weights are available. An example is the study of Faryna et al. ( 2018 ), who scraped online prices to compare online and official price statistics. Their data can cover up to 46% of the Consumer Price Inflation basket with more than 328 CPI sub-components. The data set includes 3 million price observations of more than 75,000 product categories of food, beverages, alcohol, and tobacco.

Despite the increasing efforts to scrape at a larger scale and for a longer duration to widen data coverage, there is only a limited number of studies featuring data on the quantity of goods sold (Gorodnichenko & Talavera, 2017 ). In order to derive the quantity proxy, Chevalier and Goolsbee ( 2003 ) employed the observed sales ranking of approximately 20,000 books listed on Amazon.com and BarnesandNoble.com to estimate elasticities of demand and compute a price index for online books. Using a similar approach, a project of the U.K. Office for National Statistics estimated sales quantities using products’ popularity rankings (i.e., the order that products are displayed on a web page when sorted by popularity) to approximate expenditure weights of goods in consumer price statistics. 9

Online Job Vacancies

Since about 2010 , with the growing number of employers and job seekers relying on online job portals to advertize and find jobs, researchers have increasingly identified the online job market as a new data source for analyzing labor market dynamics and trends. In comparison with more traditional approaches, scraping job vacancies has the advantage of time and cost effectiveness (Kureková et al., 2015 ). Specifically, while results of official labor market surveys might take up to a year to become available, online vacancies are real-time data that can be collected in a much shorter time at a low cost. Another key advantage is that the content of online job ads usually provides more detailed information than that provided by traditional newspaper sources.

Various research papers have focused their data collection on a single online job board. For instance, to examine gender discrimination, Kuhn and Shen ( 2013 ) collected more than a million vacancies posted on Zhaopin.com, the third-largest Chinese online job portal. Over one tenth of the job ads expresses a gender preference (male or female), which is more common in jobs requiring lower levels of skill. Some studies have employed API provided by online job portals to collect vacancies data. An example is the work of Capiluppi and Baravalle ( 2010 ), who developed a web spider to download job postings from monster.co.uk via their API, a leading online job board to investigate demand for IT skills in the United Kingdom. Their data set covers more than 48,000 vacancies in the IT category during the 9-month period from September 2009 to May 2010 .

However, data collected from a single website might not be representative of the overall job market. Alternatively, a large number of research papers rely on vacancies data scraped by a third party. The most well-known provider is Burning Glass Technologies (BGT), an analytics software company that scrapes, parses, and cleans vacancies from over 40,000 online job boards and company websites to create a near-universe of U.S. online job ads. Using a data set of almost 45 million online job postings from 2010 to 2015 collected by BGT, Deming and Kahn ( 2018 ) identified 10 commonly observed and recognizable skill groups to document the wage return of cognitive and social skills. Other studies used BGT data to address various labor market research issues, such as changes in skills demand and the nature of work (see, e.g., Adams et al., 2020 ; Djumalieva & Sleeman, 2018 ; Hershbein & Kahn, 2018 ); responses of the labor market to exogenous shocks (see, e.g., Forsythe et al., 2020 ; Javorcik et al., 2019 ), and responses of the labor market to technological developments (see, e.g., Acemoglu et al., 2020 ; Alekseeva et al., 2020 ; Deming & Noray, 2018 ).

Other Applications

Internet data enable scholars to have more insights into the peer-to-peer (P2P) economies, including online marketplaces (i.e., Ebay, Amazon), housing/rental market (i.e., Airbnb, Zoopla), and peer-to-peer lending platforms (Proper, renrendai). For the latter, with the availability of API, numerous studies employ data provided by Prosper.com—a U.S.–based P2P lending website (see, e.g., Bertsch et al., 2017 ; Krumme & Herrero, 2009 ; Lin et al., 2013 ). For instance, using Prosper API, Lin et al. ( 2013 ) obtained information on borrowers’ credit history, friendships, and the outcome of loan requests of more than 56,000 loan listings between January 2007 and May 2008 to study the information asymmetry in the P2P lending market. Specifically, they focused their analysis on borrowers’ friendship networks and credit quality and showed that friendships increase the likelihood of successful funding, lower interest rates on funded loans, and lower default rates. Later, Bertsch et al. ( 2017 ) scraped more than 326,000 loan-hour observations of loan funding progress and borrower and loan listing characteristics from Prosper between November 2015 and January 2016 to examine the impact of the monetary normalization process on online lending markets.

In the context of online marketplaces, using a data set of 20,000 coin auctions scraped directly from eBay, Lucking-Reiley et al. ( 2007 ) documented the effect of sellers’ feedback ratings on auction prices. They found that negative feedback ratings have a much larger effect on auction prices than positive feedback ratings. With regard to housing markets, Yilmaz et al. ( 2020 ) scraped more than 300,000 rental listings in 13 U.K. cities between 2015 and 2017 through Zoopla API to explore the seasonality in the rental market. Such seasonal patterns can be explained by students’ higher renting demand around the start of an academic year; this effect becomes stronger when the distance to the university campus is controlled for. Wang and Nicolau ( 2017 ) investigated accommodation prices with Airbnb listings data from a third-party website, Insideairbnb.com, which provides data sourced from publicly available information on Airbnb.com. Similarly, obtaining data of Airbnb listings in Boston from a third-party provider, Rainmaker Insights, Inc., which collects listings of property for rent, Horn and Merante ( 2017 ) examined the impact of home sharing on the housing market. Their findings suggest that the increasing presence of Airbnb leads to a decrease in the supply of housing offered for rent, thus increasing asking rents.

Concluding Remarks

The enormous amount of available data on almost every topic online should attract the interest of all empirical researchers. As an increasingly large part of our everyday activities moves online—the process speeding up due to the COVID-19 pandemic—scraping the Internet will become the only way to find information about a large part of human activities.

Data collected through web scraping has been used in thousands of projects and has led to better understanding of price formation, auction mechanisms, labor markets, social interactions, and many more important topics. With new data regularly uploaded on various websites, the old answers can be verified in different settings and new research questions can be posed.

However, the exclusivity of the data retrieved with web scraping often means that researchers are left alone to identify the potential pitfalls of the data. In the large public databases, there is broad knowledge in the public domain on database-specific issues typical for empirical research, such as sample selection, endogeneity, omitted variables, and error-in variables. In contrast, with the novel data sets, the entire burden rests on the web-scraping researchers. The growing strand of methodological research highlighting those pitfalls and suggesting steps to adjust the collected sample for representativeness helps to overcome those difficulties (e.g., Konny et al. 2022 ; Kureková et al. 2015 ).

In contrast to propriety data or sensitive registry data that require written agreements and substantial funds, and are thus off-limits to most researchers, especially in their early career stages, web scraping is available to everyone, and online data can be tapped with moderate ease. Abundant online resources and an active community willing to provide answers to even the most complex technical questions can undoubtedly make the learning curve steep. Thus, the return on investment for web-scraping skills is remarkably high, and not only to empirical researchers.

What Is Web Scraping? A Complete Beginner’s Guide

As the digital economy expands, the role of web scraping becomes ever more important. Read on to learn what web scraping is, how it works, and why it’s so important for data analytics.

The amount of data in our lives is growing exponentially. With this surge, data analytics has become a hugely important part of the way organizations are run. And while data has many sources, its biggest repository is on the web. As the fields of big data analytics , artificial intelligence , and machine learning grow, companies need data analysts who can scrape the web in increasingly sophisticated ways.

This beginner’s guide offers a total introduction to web scraping, what it is, how it’s used, and what the process involves. We’ll cover:

What is web scraping?
What is web scraping used for?
How does a web scraper function?
How to scrape the web (step-by-step)
What tools can you use to scrape the web?
What else do you need to know about web scraping?

Before we get into the details, though, let’s start with the simple stuff…

1. What is web scraping?

Web scraping (or data scraping) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed. If you’ve ever copied and pasted content from a website into an Excel spreadsheet, this is essentially what web scraping is, but on a very small scale.

However, when people refer to ‘web scrapers,’ they’re usually talking about software applications. Web scraping applications (or ‘bots’) are programmed to visit websites, grab the relevant pages and extract useful information. By automating this process, these bots can extract huge amounts of data in a very short time. This has obvious benefits in the digital age, when big data—which is constantly updating and changing—plays such a prominent role. You can learn more about the nature of big data in this post.

What kinds of data can you scrape from the web?

If there’s data on a website, then in theory, it’s scrapable! Common data types organizations collect include images, videos, text, product information, customer sentiments and reviews (on sites like Twitter, Yell, or Tripadvisor), and pricing from comparison websites. There are some legal rules about what types of information you can scrape, but we’ll cover these later on.

2. What is web scraping used for?

Web scraping has countless applications, especially within the field of data analytics. Market research companies use scrapers to pull data from social media or online forums for things like customer sentiment analysis. Others scrape data from product sites like Amazon or eBay to support competitor analysis.

Meanwhile, Google regularly uses web scraping to analyze, rank, and index their content. Web scraping also allows them to extract information from third-party websites before redirecting it to their own (for instance, they scrape e-commerce sites to populate Google Shopping).

Many companies also carry out contact scraping, which is when they scrape the web for contact information to be used for marketing purposes. If you’ve ever granted a company access to your contacts in exchange for using their services, then you’ve given them permission to do just this.

There are few restrictions on how web scraping can be used. It’s essentially down to how creative you are and what your end goal is. From real estate listings, to weather data, to carrying out SEO audits, the list is pretty much endless!

However, it should be noted that web scraping also has a dark underbelly. Bad players often scrape data like bank details or other personal information to conduct fraud, scams, intellectual property theft, and extortion. It’s good to be aware of these dangers before starting your own web scraping journey. Make sure you keep abreast of the legal rules around web scraping. We’ll cover these a bit more in section six.

3. How does a web scraper function?

So, we now know what web scraping is, and why different organizations use it. But how does a web scraper work? While the exact method differs depending on the software or tools you’re using, all web scraping bots follow three basic principles:

Step 1: Making an HTTP request to a server

Step 2: Extracting and parsing (or breaking down) the website’s code

Step 3: Saving the relevant data locally

Now let’s take a look at each of these in a little more detail.

As an individual, when you visit a website via your browser, you send what’s called an HTTP request. This is basically the digital equivalent of knocking on the door, asking to come in. Once your request is approved, you can then access that site and all the information on it. Just like a person, a web scraper needs permission to access a site. Therefore, the first thing a web scraper does is send an HTTP request to the site they’re targeting.

Step 2: Extracting and parsing the website’s code

Once a website gives a scraper access, the bot can read and extract the site’s HTML or XML code. This code determines the website’s content structure. The scraper will then parse the code (which basically means breaking it down into its constituent parts) so that it can identify and extract elements or objects that have been predefined by whoever set the bot loose! These might include specific text, ratings, classes, tags, IDs, or other information.

Once the HTML or XML has been accessed, scraped, and parsed, the web scraper will then store the relevant data locally. As mentioned, the data extracted is predefined by you (having told the bot what you want it to collect). Data is usually stored as structured data, often in an Excel file, such as a .csv or .xls format.

With these steps complete, you’re ready to start using the data for your intended purposes. Easy, eh? And it’s true…these three steps do make data scraping seem easy. In reality, though, the process isn’t carried out just once, but countless times. This comes with its own swathe of problems that need solving. For instance, badly coded scrapers may send too many HTTP requests, which can crash a site. Every website also has different rules for what bots can and can’t do. Executing web scraping code is just one part of a more involved process. Let’s look at that now.

4. How to scrape the web (step-by-step)

OK, so we understand what a web scraping bot does. But there’s more to it than simply executing code and hoping for the best! In this section, we’ll cover all the steps you need to follow. The exact method for carrying out these steps depends on the tools you’re using, so we’ll focus on the (non-technical) basics.

Step one: Find the URLs you want to scrape

It might sound obvious, but the first thing you need to do is to figure out which website(s) you want to scrape. If you’re investigating customer book reviews, for instance, you might want to scrape relevant data from sites like Amazon, Goodreads, or LibraryThing.

Step two: Inspect the page

Before coding your web scraper, you need to identify what it has to scrape. Right-clicking anywhere on the frontend of a website gives you the option to ‘inspect element’ or ‘view page source.’ This reveals the site’s backend code, which is what the scraper will read.

Step three: Identify the data you want to extract

If you’re looking at book reviews on Amazon, you’ll need to identify where these are located in the backend code. Most browsers automatically highlight selected frontend content with its corresponding code on the backend. Your aim is to identify the unique tags that enclose (or ‘nest’) the relevant content (e.g. <div> tags).

Step four: Write the necessary code

Once you’ve found the appropriate nest tags, you’ll need to incorporate these into your preferred scraping software. This basically tells the bot where to look and what to extract. It’s commonly done using Python libraries, which do much of the heavy lifting. You need to specify exactly what data types you want the scraper to parse and store. For instance, if you’re looking for book reviews, you’ll want information such as the book title, author name, and rating.

Step five: Execute the code

Once you’ve written the code, the next step is to execute it. Now to play the waiting game! This is where the scraper requests site access, extracts the data, and parses it (as per the steps outlined in the previous section).

Step six: Storing the data

After extracting, parsing, and collecting the relevant data, you’ll need to store it. You can instruct your algorithm to do this by adding extra lines to your code. Which format you choose is up to you, but as mentioned, Excel formats are the most common. You can also run your code through a Python Regex module (short for ‘regular expressions’) to extract a cleaner set of data that’s easier to read.

Now you’ve got the data you need, you’re free to play around with it.Of course, as we often learn in our explorations of the data analytics process , web scraping isn’t always as straightforward as it at first seems. It’s common to make mistakes and you may need to repeat some steps. But don’t worry, this is normal, and practice makes perfect!

5. What tools can you use to scrape the web?

We’ve covered the basics of how to scrape the web for data, but how does this work from a technical standpoint? Often, web scraping requires some knowledge of programming languages, the most popular for the task being Python . Luckily, Python comes with a huge number of open-source libraries that make web scraping much easier. These include:

BeautifulSoup

BeautifulSoup is another Python library, commonly used to parse data from XML and HTML documents. Organizing this parsed content into more accessible trees, BeautifulSoup makes navigating and searching through large swathes of data much easier. It’s the go-to tool for many data analysts.

Scrapy is a Python-based application framework that crawls and extracts structured data from the web. It’s commonly used for data mining, information processing, and for archiving historical content. As well as web scraping (which it was specifically designed for) it can be used as a general-purpose web crawler, or to extract data through APIs.

Pandas is another multi-purpose Python library used for data manipulation and indexing. It can be used to scrape the web in conjunction with BeautifulSoup. The main benefit of using pandas is that analysts can carry out the entire data analytics process using one language (avoiding the need to switch to other languages, such as R).

A bonus tool, in case you’re not an experienced programmer! Parsehub is a free online tool (to be clear, this one’s not a Python library) that makes it easy to scrape online data. The only catch is that for full functionality you’ll need to pay. But the free tool is worth playing around with, and the company offers excellent customer support.

There are many other tools available, from general-purpose scraping tools to those designed for more sophisticated, niche tasks. The best thing to do is to explore which tools suit your interests and skill set, and then add the appropriate ones to your data analytics arsenal!

6. What else do you need to know about web scraping?

We already mentioned that web scraping isn’t always as simple as following a step-by-step process. Here’s a checklist of additional things to consider before scraping a website.

Have you refined your target data?

When you’re coding your web scraper, it’s important to be as specific as possible about what you want to collect. Keep things too vague and you’ll end up with far too much data (and a headache!) It’s best to invest some time upfront to produce a clear plan. This will save you lots of effort cleaning your data in the long run.

Have you checked the site’s robots.txt?

Each website has what’s called a robot.txt file. This must always be your first port of call. This file communicates with web scrapers, telling them which areas of the site are out of bounds. If a site’s robots.txt disallows scraping on certain (or all) pages then you should always abide by these instructions.

Have you checked the site’s terms of service?

In addition to the robots.txt, you should review a website’s terms of service (TOS). While the two should align, this is sometimes overlooked. The TOS might have a formal clause outlining what you can and can’t do with the data on their site. You can get into legal trouble if you break these rules, so make sure you don’t!

Are you following data protection protocols?

Just because certain data is available doesn’t mean you’re allowed to scrape it, free from consequences. Be very careful about the laws in different jurisdictions, and follow each region’s data protection protocols. For instance, in the EU, the General Data Protection Regulation (GDPR) protects certain personal data from extraction, meaning it’s against the law to scrape it without people’s explicit consent.

Are you at risk of crashing a website?

Big websites, like Google or Amazon, are designed to handle high traffic. Smaller sites are not. It’s therefore important that you don’t overload a site with too many HTTP requests, which can slow it down, or even crash it completely. In fact, this is a technique often used by hackers. They flood sites with requests to bring them down, in what’s known as a ‘denial of service’ attack. Make sure you don’t carry one of these out by mistake! Don’t scrape too aggressively, either; include plenty of time intervals between requests, and avoid scraping a site during its peak hours.

Be mindful of all these considerations, be careful with your code, and you should be happily scraping the web in no time at all.

7. In summary

In this post, we’ve looked at what data scraping is, how it’s used, and what the process involves. Key takeaways include:

Web scraping can be used to collect all sorts of data types: From images to videos, text, numerical data, and more.
Web scraping has multiple uses: From contact scraping and trawling social media for brand mentions to carrying out SEO audits, the possibilities are endless.
Planning is important: Taking time to plan what you want to scrape beforehand will save you effort in the long run when it comes to cleaning your data.
Python is a popular tool for scraping the web: Python libraries like Beautifulsoup, scrapy, and pandas are all common tools for scraping the web.
Don’t break the law: Before scraping the web, check the laws in various jurisdictions, and be mindful not to breach a site’s terms of service.
Etiquette is important, too: Consider factors such as a site’s resources—don’t overload them, or you’ll risk bringing them down. It’s nice to be nice!

Data scraping is just one of the steps involved in the broader data analytics process. To learn about data analytics, why not check out our free, five-day data analytics short course ? We can also recommend the following posts:

Where to find free datasets for your next project
What is data quality and why does it matter?
Quantitative vs. qualitative data: What’s the difference?

Roadmap to Web Scraping: Use Cases, Methods & Tools in 2024

Data is critical for business and internet is a large data source including insights about vendors, products, services, or customers. Businesses still have difficulty automatically collecting data from numerous sources, especially the internet. Web scraping enables businesses to automatically extract public data from websites using web scraping tools .

In this article, we will dive into each critical aspect of web scraping, including what it is, how it works, its use cases and best practices.

What is web scraping?

Web scraping, sometimes called web crawling, is the process of extracting data from websites. The table below presents a comparison of leading web scraping tools. For an in-depth analysis, refer to our comprehensive guide.

How does web scraper tools / bots work?

The process of scraping a page involves making requests to the page and extracting machine-readable information from it. As seen in figure 2 general web scraping process consists of the following 7 steps :

Identification of target URLs
If the website to be crawled uses anti-scraping technologies such as CAPTCHAs , the scraper may need to choose the appropriate proxy server solution to get a new IP address to send its requests from.
Making requests to these URLs to get HTML code
Using locators to identify the location of data in HTML code
Parsing the data string that contains information
Converting the scraped data into the desired format
Transferring the scraped data to the data storage of choice

Figure 2: 7 steps of an web scraping process

The process of web scraping consists of seven steps.

Bright Data offers its web scraper as a managed cloud service . Users can rely on coding or no-code interfaces to build scrapers that run on the infrastructure provided by their SaaS solution.

Which web crawler should you use?

The right web crawler tool or service depends on various factors, including the type of project, budget, and technical personnel availability. The right-thinking process when choosing a web crawler should be like the below:

We developed a data-driven web scraping vendor evaluation to help you selecting the right web scraper.

Figure 3: Roadmap for choosing the right web scraping tool

Top 10 web scraping applications/use cases

Data analytics & data science.

1 . Training predictive models: Predictive models require a large volume of data to improve the accuracy of outputs. However, collecting a large volume of data is not easy for businesses with manual processes. Web crawlers help data scientists extract required data instead of doing it manually.

2. Optimizing NLP models: NLP is one of the conversational AI applications. A massive amount of data, especially data collected from the web, is necessary for optimizing NLP models. Web crawlers provide high-quality and current data for NLP model training .

Real Estate

3 . Web scraping in real estate: Web scraping in real estate enables companies to extract property and consumer data. Scraped data helps real estate companies:

analyze the property market.
optimize their prices according to current market values and customers’ expectations.
set targeted advertisement.
analyze market cycles and predict the forecast sales.

Oxylabs’ real estate scraper API allows users to access and gather various types of real estate data, including price history, property listings, and rental rates, bypassing anti-bot measures.

Marketing & sales

4. Price scraping: Companies can leverage crawled data to improve their revenues. Web scrapers automatically extract competitors’ price data from websites. Price scraping enables businesses to:

understand customers’ purchase behavior.
set their prices to stay competitive by tracking competitors’ product prices online
attract their competitors’ customers.

5. Scraping/Monitoring competitors’ product data: Web scrapers help companies extract and monitor products’ reviews , features, and stock availability from suppliers’ product pages. It enables companies to analyze their competitors, generate leads, and monitor their customers.

6. Lead generation: Lead generation helps companies improve their lead generation performances, time and resources. More prospects data is available online for B2B and B2C companies. Web scraping helps companies to collect the most up-to-date contact information of new customers to reach out to, such as social media accounts and emails.

Check out how to generate leads using Instagram search queries such as hashtags and keywords.

7. SEO monitoring: Web scraping helps content creators check primary SEO metrics , such as keywords ranking, dead links, rank on the google search engine, etc. Web crawlers collect publicly available competitor data from targeted websites, including keywords, URLs, customer reviews, etc. Web crawlers enable companies to optimize their content to attract more views.

8. Market sentiment analysis: Using web scrapers in marketing enables companies:

analyze and track their competitors’ performance on social media
optimize their influencer marketing activities
track the actual performance of their ads

Human Resources

9 . Improving recruitment processes: Web scrapers help recruiters automatically extract candidates’ data from recruiting websites such as LinkedIn. Recruiters can leverage the extracted data to:

analyze and compare candidates’ qualifications.
collect candidates’ contact information such as email addresses, and phone numbers.
collect salary ranges and adjust their salaries accordingly,
analyze competitors’ offerings and optimize their job offerings.

Finance & Banking

10 . Credit rating: The process of evaluating the credit risk of a borrower’s creditworthiness. Credit scores are calculated for an individual, business, company, or government. Web scrapers extract data about a business’s financial status from company public resources to calculate credit rating scores.

Check out top 18 web scraping applications & use cases to learn more about web scraping use cases.

Top 7 web scraping best practices

Here you can find top 7 web scraping best practices that help you to imply web scraping:

Use proxy servers: Many large website operators use anti-bot tools that need to be bypassed to crawl a large number of HTML pages. Using proxy servers and making requests through different IP addresses can help overcome these obstacles. If you cannot decide which proxy server type is best for you, read our ultimate guide to proxy server types.
Use dynamic IP: Changing your IP from static to dynamic can also be useful to avoid being detected as a crawler and getting blocked.
It is easier to detect crawlers if they make requests faster than humans.
A website’s server may not respond if it gets too many requests simultaneously. Scheduling crawl times to start at the websites’ off-peak hours and programming the crawler to interact with the page can also help to avoid this issue.
Comply with GDPR: It is legal and allowed to scrape publicly available data from websites. On the other hand, under GDPR , It is illegal to scrape the personally identifiable information (PII) of an EU resident unless you have their explicit consent to do so.
Beware of Terms & Conditions: If you are going to scrape data from a website that requires login, you need to agree on terms & conditions to sign up. Some T&C involves companies’ web scraping policies that explicitly state that you aren’t allowed to scrape any data on the website.
Leverage machine learning: Scraping is turning into a cat & mouse game between content owners and content scrapers with both parties spending billions to overcome measures developed by the other party. We expect both parties to use machine learning to build more advanced systems.
Consider open source web scraping platforms: Open source is playing a larger role in software development, this area is no different. The popularity of Python is high. We expect open source web scraping libraries such as Selenium , Puppeteer , and Beautiful Soup that work on Python to shape the web crawling processes in the near future.

What are the challenges of web scraping?

Complex website structures: Most web pages are based on HTML, and web page structures are widely divergent. Therefore when you need to scrape multiple websites, you need to build one scraper for each website.
Scraper maintenance can be costly: Websites change the design of the page all the time. If the location of data that is intended to be scrapped changes, crawlers are required to be programmed again.
Anti-scraping tools used by websites: Anti-scraping tools enable web developers to manipulate content shown to bots and humans and also restrict bots from scraping the website. Some anti-scraping methods are IP blocking, CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) , and honeypot traps.
Login requirement: Some information you want to extract from the web may require you to log in first. So when the website requires login, the scraper needs to make sure to save cookies that have been sent with the requests, so the website recognizes the crawler is the same person who logged in earlier.
Slow/ unstable load speed: When websites load content slowly or fail to respond, refreshing the page may help, yet, the scraper may not know how to deal with such a situation.

To learn more about web scraping challenges, check out web scraping: challenges & best practices

For more on web scraping

Web Scraping tools: Data-driven Benchmarking
Top 7 Python Web Scraping Libraries & Tools in 2023
The Ultimate Guide to Efficient Large-Scale Web Scraping [2023]

If you still have questions about the web scraping landscape, feel free to check out the sortable list of web scraping vendors .

You can also contact us:

This article was originally written by former AIMultiple industry analyst Izgi Arda Ozsubasi and reviewed by Cem Dilmegani

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 60% of Fortune 500 every month. Cem's work has been cited by leading global publications including Business Insider , Forbes, Washington Post , global firms like Deloitte , HPE, NGOs like World Economic Forum and supranational organizations like European Commission . You can see more reputable companies and media that referenced AIMultiple. Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised businesses on their enterprise software, automation, cloud, AI / ML and other technology related decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization. He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider . Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

To stay up-to-date on B2B tech & accelerate your enterprise:

Next to Read

Comprehensive guide to web crawling vs web scraping in 2024, 4 ways hedge funds can benefit from web scraping in 2024, top 7 differences of web scraping vs api in 2024.

Your email address will not be published. All fields are required.

Related research

Top 7 Python Web Scraping Libraries & Tools in 2024

Proxies for Web Scraping: Providers & Best Practices in 2024

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
Public Health Ethics

Scraping the Web for Public Health Gains: Ethical Considerations from a ‘Big Data’ Research Project on HIV and Incarceration

Stuart rennie.

p1 UNC Bioethics Center, Department of Social Medicine, University of North Carolina at Chapel Hill

Mara Buchbinder

Eric juengst, lauren brinkley-rubinstein.

p2 Center for Health Equity, Department of Social Medicine, University of North Carolina at Chapel Hill

Colleen Blue

p3 Institute for Global Health and Infectious Diseases, University of North Carolina at Chapel Hill

David L Rosen

Web scraping involves using computer programs for automated extraction and organization of data from the Web for the purpose of further data analysis and use. It is frequently used by commercial companies, but also has become a valuable tool in epidemiological research and public health planning. In this paper, we explore ethical issues in a project that “scrapes” public websites of U.S. county jails as part of an effort to develop a comprehensive database (including individual-level jail incarcerations, court records and confidential HIV records) to enhance HIV surveillance and improve continuity of care for incarcerated populations. We argue that the well-known framework of Emanuel et al. (2000) provides only partial ethical guidance for the activities we describe, which lie at a complex intersection of public health research and public health practice. We suggest some ethical considerations from the ethics of public health practice to help fill gaps in this relatively unexplored area.

Introduction

The World Wide Web can be regarded as the largest database ever created in human history. Precise estimates of its magnitude—in terms of storage, communication and computation—are a matter of debate ( Pappas, 2016 ), but it is safe to say that the volume, speed and variety of data give the internet unprecedented potential to advance important social goals. As human actions are increasingly captured by digital technology, online data are becoming a highly valued information source for researchers of all stripes. In medicine and public health, many hope that vastly increased volumes of patient data will help overcome existing knowledge gaps, lead to health innovations and improve health outcomes.

One pervasive method of gathering digital data is the practice of ‘web scraping’. The massive amount of data available on the Web means that effective data collection and processing cannot be manually conducted by individual researchers or even large research teams. Web scraping, an alternative to manual data collection, entails the use of computer programs for automated extraction and organization of data from the Web for the purpose of further data analysis and use ( Krotov and Silva, 2018 ). Commercial companies are heavily reliant on web scraping to collect, for example, data on consumer preferences (e.g., in product reviews) and business competitors (e.g., prices) in real time to inform goals and strategy.

Web scraping has also become a valuable tool in epidemiological research and public health planning. With massive, publicly accessible health-related data available on the internet, epidemiologists increasingly need to be trained in computer programming, including web scraping ( Mooney et al. , 2015 ). Web scraping for public health purposes is not limited to health data; it can also include data of potential biomedical or public health significance, such as social media posts about meals or eating habits, records of court actions, or traffic patterns ( Vayena and Gasser, 2016 ; Richterich, 2018 ).

However, as its evocative name suggests, web scraping is not necessarily a benign procedure. In fall 2017, the San Francisco-based company Strava, described as a ‘social network for athletes’, announced an update to its global heat map of user activity that visually tracks the movements of users wearing Fitbits or other fitness trackers. Journalists and activists have raised ethical concerns regarding the fact that data scraped from Strava’s website could be used to track the location of military or intelligence personnel and identify individual users when combined with other information. These events are prompting the US military to rethink its policies regarding use of fitness trackers, which were previously encouraged to promote physical activity ( Hsu, 2018 ).

Considering such potential misuses, it is not always clear how to interpret and apply privacy standards and laws that were established prior to our highly interconnected, digitalized world ( Gold and Latonero, 2018 ). What level of control should individuals have over information posted by others about them? Are some individuals more at risk than others from research that ‘scrapes the web’ to glean publicly accessible information about them? What safeguards should be in place to prevent social harms that might be associated with the use of individualized online data in research and public health contexts? What other considerations should inform the just use of this enormous resource for research purposes?

To explore these questions in the context of a specific illustrative case, we discuss the ethics of an ongoing study in which we are ‘scraping’ public websites of US county jails to create a database of individual-level jail incarcerations. In collaboration with the North Carolina Health Department, we will combine these jail incarceration records with existing confidential HIV records to create a database that could, as we explain below, (i) inform new, enhanced forms of HIV surveillance incorporating jail populations and (ii) potentially contribute to future public health approaches for improving care for incarcerated populations. As such, this case study raises questions about the use of web scraping in both research and public health contexts. In the next section, we describe the case in some detail, and apply the widely used Emanuel et al. (2000 ) research ethics framework to raise and examine ethical concerns related to use of web scraping in public health research. The Emanuel et al. framework was initially developed as guidance for the design and conduct of clinical research, but is often extended to other types of biomedical research such as public health research. We use it as our point of departure in order to ask how well it can help address raised by translational public health research that sits at the intersection of public health research and public health practice, like the web scraping activities of our case study. We argue that certain ethically salient aspects of ‘big data’ research in public health contexts are neglected by this influential framework. This is in part because of the ways in which such research would inform public health surveillance practices, which would also depend on ongoing web scraping to succeed. We conclude by suggesting how considerations from the ethics of public health practice can help to address these intersectional blind spots.

Case Study: Leveraging Big Data to Understand and Improve Continuity of Care among HIV-Positive Jail Inmates

HIV medications have been definitively shown to improve patients’ health and to prevent onward transmission of the virus ( Cohen et al. , 2011 ). And yet, in the USA, 40 per cent of people diagnosed with HIV are not retained in care, and therefore, do not have access to HIV medications ( Centers for Disease Control and Prevention, 2018 ). In this context, improving access to HIV care, including access to medications, is a central challenge in combatting the US HIV epidemic.

In 2017, our team began a National Institutes of Health-supported research project to develop a data system to improve HIV care for persons living with HIV in North Carolina who have had periods of incarceration in jail. Based on 2006 data, it was roughly estimated that among all adults in the USA infected with HIV, about one in six annually spend time incarcerated, mostly in local jails ( Spaulding et al. , 2009 ). Previous studies have demonstrated the disruptive impact of prison incarceration on HIV treatment and care ( Iroh et al. , 2015 ), but currently very little is known about access to medical services among HIV-positive persons before, during or after jail incarceration.

The major aim of our study is to improve continuity of care for justice-involved persons with HIV by improving estimates of the number of people with HIV who are passing through jails in North Carolina and by better understanding their engagement in care before, during and after incarceration. However, several challenges confronted us as we considered different study designs. Given the large number of county jails in the state (97) and the difficulties of recruiting jails as study sites, we determined that a prospective design across multiple sites was not practical. We were aware that some of the data necessary to answer our question did exist—the state Division of Public Health conducts HIV disease surveillance in which it uses a number of different data sources to determine if people living with HIV are routinely engaged in care. At the same time, there were no existing data sources to address continuity of care for people living with HIV in jails. The state’s public health data are integrated into a single database, and could in principle be linked to a database of jail incarceration records, stripped of identifiers and analyzed anonymously for our purposes. The barrier to this approach, however, is that jails operate independently, at the county level, and no single database of jail incarcerations is available to researchers.

Nevertheless, 29 of 97 jails in the state have public websites that provide some information about who is currently incarcerated in their facilities. These jails account for about half of all jail inmates in the state, or about 200,000 people each year. Although the jail websites can vary in their content and layout, they generally include incarcerated persons’ names, age or date of birth, arrest charge and date when the incarceration began. To create a database of jail incarcerations, we are web scraping these sites daily. The resulting individual incarceration records will be linked by county, name and date of birth (if available) or age, to a restricted set of state court records, which contain a more robust set of identifiers for people who have pending charges, including date of birth (if missing from the jail data) and partial social security number. As our research project unfolds, this enhanced set of identifiers will then be used by our partners within the state Division of Public Health to link the jail-court records to the state’s confidential HIV database, to create a deidentified statewide database of incarcerations involving HIV-infected individuals.

The resulting database can be used to further the goals of public health surveillance in a number of ways, even with all personal identifiers removed. These include providing more precise information about: the burden of HIV-positive inmates in each jail sampled, the length of incarceration for HIV-positive inmates and the patterns of how persons living with HIV access HIV-related care before and after they are in jail. This enhanced surveillance information could be very useful for monitoring purposes and the allocation of resources for medical care among HIV-positive persons who are in and out of jail in North Carolina.

This database could also inform what is called a Data to Care (D2C) approach. D2C involves the use of HIV surveillance data routinely collected by state and local health departments to identify out-of-care individuals and re-link or re-engage them in care. Using this type of approach, state health departments could be notified in real time when a person living with HIV entered jail. This would enable the health department to engage that person in care while in jail, or to connect the individual to health-care providers upon release. This deployment of surveillance data is a shift from its traditional use (i.e. descriptive and monitoring purposes), and has been partly driven by biomedical advances in HIV treatment and prevention ( Sweeney et al. , 2013 ). The United States Centers of Disease Control and Prevention (CDC) recently included D2C as a condition of funding for state surveillance efforts and it is currently being implemented in many states ( Centers for Disease Control and Prevention, 2018 ). In other words, a systematic integration of jail, court and public health surveillance data could provide state public health agencies with a powerful tool to identify HIV-positive persons who have been jailed, and help them access medical care.

In linking public jail rosters to individualized court records for comparison against the state health department’s confidential HIV database, both our research project and the potential use of this system by the health department and the jails to enhance continuity of care will raise many questions about how best to protect human rights and ensure fairness in the context of web scraping-based surveillance. What is unique about this research project is that it scrapes personal identifiers from websites to develop a database that could contribute to health interventions with individuals (i.e., re-engaging them in HIV care). Most public health initiatives involving web scraping have a population rather than individual-level focus. For example, HealthMap scrapes data about disease outbreaks in real time worldwide from multiple digital sources (such as news feed aggregators, Twitter and agencies such as the World Health Organization) in a Google Map-inspired format. But no personal identifiers are gathered, and its purpose is to provide surveillance information rather than support interventions ( HealthMap, 2006 ). Alternatively, Facebook has created a system that monitors its users’ posts for language deemed to convey an imminent risk of suicide; when such a user is identified, first responders can be prompted to reach out to the individual to offer support ( Goggin, 2019 ). However, this project does not employ web scraping as Facebook is utilizing data posted to its own website.

One of the most influential frameworks for thinking through ethical issues in biomedical research is the one proposed by Emanuel et al. (2000) . Applying this framework to web scraping research helps to illuminate some important ethical considerations in designing such studies and implementing their results in public health programs. But just as importantly, that framework, designed as it is for conducting clinical trials, also fails to capture other ethical considerations raised by the implementation of web scraping for public health surveillance, which will also be critical to address in applying research findings to public health practice.

Applying the Emanuel Framework to Web Scraping in Public Health Research

The Emanuel et al. framework consists of eight principles widely held to be the default ethical requirements for the development, implementation and review of clinical research protocols involving human subjects. The framework has been adapted for use in developing countries ( Emanuel et al. , 2004 ) for the review of social science research ( Wassenaar and Mamotte, 2012 ), as well as in specific areas of research, such as HIV phylogenetic studies ( Mutenherwa et al. , 2019 ). The requirements are considered universal by Emanuel et al. , in the sense that they express widely recognized and accepted moral norms for research, although their use and interpretation can be shaped by contextual and cultural factors. Below, we briefly explain the requirements and relate them to the use of web scraping in our case study.

Social Value

To be ethically justified, it is necessary for a research study to contribute new information that could potentially lead to improvements in health and well-being. This could come in the form of hypothesis-testing, evaluations of interventions or epidemiological studies to help develop interventions. A research study without social value is ethically unjustified because it wastes valuable resources and exploits research participants by exposing them to risks without prospect of social or scientific benefits. The main social value of the research described in our case study is its potential to help improve care for persons living with HIV who become incarcerated. Disruptions in HIV care, and subsequent failure to achieve viral suppression, is an important medical and public health problem. In this case, research involving web scraping has potential social value because it contributes epidemiological information to a database aiming to improve HIV surveillance and continuity of care.

Scientific Validity

A research study with potential social value can nevertheless be ethically problematic if its methods are not sound and the resulting data are unreliable. According to Emanuel et al. , the hallmarks of scientific validity are a clear scientific objective and the use of an accepted methodology (including data analysis plan) appropriate for answering the research questions. Data validity has been a concern among those conducting public health research by combining large datasets. While some argue that using higher volumes of data may help avoid some methodological problems inherent in smaller sample sizes, large size by itself does not resolve other forms of error and bias ( Chiolero, 2013 ).

In our study, web scraping jail websites poses a number of methodological challenges. The rapid turnover of inmates at jails, as well delays by jails in updating their websites, means that it is difficult to have a complete record of everyone who has passed through the jails, even when scraping jail websites every few hours. In addition, some inmates use aliases, which can complicate efforts to link information to one and the same individual. We have partly addressed this issue by including in our linkage process the aliases of all active defendants in the court data.

Decisions also need to be made as to what information should or should not be scraped from jail websites, as some web-published data may be superfluous to the needs of the project. For example, we discussed the possibility of whether ‘mugshots’—forensic photographic portraits—could ever be effectively used in differentiating between similar entries on a jail website (e.g. two entries with the same last name but slightly different spellings of a similar first name). Ultimately, we decided that the resources necessary for manual or automated inspection of mugshots were beyond the scope of the project, and therefore, we decided against collecting these images despite their potential use in enhancing our data validity. Moving forward, we will continue to assess scientific validity as we monitor and improve upon our process of linking data between the jails, court system and state public health department.

Fair Subject Selection

Inclusion and exclusion of research participants should not just advance scientific goals, but also be responsive to considerations of fairness. This requirement has several different dimensions, but in the context of health research, it concerns the equitable distribution of the burdens and benefits of research participation. In the past, research involving those in prison or jail was often exploitative in that it was designed to provide benefits for others, i.e. the general, non-incarcerated population ( Gostin, 2007 ). Selecting incarcerated individuals for research can be fair if the research responds to health problems specifically faced by this population and they are likely to benefit from the results. Our study collected incarceration data for the purpose of improving HIV care services for those who have been jailed. Fair subject selection, therefore, at least partly depends on the burdens and risks related to being in the database rather than not (see ‘favorable risk-benefit ratio’ below). In our study, inclusion in the database is primarily a result of whether the individual is incarcerated in a jail that chooses to publish a website with an inmate roster. Notably, about two-thirds of jails in our state (accounting for about half of all incarcerated persons) do not publish a website roster, and these tend to be lower resourced jails. Accordingly, people incarcerated in the lower resourced jails may be the least likely to be included in our research project. If future interventions based on this research lead to improved HIV care, those in lower resourced jails may be (at least initially) excluded from them.

In health-related studies involving technology (e.g. mobile phones), participation bias and fair subject selection are persistent challenges when access to the technology is limited. In our case study, it is a matter of institutions (i.e. jails) not publishing data online, rather than individuals not having the requisite devices. This reveals a dimension of fairness that does not enter the Emanuel et al. framework: power asymmetries, as they manifest in access to, use, and impact of information and communication technologies. In contemporary societies, data on individuals are routinely collected, stored and used by powerful institutions, both governmental and commercial. The purpose of such data collection and the motives of those who hold and use it are proper objects of ethical inquiry. In our case, inclusion or exclusion of potential research participants may depend not on scientific criteria, but on what information jails are willing to release; important data may be made inaccessible by jails in order to prevent public exposure of problems in the criminal justice system. In this context, collection of jail data (including use of web scraping) may be morally imperative whether jail authorities agree with it or not.

Favorable Risk–Benefit Ratio

Research invariably involves a certain degree of risk, and the ethical justification of exposure to risk depends on the relationship between risks to participants and benefits to the individual and society. Studies should be designed to minimize potential risks to participants without compromising the scientific validity of the research, while enhancing (when possible) potential benefits for individuals and society. The question of risk in our study is complicated. On the one hand, potential benefits for incarcerated individuals and society is high, particularly if the research contributes to more effective surveillance and increases re-engagement of the affected population into care. One could also argue this study aims to minimize risk, given that those who currently live with HIV but are not virally suppressed pose health risks to themselves and others. In addition, gathering information on jail websites that is already in the public domain would traditionally be understood as minimal risk, because the potential infringement of privacy piggybacks on an existing convention (at least in the USA).

On the other hand, whether the benefit–risk relationship is favorable depends heavily on the efficacy of the safeguards in place to protect sensitive information from inappropriate disclosure. It also depends on context, such as whether one is considering the use of data in enhanced surveillance or future use in a D2C approach. In the use of web scraping for enhanced surveillance, personal identifiers are being collected, even though the data will be anonymized for surveillance purposes. Furthermore, the information collected is sensitive and being processed in ways that could increase stigma: persons in the dataset have their HIV infection linked to formal criminal investigation and are characterized as appropriate targets of public health action. In addition, the meaning of ‘publicly available’ should be regarded critically. While it is true that ordinary citizens can gather some information about the incarcerated by visiting jail websites, web scraping can gather information in greater magnitude that can be used to generate insights about the incarcerated beyond the capacities of lay users ( Nissenbaum, 2009 ). Depending on how the data are used, web scraping can increase the visibility of a person’s incarceration history. In the enhanced surveillance use of the jail website data, where personally identifying information is removed, there is less risk of social harms to individuals while gaining potential individual and public health benefits. However, in the D2C use of the web scraped data, the risk of social harm to individuals could increase as their incarceration status is explicitly linked to court data and HIV status, and identifiers are retained to facilitate engagement in care.

Independent Review

Ethical review of research by third parties uninvolved in the research is meant to act as a counter to potential biases of researchers and to provide public assurance that the rights and welfare of research participants are adequately protected. The review should include assessment of scientific validity, social value, risks and benefits, the informed consent process and community engagement. None of this is unique to research involving web scraping. However, while web scraping has been practiced for decades, it is a relatively new approach to gathering information in health research and, as can happen with unfamiliar methods, members of ethics committees may reject studies that involve web scraping out of an abundance of caution. This is particularly likely in countries, such as the USA, where incarcerated populations are given extra research protections. Conversely, research ethics committees that are more familiar with the routine use of web scraping in other contexts may prematurely approve such studies without insight into potential risks and risk mitigation strategies. A challenge for research ethics committees with respect to web scraping and its applications is that protections for data have become highly technical; gone are the days when locking hard copies of data in cabinets or even basic file encryption is sufficient. Ethical evaluation of data safeguards increasingly requires input from information technology experts.

It is noteworthy that US Federal regulations have become less strict for studies not directly engaging human participants at the same time that researchers are increasingly using ‘big data’ resources and when many in the research ethics community are concerned about the lack of protection offered by informed consent processes and the challenges of preserving anonymity ( Barocas and Nissenbaum, 2014 ). This means that if there are ethical issues in studies using web scraping that are not captured by current regulations, much depends on research ethics committee expertise in sophisticated data processes and protections. While Emanuel et al. rightly assert that research ethics committee members should be ‘competent’, the competence required to assess ethical protections in web scraping is different than merely understanding research methodologies or ethics.

Informed Consent

The requirement for informed consent is based on the principle of respect for persons, where ‘respect’ is understood in terms of individuals having control over the decision to be part of a research study. The challenges of gaining valid informed consent in research generally, where individuals have adequate understanding of what participation involves and agree without inappropriate influence, are well-known ( Grady, 2015 ). In our study, individuals whose jail data are being gathered through web scraping did not consent to giving their information to jail authorities, having this information collected from jail websites for research purposes, or potential future uses. The same is true for their court and public health data. For its part, our Institutional Review Board considered our study minimal risk, approved it by expedited review in collaboration with a prison representative, and authorized a waiver of the requirement of informed consent on the basis of US federal regulations (45 CFR 46.116[d]).

From an ethical standpoint, one could argue against obtaining informed consent from incarcerated individuals in a number of ways. As mentioned earlier, the jail website data were already publicly available, and the benefits of the study appear to outweigh the risks. Furthermore, the purpose of the research was to contribute to surveillance efforts, and health surveillance is commonly conducted without individual consent when doing so clearly promotes public welfare. In addition, obtaining individual consent could undermine the scientific validity and social value of surveillance by introducing participation bias. Finally, obtaining individual consent from approximately 200,000 persons within the time span of our study would be practically impossible.

While some may find these reasons persuasive, there are still some loose ends. The potential D2C applications of our study, which would require identifiable information, raise questions about the appropriateness of waiving the requirement of informed consent. Imagine that the fully identified version of our database was to be used by public health agents to approach formerly incarcerated individuals who appear to have discontinued HIV care services. In that imagined scenario, the formerly incarcerated individuals may wonder how they came to be identified. The answer would be that they were contacted on the basis of an enhanced surveillance database. Will the beneficiaries of such enhanced surveillance and outreach feel that their privacy has been violated, considering how their data—some of which is sensitive—has been collected from various sources and used by agencies without their permission? Or will they appreciate the assistance? Sweeney et al. (2013 ) provide some evidence that D2C results in increased uptake of HIV services and (for the most part) acceptance of being contacted. While these data did not focus on incarcerated individuals, and does not put the ethical issue of nonconsent to rest, it at least suggests that health-related web scraping/surveillance/outreach initiatives may not necessarily erode the public’s trust.

Consent of web scrapers and web hosts must also be considered. While some web scrapers contact website hosts to request specific datasets, others simply scrape websites for the data they desire without asking permission. To some extent, website hosts can determine the parameters of their relationship with web scrapers by means of website architecture. Some websites have an application programming interface (API) that facilitates the gathering of information by users, including those who want to scrape their sites. One could reasonably assume that if a website has an API, its hosts agree to have their websites scraped within the framework set by the API. With websites that do not have an API, the situation is less clear, since the establishment of an API requires time and effort. Lack of an API may be due to, for example, limited human resources, and therefore, is not a reliable indicator of what resources the host wants to share. On the other hand, the position of site hosts regarding web scraping can be more reasonably inferred from the use of a variety of security methods to prevent web scraping, such as CAPTCHA (screening human from automated requests), blocking IP addresses and the use of honeypots (i.e. mechanisms to attract then block unauthorized users) and spider traps (i.e. mechanisms that cause the web crawling programs—also known as scripts—of unwanted users to crash or make an infinite number of requests).

Much of the current ethical discourse on web scraping is about relationships between web scrapers and web hosts, and the norms that should govern the harvesting of website information ( Mitchell, 2015 ; Densmore, 2017 ; Krotov and Silva, 2018 ). A central question here is whether researchers using web scraping for health research purposes should hold themselves to moral standards more like those of biobank researchers than commercial web scrapers, and explicitly engage (and enter into agreements) with website hosts about the collection and use of the information they are gathering from them. If web scraping for health research purposes has the prospect for significant individual and social benefit, as well as exposure of identifiable persons to significant risk, one could argue that it is appropriate to enter into biobanking-like arrangements with hosts of websites from which data are scraped. Some emerging norms on big data research seem to point in this direction, though established practices that balance interests of web hosts and web scrapers are in their infancy ( Zook et al. , 2017 ). But there are also strong reasons not to go far in this direction. Much of the data collected through web scraping are not specifically clinical or necessarily related to health at all. As with our study, the web-scraped information may only have significance for health when combined with other datasets, and therefore, requiring biobanking-like arrangements with webhosts is likely to be inappropriate in the vast majority of cases. The potential downsides of stronger formal regulation of health research web scraping are increased study costs and slow processes, which could undercut potential research benefits. To complicate matters further, forging agreements with webhosts may sometimes be simply impracticable, such as when thousands of websites are involved or websites have poor governance structures.

Respect for Recruited Participants and Study Communities

As Emanuel et al. note, clinical researchers can have obligations to research participants after they have been recruited and provided informed consent, and may also have obligations to the communities from which those participants come. For individuals, these can include confidentiality protections, the right to withdraw, compensation for study-related harms, continued post-research medical services and dissemination of research results. These obligations do not apply to all research studies, and some are not easily applicable to big data health research, including web scraping. For example, our participants do not provide informed consent, and they do not know they are in a study, so there is no way for them to exercise a right to withdraw. At the front end, one could, for example, notify those entering jail that the information to be placed on jail websites about them will be used as part of a study, unless they opt out. The burden of this approach on researchers and jail administrators would be extremely high. On the back end, if our data informed a D2C approach, the person contacted would have the right to reject the offer of health services, but that would be refusal of care rather than study withdrawal. Compensation for psychological, social and other harms due to inappropriate disclosure of personal information is relevant to web scraping and big data research, though in the US compensation structures are generally more developed in the commercial sector than in the research context. Some, like Vayena et al. (2015 ), propose the establishment of monitoring boards to devise compensation schemes for digital epidemiological research and surveillance. Such protections might be thought excessive for web scraping research using publicly accessible online data, but researchers should have some sort of contingency plans for data breaches, particularly when web scraped data are combined with nonpublicly accessible information. Responsible dissemination of results from big data health research is complicated when individual consent is not involved and the research data were originally collected by other agencies (i.e., jails, courts, public health departments) for other purposes.

Engagement of communities in the research process has come to be regarded as scientifically and ethically desirable in many clinical and public health settings. In our study, meeting this requirement is challenging for two reasons. First, those who enter the North Carolina jails are research participants in the sense that at least some of them (in counties with jail websites) will thereby enter our database. While these individuals are the focus of the study and may be its potential future beneficiaries, the relationship between them and the researchers are much more remote than in much clinical research. The web scraping component of the research is indicative of this remoteness: it is an approach that does not involve interactions with individuals from which data are collected, but automated interactions between scripts developed by web scrapers and targeted websites. In addition, those individuals are not undergoing any sort of intervention, although the resultant database aims to inform future care efforts. Second, community engagement efforts related to studies like ours, using the internet and merging databases, demonstrate that the community to be engaged is unclear. Persons who have been in jail are a very diverse group that may not share needs and priorities, and it may be very challenging to find individuals who could be legitimately act as this group’s representatives.

Beyond Emanuel et al. : Additional Ethical Considerations for Public Health Applications of Web Scraping Research

Is the Emanuel framework sufficient as a tool to help ethically guide and evaluate web scraping in public health research? As Ballantyne writes, big data research reveals that the boundary between research ethics and public health ethics is more a matter of emphasis and orientation than a hard line between incompatible frameworks ( Ballantyne, 2019 ). We, therefore, suggest that some considerations drawn from public health ethics—more specifically as offered by Childress et al ., (2002 ) and Willison et al. (2014) should also be brought in, in cases where the goals, conduct and outcomes of such research overlaps with those of surveillance activities and public health interventions.

In their classic text, Childress et al. propose five ‘justificatory conditions’ in cases where pursuit of a public health good (such as improved HIV care and viral suppression among the incarcerated) may permissibly override or impinge on other moral values (such as autonomy and confidentiality). Examining these conditions may help to further articulate the ethics of web scraping in public health research.

The most basic condition, effectiveness , much like ‘social value’ in the Emanuel et al. framework, refers to the possibility that the activities in question will help improve public health. The issue here is whether research to develop a jail/court/HIV database to be used in enhanced surveillance and D2C is likely to improve health outcomes for those who are living with HIV and are jailed. Establishing an enhanced surveillance database would not be ethically justified if it was clear that it would never be used for those purposes. Whether research contributes to actual effectiveness will only be known gradually, as changes are made to systems and their effects are empirically evaluated over time.

Proportionality refers to public health benefits outweighing infringed moral considerations, such as using the data of individuals without their consent. This speaks to the importance of examining all benefits of public health-related research along with its negative effects. An interesting question is whether our research should be ethically evaluated with regard to how its database may be used in surveillance and D2C, since the latter is not within the researcher’s control. However, if, for example, problems with the database could have an impact on persons—such as HIV-negative persons being more frequently contacted by public health services than they would otherwise—this seems to be a potential harm related to the translation of the research into practice. For translational research projects like this, researchers may have more obligation to work with public health authorities to anticipate and prevent such foreseeable harms.

Necessity refers in our case to whether a similarly useful database could have been created by without, for example, bypassing informed consent. Is that approach necessary? It is currently difficult to see a viable alternative in our case, but it is important to critically reassess this question over time. Commentators have noticed a rise in rhetoric about informed consent being superfluous in big data research, reflecting a sense that such socially beneficial research is unobtrusive, that asking for consent would lead to consent bias, that the data are already public, and that people would not mind anyway ( Richterich, 2018 ). It is important to resist this trend by remaining skeptical about claims of consent (or other ways of respecting autonomy) being unnecessary or unfeasible, and whenever possible, to design ways of pursuing public health goals compatible with individual autonomy and community agreement.

Least infringement , applied to our study, means that researchers should collect and disclose only the kind and amount of information necessary for the goal of empowering public health services to identify jailed persons living with HIV and re-engage them in care. Web scrapers should be guided by this consideration when they develop their scripts.

Public justification refers to the responsibility to explain and justify to the relevant parties why certain infringements (such as not obtaining informed consent) were done when collecting data for public health purposes. Sweeney et al. argue that public discussion, including solicitation of input from stakeholders, is crucial when planning and implementing a D2C approach using surveillance data. Ideally, the most affected population (persons living with HIV and who have been jailed) should be part of such discussions. One could argue that public justification is more appropriate for public health officials than researchers. However, to the extent that research has informed the D2C approach, researchers too should explain and justify their contribution.

Finally, Willison et al. , in offering an ethical framework for public health evaluative projects, also mention social justice considerations. This dimension is not prominent in the Childress et al. justificatory conditions and is especially important in research (like ours) focused on those already vulnerable to structural injustice, inequalities in power and stigma. Research projects should consider the extent to which their aims, activities or results are likely to ameliorate, reinforce or worsen existing inequities, and what resources can and should be leveraged by researchers to mitigate disadvantages.

While there may not (yet) be an overarching ethical framework for guiding and evaluating empirical research that uses big data methods to inform public health practice, as we suggest above, we believe some of its fundamental ingredients can already be discerned in the existing research ethics and public health ethics literature. Like Willison et al. , we believe that ethics guidance for activities in the gray zone between research and public health practice requires building outward from fundamental research ethics principles to embrace considerations that include community and population interests.

Web scraping may seem like an innocuous activity as it involves the mechanical collection of publicly accessible data, much like the everyday task of searching for information via a search engine. However, when web scraping is used for public health research, there are ethical implications that may not be obvious at first sight. We provide an example of a study using web scraping to gather jail website data in order to develop an enhanced surveillance database that could contribute to a future D2C approach for persons living with HIV who have been incarcerated in jails. In examining our own study, we believe that the identified ethical concerns are not so great as to prohibit moving forward with the project. However, our ethical assessment will continue to evolve as the project advances. Among others, a key issue moving forward is to better understand the social value and costs of the D2C paradigm in the jail context. For example, the justification of not obtaining informed consent while linking personal data among jail, court and public health agencies as part of a D2C program depends on how successful researchers and public health authorities are in designing a system to the HIV-related health to meet the needs of people incarcerated in jails. Other considerations of the project are whether HIV confidentiality can be protected during periods of incarceration and whether relevant communities can be adequately engaged to provide input regarding project activities. We will continue to monitor these issues and elicit input from relevant stakeholders as we continually assess the appropriateness of our project.

As Big Data health research continues to grow, it is likely that studies such as ours that combine web scraping, surveillance and initiatives to improve patient care will become more commonplace. Considerations from both research ethics and public health ethics will be relevant for their guidance and evaluation.

Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health under Award Number R01AI129731. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This publication resulted (in part) from research supported by the University of North Carolina at Chapel Hill Center for AIDS Research (CFAR), an NIH funded program P30 AI050410.

Conflict of Interest

None declared.

Ballantyne A. (2019). Adjusting the Focus: A Public Health Ethics Approach to Data Research . Bioethics , 33 , 357–366. [ PubMed ] [ Google Scholar ]
Barocas S., Nissenbaum H. (2014). Big Data’s End Run around Anonymity and Consent In Lane J., Stodden V., Bender S., Nissenbaum H. (eds), Privacy, Big Data, and the Public Good . Cambridge: Cambridge University Press, pp. 44–75. [ Google Scholar ]
Centers for Disease Control and Prevention (2018). Understanding the HIV Care Continuum , available from: https://www.cdc.gov/hiv/pdf/library/factsheets/cdc-hiv-care-continuum.pdf [accessed 20 February 2020].
Centers for Disease Control and Prevention (n.d.). Data to Care , available from: https://www.cdc.gov/hiv/effective-interventions/respond/data-to-care/index.html [accessed 20 February 2020].
Childress J. F., Faden R. R., Gaare R. D., Gostin L. O., Kahn J., Bonnie R. J., Kass N. E., Mastroianni A. C., Moreno J. D., Nieburg P. (2002). Public Health Ethics: mapping the Terrain . The Journal of Law, Medicine & Ethics : a Journal of the American Society of Law, Medicine & Ethics , 30 , 170–178. [ PubMed ] [ Google Scholar ]
Chiolero A. (2013). Big Data in Epidemiology: Too Big to Fail? Epidemiology , 24 , 938–939. [ PubMed ] [ Google Scholar ]
Cohen M. S., Chen Y. Q., McCauley M., Gamble T., Hosseinipour M. C., Nagalingeswaran K., Hakim J. G., Kumwenda J., Grinsztejn B., Pilotto J. H. S., Godbole S. V., Mehendale S., Chariyalertsak S., Santos B. R., Mayer K. H., Hoffman I. F., Eshleman S. H., Piwowar-Manning E. M. T., Wang L., Makhema J., Mills L. A., de Bruyn G., Sanne I., Eron J., Gallant J., Havlir D., Swindells S., Ribaudo H., Elharrar V., Burns D., Taha T. E., Nielsen-Saines K., Celentano D., Essex M., Fleming T. R.; for the HPTN 052 Study Team (2011). Prevention of HIV-1 Infection with Early Antiretroviral Therapy . New England Journal of Medicine , 365 , 493–505. [ PMC free article ] [ PubMed ] [ Google Scholar ]
Densmore J. (2017). Ethics in Web Scraping , available from: https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01 [accessed 20 February 2020].
Emanuel E. J., Wendler D., Grady C. (2000). What Makes Clinical Research Ethical? The Journal of the American Medical Association , 283 , 2701–2711. [ PubMed ] [ Google Scholar ]
Emanuel E. J., Wendler D., Killen J., Grady C. (2004). What Makes Clinical Research in Developing Countries Ethical? The Benchmarks of Ethical Research . The Journal of Infectious Diseases , 189 , 930–937. [ PubMed ] [ Google Scholar ]
Goggin B. (2019). Inside Facebook’s Suicide Algorithm: Here’s How the Company Is Using Artificial Intelligence to Predict your Mental State from your Posts , available from: https://www.businessinsider.com/facebook-is-using-ai-to-try-to-predict-if-youre-suicidal-2018-12 [accessed 20 February 2020].
Gold Z., Latonero M. (2018). Robots Welcome? Ethical and Legal Considerations for Web Crawling and Scraping , available from: https://digitalcommons.law.uw.edu/wjlta/vol13/iss3/4/ [accessed 20 February 2020].
Gostin L. O. (2007). Biomedical Research Involving Prisoners: Ethical Values and Legal Regulation . The Journal of the American Medical Association , 297 , 737–740. [ PubMed ] [ Google Scholar ]
Grady C. (2015). Enduring and Emerging Challenges of Informed Consent . New England Journal of Medicine , 372 , 855–862. [ PubMed ] [ Google Scholar ]
HealthMap (2006). Computational Epidemiology Group, Informatics Program, Boston Children's Hospital. Available from: https://www.healthmap.org/en/ [accessed 20 February 2020].
Hsu J. (2018). The Strava Heat Map and the End of Secrets , https://www.wired.com/story/strava-heat-map-military-bases-fitness-trackers-privacy/ [accessed 20 February 2020].
Iroh P. A., Mayo H., Nijhawan A. E. (2015). The HIV Treatment Cascade before, during and after Incarceration: A Systematic Review and Data Synthesis . American Journal of Public Health , 105 , e5–e16. [ PMC free article ] [ PubMed ] [ Google Scholar ]
Krotov V., Silva L. (2018). Legality and Ethics of Web Scraping , available from: https://aisel.aisnet.org/amcis2018/DataScience/Presentations/17/ [accessed 20 February 2020].
Mitchell R. (2015). Web Scraping with Python. Collecting Data from the Modern Web . Sebastopol, CA: O’Reilly Media, Inc. [ Google Scholar ]
Mooney S. J., Westreich D. J., El-Sayed A. M. (2015). Epidemiology in the Era of Big Data . Epidemiology , 26 , 390–394. [ PMC free article ] [ PubMed ] [ Google Scholar ]
Mutenherwa F., Wassenaar D. R., Oliveira T. (2019). Ethical Issues Associated with HIV Phylogenetics in HIV Transmission Dynamics Research: A Review of the Literature Using the Emanuel Framework . Developing World Bioethics , 19 , 25–35. [ PubMed ] [ Google Scholar ]
Nissenbaum H. (2009). Privacy in Context: Technology, Policy and the Integrity of Social Life . Stanford, CA: Stanford University Press. [ Google Scholar ]
Pappas S. (2016). How Big Is the Internet, Really? available from: https://www.livescience.com/54094-how-big-is-the-internet.html [accessed 20 February 2020].
Richterich A. (2018). The Big Data Agenda: Data Ethics and Critical Data Studies . London: University of Westminster Press. [ Google Scholar ]
Spaulding A. C., Seals R. M., Page M. J., Brzozowski A. K., Rhodes W., Hammett T. M. (2009). HIV/AIDS among Inmates of and Releases from US Correctional Facilities, 2006: Declining Share of Epidemic but Persistent Public Health Opportunity . PLoS One , 4 , e7558. [ PMC free article ] [ PubMed ] [ Google Scholar ]
Sweeney P., Gardner L. I., Buchacz K., Garland P. M., Mugavero M. J., Bosshart J. T., Shouse R. L., Bertolli J. (2013). Shifting the Paradigm: Using HIV Surveillance Data as a Foundation for Improving HIV Care and Preventing HIV Infection . Millbank Quarterly , 91 , 558–603. [ PMC free article ] [ PubMed ] [ Google Scholar ]
Vayena E., Gasser U. (2016). Strictly Biomedical? Sketching the Ethics of the Big Data Ecosystem in Biomedicine In Middelstadt B., Floridi L. (eds), The Ethics of Biomedical Big Data . Cham, Switzerland: Springer International Publishing, pp. 17–39. [ Google Scholar ]
Vayena E., Salathe M., Madoff L. C., Brownstein J. S. (2015). Ethical Challenges of Big Data in Public Health . PLoS Computational Biology , 11 , e1003904. [ PMC free article ] [ PubMed ] [ Google Scholar ]
Wassenaar D., Mamotte N. (2012). Ethical Issues and Ethics Reviews in Social Science Research In Leach M., Stevens M., Lindsay G., Ferrero A. and Korkutv Y. (eds), Oxford Handbook of International Psychological Ethics . Oxford: Oxford University Press, pp. 268–282. [ Google Scholar ]
Willison D. J., Ondrusek N., Dawson A., Emerson C., Ferris L. E., Saginur R., Sampson H., Upshur R. (2014). What Makes Public Health Studies Ethical? Dissolving the Boundary between Research and Practice . BMC Medical Ethics , 15 , 61. [ PMC free article ] [ PubMed ] [ Google Scholar ]
Zook M., Barocas S., boyd d., Crawford K., Keller E., Gangadharan S. P., Goodman A., Hollander R., Koenig B. A., Metcalf J., Narayanan A., Nelson A., Pasquale F. (2017). Ten Simple Rules for Responsible Big Data Research . PLoS Computational Biology , 13 , e1005399. [ PMC free article ] [ PubMed ] [ Google Scholar ]

Join the Extract Data Discord community: Connect with the best scraping developers, receive coding support, and access exclusive events!

Web Scraping For Market Research

Market research is a crucial part of maintaining a competitive business. With web data extraction , market researchers can incorporate insightful data streams such as market trend analysis, pricing analysis, R&D, and competitor monitoring into their workflows to ensure their clients and businesses always have an edge on the market.

However, manually acquiring data for market research is a mundane, arduous task. Fortunately, we can now easily automate market research by using intelligently designed web crawlers.

In this guide, we’ll give you a sneak peek into five of the most common and impactful use cases that businesses use to produce actionable insights and keep a watchful eye over the health and competitiveness of their industry.

Market trend analysis

Price monitoring, optimizing point of entry.

Research & Development
Competitor monitoring

When performing a market trend analysis, web scraped data is perfectly suited to complement and enhance the productivity and accuracy of your research methods. Acquiring insights into any particular market scenario and the larger industry environment requires a great deal of data, and web scraping provides this data with a guaranteed level of reliability and accuracy.

Rigorous quality assurance standards are critical for this type of information, as subtle trends and market behaviors have been demonstrated as key indicators for market movement, and being a first-mover unlocks tremendous opportunities for any organization.

To make profitable pricing decisions, having access to a timely, reliable source of high-quality data is crucial. By scraping pricing data, market research teams are empowered to confidently consult their organizations and clients on how to best position products and services.

In an online world where prices change as rapidly as new products appear, automating a healthy stream of pricing data into your market research team is essential for ensuring you have an up-to-date, reliable benchmark against which you can either compete within the market or enter it.

Where you position yourself and how you price your goods is as important as your product or service itself. By using web scraping to fuel market research into your industry and location you can enter the market with confidence and a competitive upper-hand.

With web scraping, a huge variety of relevant, essential information about a market can be aggregated extremely quickly, capable of fueling aggressive startup growth as well as new product launches into competitive industries.

Research and development

Precautions are taken to ensure a healthy, thoughtful R&D cycle can dramatically reduce post-launch headaches. Whether your company produces enterprise software, video games, canned beverages, or electric cars, spending more time in R&D can be the difference between catastrophic failure and success.

Web scraped data is to R&D what buttresses are to ancient structures - it supports the entire process while preventing misguidedness and disaster. Since the scope and capability of big data is truly driving an epochal change in research - of any kind - using web scraping to generate data for R&D teams unlocks tremendous insight across every aspect of the cycle.

Competitor analysis

The traditional non-data-driven competitor monitoring processes employed by many businesses put them at real risk for disruption, creating blind spots and pain points for competitors to exploit.

By integrating web scraped data into a systematic competitor monitoring process, market researchers provide businesses with a powerful advantage and can act quickly on competitor insights to maximize revenue, market share, and growth opportunities.

Investors, startups, and Fortune 500 businesses alike can utilize the sheer power of web scrape data to realize their fullest potential and endow their operations with the most effective, world-class solutions currently available.

As the internet continues to grow, the amount of data it generates grows with it, opening new opportunities for all types of organizations to improve their processes and make more informed decisions. Therefore, we firmly believe that now is the best time to act and that by incorporating such data streams into your organizational processes you can ensure your organization is disruption-proofed and fully prepared for the world of tomorrow.

Learn more about web scraping for market research data

Here at Zyte , we have been in the web scraping industry for 12 years. We have helped extract web data for more than 1,000 clients ranging from Government agencies and Fortune 100 companies to early-stage startups and individuals. During this time we gained a tremendous amount of experience and expertise in web data extraction.

Here are some of our best resources if you want to deepen your web scraping knowledge:

Data use cases: Gain a competitive edge with web scraped product data
Data use cases: Fueling real estate's big data revolution with web scraping
Scrapy, Matplotlib, and MySQL: Real estate data analysis
The build in-house or outsource decision

Beautiful Soup
Data analysis
Machine learning
Deep learning
Data Science
Interview question
ML Projects
ML interview
DL interview
Python Requests Tutorial

Getting Started with python-requests

What is web scraping and how to use it.

How to install requests in Python - For windows, linux, mac

HTTP Request Methods

GET method - Python requests
POST method - Python requests
PUT method - Python requests
DELETE method- Python requests
HEAD method - Python requests
PATCH method - Python requests

Response Methods

response.headers - Python requests
response.encoding - Python requests
response.elapsed - Python requests
response.close() - Python requests
response.content - Python requests
response.cookies - Python requests
response.history - Python requests
response.is_permanent_redirect - Python requests
response.is_redirect - Python requests
response.iter_content() - Python requests
response.json() - Python requests
response.url - Python requests
response.text - Python requests
response.status_code - Python requests
response.request - Python requests
response.reason - Python requests
response.raise_for_status() - Python requests
response.ok - Python requests
response.links - Python requests
Convert JSON data Into a Custom Python Object
Authentication using Python requests
SSL Certificate Verification - Python requests
Exception Handling Of Python Requests Module
Memory Leak in Python requests
How to get the Daily News using Python
How to Build Web scraping bot in Python
Send SMS with REST Using Python
How to check horoscope using Python ?
Web Scraping - Amazon Customer Reviews

Suppose you want some information from a website. Let’s say a paragraph on Donald Trump! What do you do? Well, you can copy and paste the information from Wikipedia into your file. But what if you want to get large amounts of information from a website as quickly as possible? Such as large amounts of data from a website to train a Machine Learning algorithm ? In such a situation, copying and pasting will not work! And that’s when you’ll need to use Web Scraping . Unlike the long and mind-numbing process of manually getting data, Web scraping uses intelligence automation methods to get thousands or even millions of data sets in a smaller amount of time.

Table of Content

What is web scraping, how web scrapers work, types of web scrapers.

Why is Python a popular programming language for Web Scraping?
What is Web Scraping used for?

If you are coming to a sticky end while trying to collect public data from websites, we have a solution for you. Smartproxy is a tool that offers a solution to deal with all the hurdles with a single tool. Their formula for scraping any website is: 40M+ pool of residential and data center proxies + powerful web scraper = Web Scraping API . This tool ensures that you get the needed data in raw HTML at a 100% success rate.

With Web Scraping API, you can collect real-time data from any city worldwide. You can rely on this tool even when scraping websites built with JavaScript and won’t face any hurdles. Additionally, Smartproxy offers four other scrapers to fit all your needs – enjoy eCommerce, SERP, Social Media Scraping APIs and a No-Code scraper that makes data gathering possible even for no-coders. Bring your data collection process to the next level from $50/month + VAT.

But before using Smartproxy or any other tool you must know what web scraping actually is and how it’s done. So let’s understand what Web scraping is in detail and how to use it to obtain data from other websites.

Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. There are many different ways to perform web scraping to obtain data from websites. These include using online services, particular API’s or even creating your code for web scraping from scratch. Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow you to access their data in a structured format. This is the best option, but there are other sites that don’t allow users to access large amounts of data in a structured form or they are simply not that technologically advanced. In that situation, it’s best to use Web Scraping to scrape the website for data.

Web scraping requires two parts, namely the crawler and the scraper . The crawler is an artificial intelligence algorithm that browses the web to search for the particular data required by following the links across the internet. The scraper, on the other hand, is a specific tool created to extract data from the website. The design of the scraper can vary greatly according to the complexity and scope of the project so that it can quickly and accurately extract the data.

Web Scrapers can extract all the data on particular sites or the specific data that a user wants . Ideally, it’s best if you specify the data you want so that the web scraper only extracts that data quickly. For example, you might want to scrape an Amazon page for the types of juicers available, but you might only want the data about the models of different juicers and not the customer reviews.

So, when a web scraper needs to scrape a site, first the URLs are provided. Then it loads all the HTML code for those sites and a more advanced scraper might even extract all the CSS and Javascript elements as well. Then the scraper obtains the required data from this HTML code and outputs this data in the format specified by the user. Mostly, this is in the form of an Excel spreadsheet or a CSV file, but the data can also be saved in other formats, such as a JSON file.

Web Scrapers can be divided on the basis of many different criteria, including Self-built or Pre-built Web Scrapers, Browser extension or Software Web Scrapers, and Cloud or Local Web Scrapers.

You can have Self-built Web Scrapers but that requires advanced knowledge of programming. And if you want more features in your Web Scraper, then you need even more knowledge. On the other hand, pre-built Web Scrapers are previously created scrapers that you can download and run easily. These also have more advanced options that you can customize.

Browser extensions Web Scrapers are extensions that can be added to your browser. These are easy to run as they are integrated with your browser, but at the same time, they are also limited because of this. Any advanced features that are outside the scope of your browser are impossible to run on Browser extension Web Scrapers. But Software Web Scrapers don’t have these limitations as they can be downloaded and installed on your computer. These are more complex than Browser web scrapers, but they also have advanced features that are not limited by the scope of your browser.

Cloud Web Scrapers run on the cloud, which is an off-site server mostly provided by the company that you buy the scraper from. These allow your computer to focus on other tasks as the computer resources are not required to scrape data from websites. Local Web Scrapers , on the other hand, run on your computer using local resources. So, if the Web scrapers require more CPU or RAM, then your computer will become slow and not be able to perform other tasks.

Why is Python a Popular Programming Language for Web Scraping?

Python seems to be in fashion these days! It is the most popular language for web scraping as it can handle most of the processes easily. It also has a variety of libraries that were created specifically for Web Scraping. Scrapy is a very popular open-source web crawling framework that is written in Python. It is ideal for web scraping as well as extracting data using APIs. Beautiful soup is another Python library that is highly suitable for Web Scraping. It creates a parse tree that can be used to extract data from HTML on a website. Beautiful soup also has multiple features for navigation, searching, and modifying these parse trees.

What is Web Scraping Used for?

Web Scraping has multiple applications across various industries. Let’s check out some of these now!

1. Price Monitoring

Web Scraping can be used by companies to scrap the product data for their products and competing products as well to see how it impacts their pricing strategies. Companies can use this data to fix the optimal pricing for their products so that they can obtain maximum revenue.

2. Market Research

Web scraping can be used for market research by companies. High-quality web scraped data obtained in large volumes can be very helpful for companies in analyzing consumer trends and understanding which direction the company should move in the future.

3. News Monitoring

Web scraping news sites can provide detailed reports on the current news to a company. This is even more essential for companies that are frequently in the news or that depend on daily news for their day-to-day functioning. After all, news reports can make or break a company in a single day!

4. Sentiment Analysis

If companies want to understand the general sentiment for their products among their consumers, then Sentiment Analysis is a must. Companies can use web scraping to collect data from social media websites such as Facebook and Twitter as to what the general sentiment about their products is. This will help them in creating products that people desire and moving ahead of their competition.

5. Email Marketing

Companies can also use Web scraping for email marketing. They can collect Email ID’s from various sites using web scraping and then send bulk promotional and marketing Emails to all the people owning these Email ID’s.

Please Login to comment...

Improve your Coding Skills with Practice

What kind of Experience do you want to share?

How Does Web Scraping For Research Purposes Affects The Market?

Table of Contents

In the digital age, information is the key to success.

Online markets are becoming increasingly popular, making customer databases a precious asset.

Every piece of information on the Word Wide Web can help companies tailor their products or target specific customers.

Understanding Web Scraping

Web scraping is the process of gathering data from various websites and storing it for future use.

It mainly involves collecting textual data, either manually or through automated means.

You can either scrape texts from websites yourself or hire web scraping companies.

These companies use automated software to extract content based on predefined criteria.

Web scraping tools also allow you to organize the collected data, making it easier to use later on.

For example, you can convert large texts into spreadsheets, which is highly beneficial for research purposes.

Organized data is much easier to process than raw information, such as price comparison data that needs to be matched and arranged accordingly.

Web Scraping Applications in Various Fields

Web scraping applications are versatile and extend across various fields.

They greatly impact the way businesses and organizations gather and analyze data.

In addition to being used in real estate, weather forecasting, and market research, industries like finance, healthcare, and e-commerce also actively use web scraping.

The process may involve gathering user data stored on various websites and even scraping leads from multiple directories .

In finance, web scraping allows financial institutions to track market trends, gather data on competitors, and monitor customer sentiment.

Likewise, healthcare companies use web scraping to collect and analyze data on disease outbreaks, drug research, and patient feedback.

This ultimately leads to improvements in patient care and outcomes.

E-commerce businesses benefit from web scraping by obtaining data on product pricing, customer reviews, and competitor strategies.

This data helps them enhance their marketing and pricing models.

In the field of travel and tourism, companies actively employ web scraping to gather information on flight prices, hotel rates, and tourist attractions.

This data enables them to develop competitive packages and marketing strategies.

Companies in every sector use data from websites to improve their services and target specific customer segments.

Social Media Scraping for Marketplace Research

Social media web scraping plays a vital role in market research.

Analysts gather data from social media sites like Twitter and Facebook on specific topics.

They then use this information to create research reports.

Social media platforms provide a wealth of textual data.

With a single web scraping tool, you can access user information on relevant topics from around the world.

Besides social media research, many research agencies need web scraping tools to collect and process large amounts of data on their research subjects.

These subjects may include real estate, procurement, or human resources data.

Web scraping tools can greatly benefit researchers working on high-quality projects that require substantial data.

Navigating the Challenges of Market Data Scraping

Navigating the challenges of market data scraping not only involves ensuring data accuracy but also dealing with technical obstacles and resource constraints.

Websites may employ anti-scraping measures, such as CAPTCHAs or IP blocking, to prevent automated data extraction.

Researchers need to adapt their strategies to bypass these roadblocks while still respecting ethical boundaries.

Managing large volumes of collected data is another challenge that can consume time and resources.

Implementing database systems and utilizing cloud storage are effective data management strategies.

These strategies help researchers organize, store, and analyze the data more efficiently.

Maintaining the quality of scraped data is crucial to producing reliable results.

This may require preprocessing techniques like data cleansing and transformation to identify and correct errors or inconsistencies in the extracted data.

Furthermore, staying current with the latest web scraping technologies and tools is crucial for researchers.

It helps them optimize their data extraction processes and stay ahead of the competition.

Ultimately, collaboration and knowledge-sharing among researchers can greatly enhance the overall success of market data scraping projects.

By learning from each other’s experiences, researchers can refine their methodologies and address common challenges.

This collaborative approach helps ensure the data collected is of the highest quality.

This, in turn, contributes to more accurate market insights and data-driven decision-making in various industries.

Especially, if you are looking for digital marketing data and strategies .

Choosing the Right Web Scraping Tool for Research

When starting a web scraping project, the first step is to identify the best web scraping tool for your research.

There are many options available, each with its own advantages and disadvantages.

From manual scraping plugins to purpose-built libraries, the variety of web scraping tools is extensive.

Here are some commonly used ones:

Browser plugin tools

The simplest type of web scraping tool is the browser plugin.

These are easy to install on your internet browser and allow you to search for specific texts on websites.

They are usually manual and require you to select the text content to store.

Plugin tools are suitable for small-scale research projects that need precise data on a particular topic.

Their simplicity and control over the collected data make them preferred.

Web crawling programs

Manual plugins are great for small projects, but they don’t scale well for larger projects.

Developers can create web crawling programs using various programming languages.

These programs enable them to sift through massive amounts of text data and identify parts that meet predefined criteria.

They are ideal for larger research projects requiring data from multiple sources.

Once set up, web crawling programs run automatically with minimal manual intervention.

Numerous tutorials are available to help you create your own web crawling program.

Desktop applications

Desktop applications function likewise to web crawling programs but are more user-friendly for those without programming knowledge.

These are polished and compiled web crawling programs that are easy to use, even for non-experts.

APIs or Application Programming Interfaces

APIs enable you to interact with the data stored on specific websites.

While there are many general-purpose APIs, larger websites like Google and Amazon offer their own APIs.

These APIs enable users to collect and process data specifically from their platforms.

In Conclusion

Web scraping has significantly impacted how industries gather and examine data.

This impact has led to improved understanding and informed decision-making across various sectors.

While web scraping offers many benefits, it’s essential to address ethical concerns and maintain the privacy of individuals and organizations.

When choosing a web scraping tool, make sure it meets your needs while also following ethical guidelines and best practices.

This balance ensures that web scraping remains a valuable resource without crossing any boundaries.

Sandro Shubladze

Building a World Inspired By Data

Privacy Overview

IMAGES

Web Scraping: The easy way to collect and structure data from the Web
Web Scraping
Web Scraping vs Web Crawling: What's the Difference? A Comprehensive
Web Scraping
What Is Web Scraping? (How it Works and Why it’s So Valuable)
Web Scraping for Market Research in 2023

VIDEO

6. web Scraping & API
Scraping templeandwebster.com.au
WEB SCRAPING CON SELENIUM
Learn web scraping and earn
Web Scraping Project -Part 2 || Group Discussion || #dataanalytics #datascience #webscraping
Web Scraping Project

COMMENTS

An Introduction to Web Scraping for Research
Posted on November 7, 2019. Like web archiving, web scraping is a process by which you can collect data from websites and save it for further research or preserve it over time. Also like web archiving, web scraping can be done through manual selection or it can involve the automated crawling of web pages using pre-programmed scraping applications.
Full article: Web Scraping in the Statistics and Data Science
Even though web scraping can provide large amounts of data for the data science classroom, the speed will matter and differ. ... including small liberal arts colleges and large private and public research universities. On 14-15 week semester systems, at the undergraduate level, we have taught web scraping in introductory data science, data ...
Web Scraping for Research: A Guide for Researchers in the Modern Era
What is web scraping as a research method? Web scraping is a technique researchers use to automatically collect data from websites. By employing specialized tools, they can efficiently organize information from the internet, enabling a quicker analysis of trends and patterns. This not only streamlines the research process but also provides ...
Web Scraping: A Useful Tool to Broaden and Extend Psychological Research
Figure 1. Three different approaches to integrating web scraping into psychological research projects. The first approach involves forming a hypothesis and identifying a source of web data to test the hypothesis, focusing exclusively on web data and foregoing experiments. In this case, web scraping is the only method of data acquisition.
web-scraping.org
Scraping Web Data. for Marketing Insights. Learn how to use web scraping and APIs to build valid web data sets for academic research. Read the paper Explore the database Watch webinar. Journal of Marketing (Vol. 86, Issue 5, 2022)
Web Scraping 101: Tools, Techniques and Best Practices
Web scraping is a powerful technique for extracting data from the internet and using it for various purposes, from business analysis and research to marketing and more.
How we learnt to stop worrying and love web scraping
Build your own webscraping tool to step up your online research. Credit: Shutterstock. In research, time and resources are precious. Automating common tasks, such as data collection, can make a ...
How to use web scraping for online research
Enter web scraping. Also referred to as data scraping, data mining, data extraction, screen scraping, or web harvesting, web scraping is an automated solution for all your research needs. You tell the scraper what information you are looking for, and it gets you the data — already neatly organized and structured, so you won't even have to ...
Web Scraping for Hospitality Research: Overview, Opportunities, and
Web scraping is the process of creating a computer program to download, parse, and organize data from the web in an automated manner (vanden Broucke & Baesens, 2018). ... Researchers should carefully decide whether web scraping is suitable to their research and fully consider the potential challenges web scraping poses.
Applications of Web Scraping in Economics and Finance
The use of web scraping in economic research started nearly as soon as when the first data started to be published on the Web (see Edelman, 2012 for a review of early web-scraping research in economics). In the early days of the Internet (and web scraping), ...
In-Depth Guide to Web Scraping for Machine Learning in 2024
In-Depth Guide to Web Scraping for Machine Learning in 2024. ~ 54.7 billion people around the world have been recorded to use the internet, creating 1.7MB of data every second. Crawling this exponentially growing volume of data could provide many opportunities for breakthroughs in data science. Data scientists can leverage crawled data to ...
What Is Web Scraping? [A Complete Step-by-Step Guide]
While the exact method differs depending on the software or tools you're using, all web scraping bots follow three basic principles: Step 1: Making an HTTP request to a server. Step 2: Extracting and parsing (or breaking down) the website's code. Step 3: Saving the relevant data locally.
Roadmap to Web Scraping: Use Cases, Methods & Tools in 2024
As seen in figure 2 general web scraping process consists of the following 7 steps : Identification of target URLs. If the website to be crawled uses anti-scraping technologies such as CAPTCHAs, the scraper may need to choose the appropriate proxy server solution to get a new IP address to send its requests from.
Scraping the Web for Public Health Gains: Ethical Considerations from a
None of this is unique to research involving web scraping. However, while web scraping has been practiced for decades, it is a relatively new approach to gathering information in health research and, as can happen with unfamiliar methods, members of ethics committees may reject studies that involve web scraping out of an abundance of caution.
2056 PDFs
Explore the latest full-text research PDFs, articles, conference papers, preprints and more on WEB SCRAPING. Find methods information, sources, references or conduct a literature review on WEB ...
Fields of Gold: Scraping Web Data for Marketing Insights
Challenge #1.2: Considering Alternatives to Web Scraping: Reason for importance: Because web scraping is the most popular extraction method for web data, researchers may overlook alternative ways to extract data. APIs provide a documented and authorized way to obtain web data for many sources. Some sources also provide readily available data sets.
How To Use Web Scraping For Market Research
By using web scraping to fuel market research into your industry and location you can enter the market with confidence and a competitive upper-hand. With web scraping, a huge variety of relevant, essential information about a market can be aggregated extremely quickly, capable of fueling aggressive startup growth as well as new product launches ...
Web Scraping In Market Research: A Beginner's Guide
The Role of Web Scraping in Market Research: A Beginner's Guide. Web scraping has emerged as a potent tool in the arsenal of market researchers, transforming oceans of online data into actionable pools of insights. This innovative technique allows for efficient collection and analysis of vast amounts of information, propelling businesses ...
How Next-Gen Web Scraping Is Redefining Research In 2024
One of the fresh reports shows that the industry was valued at $4.9 billion in 2023 and is expected to grow with an impressive CAGR of 28% till 2032. As for the global web scraping software market ...
The Influence Of Web Scraping On Research In The Internet Age
Web scraping, synonymous with modern research, is an automated technique used to extract large volumes of data from websites. This method transforms the chaotic internet into a structured ...
What is Web Scraping and How to Use It?
2. Market Research. Web scraping can be used for market research by companies. High-quality web scraped data obtained in large volumes can be very helpful for companies in analyzing consumer trends and understanding which direction the company should move in the future. 3. News Monitoring. Web scraping news sites can provide detailed reports on ...
Web Scraping For Research: How It Affects The Market
With a single web scraping tool, you can access user information on relevant topics from around the world. Besides social media research, many research agencies need web scraping tools to collect and process large amounts of data on their research subjects. These subjects may include real estate, procurement, or human resources data.
(PDF) Web Data Scraping
Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a concept. utilized to gather information from sites whereby the information is extricated and spared to a ...