The New Era of Web Scraping: Does Zyte’s AI Scraping Change the Game?

Bartosz Sekiewicz
5 min readMar 5, 2024

--

In the fast-evolving tech world, the web scraping market continually offers newer tools and solutions to simplify and automate data collection from the internet. One of the recent innovations catching attention is Zyte’s AI Scraping, a comprehensive, AI-powered solution that significantly simplifies the scraping process. One can start gathering data just by entering a website address. You can find out more here.

Photo by Midjourney

Having completed my initial testing, I’d like to share some insights.

It’s a tool mainly for developers, not end users:
The present times have accustomed us to the idea that AI-powered tools are highly advanced solutions, capable of bypassing numerous specialist stages in processes. Although Zyte’s AI Scraping appears intuitive, it proves most effective in the hands of those with web scraping experience and knowledge of tools like Scrapy. Websites are complex and filled with challenges, and without the proper technical knowledge, it’s easy to end up with unreliable data (even with AI in the backend). Understanding a website’s specifics is crucial to effectively leverage AI Scraping’s capabilities.

Time savings:

(…) to build and launch spiders, unblock websites, and extract data from a single UI three times faster than traditional scraping vendors and proxy APIs.

This advantage becomes particularly pronounced when the current workflow is less advanced and demands significant manual effort each time. Traditional scraping necessitates numerous iterations of website analysis to compile a complete and precise dataset. In contrast, AI Scraping offers a swift snapshot of the website’s content, facilitating an assessment of the necessary work scope and, in certain situations, the immediate gathering of required data without further developing the tool.

Broad coverage, but …
Undoubtedly, the tool can handle a vast array of website types. However, it’s essential to acknowledge the sheer diversity of technologies and methods used in website creation and data processing. Additionally, the continuous evolution of anti-scraping solutions adds another layer of complexity. The more sophisticated the security measures, the higher the scraping costs — embarking on data collection without reflection could be expensive.

Data quality:
There’s a prevalent misconception among less technical individuals that tasks performed by machines should attain 100% accuracy. However, this is an unrealistic expectation for machine learning algorithms. It’s essential to understand that the accuracy of the values returned by these tools can, at best, reach 99.9%, which inherently carries the risk of inaccuracies. Consequently, while automation conserves development time and resources, a portion of these savings must be judiciously allocated to rigorous data validation efforts.

On one side, minor alterations to a website can completely disrupt our classic scraping scripts, necessitating a multi-stage data collection strategy. Conversely, these changes can also mislead our algorithms, resulting in incorrect outputs in a small but significant fraction of cases.

Striking the Optimal Balance:
The choice of a scraping method should emerge from a meticulous evaluation of the individual scenario and the business scale. Zyte’s AI Scraping, with its automation capabilities, offers a smoother pathway for managing projects at scale, enhancing the system’s adaptability to frequent changes across websites. While often incurring higher costs, these tools significantly reduce manual labour, which is especially beneficial when updates alter the appearance of target sites.

Conversely, developing bespoke scraping scripts within a carefully planned infrastructure may demand higher upfront investments yet lead to negligible recurrent expenses post-deployment. The primary concern with this approach is the potential for process disruption caused by modifications of websites. This risk intensifies when managing numerous such scripts, highlighting the importance of balancing initial costs against long-term efficiency and scalability in the broader context of business needs and operational scale.

Data Marketplaces as alternatives: If someone is scraping this data, why should I?

Before embarking on data scraping, it’s prudent to consider whether someone else has already undertaken this effort. Several initiatives, such as DataBoutique, are designed to streamline these processes, offering significant advantages. Notably, they reduce the load on the scraped website and, more importantly, decrease data costs.

For several months, I’ve managed my own “data store,” which currently features data from over 30 sources (WebDataWatch). A significant focus is placed on the quality and completeness of data. Drawing from my experience, below is a general cost comparison for scraping data from a retailer, viewed from different perspectives:

Website: H&M (across 25 regions)

  • Standard Scraping Approach: Traditionally, crafting, testing, and deploying a scraping tool requires 30 hours. A one-time data retrieval from 25 regions takes approximately 2 hours, translating to roughly $800 for the initial comprehensive download and $50 for each subsequent retrieval at $25 per hour.
  • Using a AI Scraping by Zyte: Assuming that template is valid (which requires time to set up), a single region’s data retrieval needs at least 130k requests, costing a minimum of $70 with Zyte API. This equates to at least $1750 for the first and each subsequent data collection.
  • Purchasing from a Data Marketplace: The cost ranges from $20 to $40 per dataset, resulting in $500 to $1000 for the initial and each subsequent data retrieval.

In the first two scenarios, additional responsibilities include ensuring sufficient computing power and implementing a robust validation system — requirements not needed in the third scenario, which showcases the efficiency and cost-effectiveness of purchasing data directly from marketplaces.

Zyte’s AI Scraping stands out as an impactful tool that streamlines the web scraping process, ideally suited for sizable projects and developers with a deep understanding of web scraping techniques. Yet, it’s vital to meticulously assess the unique aspects of each project, such as its scale, the regularity of website updates, and budget constraints. Alternatives, such as data marketplaces, present economical options that are particularly appealing when faced with the high upfront costs of scraping or the potential for website modifications to disrupt the process. Therefore, making an informed decision requires a comprehensive evaluation of these considerations to enhance operational efficiency while effectively managing risks and expenses.

--

--

Bartosz Sekiewicz

I am an experienced data scientist with expertise in web scraping. I specialize in managing analytical processes.