Scaling AI Projects: A Controlled Approach to Web Data Processing

2 months ago 30

There’s a lot of crap out there when you read about Artificial Intelligence projects, especially on LinkedIn, where I suspect half of the posts may have been created by AI bots. However, we recently implemented a process that included the use of an LLM, but in a very controlled fashion. The overall implementation process was […]

There’s a lot of crap out there when you read about Artificial Intelligence projects, especially on LinkedIn, where I suspect half of the posts may have been created by AI bots. However, we recently implemented a process that included the use of an LLM, but in a very controlled fashion. The overall implementation process was pretty interesting, and I wanted to talk about some of the decisions I made, and why.

I obviously can’t share all the details on my current project, but at a high level it’s a custom application that we are developing. One of the biggest challenges my team has faced in the project is acquiring data to support the application. We’ve tried to engage with several data vendors, and when why finally landed with one, we weren’t very happy with the quality and depth of the data they provided. I can say we were seeking information about companies in specific sectors. The obvious answer here was “webscraping”—I don’t know if you’ve ever written code to try and scrape websites, but given there is no common standard for websites, and they are developed with a wide array of frameworks, languages and formats, it’s just a mess.

One day during our standup my product manager/data engineer, suggested that instead of using traditional webscraping, we capture images of web pages, and then feed them to an optical character recognition (OCR) model. This immediately piqued my interest—he had tried it as one off, and it seemed effective. This led me to try it out with about 100 sites—I wrote some code on my machine to scrape the 100 sites using some Python code and package called Playwright. I initially ran it through Azure Computer vision, because I have a free account with my MVP. I had the scraping code grab the home page and about us page of each of my targets.

I looked at the output and it was reasonably good—I had a CSV file with a domain name (which effectively acts as our primary key) and a long text description of the company and what they did. My plan was to feed this to an LLM and get it to summarize the what the company did, and pull some other specific data features we were looking for. I first tried using the Azure Document Services summary tool, and that worked pretty poor. I then used Azure AI Foundry to use one of the OpenAI models to see if I would get better results. I got a lovely summary and my other data features were extracted as I expected. Now I could see this working, I had to make this work in a production environment.

I quickly threw together a script to scrape 200k websites—I decided to get smart and split the load across 8 nodes. But I cheated—I just split the file into 8 parts, and ran the Python script to do the screen scraping, I knew this was a bad idea, but it was late on a Friday and I wanted to get this going. Predictably all of my worker nodes died over the weekend, and I had to start over from square one.

I’ve been working in Linux for a very long time, so the next part of this process was fairly easy to me, but I still learned a few new things. I implemented a package called Supervisor, which let me build a cluster. I wrote some additional code to be able to easily add additional nodes, and to take advantage of using an AWS Simple Queue Services queue, to pull URLs off the queue. This gave me resumability, and scale—and because the queue maintained state, if nodes were rebooted, it didn’t impact my workload. In fact, I added an additional script running as a service on my controller nodes, which checked for unavailable worker nodes—if they were unavailable, we simply rebooted them. I ultimately scaled this scraping cluster to 36 nodes, and we completed our process in about 2.5 days.

Diagram illustrating a data processing workflow consisting of three clusters: Webscraping, OCR, and Summarization, each with control and worker nodes interacting with AWS services and S3 Buckets for data storage.

I was able to reuse the same cluster to perform the OCR and summarization tasks. Both were much less time consuming than the scraping process. I was able to get away with using eight nodes for both of those processes. The same basic idea applied—publish the data into the queue and let the worker nodes operate on them.

The summarization process is important to us—we wanted to have high quality data and avoid the risk of hallucinations that LLMs can have. I did a couple of things to reduce the risk there—I dropped the temperature parameter to .1, greatly reducing the creativity of the model. I also carefully crafted a system prompt instructing the model to only use it’s input text to create a summary of the site. I ended up using one of the Amazon Nova models—you don’t need a big cutting-edge model to summarize text and extract features. This means the inference costs were extremely low.

AI tools are best used when we tightly control the input data and put tight guardrails around the process. In this post, I wanted to demonstrate how you can take advantage of the benefits of an LLM, at a low cost. I also wanted to walk you through my process of how to scale this process, and make it into a production level process.

View Entire Post

Read Entire Article