How to Run an AI Scraping Hub for Under $10 a Month
Running a web scraping service usually requires a substantial investment in infrastructure and computing power. However, with the right tech stack and a little ingenuity, you can create a robust AI scraping hub for under $10 a month. This post will guide you through the setup and use of an affordable scraping hub, based on my project on Github.
If you're curious about why I chose to use Cloudflare Workers for scraping and want to delve into the code specifics and the restrictions, I recommend checking out my previous article on Scraping with Cloudflare Workers and OpenAI. In that article, I thoroughly discuss the rationale and advantages/disadvantages of this particular setup.
This scraping hub project leverages a few key technologies: a Node.js Scheduler Service, two Cloudflare Workers, and a combination of Neon PostgreSQL, Redis via Upstash, and Render.com for deployment. Here's a quick rundown of how each component fits in:
Node.js Scheduler Service: This application polls the Neon PostgreSQL database every 15 minutes, retrieving the latest cron schedule from the
scraping_jobtable. The service is deployed on Render.com.
Cloudflare Worker 1 - Scraper: As the workhorse of the hub, this worker is responsible for executing the web scraping tasks.
Cloudflare Worker 2 - Apollo GraphQL Server: This worker provides a GraphQL API that allows clients to create, read, and update scraping cron jobs.
Neon PostgreSQL: A serverless PostgreSQL database where scraping job schedules and historical job results are stored.
Redis with Upstash.com: Used for caching the scraped data.
Here is a simplified diagram of how these services interact:
Cloudflare Worker 1 (Scraper) <-------- Node.js Scheduler Service -------> Neon PostgreSQL
^ ^ |
| | v
| | Redis (Cache)
Cloudflare Worker 2 (GraphQL Server) <----|
Initially, I considered using Rails with Sidekiq to create the UI dashboard and scheduling service. However, deploying Rails and Sidekiq turned out to be more complicated than anticipated, making the solution unattractive.
I also tried using Bun as a task scheduler, but it wasn't compatible with Neon Postgres or Node-Cron. Therefore, I settled on a simple Node Express server with a very basic dashboard for visualizing the jobs, their schedules, and the results of the last successful scrape.
Setting Up Your Scraping Hub
Here's a preview of what the scheduler dashboard looks like:
Bun Scheduler Dashboard
Uptime: 277 seconds
Job ID URL Selector Description Latest Content Cron Schedule Cron Schedule Described
fcb4fb7e-f20c-4486-b0c7-568d948257e4 https://www.somecreditunion.org/loans-cards/mortgage/mortgage-rates/conventional-fixed-rate-mortgages.html table, thead, tbody, tr, th, td There is a Rates tables which has the headers: 'Term' 'Interest Rates As Low As' 'Discount Points' 'APR As Low As' please fetch them in the following format: 'term' 'rate' 'discount_points' and 'apr'. The key for the json array should be 'rates' ❌ 0 7 * * * At 07:00 AM US Eastern
The scheduler-service application can be spun up using Docker Compose. In the root directory of the project, you will find a
docker-compose.yml file. Run the command
docker-compose up in your terminal to start the service, and ensure Docker is installed on your machine.
More specific instructions on how to use the individual Cloudflare workers are available in their respective README.md files in the worker directories. You can use the
wrangler CLI tool to deploy the workers to your Cloudflare account.
You might be surprised to learn that this entire setup, with its multiple services and advanced capabilities, costs less than $10 per month. Here's the cost breakdown:
|Neon Serverless PostgreSQL
Yes, you read that right. For the price of a fancy coffee, you can have a powerful web scraping hub running round the clock. Coupled with OpenAI's reasonable costs, this solution remains well below the $10 a month mark. Get that data.
If you're in need of a simple and cost-effective web scraping solution, this project might be the right fit for you. Check out the Github repository to get started. Embrace the world of affordable, efficient, and powerful web scraping, all for under $10 a month. Happy scraping!
I'd love to hear your thoughts on LinkedIn.