Scraping with Cloudflare Workers and OpenAI

What is a Cloudflare Worker? Why would I want to use one?

Cloudflare Workers is a powerful service that runs on Cloudflare's edge runtime. This is very similar to something like AWS Lambda, however it is much more lightweight and has a much faster cold start time. It's also considerably cheaper.

Plan Type	Requests	Duration
Free Plan	100,000 / day	10ms CPU time / invocation
Paid Plan - Bundled	10 million / month, +$0.50/million	50 ms CPU time / invocation
Paid Plan - Unbound	1 million / month, + $0.15/million	400,000 GB-s, + $12.50/million GB-s^2,3
Cloudflare Workers Pricing

You only get charged for number of requests and duration of the request. You don't get charged for memory usage or bandwidth. CPU time is important here. You do NOT pay for time waiting on fetch API requests. You only pay for time spent processing the request. In my example today, while we're waiting for the scrape to resolve and while we're waiting for OpenAI to resolve we are not being charged a penny. That's awesome!

Watch this video from Theo - t3.gg on YouTube "That's It, I'm Done With Serverless." for a great explanation of Edge vs Serverless. tldr; Edge is faster and cheaper, but much more limited.

One of the limitations of the Cloudflare edge runtime is that you can't bring in just any old dependency from NPM. Cloudflare has a list of Works on Workers if you want to bring in an external dependency.

Unfortunately you can't use typical Node scraping packages like cheerio or puppeteer on Cloudflare Workers. However, you can use plain old fetch, the one built into the browser and Cloudflare's edge runtime.

I followed an example project from Cloudflare's website called Web Scraper it was written by Adam Schwartz and you can find the GitHub repo here.

This got me pretty far, it's output was a JSON object with the scraped text data. I wanted to take it a step further and use OpenAI's API to generate a JSON Array of the scraped response. The response should be trusted and ready to be inserted into a Redis cache or consumed by a frontend client.

How I'm scraping with a Cloudflare Worker

Here's what I did:

Created a new Cloudflare Worker using the wrangler cli

wrangler init scraper

Added my env variables to the wrangler.toml file

name = "scraper"
main = "src/worker.ts"
compatibility_date = "2023-05-25"

[vars]
OPEN_AI_API_KEY = "your-open-ai-api-key"
CORRECT_API_KEY = "your-correct-api-key"

I added a "CORRECT_API_KEY", although in hindsight it's probably better named as a SECRET_KEY or something similar. This is just a simple way to add a layer of security to the endpoint.

Started writing code

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // We only want to accept a POST request.
    if (request.method !== 'POST') {
      return new Response('Method not allowed', { status: 405 });
    }

    // Verify the request has the CORRECT_API_KEY, so if someone tries to call your endpoint and they don't know your secret they will get a 401 Unauthorized response
    const authHeader = request.headers.get('Authorization');
    if (!authHeader || authHeader !== `Bearer ${env.CORRECT_API_KEY}`) {
      return new Response('Unauthorized', { status: 401 });
    }

    // TS interface for the POST body. We want an href, description and selector
    interface FetchData {
      href: string;
      description: string;
      selector: string;
    }

    // Validate all of the required fields are present in the POST body
    const body: FetchData = (await request.json()) as FetchData;
    if (!body.href || !body.description || !body.selector) {
      return new Response('Bad request. Please include an href, selector, and description in the POST body.', { status: 400 });
    }

    // Fetch the HTML content from the provided href that we wish to scrape
    const response = await fetch(body.href);
    if (!response.ok) {
      return new Response(`Failed to fetch ${body.href}`, { status: 404 });
    }

    // Initialize a new scraper and pass the fetched response to it
    const scraper = new Scraper();
    await scraper.fetch(body.href);

    // Use the scraper to get the text content of the specified elements
    const textContent = await scraper.querySelector(body.selector).getText({ spaced: false });

    // Serialize the text content to JSON
    const jsonData = JSON.stringify(textContent);

    try {
      // Send the JSON data to OpenAI's API
      const result = await sendToOpenAI(env, jsonData, body.href, body.description);

      // Return the result from OpenAI's API
      return new Response(JSON.stringify(result), { status: 200 });
    } catch (err: any) {
      console.error(err);

      // Do some error handling. If the error message includes "maximum context length is" then the JSON data provided was too large for OpenAI to process.
      if (err?.message?.includes('maximum context length is')) {
        return new Response('The HTML provided is too large for OpenAI to process.', { status: 413 });
      } else {
        return new Response(err?.message ? err.message : "OpenAI's API had a problem.", { status: 500 });
      }
    }
  },
};

Feel free to read the code comments, but basically I'm validating the request to my Cloudflare worker, scraping the text content from the provided href and selector, serializing the text content to JSON, sending the JSON data to OpenAI's API, and returning the result from OpenAI's API.

How I'm prompting OpenAI's API

To see how I'm crafting my prompt to send to OpenAI's API, check out the sendToOpenAI function below.

async function sendToOpenAI(env: Env, rawHtml: string, href: string, description: string) {
  const body = {
    model: 'gpt-3.5-turbo',
    messages: [
      {
        role: 'system',
        content: `You are a web scraper being asked to get content from ${href}. You have been provided with the scraped text values of certain DOM Element Tags. Your task is to parse these and extract the requested data as described. The format in your response should strictly be a JSON Array of the requested data with no additional text or formatting. If it's a singular item, it will be a JSON Array of 1 length. Here is a description of how to locate the content that we want from the site: ${description}. Treat this like you are an API endpoint, just return the JSON.`,
      },
      {
        role: 'user',
        content: `Parse the following scraped text values and locate the content as per the description provided: ${rawHtml}. Remember, no explanation. Just JSON.`,
      },
    ],
    temperature: 0.2,
  };

  const response = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${env.OPEN_AI_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify(body),
  });

  if (!response.ok) {
    const errorMessage = await response.text();
    throw new Error(`OpenAI API error: ${errorMessage}`);
  }

  interface OpenAIChoice {
    message: {
      content: string;
    };
  }

  interface OpenAIResponse {
    choices: OpenAIChoice[];
  }

  const data: OpenAIResponse = await response.json();
  return data['choices'][0]['message']['content'];
}

The key aspect is in the system and user prompts. I'm passing in the href for added context of what task it is assigned and what the mission objective is.

I am using the Chat Completions API, but you could also use the standard Text Completions API which would work in a similar manner. I like the idea of a System and User, and I've had success using Chat Completions, even though this is not a conversational task and there is no back and forth.

The temperature is set to 0.2, which is a low value. This means the response will be more predictable and less random. I've found this to be a good value for the types of code generation/JSON work I'm doing. ChatGPT's default is 0.7 which is much more creative.

I'm using gpt-3.5-turbo which is faster and cheaper than gpt-4. I've found it to be sufficient for these rather "basic" text transformation tasks that are well defined.

Making the request to your worker to scrape the data

curl --location 'https://scraper.yourcloudflareusername.workers.dev' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer your-correct-api-key' \
--data '{
    "href": "https://website.com/page-to-scrape",
    "selector": "table, thead, tbody, tr, th, td",
    "description": "There is a Rates tables which has the headers: '\''Term'\'' '\''Interest Rates As Low As'\'' '\''Discount Points'\'' '\''APR As Low As'\'' please fetch them in the following format: '\''term'\'' '\''rate'\'' '\''discount_points'\'' and '\''apr'\''. The key for the json array should be '\''rates'\''"
}'

In the href you send the url of the page you want to scrape.

In the selector you can use css selectors to choose which elements to scrape.

In the description, you can describe the data you want to scrape. This is used to help the AI understand what you want to scrape. The more detailed you are, the better the results will be. The default response is a JSON array, even if the result is a single value. Tell it what keys and values you want in your description.

The Source code

As always you can find the source code for my projects on GitHub.

Conclusion

I think there are a lot of interesting use cases for this. In the future I may build upon this to create a Scraping Hub style platform where you can schedule these requests on as a cronjob to repeat however often you need and save historical results.

I'd love to hear your thoughts on LinkedIn.