• The Viable SaaS
  • Posts
  • How To Effortlessly Integrate Scraped Data Into Your SaaS

How To Effortlessly Integrate Scraped Data Into Your SaaS

Elevate your LLM or application with third-party data from an API

You’ve all seen it.

The “GPT-4 powered” AI SaaS that promises to read your mind, make you an omelette and massage your toes.

Ok, I lied. We haven’t quite seen that yet.

But we have seen apps cook up a recipe for an omelette based on GPT-4’s training data.

What if you wanted to take that same app and train it exclusively on recipes contributed by mothers and grandmothers from across the world? You’d have the most premium AI-powered recipe app on the planet.

You could ask it if it’s a good idea to put ketchup on a steak and it would automatically report your inquiry to your nearest credit bureau.

Now how do you get the latest data, cleaned up and easy enough to feed to an AI model?

You could hire someone from Upwork to scrape data and give you a big CSV file. That might be a big data set, but you’d have to keep paying the guy to get updates.

You could hire someone to build an API that scraped the data to a database and kept it up-to-date automatically.

The first solution is the fastest for building purposes. The second solution is ideal.

For a solution in the middle, we can look to Apify.

Apify has a store full of actors. Just like human actors, they perform actions based on a script.

Let’s say you’re building a travel app that uses AI in some way. You could use a TripAdvisor actor to get data on hotels, restaurants, things to do, and more.

What’s convenient about these actors is that they “act” like regular APIs.

Here’s an example using Node.js:

import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with API token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "locationFullName": "Chicago",
    "maxItemsPerQuery": 10,
    "includeTags": true,
    "includeAttractions": true,
    "includeRestaurants": true,
    "includeHotels": true,
    "includeVacationRentals": false,
    "checkInDate": "",
    "checkOutDate": "",
    "includePriceOffers": true,
    "includeAiReviewsSummary": false,
    "language": "en",
    "currency": "USD"
};

(async () => {
    // Run the Actor and wait for it to finish
    const run = await client.actor("<YOUR_ACTOR_ID>").call(input);

    // Fetch and print Actor results from the run's dataset (if any)
    console.log('Results from dataset');
    const { items } = await client.dataset(run.defaultDatasetId).listItems();
    items.forEach((item) => {
        console.dir(item);
    });
})();

All you have to do is:

  1. Create an Apify account and log in to your account.

  2. Select an actor from the store.

  3. Click on the API dropdown on the actor page.

  4. Choose API clients and you’ll see code like the snippet above.

You can also play around with the actor and see if it is giving you the output you expect in Apify’s console before you start using it like an API:

Different actors have different prices. In the example above, Tripadvisor Scraper charges you per result set, which you can tweak in the Max results input box.

Now you’re probably wondering, “Okay, I have all this data now; how do I feed it to an AI model?”

Look up “retrieval augmented generation,” also known as RAG.

That’s what you’re going to want to build.

Maybe we’ll touch on it in a future post!

Reply

or to participate.