Objective: Use LLM-generated scraping scripts to fetch and parse HTML within AVM’s sandbox.

Data Extraction

Delegate web scraping logic to an LLM, execute safely on AVM nodes, and obtain structured CSV/JSON without local risk.

Scenario: Bulk Scraping

Extract account balances pages from a DeFi dashboard in parallel.

Solution: Two-Stage Extraction

  1. Fetch HTML
    Retrieve page content locally.
  2. Parse with LLM
    Prompt the LLM to extract table data via BeautifulSoup.
  3. Run in AVM
    Execute parsing code with the runPython tool.
  4. Aggregate Results
    Combine CSV outputs for all URLs.

Example (TypeScript)

import { runPythonTool } from "@avm-ai/avm-vercel-ai";
import fetch from "node-fetch";
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";

const tool = runPythonTool();

async function scrape(url: string) {
  const html = await fetch(url).then(r => r.text());
  const prompt = \`
Write execute(input) that uses BeautifulSoup to parse input["data"] HTML 
and outputs CSV.\`;
  const { text: code } = await generateText({
    model: openai("gpt-4o"),
    prompt,
    tools: { runPython: tool },
  });
  const result = await tool.exec({ code, input: { data: html } });
  return result.output.stdout;
}

Next Steps

  • Parallelize jobs via MCP concurrency.
  • Store results on IPFS.
  • Add retry and throttling mechanisms.