Skip to content

Web scraping AI that leverages the crawl4ai library to extract structured data from web pages using various large language models (LLMs).

Notifications You must be signed in to change notification settings

Croups/auto-scraper-with-llms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Web Scraper with LLM Extraction Strategy

This project is a simple web scraping application built with Streamlit that leverages the crawl4ai library to extract structured data from web pages using various large language models (LLMs). Users can dynamically configure the extraction process by selecting a model, entering an API key, defining a custom schema, and providing extraction instructions.

web_scraper.mp4

Features

  • Model Selection: Choose from multiple LLM providers (e.g. gpt-4o-mini, gpt-4o, ollama/llama2, ollama&llama3).
  • API Key Input: Securely enter your OpenAI API key (or other required keys) directly in the sidebar.
  • Schema Definition: Define a JSON schema for data extraction (default schema extracts product name and price).
  • Custom Extraction Instructions: Provide tailored extraction instructions to suit your target webpage.
  • Asynchronous Crawling: Uses asynchronous functions for efficient web crawling and data extraction.
  • Interactive UI: Built with Streamlit for a user-friendly interface.

Prerequisites

Installation

  1. Clone the Repository:
  1. Create a Virtual Environment (Optional but Recommended):
  • python -m venv venv
  • source venv/bin/activate # On Windows, use venv\Scripts\activate
  1. Install Dependencies and Run:
  • pip install -r requirements.txt
  • streamlit run app.py
  1. Configuration
  • API Key: If using a model that requires an API key (such as OpenAI's GPT models), either set the environment variable OPENAI_API_KEY or enter it in the sidebar when running the app.

  • Schema Definition: The sidebar allows you to define a JSON schema for extraction. The default is:

  • { "name": "str", "price": "str" }
  1. CONTACT WITH ME

    -Feel free to contact with me for any questions:

    -Linkedin : www.linkedin.com/in/enes-koşar

About

Web scraping AI that leverages the crawl4ai library to extract structured data from web pages using various large language models (LLMs).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages