Web Scraper with LLM Extraction Strategy

This project is a simple web scraping application built with Streamlit that leverages the crawl4ai library to extract structured data from web pages using various large language models (LLMs). Users can dynamically configure the extraction process by selecting a model, entering an API key, defining a custom schema, and providing extraction instructions.

web_scraper.mp4

Features

Model Selection: Choose from multiple LLM providers (e.g. gpt-4o-mini, gpt-4o, ollama/llama2, ollama&llama3).
API Key Input: Securely enter your OpenAI API key (or other required keys) directly in the sidebar.
Schema Definition: Define a JSON schema for data extraction (default schema extracts product name and price).
Custom Extraction Instructions: Provide tailored extraction instructions to suit your target webpage.
Asynchronous Crawling: Uses asynchronous functions for efficient web crawling and data extraction.
Interactive UI: Built with Streamlit for a user-friendly interface.

Prerequisites

Python 3.8 or higher
Streamlit
crawl4ai (or ensure the library is installed)

Installation

Clone the Repository:

git clone https://github.com/Croups/web-scraper-llm.git

Create a Virtual Environment (Optional but Recommended):

python -m venv venv
source venv/bin/activate # On Windows, use venv\Scripts\activate

Install Dependencies and Run:

pip install -r requirements.txt
streamlit run app.py

Configuration

API Key: If using a model that requires an API key (such as OpenAI's GPT models), either set the environment variable OPENAI_API_KEY or enter it in the sidebar when running the app.
Schema Definition: The sidebar allows you to define a JSON schema for extraction. The default is:

{ "name": "str", "price": "str" }

CONTACT WITH ME

-Feel free to contact with me for any questions:

-Linkedin : www.linkedin.com/in/enes-koşar

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraper with LLM Extraction Strategy

Features

Prerequisites

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Croups/auto-scraper-with-llms

Folders and files

Latest commit

History

Repository files navigation

Web Scraper with LLM Extraction Strategy

Features

Prerequisites

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages