This project is a simple web scraping application built with Streamlit that leverages the crawl4ai library to extract structured data from web pages using various large language models (LLMs). Users can dynamically configure the extraction process by selecting a model, entering an API key, defining a custom schema, and providing extraction instructions.
web_scraper.mp4
- Model Selection: Choose from multiple LLM providers (e.g.
gpt-4o-mini,gpt-4o,ollama/llama2,ollama&llama3). - API Key Input: Securely enter your OpenAI API key (or other required keys) directly in the sidebar.
- Schema Definition: Define a JSON schema for data extraction (default schema extracts product name and price).
- Custom Extraction Instructions: Provide tailored extraction instructions to suit your target webpage.
- Asynchronous Crawling: Uses asynchronous functions for efficient web crawling and data extraction.
- Interactive UI: Built with Streamlit for a user-friendly interface.
- Clone the Repository:
- Create a Virtual Environment (Optional but Recommended):
- python -m venv venv
- source venv/bin/activate # On Windows, use
venv\Scripts\activate
- Install Dependencies and Run:
- pip install -r requirements.txt
- streamlit run app.py
- Configuration
-
API Key: If using a model that requires an API key (such as OpenAI's GPT models), either set the environment variable OPENAI_API_KEY or enter it in the sidebar when running the app.
-
Schema Definition: The sidebar allows you to define a JSON schema for extraction. The default is:
- { "name": "str", "price": "str" }
-
CONTACT WITH ME
-Feel free to contact with me for any questions:
-Linkedin : www.linkedin.com/in/enes-koşar