Agentic Web Scraper API

An Agentic AI method for web scraping that uses LLM to understand natural language queries and extract structured data from websites. Built with FastAPI, Google Gemini, Playwright, and BeautifulSoup.

Transparency Acknowledgement

This project was developed with the assistance of AI tools such as Cursor and ChatGPT for rapid prototyping. Final logic and structure were reviewed and modified by the developer

Features

AI-Powered Query Interpretation: Uses Google Gemini to understand what to scrape from natural language queries
JavaScript Rendering: Uses Playwright to handle dynamic content and SPAs
Structured Data Extraction: Extracts articles, links, headlines, and metadata
Async Processing: Built with async/await for better performance
CORS Support: Ready for frontend integration
Production Ready: Includes error handling, logging, and health checks

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Install Playwright Browsers

playwright install chromium

3. Set Up Environment Variables

Copy the example environment file and add your OpenAI API key:

cp env.example .env

Edit .env and add your Gemini API key:

GEMINI_API_KEY=your_actual_gemini_api_key_here

4. Run the API

python main.py

The API will be available at http://localhost:8000

API Usage

Scrape Content

POST /scrape

Request body:

{
  "query": "Scrape Bitcoin articles from CNN"
}

Response:

{
  "success": true,
  "data": [
    {
      "title": "Bitcoin reaches new all-time high",
      "link": "https://www.cnn.com/2024/01/15/bitcoin-high",
      "summary": "Bitcoin surged to a new record high as institutional adoption continues...",
      "date": "2024-01-15",
      "source": "CNN",
      "category": "Cryptocurrency"
    }
  ],
  "message": "Successfully scraped 5 items from https://www.cnn.com"
}

Health Check

GET /health

Returns the status of all components:

{
  "status": "healthy",
  "components": {
    "browser": true,
    "openai": true
  }
}

Supported Query Types

The AI can understand various types of scraping requests:

News Articles: "Get latest tech news from TechCrunch"
Product Information: "Scrape iPhone reviews from Amazon"
Social Media: "Get trending posts from Reddit"
Blog Posts: "Extract articles from Medium about AI"
Custom Websites: "Scrape job listings from Indeed"

Architecture

Components

main.py - FastAPI application with endpoints and middleware
agent.py - Google Gemini agent for query interpretation and data extraction
browser.py - Playwright browser manager for page loading
scraper.py - BeautifulSoup HTML parser and data extractor

Flow

Query Interpretation: AI analyzes the natural language query to determine:
- Target website URL
- What elements to scrape
- Scraping strategy
Page Loading: Playwright loads the page and renders JavaScript
Data Extraction: AI extracts structured data from the HTML content
Response: Returns formatted JSON with extracted data

Configuration

Environment Variables

Variable	Description	Default
`GEMINI_API_KEY`	Your Google Gemini API key	Required
`HOST`	Server host	`0.0.0.0`
`PORT`	Server port	`8000`
`LOG_LEVEL`	Logging level	`INFO`
`HEADLESS`	Run browser in headless mode	`true`
`BROWSER_TIMEOUT`	Browser timeout (ms)	`30000`

Customization

You can customize the scraping behavior by modifying:

Website mappings in agent.py for common sites
CSS selectors in scraper.py for different content types
Browser settings in browser.py for different environments

Development

Running in Development Mode

uvicorn main:app --reload --host 0.0.0.0 --port 8000

API Documentation

Once running, visit:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Testing

Test the API with curl:

curl -X POST "http://localhost:8000/scrape" \
     -H "Content-Type: application/json" \
     -d '{"query": "Get latest news from BBC"}'

Production Deployment

Docker

Create a Dockerfile:

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    wget \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Install Playwright browsers
RUN playwright install chromium

# Copy application code
COPY . .

# Expose port
EXPOSE 8000

# Run the application
CMD ["python", "main.py"]

Environment Setup

For production, ensure:

Set HEADLESS=true for server environments
Configure proper CORS origins
Set up logging and monitoring
Use environment-specific API keys

Troubleshooting

Common Issues

Playwright Installation: Make sure to run playwright install chromium
Gemini API Key: Verify your API key is valid
Browser Issues: Check if running in headless mode is required for your environment
Memory Usage: Large pages may require more memory allocation

Logs

Check the application logs for detailed error information. The API includes comprehensive logging for debugging.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
GEMINI_SETUP.md		GEMINI_SETUP.md
LICENSE		LICENSE
README.md		README.md
agent.py		agent.py
browser.py		browser.py
env.example		env.example
main.py		main.py
requirements.txt		requirements.txt
scraper.py		scraper.py
test_api.py		test_api.py

Folders and files

Latest commit

History

Repository files navigation

Agentic Web Scraper API

Transparency Acknowledgement

Features

Quick Start

1. Install Dependencies

2. Install Playwright Browsers

3. Set Up Environment Variables

4. Run the API

API Usage

Scrape Content

Health Check

Supported Query Types

Architecture

Components

Flow

Configuration

Environment Variables

Customization

Development

Running in Development Mode

API Documentation

Testing

Production Deployment

Docker

Environment Setup

Troubleshooting

Common Issues

Logs

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages