An Agentic AI method for web scraping that uses LLM to understand natural language queries and extract structured data from websites. Built with FastAPI, Google Gemini, Playwright, and BeautifulSoup.
This project was developed with the assistance of AI tools such as Cursor and ChatGPT for rapid prototyping. Final logic and structure were reviewed and modified by the developer
- AI-Powered Query Interpretation: Uses Google Gemini to understand what to scrape from natural language queries
- JavaScript Rendering: Uses Playwright to handle dynamic content and SPAs
- Structured Data Extraction: Extracts articles, links, headlines, and metadata
- Async Processing: Built with async/await for better performance
- CORS Support: Ready for frontend integration
- Production Ready: Includes error handling, logging, and health checks
pip install -r requirements.txtplaywright install chromiumCopy the example environment file and add your OpenAI API key:
cp env.example .envEdit .env and add your Gemini API key:
GEMINI_API_KEY=your_actual_gemini_api_key_here
python main.pyThe API will be available at http://localhost:8000
POST /scrape
Request body:
{
"query": "Scrape Bitcoin articles from CNN"
}Response:
{
"success": true,
"data": [
{
"title": "Bitcoin reaches new all-time high",
"link": "https://www.cnn.com/2024/01/15/bitcoin-high",
"summary": "Bitcoin surged to a new record high as institutional adoption continues...",
"date": "2024-01-15",
"source": "CNN",
"category": "Cryptocurrency"
}
],
"message": "Successfully scraped 5 items from https://www.cnn.com"
}GET /health
Returns the status of all components:
{
"status": "healthy",
"components": {
"browser": true,
"openai": true
}
}The AI can understand various types of scraping requests:
- News Articles: "Get latest tech news from TechCrunch"
- Product Information: "Scrape iPhone reviews from Amazon"
- Social Media: "Get trending posts from Reddit"
- Blog Posts: "Extract articles from Medium about AI"
- Custom Websites: "Scrape job listings from Indeed"
main.py- FastAPI application with endpoints and middlewareagent.py- Google Gemini agent for query interpretation and data extractionbrowser.py- Playwright browser manager for page loadingscraper.py- BeautifulSoup HTML parser and data extractor
-
Query Interpretation: AI analyzes the natural language query to determine:
- Target website URL
- What elements to scrape
- Scraping strategy
-
Page Loading: Playwright loads the page and renders JavaScript
-
Data Extraction: AI extracts structured data from the HTML content
-
Response: Returns formatted JSON with extracted data
| Variable | Description | Default |
|---|---|---|
GEMINI_API_KEY |
Your Google Gemini API key | Required |
HOST |
Server host | 0.0.0.0 |
PORT |
Server port | 8000 |
LOG_LEVEL |
Logging level | INFO |
HEADLESS |
Run browser in headless mode | true |
BROWSER_TIMEOUT |
Browser timeout (ms) | 30000 |
You can customize the scraping behavior by modifying:
- Website mappings in
agent.pyfor common sites - CSS selectors in
scraper.pyfor different content types - Browser settings in
browser.pyfor different environments
uvicorn main:app --reload --host 0.0.0.0 --port 8000Once running, visit:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
Test the API with curl:
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{"query": "Get latest news from BBC"}'Create a Dockerfile:
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
wget \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Install Playwright browsers
RUN playwright install chromium
# Copy application code
COPY . .
# Expose port
EXPOSE 8000
# Run the application
CMD ["python", "main.py"]For production, ensure:
- Set
HEADLESS=truefor server environments - Configure proper CORS origins
- Set up logging and monitoring
- Use environment-specific API keys
- Playwright Installation: Make sure to run
playwright install chromium - Gemini API Key: Verify your API key is valid
- Browser Issues: Check if running in headless mode is required for your environment
- Memory Usage: Large pages may require more memory allocation
Check the application logs for detailed error information. The API includes comprehensive logging for debugging.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.