Skip to content

First step to learn scraping and document what can be improved to build a modular, secure scraper platform

License

Notifications You must be signed in to change notification settings

iamgerwin/scraper-alpha

Repository files navigation

Scraper Alpha

First step to learn web scraping and document what can be improved to build a modular, secure scraper platform.

Overview

This project is a learning exercise in web scraping, featuring a basic scraper implementation that collects user data from an API and stores it in a MariaDB database with comprehensive audit trail tracking.

Features

  • Adaptive Scraping: Automatically detects the end of available data
  • Rate Limiting: Configurable delay between requests to be respectful to servers
  • Audit Trail: Tracks all data changes (INSERT/UPDATE) for compliance and debugging
  • UUID v7 Support: Uses time-ordered UUIDs for better database indexing
  • Error Handling: Graceful handling of network errors and missing data
  • Configuration-Based: Sensitive data stored in external config files

Project Structure

scraper-alpha/
├── scrape_swc_data.py      # Main scraper script
├── config.yaml.example     # Example configuration file
├── schema.sql              # Database schema
├── SCRAPER_SCRIPT_PLAN.md  # Future architecture plan
├── requirements.txt        # Python dependencies
└── README.md              # This file

Prerequisites

  • Python 3.7+
  • MariaDB or MySQL database
  • pip (Python package manager)

Installation

  1. Clone the repository

    git clone https://github.com/iamgerwin/scraper-alpha.git
    cd scraper-alpha
  2. Install dependencies

    pip install -r requirements.txt
  3. Set up the database

    mysql -u root -p < schema.sql
  4. Configure the scraper

    cp config.yaml.example config.yaml
    # Edit config.yaml with your database credentials and API endpoint

Configuration

Edit config.yaml with your settings:

database:
  host: localhost
  port: 3306
  user: your_db_user
  password: your_db_password
  database: your_database_name
  charset: utf8mb4

api:
  base_url: https://your-api-endpoint.com/api/get-user-details
  user_id_param: userId

scraping:
  start_id: 1
  max_consecutive_404: 20
  rate_limit_delay: 0.5

Usage

Run the scraper:

python scrape_swc_data.py

The scraper will:

  1. Connect to the configured database
  2. Start fetching user data from the API (starting from start_id)
  3. Insert new records or update existing ones
  4. Track all changes in the audit trail
  5. Stop automatically after encountering consecutive failures

Database Schema

swc_data Table

Stores scraped user information:

  • id - UUID v7 primary key
  • id_external - External user ID from the API
  • full_name - User's full name
  • email_address - User's email
  • number_of_entries - Number of entries
  • accumulated_amount - Accumulated amount
  • created_at / updated_at - Timestamps

audit_trail Table

Tracks all data changes:

  • id - UUID v7 primary key
  • id_external - Related external user ID
  • field_name - Field that changed
  • old_value / new_value - Before and after values
  • change_type - INSERT or UPDATE
  • created_at - When the change occurred

Features in Detail

UUID v7 Implementation

Uses timestamp-based UUIDs for better database performance and natural ordering.

Audit Trail

Every data change is logged with:

  • Which field changed
  • Old and new values
  • Type of change (INSERT/UPDATE)
  • Timestamp of change

Adaptive Scraping

  • Automatically detects when to stop (consecutive failures)
  • Configurable threshold for stopping
  • Tracks progress and statistics

Rate Limiting

  • Configurable delay between requests
  • Prevents overwhelming the target server
  • Follows best practices for ethical scraping

Future Improvements

See SCRAPER_SCRIPT_PLAN.md for a comprehensive plan to transform this into a production-grade, modular scraping platform with:

  • Multiple scraping strategies (sequential, paginated, authenticated, etc.)
  • Pluggable storage backends (MySQL, PostgreSQL, MongoDB, CSV, JSON)
  • Circuit breakers and fault tolerance
  • Self-healing capabilities
  • Comprehensive monitoring and observability
  • Plugin architecture for extensibility
  • Hot configuration reload
  • And much more...

Security Considerations

  • Never commit config.yaml to version control
  • Use strong database passwords
  • Limit database user permissions to only required operations
  • Review the target website's robots.txt and Terms of Service
  • Implement appropriate rate limiting

Best Practices Implemented

  • ✅ Configuration files for sensitive data
  • ✅ Comprehensive error handling
  • ✅ Audit trail for compliance
  • ✅ Rate limiting for ethical scraping
  • ✅ Structured logging
  • ✅ Database transactions for data integrity
  • ✅ Documentation and code comments

Lessons Learned

This project serves as a foundation for understanding:

  • Web scraping fundamentals
  • Database design and migrations
  • Audit trail implementation
  • Configuration management
  • Error handling and resilience patterns
  • Ethical scraping practices

Contributing

This is a learning project. Suggestions and improvements are welcome!

Roadmap

  • Add support for multiple data sources
  • Implement retry logic with exponential backoff
  • Add CLI arguments for configuration overrides
  • Create Docker containerization
  • Add comprehensive unit tests
  • Implement checkpoint/resume functionality
  • Add monitoring and alerting
  • Create web dashboard for statistics

License

MIT License - See LICENSE file for details

Author

John Gerwin De las Alas

Acknowledgments

This project is part of a learning journey to build better, more resilient scraping systems.


Note: Always ensure you have permission to scrape a website and respect their robots.txt file and rate limits.

About

First step to learn scraping and document what can be improved to build a modular, secure scraper platform

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages