First step to learn web scraping and document what can be improved to build a modular, secure scraper platform.
This project is a learning exercise in web scraping, featuring a basic scraper implementation that collects user data from an API and stores it in a MariaDB database with comprehensive audit trail tracking.
- Adaptive Scraping: Automatically detects the end of available data
- Rate Limiting: Configurable delay between requests to be respectful to servers
- Audit Trail: Tracks all data changes (INSERT/UPDATE) for compliance and debugging
- UUID v7 Support: Uses time-ordered UUIDs for better database indexing
- Error Handling: Graceful handling of network errors and missing data
- Configuration-Based: Sensitive data stored in external config files
scraper-alpha/
├── scrape_swc_data.py # Main scraper script
├── config.yaml.example # Example configuration file
├── schema.sql # Database schema
├── SCRAPER_SCRIPT_PLAN.md # Future architecture plan
├── requirements.txt # Python dependencies
└── README.md # This file
- Python 3.7+
- MariaDB or MySQL database
- pip (Python package manager)
-
Clone the repository
git clone https://github.com/iamgerwin/scraper-alpha.git cd scraper-alpha -
Install dependencies
pip install -r requirements.txt
-
Set up the database
mysql -u root -p < schema.sql -
Configure the scraper
cp config.yaml.example config.yaml # Edit config.yaml with your database credentials and API endpoint
Edit config.yaml with your settings:
database:
host: localhost
port: 3306
user: your_db_user
password: your_db_password
database: your_database_name
charset: utf8mb4
api:
base_url: https://your-api-endpoint.com/api/get-user-details
user_id_param: userId
scraping:
start_id: 1
max_consecutive_404: 20
rate_limit_delay: 0.5Run the scraper:
python scrape_swc_data.pyThe scraper will:
- Connect to the configured database
- Start fetching user data from the API (starting from
start_id) - Insert new records or update existing ones
- Track all changes in the audit trail
- Stop automatically after encountering consecutive failures
Stores scraped user information:
id- UUID v7 primary keyid_external- External user ID from the APIfull_name- User's full nameemail_address- User's emailnumber_of_entries- Number of entriesaccumulated_amount- Accumulated amountcreated_at/updated_at- Timestamps
Tracks all data changes:
id- UUID v7 primary keyid_external- Related external user IDfield_name- Field that changedold_value/new_value- Before and after valueschange_type- INSERT or UPDATEcreated_at- When the change occurred
Uses timestamp-based UUIDs for better database performance and natural ordering.
Every data change is logged with:
- Which field changed
- Old and new values
- Type of change (INSERT/UPDATE)
- Timestamp of change
- Automatically detects when to stop (consecutive failures)
- Configurable threshold for stopping
- Tracks progress and statistics
- Configurable delay between requests
- Prevents overwhelming the target server
- Follows best practices for ethical scraping
See SCRAPER_SCRIPT_PLAN.md for a comprehensive plan to transform this into a production-grade, modular scraping platform with:
- Multiple scraping strategies (sequential, paginated, authenticated, etc.)
- Pluggable storage backends (MySQL, PostgreSQL, MongoDB, CSV, JSON)
- Circuit breakers and fault tolerance
- Self-healing capabilities
- Comprehensive monitoring and observability
- Plugin architecture for extensibility
- Hot configuration reload
- And much more...
- Never commit
config.yamlto version control - Use strong database passwords
- Limit database user permissions to only required operations
- Review the target website's
robots.txtand Terms of Service - Implement appropriate rate limiting
- ✅ Configuration files for sensitive data
- ✅ Comprehensive error handling
- ✅ Audit trail for compliance
- ✅ Rate limiting for ethical scraping
- ✅ Structured logging
- ✅ Database transactions for data integrity
- ✅ Documentation and code comments
This project serves as a foundation for understanding:
- Web scraping fundamentals
- Database design and migrations
- Audit trail implementation
- Configuration management
- Error handling and resilience patterns
- Ethical scraping practices
This is a learning project. Suggestions and improvements are welcome!
- Add support for multiple data sources
- Implement retry logic with exponential backoff
- Add CLI arguments for configuration overrides
- Create Docker containerization
- Add comprehensive unit tests
- Implement checkpoint/resume functionality
- Add monitoring and alerting
- Create web dashboard for statistics
MIT License - See LICENSE file for details
John Gerwin De las Alas
This project is part of a learning journey to build better, more resilient scraping systems.
Note: Always ensure you have permission to scrape a website and respect their robots.txt file and rate limits.