Scraper Alpha

First step to learn web scraping and document what can be improved to build a modular, secure scraper platform.

Overview

This project is a learning exercise in web scraping, featuring a basic scraper implementation that collects user data from an API and stores it in a MariaDB database with comprehensive audit trail tracking.

Features

Adaptive Scraping: Automatically detects the end of available data
Rate Limiting: Configurable delay between requests to be respectful to servers
Audit Trail: Tracks all data changes (INSERT/UPDATE) for compliance and debugging
UUID v7 Support: Uses time-ordered UUIDs for better database indexing
Error Handling: Graceful handling of network errors and missing data
Configuration-Based: Sensitive data stored in external config files

Project Structure

scraper-alpha/
├── scrape_swc_data.py      # Main scraper script
├── config.yaml.example     # Example configuration file
├── schema.sql              # Database schema
├── SCRAPER_SCRIPT_PLAN.md  # Future architecture plan
├── requirements.txt        # Python dependencies
└── README.md              # This file

Prerequisites

Python 3.7+
MariaDB or MySQL database
pip (Python package manager)

Installation

Clone the repository

git clone https://github.com/iamgerwin/scraper-alpha.git
cd scraper-alpha

Install dependencies
```
pip install -r requirements.txt
```
Set up the database
```
mysql -u root -p < schema.sql
```

Configure the scraper

cp config.yaml.example config.yaml
# Edit config.yaml with your database credentials and API endpoint

Configuration

Edit config.yaml with your settings:

database:
  host: localhost
  port: 3306
  user: your_db_user
  password: your_db_password
  database: your_database_name
  charset: utf8mb4

api:
  base_url: https://your-api-endpoint.com/api/get-user-details
  user_id_param: userId

scraping:
  start_id: 1
  max_consecutive_404: 20
  rate_limit_delay: 0.5

Usage

Run the scraper:

python scrape_swc_data.py

The scraper will:

Connect to the configured database
Start fetching user data from the API (starting from start_id)
Insert new records or update existing ones
Track all changes in the audit trail
Stop automatically after encountering consecutive failures

Database Schema

`swc_data` Table

Stores scraped user information:

id - UUID v7 primary key
id_external - External user ID from the API
full_name - User's full name
email_address - User's email
number_of_entries - Number of entries
accumulated_amount - Accumulated amount
created_at / updated_at - Timestamps

`audit_trail` Table

Tracks all data changes:

id - UUID v7 primary key
id_external - Related external user ID
field_name - Field that changed
old_value / new_value - Before and after values
change_type - INSERT or UPDATE
created_at - When the change occurred

Features in Detail

UUID v7 Implementation

Uses timestamp-based UUIDs for better database performance and natural ordering.

Audit Trail

Every data change is logged with:

Which field changed
Old and new values
Type of change (INSERT/UPDATE)
Timestamp of change

Adaptive Scraping

Automatically detects when to stop (consecutive failures)
Configurable threshold for stopping
Tracks progress and statistics

Rate Limiting

Configurable delay between requests
Prevents overwhelming the target server
Follows best practices for ethical scraping

Future Improvements

See SCRAPER_SCRIPT_PLAN.md for a comprehensive plan to transform this into a production-grade, modular scraping platform with:

Multiple scraping strategies (sequential, paginated, authenticated, etc.)
Pluggable storage backends (MySQL, PostgreSQL, MongoDB, CSV, JSON)
Circuit breakers and fault tolerance
Self-healing capabilities
Comprehensive monitoring and observability
Plugin architecture for extensibility
Hot configuration reload
And much more...

Security Considerations

Never commit config.yaml to version control
Use strong database passwords
Limit database user permissions to only required operations
Review the target website's robots.txt and Terms of Service
Implement appropriate rate limiting

Best Practices Implemented

✅ Configuration files for sensitive data
✅ Comprehensive error handling
✅ Audit trail for compliance
✅ Rate limiting for ethical scraping
✅ Structured logging
✅ Database transactions for data integrity
✅ Documentation and code comments

Lessons Learned

This project serves as a foundation for understanding:

Web scraping fundamentals
Database design and migrations
Audit trail implementation
Configuration management
Error handling and resilience patterns
Ethical scraping practices

Contributing

This is a learning project. Suggestions and improvements are welcome!

Roadmap

Add support for multiple data sources
Implement retry logic with exponential backoff
Add CLI arguments for configuration overrides
Create Docker containerization
Add comprehensive unit tests
Implement checkpoint/resume functionality
Add monitoring and alerting
Create web dashboard for statistics

License

MIT License - See LICENSE file for details

Author

John Gerwin De las Alas

Acknowledgments

This project is part of a learning journey to build better, more resilient scraping systems.

Note: Always ensure you have permission to scrape a website and respect their robots.txt file and rate limits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scraper Alpha

Overview

Features

Project Structure

Prerequisites

Installation

Configuration

Usage

Database Schema

`swc_data` Table

`audit_trail` Table

Features in Detail

UUID v7 Implementation

Audit Trail

Adaptive Scraping

Rate Limiting

Future Improvements

Security Considerations

Best Practices Implemented

Lessons Learned

Contributing

Roadmap

License

Author

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
API_SCRAPING_ALGORITHM.md		API_SCRAPING_ALGORITHM.md
LICENSE		LICENSE
README.md		README.md
SCRAPER_SCRIPT_PLAN.md		SCRAPER_SCRIPT_PLAN.md
config.yaml.example		config.yaml.example
requirements.txt		requirements.txt
schema.sql		schema.sql
scrape_swc_data.py		scrape_swc_data.py

License

iamgerwin/scraper-alpha

Folders and files

Latest commit

History

Repository files navigation

Scraper Alpha

Overview

Features

Project Structure

Prerequisites

Installation

Configuration

Usage

Database Schema

swc_data Table

audit_trail Table

Features in Detail

UUID v7 Implementation

Audit Trail

Adaptive Scraping

Rate Limiting

Future Improvements

Security Considerations

Best Practices Implemented

Lessons Learned

Contributing

Roadmap

License

Author

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`swc_data` Table

`audit_trail` Table

Packages