Noirscan is an extensible, ethical, and robust Python crawler designed to explore both the surface web and the dark web (via TOR). It is built for research, analysis, and data extraction in formats suitable for AI training, cybersecurity, and law enforcement use-cases. Noirscan puts emphasis on modularity, ethical compliance, and high data quality.
Noirscan is still in alpha stage and more functionality will be added.
- TOR & Surface Web Support: Crawl standard and .onion sites using SOCKS5H over TOR.
- Depth-Limited Recursive Crawling: Configurable crawl depth to control scope and resource use.
- User-Agent Randomization/Override: Mimic different browsers or set your own.
- Ethical Crawling: robots.txt parsing (with CLI bypass option) and clear warnings to users.
- Structured Data Extraction: Extracts page titles, cleaned main text, links, and more.
- Automatic Metadata Collection: Timestamps, IP, geolocation, and more.
- Rich Output: Saves results as structured, well-documented JSON for easy downstream analysis or ingestion.
- Colorized, User-Friendly CLI: Clear prompts, helpful summaries, and progress info.
- Modular Codebase: Easily extend scraping logic, add new extractors, or integrate with external systems.
- Python 3.9+
- TOR running locally (default port 9050)
- All dependencies in
requirements.txt
git clone https://github.com/TencoDev/Noirscan.git
cd Noirscan
pip install -r requirements.txt# Basic crawl (surface web or dark web)
python noirscan/main.py --url "http://example.com" --depth 1
# With custom user agent and robots.txt enabled
python noirscan/main.py --url "http://somesite.com" --user-agent "MyBot/1.0" --ignore-robots False
# Crawl a .onion site with output saved to JSON
python noirscan/main.py --url "http://somedarkweb.onion" --save --depth 2
⚠️ Important: Noirscan is a research tool. You are responsible for all use.
- Always respect local laws and the terms of service of target sites.
- By default, Noirscan respects robots.txt and provides clear CLI warnings.
- For law enforcement or sensitive use, always consult legal counsel before deploying.
noirscan/
├── main.py # CLI entrypoint
├── crawler.py # Crawler logic (class-based, extensible)
├── utils.py # Utility functions (cleaning, printing, saving)
├── network.py # Network helpers (TOR, IP, geolocation)
├── models.py # ScrapedPage and data models
├── config.py # User agents, timeouts, config
└── output/ # (auto-created) JSON output files
requirements.txt
README.md
- Custom Extraction: Add new parsing methods in
models.pyor utility functions inutils.py. - Plugins: Integrate entity recognizers, ML/NLP pipelines, or custom exporters.
- CLI: Easily extendable, just add new flags and wire to the
Crawlerclass.
- Advanced Content Extraction: Use NLP/ML for better main text, keyword, language, and entity extraction.
- Session Management: Support for cookies, login, and authenticated crawling.
- Distributed Crawling: Multi-process or multi-machine support for large-scale scans.
- Database Export: Native support for MongoDB, Elastic, or SQL.
- UI Dashboard: Live crawl progress and data visualization.
- Unit & Integration Tests: Expand testing for reliability and CI/CD.
- Crawler Politeness: Configurable delays, concurrency, and auto-throttling.
- Multi-format Output: CSV, Parquet, and direct data-lake integration.
- Dark Web Intelligence: Plugins for dark web threat hunting, marketplace detection, etc.
Pull requests and suggestions are welcome!
Please open an issue to discuss any major changes first.
MIT License