π An agent-first command-line utility and library that fetches web pages and extracts clean, structured, noise-free Markdown. Perfect for LLM prompts, RAG pipelines, and automated agent workflows.
- π¦ mdget β Agent-First HTTP Client
When AI agents or LLMs browse the web, they are often overwhelmed by cookie banners, navigation menus, ads, tracker scripts, and complex layouts. mdget solves this by acting as a modern, agent-friendly replacement for curl or wget.
It doesn't just download raw HTML; it extracts the main article content (using a sophisticated scoring algorithm similar to Readability.js), converts the DOM into clean Markdown, resolves all relative paths to absolute URLs, and wraps the result in a predictable YAML metadata frontmatter envelope.
graph TD
A[URL Input] --> B[HTTP Fetcher]
B -->|Charset & Redirect Tracking| C[HTML Parser]
C --> D[Article Extractor: Readability Scoring]
D --> E[GFM Converter: Markdown Rendering]
E --> F[Envelope Packager: YAML Metadata]
F --> G[stdout / Output File]
- π Agent-Safe Fetching: Automatically detects charsets from HTTP headers or HTML meta tags, handles decompression, and respects client-side timeouts.
- π― Noise Reduction: Intelligently scores HTML nodes to remove headers, footers, sidebars, advertisements, and navigation links.
- π Token Reduction & Optimization: Shrink content sizes for LLMs dynamically via compact mode (summarizing paragraphs to the first sentence, stripping code blocks) or hard word-count limits.
- π Absolute URL Resolution: Rewrites all relative links (
<a href="...">) and images (<img src="...">) into absolute URLs using the document base URL, ensuring agents can follow links or fetch assets. - π¨ Multiple Output Formats: Choose between the default YAML frontmatter + markdown envelope, markdown-only (
--no-frontmatter), or a single structured JSON object (--json) for direct API ingestion. - ποΈ Request Customization: Send custom headers (
-H/--header), cookies (--cookie), or bearer tokens (--bearer) to authenticate or match browser behaviors. - π Rich Content Support: Automatically parses, converts, and formats HTML, JSON, XML/RSS/Atom feeds, plain text, and PDF files into structured Markdown.
- π Table to GFM Conversion: Converts standard HTML tables into clean GitHub Flavored Markdown (GFM) tables.
- π¦ YAML Frontmatter: Wraps success and failure responses in a consistent YAML envelope to give agents structured access to response headers, redirect chains, and page metadata.
- π§ Layered Configuration: Integrates defaults, TOML files (
config.toml), environment variables, and CLI overrides seamlessly.
Ensure you have Rust and Cargo installed (edition 2024, Rust 1.90+ recommended).
git clone https://github.com/hiranp/mdget.git
cd mdget
cargo install --path .Fetch any web page and print the parsed Markdown with its YAML frontmatter to stdout:
mdget fetch https://example.comUse the -o or --output flag to save the output directly to a file:
mdget fetch https://example.com -o article.mdYou can override defaults with CLI arguments:
mdget fetch https://example.com \
--timeout 15 \
--max-redirects 3 \
--user-agent "MyAgent/1.0"Customize outgoing HTTP requests using headers, cookies, and bearer tokens:
# Add custom headers
mdget fetch https://example.com -H "X-Custom-Header: value" -H "Accept: text/plain"
# Add custom cookies
mdget fetch https://example.com --cookie "session=xyz123" --cookie "theme=dark"
# Authenticate with a Bearer Token (overrides any manual Authorization header)
mdget fetch https://example.com --bearer "your-jwt-token"Control the format of the output returned by mdget. By default, mdget prepends a YAML metadata frontmatter. You can alter this behavior with the following mutually exclusive flags:
--no-frontmatter: Discard the YAML envelope and output raw Markdown body only.--json: Output a single formatted JSON object containing all metadata envelope keys and the markdown body under themarkdownkey.
Example:
# Get raw markdown body only
mdget fetch https://example.com --no-frontmatter
# Get output wrapped in a JSON envelope
mdget fetch https://example.com --jsonReduce the size of the retrieved page content for LLM contexts, RAG pipelines, or agents using the following parameters:
--compact: Strips out code blocks and summarizes each paragraph down to its first sentence/excerpt.--max-body-words <WORDS>: Caps the output body word count to a maximum value, truncating any trailing content.
Example:
mdget fetch https://example.com --compact --max-body-words 100Generate shell completions dynamically:
# For zsh
mdget completion zsh > ~/.zsh/completions/_mdget
# For bash
mdget completion bash > ~/.bash_completion.d/mdgetmdget uses a layered configuration system, loading settings in the following order (lowest to highest priority):
- Default Values (e.g., info logging, default agent-safe HTTP settings).
- System-wide configuration file:
/etc/mdget/config.toml - User configuration file:
~/.config/mdget/config.toml(or platform equivalent, e.g.~/Library/Application Support/mdget/config.tomlon macOS). - Environment variables: Prefixed with
MDGET_(e.g.,MDGET_LOG_LEVEL=debug). - Command-line arguments (e.g.,
--log-level debug).
[log]
level = "info"
[log.file]
enabled = false
path = "~/.cache/mdget/mdget.log"
level = "info"mdget returns a standard envelope structure separated by standard YAML markers (---).
---
success: true
url: https://example.com/
status: 200
title: Example Domain
description: This is a description of the example domain.
canonical_url: https://example.com/canonical
word_count: 19
body_word_count: 19
render_mode: full
body_word_limit: null
body_truncated: false
fetched_at: '2026-05-22T00:34:41.071405Z'
redirect_chain:
- https://example.com/
---
# Example Domain
This domain is for use in documentation examples without needing permission. Avoid use in operations.
[Learn more](https://iana.org/domains/example)If a request fails (e.g., 404, network timeout, DNS failure), mdget outputs an error envelope with status code 0 and exits gracefully with a structured error log.
---
success: false
url: https://httpbin.org/status/404
status: 404
error: http_404
message: 'HTTP 404: Not Found'
fetched_at: '2026-05-22T00:34:50.143208Z'
---When the --json flag is provided, the output is formatted as a single JSON object:
{
"success": true,
"url": "https://example.com/",
"status": 200,
"title": "Example Domain",
"description": "This is a description of the example domain.",
"canonical_url": "https://example.com/canonical",
"word_count": 19,
"body_word_count": 19,
"render_mode": "full",
"body_word_limit": null,
"body_truncated": false,
"fetched_at": "2026-05-22T00:34:41.071405Z",
"redirect_chain": [
"https://example.com/"
],
"markdown": "# Example Domain\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\n[Learn more](https://iana.org/domains/example)"
}- Phase 1: HTTP Core & Extraction (Completed β )
- Phase 2: Content Types & Output Modes (Completed β )
- Phase 3: Caching & Resources (Planned)
- Phase 4: Authentication & Filtering (Planned)
git clone https://github.com/hiranp/mdget.git
cd mdgetVerify code changes and run integration tests:
cargo testKeep the code clean and idiomatic:
# Format code
cargo fmt --all
# Run linter
cargo clippy --all-targets --all-features -- -D warningsContributions are highly appreciated! Please see CONTRIBUTING.md for environment setup instructions, style guides, and PR workflows.
By participating in this project, you agree to abide by the Contributor Covenant CODE_OF_CONDUCT.md.
This project is licensed under the MIT License. See the LICENSE file for details.
π Acknowledgments Built with β€οΈ using the amazing Rust ecosystem Inspired by modern CLI best practices Thanks to all the crate maintainers for their excellent work