Skip to content

JNK234/Quibo

Repository files navigation

Agentic Blogging Assistant

📌 Overview

The Agentic Blogging Assistant is a sophisticated AI-powered content transformation platform that converts technical materials (Jupyter notebooks, Markdown files, Python scripts) into professionally structured blog posts. Built on a hierarchical agent-based architecture, it leverages advanced semantic processing, iterative refinement, and multi-provider LLM integration to produce high-quality technical content.

🎯 Key Features

Core Functionality

  • Multi-Format Content Processing: Supports .ipynb, .md, and .py file uploads with intelligent parsing
  • Hierarchical Agent Architecture: Specialized agents for content parsing, outline generation, and blog drafting
  • Advanced Semantic Search: ChromaDB vector storage with HyDE (Hypothetical Document Embeddings) for enhanced retrieval
  • Multi-LLM Provider Support: OpenAI, Claude, Gemini, Deepseek, and OpenRouter integration with factory pattern
  • Project Management System: Persistent project tracking with milestone-based workflow management
  • Iterative Quality Refinement: Multi-criteria scoring with feedback loops (completeness, accuracy, clarity, engagement)

Technical Capabilities

  • LangGraph Workflow Orchestration: Complex logic implemented as directed graphs with conditional routing
  • Section-Level Caching: Granular performance optimization with hash-based storage
  • Type-Safe State Management: Pydantic models throughout the system for validation and consistency
  • Content-Aware Chunking: Syntax-aware text splitting for different content types
  • Social Media Content Generation: Platform-specific promotional content with structured analysis

User Experience

  • Interactive Streamlit Interface: Tabbed workflow with real-time progress tracking
  • Version Control & Feedback: Section-level approval, editing, and regeneration capabilities
  • SEO Optimization: Intelligent title and summary generation
  • Export Capabilities: Multiple format support with professional formatting

🏗️ Technical Architecture

System Overview

The system employs a sophisticated multi-tier architecture designed for scalability, maintainability, and performance:

┌─────────────────────────────────────────┐
│           Frontend (Streamlit)          │
│   ┌─────────────────────────────────┐   │
│   │    Project Management UI        │   │
│   │    Workflow Components         │   │
│   │    State Management            │   │
│   └─────────────────────────────────┘   │
└─────────────────┬───────────────────────┘
                  │ REST API
┌─────────────────┴───────────────────────┐
│           Backend (FastAPI)             │
│   ┌─────────────────────────────────┐   │
│   │      Agent Orchestration        │   │
│   │   ┌─────────────────────────┐   │   │
│   │   │  ContentParsingAgent    │   │   │
│   │   │  OutlineGeneratorAgent  │   │   │
│   │   │  BlogDraftGenerator     │   │   │
│   │   │  BlogRefinementAgent    │   │   │
│   │   │  SocialMediaAgent       │   │   │
│   │   └─────────────────────────┘   │   │
│   └─────────────────────────────────┘   │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────┴───────────────────────┐
│        Infrastructure Layer             │
│   ┌─────────────┐  ┌─────────────────┐  │
│   │   ChromaDB  │  │  Multi-LLM      │  │
│   │   Vector    │  │  Provider       │  │
│   │   Storage   │  │  Abstraction    │  │
│   └─────────────┘  └─────────────────┘  │
└─────────────────────────────────────────┘

Agent Architecture Deep Dive

1. ContentParsingAgent

Purpose: Processes and semantically indexes technical content files

LangGraph Workflow:

validate_file → parse_content → chunk_content → prepare_metadata → store_content

Key Technical Decisions:

  • Content Deduplication: SHA256 hashing prevents redundant processing
  • Semantic Chunking: 1000-character chunks with 200-character overlap for optimal retrieval
  • Parser Factory Pattern: Extensible architecture supporting multiple file formats
  • Rich Metadata: Project scoping with timestamps and content type classification

2. OutlineGeneratorAgent

Purpose: Analyzes content and generates structured blog outlines with difficulty assessment

LangGraph Workflow:

analyze_content → assess_difficulty → identify_prerequisites → structure_outline → generate_final

Key Technical Decisions:

  • Multi-stage Analysis: Separates topic extraction from complexity assessment
  • Difficulty Classification: Automatic Beginner/Intermediate/Advanced categorization
  • Cache Optimization: Hash-based outline storage for performance
  • User Guidelines Integration: Flexible constraint system for customization

3. BlogDraftGeneratorAgent

Purpose: Generates complete blog drafts with iterative quality refinement

LangGraph Workflow:

semantic_content_mapper → section_generator → content_enhancer → code_example_extractor → quality_validator
                                                                                              ↓
section_finalizer ← [auto_feedback_generator → feedback_incorporator → quality_validator (loop)]

Key Technical Decisions:

  • HyDE RAG Implementation: Uses generate_hypothetical_document and retrieve_context_with_hyde for enhanced semantic matching
  • Quality-Driven Iteration: Multi-dimensional scoring with feedback loops
  • Section-by-Section Processing: Individual section generation with validation
  • Code Example Extraction: Dedicated node for extracting and formatting code blocks

Vector Storage & Semantic Intelligence

ChromaDB Implementation

  • Persistent Storage: SQLite backend in root/data/vector_store/
  • Embedding Functions: Uses EmbeddingFactory to support multiple providers
  • Single Collection Design: "content" collection with metadata filtering
  • Content Hash Deduplication: SHA256 hashing prevents duplicate storage

Retrieval Techniques

  • Content Chunking: Processes text into manageable chunks for embedding
  • Metadata Filtering: Project scoping and content type organization
  • Hash-Based Validation: Prevents redundant processing of identical content
  • Vector Similarity Search: Semantic matching for relevant content retrieval

Performance Optimization

Caching Implementation

  • Content Hash Caching: Prevents reprocessing of duplicate files
  • Outline Caching: TTL-based storage using FastAPI's TTLCache
  • Section Caching: Generated sections cached based on outline hashes
  • Job State Caching: Session management with in-memory state storage

Agent Coordination

  • Dependency Injection: Agents initialized with shared vector store and model instances
  • Async Operations: All agents support async processing
  • Resource Sharing: VectorStoreService and PersonaService shared across agents

Quality Assurance & Testing

Testing Infrastructure

  • Pytest Framework: Test suite located in root/backend/tests/
  • API Testing: Endpoint validation in tests/api/
  • Service Layer Testing: Unit tests in tests/services/
  • Test Configuration: pytest.ini with test discovery settings
  • Test Requirements: Separate test-requirements.txt for testing dependencies

Quality Features

  • Quality Validator Node: Automated section quality assessment
  • Feedback Generation: Auto-feedback generator for content improvement
  • Iterative Refinement: Feedback incorporation loops in blog draft generation
  • Content Enhancement: Dedicated enhancer node for improving section quality

🚀 Setup & Deployment

Prerequisites

  • Python 3.8+
  • Node.js (for frontend dependencies)
  • At least one LLM provider API key (OpenAI, Claude, Gemini, etc.)

Environment Configuration

Create a .env file in the root/ directory:

# LLM Provider Configuration (configure at least one)
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_claude_key_here
GOOGLE_API_KEY=your_gemini_key_here
DEEPSEEK_API_KEY=your_deepseek_key_here
OPENROUTER_API_KEY=your_openrouter_key_here

# Vector Storage Configuration
CHROMA_DB_PATH=./data/vector_store

# Embedding Provider (optional - defaults to sentence-transformers)
AZURE_OPENAI_ENDPOINT=your_azure_endpoint
AZURE_OPENAI_API_KEY=your_azure_key
AZURE_OPENAI_API_VERSION=2023-05-15

Installation & Launch

Quick Start (Parallel Launch)

# Make launch script executable
chmod +x launch-parallel.sh

# Launch both backend and frontend
./launch-parallel.sh

Manual Setup

# Backend Setup
cd root
pip install -r requirements.txt
cd backend
uvicorn main:app --reload --host 0.0.0.0 --port 8000

# Frontend Setup (separate terminal)
cd root/frontend
pip install -r requirements.txt
streamlit run new_app_api.py

Testing

# Run comprehensive test suite
cd root/backend
pytest -v

# Run specific test categories
pytest tests/api/ -v          # API endpoint tests
pytest tests/services/ -v     # Service layer tests
pytest tests/utils/ -v        # Utility function tests

Development Features

Launch Scripts

  • Parallel Launch: launch-parallel.sh runs both services in the same terminal
  • Individual Launch: launch.sh for separate terminal deployment
  • Process Management: Built-in cleanup and signal handling

Storage Structure

  • Vector Storage: root/data/vector_store/ for ChromaDB persistence
  • Project Data: root/data/projects/ for project management storage
  • Upload Directory: root/data/uploads/ for file processing

📊 Key Technical Decisions & Rationale

Why LangGraph?

  • Complex Workflow Management: Blog generation requires conditional routing, iterative refinement, and state management
  • Debuggability: Graph visualization and step-by-step execution tracking
  • Extensibility: Easy addition of new nodes and workflows without architectural changes

Why ChromaDB?

  • Lightweight Deployment: SQLite backend for simple deployment scenarios
  • Embedding Flexibility: Support for multiple embedding providers
  • Developer Experience: Easy local development and debugging

Why FastAPI + Streamlit?

  • Separation of Concerns: Clean API layer decoupled from UI implementation
  • Development Velocity: Rapid prototyping and iteration capabilities
  • Production Ready: Async support, automatic API documentation, and robust error handling

Quality-First Design

The system emphasizes content quality through validation nodes and feedback loops rather than speed-only generation. This reflects the goal of producing high-quality technical content.

📁 File Structure

Core Components

root/
├── backend/                 # FastAPI application
│   ├── agents/             # Specialized agents with LangGraph workflows
│   │   ├── content_parsing/        # File processing agent
│   │   ├── outline_generator/      # Blog outline generation
│   │   ├── blog_draft_generator/   # Section-by-section drafting
│   │   └── blog_refinement/        # Content refinement and titles
│   ├── models/             # LLM providers and embeddings
│   ├── services/           # Vector store, personas, project management
│   ├── parsers/            # File format parsers (.ipynb, .md, .py)
│   └── tests/              # Comprehensive test suite
├── frontend/               # Streamlit interface
│   ├── components/         # UI components and project management
│   └── services/           # Frontend service layer
└── data/                   # Storage directories
    ├── vector_store/       # ChromaDB persistence
    ├── projects/           # Project management data
    └── uploads/            # Uploaded files

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages