Skip to content

oni-swr/picture-to-json

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Picture to JSON - Document Conversion System

A comprehensive Java Spring Boot application that converts pictures of documents (specifically signup forms) into structured JSON format with batch processing capabilities, manual correction features, and field mapping functionality.

Features

  • OCR Processing: Advanced text extraction using Tess4J (Tesseract)
  • Image Preprocessing: Automatic image enhancement using OpenCV
  • PDF Support: Direct PDF text extraction and rendering
  • Batch Processing: Process multiple documents simultaneously
  • Manual Corrections: Web interface for correcting OCR results
  • Field Mapping: Configurable mapping between form fields and JSON keys
  • Progress Tracking: Real-time processing status updates
  • REST API: Complete RESTful API with Swagger documentation
  • Docker Support: Containerized deployment with PostgreSQL

Tech Stack

Backend

  • Java 17 - Modern Java with latest features
  • Spring Boot 3.2 - Main application framework
  • Spring Data JPA - Database operations
  • Spring Security - Authentication and authorization
  • Maven - Dependency management and build

OCR & Image Processing

  • Tess4J - Java wrapper for Tesseract OCR
  • OpenCV Java - Advanced image preprocessing
  • Apache PDFBox - PDF handling and rendering

Database

  • PostgreSQL - Production database with JSON support
  • H2 - Development and testing database

Tools

  • Docker - Containerization
  • Swagger/OpenAPI 3 - API documentation
  • JUnit 5 - Testing framework

Quick Start

Prerequisites

  • Java 17+
  • Maven 3.6+
  • Docker (optional)

Running Locally

  1. Clone the repository
git clone <repository-url>
cd picture-to-json
  1. Build the application
mvn clean package
  1. Run the application
mvn spring-boot:run

The application will start on http://localhost:8080/api

Using Docker

  1. Build and run with Docker Compose
docker-compose up --build

This will start both the application and PostgreSQL database.

API Endpoints

Document Processing

  • POST /api/documents/upload - Upload a single document
  • POST /api/documents/batch/upload - Upload multiple documents
  • POST /api/documents/{id}/process - Start processing a document
  • POST /api/documents/batch/process - Process multiple documents

Document Management

  • GET /api/documents - Get all documents (paginated)
  • GET /api/documents/{id} - Get document by ID
  • GET /api/documents/status/{status} - Get documents by status
  • PUT /api/documents/{id}/correct - Apply manual corrections

Status Values

  • PENDING - Document uploaded, waiting for processing
  • PROCESSING - Currently being processed
  • COMPLETED - Processing completed successfully
  • FAILED - Processing failed
  • CORRECTED - Manual corrections applied

API Documentation

Once the application is running, visit:

  • Swagger UI: http://localhost:8080/api/swagger-ui.html
  • OpenAPI JSON: http://localhost:8080/api/api-docs

Usage Examples

Upload and Process a Document

# Upload a document
curl -X POST "http://localhost:8080/api/documents/upload" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@signup_form.jpg"

# Response: {"id": 1, "filename": "...", "status": "PENDING", ...}

# Start processing
curl -X POST "http://localhost:8080/api/documents/1/process"

# Check status
curl "http://localhost:8080/api/documents/1"

Apply Manual Corrections

curl -X PUT "http://localhost:8080/api/documents/1/correct" \
  -H "Content-Type: application/json" \
  -d '{"correctedJson": "{\"firstName\":\"John\",\"lastName\":\"Doe\",\"email\":\"[email protected]\"}"}'

Batch Processing

# Upload multiple files
curl -X POST "http://localhost:8080/api/documents/batch/upload" \
  -H "Content-Type: multipart/form-data" \
  -F "[email protected]" \
  -F "[email protected]" \
  -F "[email protected]"

# Process batch
curl -X POST "http://localhost:8080/api/documents/batch/process" \
  -H "Content-Type: application/json" \
  -d '[1, 2, 3]'

Supported File Formats

  • Images: PNG, JPG, JPEG
  • Documents: PDF

Configuration

Application Properties

Key configuration options in application.yml:

app:
  upload:
    directory: /tmp/picture-to-json/uploads  # File storage location
  ocr:
    tesseract:
      data-path: /tmp/tessdata  # Tesseract data files
      language: eng             # OCR language

spring:
  servlet:
    multipart:
      max-file-size: 50MB      # Maximum file size
      max-request-size: 100MB  # Maximum request size

Development

Running Tests

mvn test

Building for Production

mvn clean package -Pproduction

Environment Profiles

  • default - Development with H2 database
  • test - Testing configuration
  • production - Production with PostgreSQL

Docker Deployment

Development

docker-compose up

Production

docker-compose -f docker-compose.yml -f docker-compose.prod.yml up

Monitoring

Health Check

  • GET /api/actuator/health - Application health status

Database Console (Development)

  • H2 Console: http://localhost:8080/api/h2-console
    • JDBC URL: jdbc:h2:mem:testdb
    • Username: sa
    • Password: (empty)

Architecture

├── controller/     # REST API endpoints
├── service/        # Business logic
├── repository/     # Data access layer
├── entity/         # JPA entities
├── dto/            # Data transfer objects
├── config/         # Spring configuration
├── exception/      # Error handling
└── util/           # Utility classes

Performance

  • Throughput: Process 100 documents in under 8 minutes
  • Memory: < 2GB heap size for batch processing
  • Startup: < 30 seconds application startup
  • API Response: < 500ms for standard operations

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published