Skip to content

OrangePomeranian/Dingo_Dogs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genomic-Analysis-of-Founder-Effects-and-Bottleneck-Events-in-Dingo-Dogs

5cdddc32-5986-4718-a5e6-da980c562eaa

Production-grade implementation of an end-to-end Whole Genome Sequencing (WGS) variant calling pipeline, coupled with a full-stack analytical dashboard for inspection, validation, and interpretability of results.

The backend pipeline orchestrates all stages of genomic data processing — from raw FASTQ acquisition to final annotated variants — using modular, reproducible Bash workflows coordinated via a master execution script . It includes data download and QC, read trimming, reference genome preparation, alignment, and dual variant calling using both GATK HaplotypeCaller (GVCF workflow) and BCFtools mpileup/call , followed by merging, filtering, annotation with SnpEff , and cross-tool comparison.

The pipeline is designed with explicit intermediate outputs, validation checkpoints, and structured result directories, ensuring reproducibility, traceability, and ease of debugging across all stages (QC, alignment, variant calling, filtering, annotation, and post-processing).

On top of this, the project delivers a production-grade web dashboard that acts as a visualization and validation layer over the pipeline. It exposes pipeline state, execution trace, parameters, and outputs in a structured and interactive format, enabling both technical and biological inspection.

Core capabilities:

  • End-to-end automated WGS pipeline with modular Bash scripts and deterministic execution flow
  • Dual variant calling strategy (GATK vs BCFtools) with downstream harmonization and comparison
  • Variant filtering pipeline (missingness, SNP selection, LD pruning) and functional annotation (SnpEff)
  • Post-processing layer for contig normalization and chromosome mapping
  • Full observability of pipeline stages, inputs/outputs, and tool configurations
  • Interactive dashboard with:
    • Aggregated metrics (reads, variants, filtering impact)
    • Step-by-step execution trace with artifacts and parameters
    • Side-by-side comparison of variant callers (counts, overlap, distributions)
    • Variant filtering funnel and chromosome-level visualizations
    • Annotation summaries and consistency checks
  • Embedded LLM assistant enabling contextual querying over pipeline outputs and results

This project bridges raw bioinformatics execution with a modern data engineering and analytics layer, providing a reproducible, inspectable, and explainable environment for genomic variant analysis.


Designed as a thin visualization and interpretation layer on top of bioinformatics workflows, with emphasis on transparency, comparability, and debugging of variant calling pipelines.

Main dashboard view of a Dingo WGS analysis pipeline, showing high-level pipeline metrics, sample and reference metadata, and a visual summary of variant processing stages.

Includes variant counts (GATK vs BCFtools), filtering funnel, pipeline stage progression, and quick access to detailed results and analyses. image


Detailed pipeline configuration and execution view, showing all tools used (with versions and roles) and a step-by-step breakdown of the workflow.

Includes preprocessing, alignment, variant calling, filtering, and annotation stages, with parameters, scripts, and generated outputs for each step—enabling full transparency and reproducibility of the analysis. image


Step-by-step execution view of the genomic pipeline, presenting each stage from raw data quality control through trimming, alignment, variant calling, filtering, and annotation.

Displays key metrics, parameters, intermediate outputs, and comparison between GATK and BCFtools results, enabling traceability of how raw reads are transformed into final high-confidence variants.

image ___

Comparison view between GATK HaplotypeCaller and BCFtools pipelines, highlighting differences in variant detection.

Includes overlap analysis (shared vs unique variants), detailed metrics (raw counts, filtering stages, SNP/indel breakdown), and chromosome-level distributions, supported by visualizations and concise explanations of methodological differences between callers.

image ___

End-to-end pipeline view visualizing the full genomic workflow from raw data ingestion to final annotated variants.

Shows each stage (QC, trimming, alignment, variant calling, filtering, annotation, post-processing) in a sequential timeline with associated inputs, outputs, and parameters, providing a complete, traceable execution overview of the pipeline.

image ___

Integrated LLM assistant interface that enables querying pipeline results using natural language.

Provides contextual answers based on processed genomic data (variants, samples, annotations), with pre-defined prompts and full pipeline awareness to support interpretation, troubleshooting, and exploratory analysis.

image

About

Production-grade WGS variant calling pipeline with an interactive dashboard for end-to-end analysis, validation, and comparison of GATK and BCFtools results, including filtering, annotation, and LLM-assisted exploration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors