Skip to content

OpenDCAI/DataFlow-Agent

Repository files navigation

DataFlow-Agent

DataFlow-Agent Logo

DataFlow Python License GitHub Repo Stars

Chinese | English

Quickstart Docs Contributing

🔍 Project Overview

【Core Component of the DataFlow Ecosystem】A state-driven, modular AI Agent framework that provides an extensible Agent / Workflow / Tool system for DataFlow. It comes with a built-in CLI scaffold and visual pages, targeting "dataflow/operator orchestration" tasks (operator recommendation, pipeline generation/debugging, operator Q&A, web data collection, etc.).


🛠️ Feature Overview

The DataFlow Agent multi-functional platform (Gradio) includes 6 core feature modules:

  • PromptAgent Frontend: Generate/optimize operator Prompt Templates for easy accumulation of reusable prompt repositories.
  • Op Assemble Line: Quickly assemble Pipelines by selecting operators from the operator library, supporting debug and run.
  • Operator QA: A dedicated Q&A assistant for operators/tools to quickly answer questions about usage, parameters, examples, etc.
  • Operator Write: Generate custom operator code from natural language requirements, supporting in-page testing/debugging closed-loop.
  • Pipeline Rec: Automatically generate executable Pipelines from task descriptions, supporting multi-round iterative optimization.
  • Web Collection: Web data collection and structured transformation for the "data production → data governance/training data" link.

📋 Feature Details

PromptAgent Frontend

Reuse existing operators to generate and iteratively optimize "operator Prompt Templates":

  • Inputs: Support passing task descriptions, operator names (op-name), parameter lists, output formats, etc. (optional)
  • Outputs: Generate directly reusable Prompt Templates or provide rewrite suggestions for easy accumulation in the prompt repository

Op Assemble Line

Filter suitable operators from the operator library, quickly assemble them into executable Pipelines, and support end-to-end debugging and running:

  • Operator Selection: Filter target operators by category to accurately match business requirements
  • Parameter Configuration: Configure operator parameters in JSON format and add to the Pipeline queue
  • One-click Run: Quickly execute the assembled Pipeline and support end-to-end effect verification

Operator QA

A dedicated Q&A assistant for operators/tools to help quickly understand "how to use / what to use / what to note":

  • Operator Recommendation: Intelligently recommend suitable operators based on user requirements
  • Parameter Interpretation: Clearly explain operator input/output rules and the meaning of key parameters
  • Usage Examples: Provide directly reusable code snippets and scenario-based usage cases

Operator Write

Automatically generate DataFlow operator code from natural language requirements, supporting in-page testing/debugging closed-loop:

  • Code Generation: Generate standard-compliant operator implementation code based on target descriptions and constraints
  • Operator Matching: Align with existing operator specifications to facilitate inclusion of generated operators into the operator library
  • Debug Verification: Execute operators directly in the page and view execution results, debugging information, and running logs

Pipeline Rec

Automatically generate executable Pipelines from natural language task descriptions, supporting multi-round refinement:

  • Pipeline Generation: Map natural language tasks to operator combinations and execution sequences, outputting Pipeline code/JSON
  • Iterative Optimization: Perform secondary refinement based on the initial Pipeline to continuously improve pipeline adaptability
  • Artifact Output: Full artifacts such as Pipeline code, JSON configuration, and execution logs

Web Collection

Web data collection and structured transformation for the "data production → data governance/training data" link:

  • Collection Configuration: Customize collection targets, data types, and collection scale
  • Structured Transformation: Automatically collect and output structured results
  • Result Viewing: Support viewing execution logs, data summaries, and structured output results

📊 Core Design

  • Unified State Model: Organize multi-agent execution processes around state objects such as MainState / DFState, with traceable and reusable state transitions.
  • Agent Plug-inization: Automatically discover/load Agents through a registration mechanism, enabling expansion of agent capabilities without modifying core code.
  • Workflow Orchestration: Orchestrate nodes based on graph structures (GraphBuilder), supporting complex process nesting and tool call links.
  • Tool Management: Uniquely inject pre-tools/post-tools through ToolManager to control tool permissions and execution boundaries.
  • Visual Pages: Built-in Gradio multi-page interface covering high-frequency scenarios such as operators/pipelines/prompts/web collection, ready to use out of the box.
  • CLI Scaffold: dfa create one-click generates templates for workflow/agent/gradio page/prompt/state, reducing development costs.

🚀 Quickstart

🔥 Quickstart with Google Colab

Feature Name Colab Tutorial Link
PromptAgent Frontend Open In Colab
Op Assemble Line Open In Colab
Operator QA Open In Colab
Operator Write Open In Colab
Pipeline Rec Open In Colab

🛠️ Environment Configuration and Installation

1) Clone the Repository

git clone https://github.com/OpenDCAI/DataFlow-Agent
cd DataFlow-Agent

2) Create a Virtual Environment

python -m venv venv
source venv/bin/activate  # Windows: venv\\Scripts\\activate

# Or use conda
conda create -n myenv python=3.11
conda activate myenv

3) Install Dependencies

Recommended for development/local debugging:

pip install -r requirements-data.txt
pip install -e .

Launch UI (Gradio)

Load only the dataflow-related page set (recommended):

python gradio_app/app.py --page_set data

Page Entries

PromptAgent Frontend / Op Assemble Line / Operator QA / Operator Write / Pipeline Rec / Web Collection

Port and Address Configuration

  • Port: Set via the environment variable GRADIO_SERVER_PORT or command line --server_port (default 7860)
  • Listening Address: Set via GRADIO_SERVER_NAME (default 0.0.0.0)

CLI Usage

View CLI help:

dfa --help

Common Scaffold Commands

dfa create --agent_name my_agent   
dfa create --wf_name my_workflow   
dfa create --gradio_name my_page   
dfa create --prompt_name my_prompt  
dfa create --state_name my_state       

Generated File Locations (Convention)

  • Workflow:dataflow_agent/workflow/wf_<name>.py
  • Agent:dataflow_agent/agentroles/common_agents/<name>_agent.py
  • Gradio Page:gradio_app/pages/page_<name>.py
  • Prompt Template:dataflow_agent/promptstemplates/resources/pt_<name>_repo.py
  • State:dataflow_agent/states/<name>_state.py

Workflows

Workflows are located in dataflow_agent/workflow/ with the filename convention wf_*.py. During system startup, it will attempt to automatically import and register workflows; if a workflow depends on missing external environments/packages, it will prompt in the log and skip the import.

View currently successfully registered workflows:

python - <<'PY'
from dataflow_agent.workflow import list_workflows
print(sorted(list_workflows()))
PY

Running Method (taking run_workflow as an example):

python - <<'PY'
import asyncio
from dataflow_agent.workflow import run_workflow
from dataflow_agent.state import MainState

async def main():
    state = MainState()
    out = await run_workflow("operator_qa", state)
    print(out)

asyncio.run(main())
PY

Configuration and Environment Variables

LLM Related

  • DF_API_URL:LLM API Base URL (default test)
  • DF_API_KEY:API Key (default test)
  • DATAFLOW_LOG_LEVEL:Log level (default INFO)
  • DATAFLOW_LOG_FILE:Log file (default dataflow_agent.log)

Path Related (Optional)

dataflow_agent/state.py prioritizes obtaining paths through dataflow.cli_funcs.paths.DataFlowPath; if the external package is unavailable, it falls back to environment variables:

  • DATAFLOW_DIR:Root path of the data directory (default repository root path)
  • DATAFLOW_STATICS_DIR:Statics directory (default ./statics)

Documentation

Online Documentation (Feature Overview)

Quickly learn about the core feature modules of the DataFlow Agent platform by visiting: DataFlow Agent Official Documentation

Development Documentation (Local Development)

To learn about the DataFlow Agent design architecture or conduct local development based on DataFlow Agent (e.g., developing custom workflow, agent, etc.), launch the local documentation site to view development guidelines:

mkdocs serve

Local Access Address: http://127.0.0.1:8000/

Documentation Configuration File: mkdocs.yml


Project Structure

DataFlow-Agent/
├── dataflow_agent/          # Core framework code
├── gradio_app/              # Gradio Web interface
├── docs/                    # Documentation
├── static/                  # Static resources (README images, etc.)
├── script/                  # Script tools
└── tests/                   # Test cases

Roadmap

Feature Status Subfeatures
🔄 Easy-DataFlow (Data Governance Pipeline) ✅ Done Pipeline recommendation / Operator writing / Visual orchestration / Prompt optimization / Web collection
🎨 Workflow Visual Editor (Drag-and-Drop) 🚧 In Progress Drag-and-drop interface / 5 Agent modes / 20+ preset nodes
💾 Trace Data Export (Training Data) 🚧 In Progress JSON/JSONL format / SFT format / DPO format

Contributing

We welcome contributions in all forms!


License

Apache-2.0, see LICENSE.


Join the Community

DataFlow-Agent Community WeChat Group
Scan the QR code to join the DataFlow-Agent Community WeChat Group

About

Agent for DataFlow: Automatic Data Workflow Design

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 13

Languages