DataFlow-Agent

🔍 Project Overview

【Core Component of the DataFlow Ecosystem】A state-driven, modular AI Agent framework that provides an extensible Agent / Workflow / Tool system for DataFlow. It comes with a built-in CLI scaffold and visual pages, targeting "dataflow/operator orchestration" tasks (operator recommendation, pipeline generation/debugging, operator Q&A, web data collection, etc.).

🛠️ Feature Overview

The DataFlow Agent multi-functional platform (Gradio) includes 6 core feature modules:

PromptAgent Frontend: Generate/optimize operator Prompt Templates for easy accumulation of reusable prompt repositories.
Op Assemble Line: Quickly assemble Pipelines by selecting operators from the operator library, supporting debug and run.
Operator QA: A dedicated Q&A assistant for operators/tools to quickly answer questions about usage, parameters, examples, etc.
Operator Write: Generate custom operator code from natural language requirements, supporting in-page testing/debugging closed-loop.
Pipeline Rec: Automatically generate executable Pipelines from task descriptions, supporting multi-round iterative optimization.
Web Collection: Web data collection and structured transformation for the "data production → data governance/training data" link.

📋 Feature Details

PromptAgent Frontend

Reuse existing operators to generate and iteratively optimize "operator Prompt Templates":

Inputs: Support passing task descriptions, operator names (op-name), parameter lists, output formats, etc. (optional)
Outputs: Generate directly reusable Prompt Templates or provide rewrite suggestions for easy accumulation in the prompt repository

Op Assemble Line

Filter suitable operators from the operator library, quickly assemble them into executable Pipelines, and support end-to-end debugging and running:

Operator Selection: Filter target operators by category to accurately match business requirements
Parameter Configuration: Configure operator parameters in JSON format and add to the Pipeline queue
One-click Run: Quickly execute the assembled Pipeline and support end-to-end effect verification

Operator QA

A dedicated Q&A assistant for operators/tools to help quickly understand "how to use / what to use / what to note":

Operator Recommendation: Intelligently recommend suitable operators based on user requirements
Parameter Interpretation: Clearly explain operator input/output rules and the meaning of key parameters
Usage Examples: Provide directly reusable code snippets and scenario-based usage cases

Operator Write

Automatically generate DataFlow operator code from natural language requirements, supporting in-page testing/debugging closed-loop:

Code Generation: Generate standard-compliant operator implementation code based on target descriptions and constraints
Operator Matching: Align with existing operator specifications to facilitate inclusion of generated operators into the operator library
Debug Verification: Execute operators directly in the page and view execution results, debugging information, and running logs

Pipeline Rec

Automatically generate executable Pipelines from natural language task descriptions, supporting multi-round refinement:

Pipeline Generation: Map natural language tasks to operator combinations and execution sequences, outputting Pipeline code/JSON
Iterative Optimization: Perform secondary refinement based on the initial Pipeline to continuously improve pipeline adaptability
Artifact Output: Full artifacts such as Pipeline code, JSON configuration, and execution logs

Web Collection

Web data collection and structured transformation for the "data production → data governance/training data" link:

Collection Configuration: Customize collection targets, data types, and collection scale
Structured Transformation: Automatically collect and output structured results
Result Viewing: Support viewing execution logs, data summaries, and structured output results

📊 Core Design

Unified State Model: Organize multi-agent execution processes around state objects such as MainState / DFState, with traceable and reusable state transitions.
Agent Plug-inization: Automatically discover/load Agents through a registration mechanism, enabling expansion of agent capabilities without modifying core code.
Workflow Orchestration: Orchestrate nodes based on graph structures (GraphBuilder), supporting complex process nesting and tool call links.
Tool Management: Uniquely inject pre-tools/post-tools through ToolManager to control tool permissions and execution boundaries.
Visual Pages: Built-in Gradio multi-page interface covering high-frequency scenarios such as operators/pipelines/prompts/web collection, ready to use out of the box.
CLI Scaffold: dfa create one-click generates templates for workflow/agent/gradio page/prompt/state, reducing development costs.

🚀 Quickstart

🔥 Quickstart with Google Colab

Feature Name	Colab Tutorial Link
PromptAgent Frontend
Op Assemble Line
Operator QA
Operator Write
Pipeline Rec

🛠️ Environment Configuration and Installation

1) Clone the Repository

git clone https://github.com/OpenDCAI/DataFlow-Agent
cd DataFlow-Agent

2) Create a Virtual Environment

python -m venv venv
source venv/bin/activate  # Windows: venv\\Scripts\\activate

# Or use conda
conda create -n myenv python=3.11
conda activate myenv

3) Install Dependencies

Recommended for development/local debugging:

pip install -r requirements-data.txt
pip install -e .

Launch UI (Gradio)

Load only the dataflow-related page set (recommended):

python gradio_app/app.py --page_set data

Page Entries

PromptAgent Frontend / Op Assemble Line / Operator QA / Operator Write / Pipeline Rec / Web Collection

Port and Address Configuration

Port: Set via the environment variable GRADIO_SERVER_PORT or command line --server_port (default 7860)
Listening Address: Set via GRADIO_SERVER_NAME (default 0.0.0.0)

CLI Usage

View CLI help:

dfa --help

Common Scaffold Commands

dfa create --agent_name my_agent   
dfa create --wf_name my_workflow   
dfa create --gradio_name my_page   
dfa create --prompt_name my_prompt  
dfa create --state_name my_state

Generated File Locations (Convention)

Workflow：dataflow_agent/workflow/wf_<name>.py
Agent：dataflow_agent/agentroles/common_agents/<name>_agent.py
Gradio Page：gradio_app/pages/page_<name>.py
Prompt Template：dataflow_agent/promptstemplates/resources/pt_<name>_repo.py
State：dataflow_agent/states/<name>_state.py

Workflows

Workflows are located in dataflow_agent/workflow/ with the filename convention wf_*.py. During system startup, it will attempt to automatically import and register workflows; if a workflow depends on missing external environments/packages, it will prompt in the log and skip the import.

View currently successfully registered workflows:

python - <<'PY'
from dataflow_agent.workflow import list_workflows
print(sorted(list_workflows()))
PY

Running Method (taking run_workflow as an example):

python - <<'PY'
import asyncio
from dataflow_agent.workflow import run_workflow
from dataflow_agent.state import MainState

async def main():
    state = MainState()
    out = await run_workflow("operator_qa", state)
    print(out)

asyncio.run(main())
PY

Configuration and Environment Variables

LLM Related

DF_API_URL：LLM API Base URL (default test)
DF_API_KEY：API Key (default test)
DATAFLOW_LOG_LEVEL：Log level (default INFO)
DATAFLOW_LOG_FILE：Log file (default dataflow_agent.log)

Path Related (Optional)

dataflow_agent/state.py prioritizes obtaining paths through dataflow.cli_funcs.paths.DataFlowPath; if the external package is unavailable, it falls back to environment variables:

DATAFLOW_DIR：Root path of the data directory (default repository root path)
DATAFLOW_STATICS_DIR：Statics directory (default ./statics)

Documentation

Online Documentation (Feature Overview)

Quickly learn about the core feature modules of the DataFlow Agent platform by visiting: DataFlow Agent Official Documentation

Development Documentation (Local Development)

To learn about the DataFlow Agent design architecture or conduct local development based on DataFlow Agent (e.g., developing custom workflow, agent, etc.), launch the local documentation site to view development guidelines:

mkdocs serve

Local Access Address: http://127.0.0.1:8000/

Documentation Configuration File: mkdocs.yml

Project Structure

DataFlow-Agent/
├── dataflow_agent/          # Core framework code
├── gradio_app/              # Gradio Web interface
├── docs/                    # Documentation
├── static/                  # Static resources (README images, etc.)
├── script/                  # Script tools
└── tests/                   # Test cases

Roadmap

Feature	Status	Subfeatures
🔄 Easy-DataFlow (Data Governance Pipeline)	✅ Done	Pipeline recommendation / Operator writing / Visual orchestration / Prompt optimization / Web collection
🎨 Workflow Visual Editor (Drag-and-Drop)	🚧 In Progress	Drag-and-drop interface / 5 Agent modes / 20+ preset nodes
💾 Trace Data Export (Training Data)	🚧 In Progress	JSON/JSONL format / SFT format / DPO format

Contributing

We welcome contributions in all forms!

Submit Bugs / Feature Requests: https://github.com/OpenDCAI/DataFlow-Agent/issues
Participate in Discussions: https://github.com/OpenDCAI/DataFlow-Agent/discussions
Submit Code: https://github.com/OpenDCAI/DataFlow-Agent/pulls
Contribution Guide: docs/contributing.md

License

Apache-2.0, see LICENSE.

Join the Community

📮 GitHub Issues：https://github.com/OpenDCAI/DataFlow-Agent/issues
🔧 GitHub Pull Requests：https://github.com/OpenDCAI/DataFlow-Agent/pulls
💬 Community Chat Group：Real-time communication with developers and contributors

_{Scan the QR code to join the DataFlow-Agent Community WeChat Group}

Name		Name	Last commit message	Last commit date
Latest commit History 449 Commits
dataflow_agent		dataflow_agent
docs		docs
gradio_app		gradio_app
script		script
static		static
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README-zh.md		README-zh.md
README.md		README.md
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements-base.txt		requirements-base.txt
requirements-data.txt		requirements-data.txt
requirements-win-base.txt		requirements-win-base.txt
requirements.txt		requirements.txt
tmp_rewrite.py		tmp_rewrite.py

License

OpenDCAI/DataFlow-Agent

Folders and files

Latest commit

History

Repository files navigation

DataFlow-Agent

🔍 Project Overview

🛠️ Feature Overview

📋 Feature Details

PromptAgent Frontend

Op Assemble Line

Operator QA

Operator Write

Pipeline Rec

Web Collection

📊 Core Design

🚀 Quickstart

🔥 Quickstart with Google Colab

🛠️ Environment Configuration and Installation

1) Clone the Repository

2) Create a Virtual Environment

3) Install Dependencies

Launch UI (Gradio)

Page Entries

Port and Address Configuration

CLI Usage

Common Scaffold Commands

Generated File Locations (Convention)

Workflows

Configuration and Environment Variables

LLM Related

Path Related (Optional)

Documentation

Online Documentation (Feature Overview)

Development Documentation (Local Development)

Project Structure

Roadmap

Contributing

License

Join the Community

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 13

Uh oh!

Languages

Packages