【Core Component of the DataFlow Ecosystem】A state-driven, modular AI Agent framework that provides an extensible Agent / Workflow / Tool system for DataFlow. It comes with a built-in CLI scaffold and visual pages, targeting "dataflow/operator orchestration" tasks (operator recommendation, pipeline generation/debugging, operator Q&A, web data collection, etc.).
The DataFlow Agent multi-functional platform (Gradio) includes 6 core feature modules:
- PromptAgent Frontend: Generate/optimize operator Prompt Templates for easy accumulation of reusable prompt repositories.
- Op Assemble Line: Quickly assemble Pipelines by selecting operators from the operator library, supporting debug and run.
- Operator QA: A dedicated Q&A assistant for operators/tools to quickly answer questions about usage, parameters, examples, etc.
- Operator Write: Generate custom operator code from natural language requirements, supporting in-page testing/debugging closed-loop.
- Pipeline Rec: Automatically generate executable Pipelines from task descriptions, supporting multi-round iterative optimization.
- Web Collection: Web data collection and structured transformation for the "data production → data governance/training data" link.
Reuse existing operators to generate and iteratively optimize "operator Prompt Templates":
- Inputs: Support passing task descriptions, operator names (
op-name), parameter lists, output formats, etc. (optional) - Outputs: Generate directly reusable Prompt Templates or provide rewrite suggestions for easy accumulation in the prompt repository
Filter suitable operators from the operator library, quickly assemble them into executable Pipelines, and support end-to-end debugging and running:
- Operator Selection: Filter target operators by category to accurately match business requirements
- Parameter Configuration: Configure operator parameters in JSON format and add to the Pipeline queue
- One-click Run: Quickly execute the assembled Pipeline and support end-to-end effect verification
A dedicated Q&A assistant for operators/tools to help quickly understand "how to use / what to use / what to note":
- Operator Recommendation: Intelligently recommend suitable operators based on user requirements
- Parameter Interpretation: Clearly explain operator input/output rules and the meaning of key parameters
- Usage Examples: Provide directly reusable code snippets and scenario-based usage cases
Automatically generate DataFlow operator code from natural language requirements, supporting in-page testing/debugging closed-loop:
- Code Generation: Generate standard-compliant operator implementation code based on target descriptions and constraints
- Operator Matching: Align with existing operator specifications to facilitate inclusion of generated operators into the operator library
- Debug Verification: Execute operators directly in the page and view execution results, debugging information, and running logs
Automatically generate executable Pipelines from natural language task descriptions, supporting multi-round refinement:
- Pipeline Generation: Map natural language tasks to operator combinations and execution sequences, outputting Pipeline code/JSON
- Iterative Optimization: Perform secondary refinement based on the initial Pipeline to continuously improve pipeline adaptability
- Artifact Output: Full artifacts such as Pipeline code, JSON configuration, and execution logs
Web data collection and structured transformation for the "data production → data governance/training data" link:
- Collection Configuration: Customize collection targets, data types, and collection scale
- Structured Transformation: Automatically collect and output structured results
- Result Viewing: Support viewing execution logs, data summaries, and structured output results
- Unified State Model: Organize multi-agent execution processes around state objects such as
MainState / DFState, with traceable and reusable state transitions. - Agent Plug-inization: Automatically discover/load Agents through a registration mechanism, enabling expansion of agent capabilities without modifying core code.
- Workflow Orchestration: Orchestrate nodes based on graph structures (GraphBuilder), supporting complex process nesting and tool call links.
- Tool Management: Uniquely inject pre-tools/post-tools through
ToolManagerto control tool permissions and execution boundaries. - Visual Pages: Built-in Gradio multi-page interface covering high-frequency scenarios such as operators/pipelines/prompts/web collection, ready to use out of the box.
- CLI Scaffold:
dfa createone-click generates templates for workflow/agent/gradio page/prompt/state, reducing development costs.
| Feature Name | Colab Tutorial Link |
|---|---|
| PromptAgent Frontend | |
| Op Assemble Line | |
| Operator QA | |
| Operator Write | |
| Pipeline Rec |
git clone https://github.com/OpenDCAI/DataFlow-Agent
cd DataFlow-Agentpython -m venv venv
source venv/bin/activate # Windows: venv\\Scripts\\activate
# Or use conda
conda create -n myenv python=3.11
conda activate myenvRecommended for development/local debugging:
pip install -r requirements-data.txt
pip install -e .Load only the dataflow-related page set (recommended):
python gradio_app/app.py --page_set dataPromptAgent Frontend / Op Assemble Line / Operator QA / Operator Write / Pipeline Rec / Web Collection
- Port: Set via the environment variable
GRADIO_SERVER_PORTor command line--server_port(default 7860) - Listening Address: Set via
GRADIO_SERVER_NAME(default0.0.0.0)
View CLI help:
dfa --helpdfa create --agent_name my_agent
dfa create --wf_name my_workflow
dfa create --gradio_name my_page
dfa create --prompt_name my_prompt
dfa create --state_name my_state - Workflow:
dataflow_agent/workflow/wf_<name>.py - Agent:
dataflow_agent/agentroles/common_agents/<name>_agent.py - Gradio Page:
gradio_app/pages/page_<name>.py - Prompt Template:
dataflow_agent/promptstemplates/resources/pt_<name>_repo.py - State:
dataflow_agent/states/<name>_state.py
Workflows are located in dataflow_agent/workflow/ with the filename convention wf_*.py. During system startup, it will attempt to automatically import and register workflows; if a workflow depends on missing external environments/packages, it will prompt in the log and skip the import.
View currently successfully registered workflows:
python - <<'PY'
from dataflow_agent.workflow import list_workflows
print(sorted(list_workflows()))
PYRunning Method (taking run_workflow as an example):
python - <<'PY'
import asyncio
from dataflow_agent.workflow import run_workflow
from dataflow_agent.state import MainState
async def main():
state = MainState()
out = await run_workflow("operator_qa", state)
print(out)
asyncio.run(main())
PYDF_API_URL:LLM API Base URL (defaulttest)DF_API_KEY:API Key (defaulttest)DATAFLOW_LOG_LEVEL:Log level (defaultINFO)DATAFLOW_LOG_FILE:Log file (defaultdataflow_agent.log)
dataflow_agent/state.py prioritizes obtaining paths through dataflow.cli_funcs.paths.DataFlowPath; if the external package is unavailable, it falls back to environment variables:
DATAFLOW_DIR:Root path of the data directory (default repository root path)DATAFLOW_STATICS_DIR:Statics directory (default./statics)
Quickly learn about the core feature modules of the DataFlow Agent platform by visiting: DataFlow Agent Official Documentation
To learn about the DataFlow Agent design architecture or conduct local development based on DataFlow Agent (e.g., developing custom workflow, agent, etc.), launch the local documentation site to view development guidelines:
mkdocs serveLocal Access Address: http://127.0.0.1:8000/
Documentation Configuration File: mkdocs.yml
DataFlow-Agent/
├── dataflow_agent/ # Core framework code
├── gradio_app/ # Gradio Web interface
├── docs/ # Documentation
├── static/ # Static resources (README images, etc.)
├── script/ # Script tools
└── tests/ # Test cases
| Feature | Status | Subfeatures |
|---|---|---|
| 🔄 Easy-DataFlow (Data Governance Pipeline) | ✅ Done | Pipeline recommendation / Operator writing / Visual orchestration / Prompt optimization / Web collection |
| 🎨 Workflow Visual Editor (Drag-and-Drop) | 🚧 In Progress | Drag-and-drop interface / 5 Agent modes / 20+ preset nodes |
| 💾 Trace Data Export (Training Data) | 🚧 In Progress | JSON/JSONL format / SFT format / DPO format |
We welcome contributions in all forms!
- Submit Bugs / Feature Requests: https://github.com/OpenDCAI/DataFlow-Agent/issues
- Participate in Discussions: https://github.com/OpenDCAI/DataFlow-Agent/discussions
- Submit Code: https://github.com/OpenDCAI/DataFlow-Agent/pulls
- Contribution Guide:
docs/contributing.md
Apache-2.0, see LICENSE.
- 📮 GitHub Issues:https://github.com/OpenDCAI/DataFlow-Agent/issues
- 🔧 GitHub Pull Requests:https://github.com/OpenDCAI/DataFlow-Agent/pulls
- 💬 Community Chat Group:Real-time communication with developers and contributors

