The open toolkit for
data & agent
engineering.
Write what you want to build. A Pipeline Builder Agent translates it into ADPL — the open format for agentic data pipelines. Import it into Pipeline CAD, run a live simulation, and generate all project files. One document drives the whole process.
From project description
to deployed pipeline.
Every step is connected. Click any node to open the corresponding tool or documentation.
.adpl JSON file — topology, config,
and agent prompts in one portable documentMeet H.A.R.L.I.E.
This toolkit is kept alive by H.A.R.L.I.E. — a collective of 7 specialized agents running weekly: Scout researches the DE/AI landscape, Template Engineer writes new prompts, Pulse Writer summarises findings, Project Architect builds case studies, Market Watcher tracks tools, Pipeline Builder generates ADPL files, and Publisher ships it all to production.
"Consistency stops pipelines from drifting. Reliability stops agents from diverging. Agentic support stops humans from being overwhelmed. All three require the same thing: a shared, structured conversation."
Data Engineering meets Agent Engineering
One document — the ADPL file — connects your project description to a running, simulated, deployable pipeline. Explore the workflow, the format, the interface, and the autonomy model behind it.
ADPL — The Pipeline Document
ADPL (Agentic Data Pipeline Language) is the open JSON
format that connects every step in the workflow. One .adpl
file encodes the complete pipeline — topology, node configuration, orchestration
settings, quality checks, and embedded AI agent prompts.
.adpl
file into Pipeline CAD to instantly reconstruct the exact visual graph — no
manual rebuilding.agents
section embeds full system prompts so monitoring agents can be deployed directly
from the file."meta": { name, description, stack, autonomy_level },
"pipeline": { nodes[], edges[], orchestrator },
"agents": { setup, monitors[] }, // ★ new in v1.1
"ahi": { enabled, entry_types[], log_location },
"summary": { strengths[], risks[], next_steps[] }
Data Pipelines
Move data between storage, transformation & presentation
Data executes.
Observations flow back.
Agent Pipelines
Move orders & decisions between specialized agents
How They Connect
Agent pipelines sit above data pipelines. Agents issue orders — data pipelines execute. Data pipelines feed observations back up. The ADPL file captures both layers in a single document: the pipeline topology and the agent prompts that govern it.
The Agent-Human Interface
Without a structured exchange layer, agent pipelines fail silently or act without context. An agent that detected an anomaly wrote it to a log file. A human restarted the pipeline without reading the log. The agent's finding was discarded. The fix made things worse.
This happens because informal signals — Slack messages, log files, dashboards — are not part of the pipeline. They are not queryable, not typed, not append-only. When something goes wrong, you cannot reconstruct who knew what, when, and what was decided.
The arrows between the two pipelines are not vague signals — they are structured, typed entries in a shared log. Both agents and humans read and write here using the same format. It is the only place where machine decisions and human intent meet.
| Entry Type | Direction | Purpose |
|---|---|---|
| observation | Agent → Human | What was found in the data — no action implied |
| recommendation | Agent → Human | Proposed action with evidence and expected outcome |
| alert | Agent → Human | Anomaly requiring immediate attention |
| order | Human → Agent | Directive: priority change, focus area, constraint |
| approval | Human → Agent | Confirms a recommendation should be acted on |
| override | Human → Agent | Cancel or modify a planned agent action |
Navigate the Autonomy Spectrum
Select a level to explore what it means, what trust infrastructure it requires, and which concepts apply.
No agent pipeline. All decisions are made by humans who monitor dashboards, investigate anomalies, and manually fix failures. The data pipeline runs on schedule; the human is the only "agent" in the loop.
- How long does it take to discover a pipeline failure?
- How much engineer time goes to manual restarts?
- What would happen if no one checked the dashboard for a day?
An agent monitors the data pipeline and surfaces alerts: quality score dropped, schema changed, SLA breached. It observes and informs — but the decision to act still rests with a human. This is the first step toward agency.
- What are you monitoring, and how quickly does an alert reach the right person?
- Are your data contracts documented and enforced?
- How much is alert fatigue affecting your team's response quality?
The agent diagnoses known failure patterns and applies predefined playbook strategies autonomously — without waiting for a human. Schema drift? Add the column. Null spike? Quarantine the partition. This is the critical threshold: the first time the system acts without explicit human instruction.
- Do you have a documented playbook of known failure → fix pairs?
- Can you audit every autonomous action the system took?
- What's your rollback strategy if a remediation makes things worse?
The agent encounters novel situations and reasons about them — it doesn't pick from a playbook, it generates solutions. It can write new dbt models, restructure transformation logic, or draft contract amendments. It evaluates its own confidence and escalates to humans when uncertain.
- How does the agent know when it's out of its depth?
- What prevents the agent from taking a confident but catastrophically wrong action?
- How do you evaluate the quality of the agent's reasoning, not just its output?
Multiple specialized agents collaborate on a shared mission, exchanging orders, findings, verdicts, and recommendations through structured protocols. No single agent controls everything — the collective self-organizes. Human governance sets constitutional constraints for the entire system. H.A.R.L.I.E. runs this site at L4: 7 agents, one weekly pipeline, one ADPL file per project.
- How do agents resolve conflicting recommendations?
- What prevents an echo chamber where agents reinforce each other's errors?
- Where does the human sit in the governance structure?
The Trust Equation
Autonomy without trust is recklessness. Trust without autonomy is waste.
The Trust Gap
When autonomy outpaces trust infrastructure. An agent that can rewrite SQL but has no guardrails against destructive operations.
The Sweet Spot
Autonomy and trust grow together. Each level of agency is backed by proportional contracts, audits, and oversight.
The Waste Gap
When trust infrastructure exists but autonomy is capped. Sophisticated monitoring, but humans still manually restart every failed job.
| Data Pipeline | Agent Pipeline | |
|---|---|---|
| Contract | Data contract (schema, SLA) | Behavioral contract (guardrails, escalation) |
| Quality | Data quality (completeness, accuracy) | Decision quality (appropriateness, reasoning) |
| Failure | Corrupt data downstream | Wrong action taken |
| Audit | Data lineage | Decision lineage (chain-of-thought) |
Three Disciplines — One Stack
Data Engineering builds the pipeline. Agent Engineering governs it. ADPL captures both in one portable document — the link between the description you write and the system that runs.
Data Engineering
Data Pipeline Mastery
Build, move & transform data at scale. Design robust pipelines, architect lakehouses, write production SQL. The foundation — without solid data infrastructure, agents have nothing to govern.
Data Science & AI
The Intelligence Layer
The perception and reasoning capabilities that make agent judgment possible. ML models, embeddings, RAG, and vector search give agents eyes, ears, and understanding.
Agent Engineering
Agent Pipeline Design
Design the governance layer — agent personas, orchestration logic, self-healing loops, behavioral contracts. The system that decides what should happen.
Featured Projects
Real-world business cases demonstrating the power of AI-Augmented Data Engineering.
ElixirData Decision Pipeline
Governing an unpredictable AI component natively inside an Airflow deterministic DAG using ElixirData decision infrastructure.
Browser-Use Data Extraction
Automating data extraction from messy legacy web portals using browser-use and AI agents.
Browser-Use Data Extraction — Agentic Web Scraping to DuckDB
Agentic DOM parsing to scrape dynamic portals directly into DuckDB.
DeerFlow DE Agent Harness — Autonomous Pipeline Analysis with Sandboxed Execution
DeerFlow 2.0 (ByteDance, 43k stars) SuperAgent harness on LangGraph — a coordinator agent decomposes multi-source data quality investigations, fires parallel sub-agents into a Docker-sandboxed DuckDB environment, and persists confirmed findings to a skills store that compounds across sessions.
SQLMesh Transformation Pipeline — Vendor-Neutral dbt Alternative with Virtual Environments
SQLMesh (Linux Foundation, March 2026) replaces dbt Core in a retail analytics pipeline — zero-copy virtual dev/staging environments, column-level lineage, and CI/CD plan diffs running on DuckDB locally and Spark 4.x in production, illustrating the Spark 3.5 upgrade path.
Agent Knowledge Graph Pipeline — Persistent DE Project Memory with Cognee
LangGraph analyst agent with persistent dbt project memory. Cognee ingests manifest.json, schema YAMLs, and run history into a knowledge graph — agents traverse lineage and answer architectural questions across sessions without re-loading context.
MAF Agent Pipeline — MCP + AG-UI + HITL Production Pattern
Four-agent Microsoft Agent Framework pipeline with DuckDB and Airflow MCP servers, AG-UI real-time streaming to a CopilotKit dashboard, and HITL approval gates before any production table changes are applied.
Swarm Simulation Pipeline
Multi-agent swarm simulation for supply chain disruption forecasting. GraphRAG seeds the agent world from real logistics data; MiroFish runs thousands of parallel scenarios to predict outcomes.
Multimodal Document Pipeline
Local multimodal LLM pipeline for regulated financial document processing. Mistral Small 4 via Ollama extracts structured data from PDFs and screenshots — vision, reasoning, and SQL generation entirely on-premises.
Agent Context Database Pipeline — Long-Horizon Analysis with Tiered Memory
LangGraph platform analyst agent using OpenViking's L0/L1/L2 tiered context database for precise, low-token retrieval across 50-200 client data platform files — with automatic memory evolution after each session.
Local LLM Inference Pipeline — BitNet CPU Agents for Air-Gapped DE
CPU-native LLM pipeline using Microsoft BitNet b1.58 to run anomaly summarization, data quality reports, and SQL assistance on x86 servers without GPUs — designed for DSGVO-compliant air-gapped environments.
AG-UI Streaming Dashboard — Connecting Pipeline Agents to Live Frontends
LangGraph supply chain anomaly monitor wired to a React operations dashboard via AG-UI protocol — streaming token output, tool call progress, and agent state in real time, with HITL approval modals for critical remediation decisions.
Multi-Framework Agent Pipeline — A2A, HITL, and OTel Tracing
Google ADK coordinates risk analysis tasks via A2A protocol to OpenAI Agents SDK workers, with cross-framework OpenTelemetry tracing stitching every LLM call and human approval into a single auditable trace for regulatory compliance.
Kafka 4.2 Streaming Pipeline — Share Groups, Streams, and Lakehouse Sink
A logistics parcel-tracking pipeline uses Kafka 4.2 Share Groups for true queue semantics — any worker picks any message — with Kafka Streams DLQ routing and Delta Lake sink for lakehouse analytics via DuckDB.
Deep Agents Pipeline Analyst — LangChain Long-Horizon DE Agent
A LangChain Deep Agents harness analyses 60+ dbt models across context window limits using planning, filesystem-backed context offloading, and specialised subagents — producing full lineage impact reports with HITL review gates.
Databricks Agentic Lakehouse
An AI agent autonomously ingests, transforms, and monitors Delta tables via MCP OAuth — with exchange-layer approval gates for production writes.
HITL Approval Pipeline — Human-in-the-Loop Data Governance
Airflow 3.1 HITL tasks enforce mandatory human approval checkpoints in a regulated credit-scoring data pipeline. AI-generated risk summaries pre-populate approval forms, and every decision is logged to an immutable audit trail for EU AI Act compliance.
Timeline Prognose — Event Forecasting
End-to-end ML forecasting pipeline for Event schedule data. Events Historical Mean, Linear Regression & Random Forest models, detects anomalies, and exposes everything via an interactive dashboard with a local Qwen3.5 AI chat interface over MCP.
Agentic Data Pipeline
Self-healing pipeline that monitors all 3 virtual_data_source feeds. LangGraph agent detects anomalies, reasons via LLM, acts via MCP tools, and escalates only when it can't resolve alone.
API-to-Warehouse Ingestion
FastAPI product catalog → DuckDB with retry logic, pagination handling, schema drift detection, and JSON flattening. Production-grade REST ingestion patterns for any API.
Data Quality Gauntlet
Edge-case-heavy transaction CSV from virtual_data_source as a gauntlet for Great Expectations + dbt + Soda. Catches duplicate IDs, null card data, geolocation conflicts, orphaned transactions.
Multi-Source ELT Pipeline
PostgreSQL + FastAPI + CSV → dbt → DuckDB in a single Airflow-orchestrated ELT pipeline. Three source types, one unified mart. Starter data from virtual_data_source.
Data Quality & Testing Framework
Multi-layer quality framework with dbt tests, Great Expectations, Soda, and pytest. CI gating blocks broken pipelines before they reach production data consumers.
Real-Time BI Dashboard with DuckDB & dbt
Event streaming to Grafana via Kafka, Flink, DuckDB, and dbt incremental models. Sub-minute dashboard freshness without a cloud warehouse — on commodity hardware.
Infrastructure-as-Code Data Platform
Full data platform defined in Terraform and deployed via GitHub Actions to Kubernetes. Airflow with KEDA autoscaling, Kafka via Strimzi — reproducible from zero in under 30 minutes.
Local Data Engineering Knowledge Base (RAG)
Ingest PDFs, Markdown, and Jupyter Notebooks into a local ChromaDB vector store. Enables Agents to answer questions based on internal project context without cloud exposure.
Agent Pipeline Concept
Wie man skalierbare Agenten-Workflows baut: Persona Creation, Orchestration Logic und Quality Control (Critic). Das Framework für autonome Systeme.
Real-Time Predictive Maintenance (IoT)
Anomalie-Erkennung für Industriesensoren mittels Dagster, dbt und DuckDB. Angereichert durch AI-Enrichment (Anomaly Detection) und Data Trust (Sensor Contracts).
Personalized Customer Churn Prevention
360-Grad Kundenprofile mit Airflow und DuckDB. Inklusive AI-driven Health Scores und Integration-Blueprints für Salesforce & Reporting-Tools.
Dynamic Supply Chain Optimization
Lagerbestandsoptimierung basierend auf Markttrends. Nutzt Dagster Assets, dbt Vorhersagemodelle und Data Trust Quality Monitoring für Lieferantendaten.
AI-Driven Business Analytics Pipeline
How to leverage Airflow, dbt, and LLMs to turn passive business streams into proactive executive insights. Includes semantic quality gates and automated trend narratives.
Market Watch
Open source tools and LLMs for Data Engineering & AI Agent workflows — curated and kept current by the Market Watcher agent.
The Human Role in Data Engineering
AI handles the repetitive. Agents handle the predictable. What remains — judgment, architecture, trust, communication — is irreducibly human.
What Humans Bring That Machines Can't
Architecture & Design
Data Engineers don't just build pipelines — they design systems. Choosing between Kimball and Data Vault, evaluating Databricks vs. Snowflake, defining scalability patterns: these are judgment calls, not algorithms.
Business Translation
The most valuable skill in the market: understanding what the business actually needs, and translating that into technical reality. Stakeholders speak outcomes — Data Engineers speak pipelines. Bridging that gap is human work.
Ownership & Accountability
Agents execute. Humans own. Data quality, reliability, and compliance decisions carry consequences. Someone has to sign off on what flows through the system — and that someone is human.
Strategic Direction
Choosing which problems to solve, which tools to adopt, and which technical debt to carry is a strategic act. Data Engineers set the standards, define the roadmap, and decide what "good" looks like.
Innovation & Evaluation
New tools emerge weekly. Evaluating Polars vs. Pandas, DuckDB vs. BigQuery, MCP vs. custom APIs — this requires hands-on expertise and contextual judgment that no agent can fully replicate.
Ethics & Privacy
Who decides that anonymized mobility data is handled responsibly? Who ensures AI pipelines don't encode bias? These questions sit at the intersection of data, society, and conscience — a human domain.
Technical Skills — What the Market Demands
Aggregated from 120+ active Data Engineer listings on karriere.at (March 2026)
Languages
Pipeline & Orchestration
Cloud & Platforms
Databases & Modeling
DevOps & Infra
AI / ML Integration
Soft Skills — The Human Edge
These appear in nearly every listing — and no agent can fake them.
Technische Konzepte klar erklären — gegenüber Entwicklern und Management.
Pragmatische Lösungen finden — auch wenn die Anforderungen unklar sind.
Komplexe Systeme in beherrschbare Teile zerlegen und priorisieren.
In cross-funktionalen Teams arbeiten — Engineering, Data Science, Business.
Probleme antizipieren, bevor sie eskalieren. Eigeninitiative zeigen.
Die Toollandschaft verändert sich schnell. Wer stehen bleibt, fällt zurück.
Was der Markt 2026 treibt
Batch reicht nicht mehr. Spark Streaming, Flink, Kafka — Echtzeit-Architekturen werden Standard.
Fast jedes Stellenangebot nennt eines oder beide. Cloud-native Data Platforms sind kein Trend mehr — sie sind Voraussetzung.
Data Engineers müssen RAG, Embeddings und LLMOps verstehen — nicht implementieren, aber integrieren können.
DSGVO ist längst nicht gelöst. Privacy-by-design und anonymisierte Analysen werden als Differenziator gefordert.
Rein technische Profile verlieren. Gesucht: Data Engineers, die Architekturverantwortung UND Fachbereichsverständnis mitbringen.
MCP, autonome Agenten, self-healing Pipelines — noch selten in Stellenanzeigen, aber der Horizont ist klar sichtbar.
📊 Research basierend auf karriere.at — 120+ aktive Data Engineer Stellenanzeigen in Österreich (Februar 2026). Kuratiert von H.A.R.L.I.E. 🌀
Data Pipelines Templates
Select a template to start building your prompt
System Prompt
User Prompt
Variables
Fill in to customize your promptConstructed Prompt
Select a template to begin...
AI Response
Click "Run" to send your prompt to the AI...