The data pipelines running in production at most companies were designed for a world that no longer exists.
That world had predictable consumers. Daily ETL jobs ran on schedule. Analytical queries came from a known set of dashboards and notebooks. Human engineers, who understood the implicit contracts between systems, were the ones writing new queries and orchestrating new flows.
That world is gone. The new consumer of your data pipeline is an agent. It writes its own queries, calls APIs you didn't design for, retries on failure in patterns no human would, and orchestrates flows the original architects never anticipated.
From your data pipeline's perspective, an agent is a consumer that doesn't know any of the implicit contracts the pipeline was designed around — and breaks them with breathtaking efficiency.

The Trust Assumptions Your Pipeline Doesn't Know It's Making
Every production pipeline carries a stack of implicit trust assumptions that nobody documented because nobody needed to. The schema is what the DDL says. The semantics of a column are what the team agreed they were in a Slack thread two years ago. The query patterns are what the analytics team has been running for six quarters.
What an agent sees when it reads your data warehouse:
- Column names, but not the cultural context that explains them
- Data types, but not the semantics encoded in their values
- Query syntax, but not the cost model behind execution
- Tables, but not the implicit ownership and update contracts that govern them
The Fintech Reconciliation Break
A mid-sized fintech had a daily reconciliation pipeline that compared payment processor records against internal ledgers. The pipeline had run reliably for three years, every night at 2 AM, completing in roughly 40 minutes.
When they introduced a finance ops agent, one of its early actions was to "verify the reconciliation status" by querying the underlying tables — several times during business hours.
What broke:
- The reconciliation job assumed exclusive access to a temporary aggregation table. Agent queries collided with the job at midnight.
- Some agent queries triggered cache invalidation, slowing other analytical workloads.
- One agent run generated a query plan that escalated to a six-figure compute bill before being killed.
- The pipeline's daily SLA was missed for the first time in eighteen months.
None of this was a bug in the pipeline. The agent was a new consumer that no one had told the pipeline about — and the pipeline didn't have the means to defend itself.
The Schema Contract Problem
The first thing agents break is the schema contract — not the formal DDL, but the human-meaningful semantic contract that lives on top of it.
A column called revenue_amount is, in DDL terms, a numeric column. In your team's actual usage, it might mean gross revenue before refunds, booked revenue for the period, or net revenue after returns. A human analyst learns the right interpretation by asking. An agent reads the column name and proceeds.
The failure mode isn't loud. The agent doesn't crash. It produces a number, presents it confidently, and moves on. Downstream consumers treat the number as authoritative. The wrongness propagates.
What Practitioners Are Doing About It
The patterns that work:
- Semantic layers and metric registries. Centralized definitions of business metrics that agents can query, rather than letting them roll their own from raw tables.
- Data contracts at table boundaries. Machine-readable contracts that describe not just types but semantics, valid value ranges, and update guarantees.
- Agent-readable column descriptions. Comprehensive column metadata that captures the semantic context human team members would have known by default.
- Restricted query surfaces. Agents query against curated views or semantic APIs rather than against raw tables.

The Cardinality Problem: Agents Query at Unexpected Scale
Pipelines are tuned for known query patterns. The optimizer is configured for them, the storage is partitioned for them, the cost limits are set for them. Agents generate queries the optimizer wasn't expecting.
What goes wrong in practice:
- An agent answering "what's our top product by region this month" runs a query that scans a year of partitioned data because it didn't know about the partition pruning convention
- An agent doing exploratory analysis runs a SELECT * on a billion-row table, generating a multi-thousand-dollar BigQuery scan
- An agent retrying a failed query escalates to a runaway cost incident before any rate limiter catches it
A B2B SaaS company connected a customer-facing analytics agent to their data warehouse. In one month, a customer's cross-account analysis triggered query patterns the warehouse had never been tuned for. The bill for that customer's traffic alone that month: $61,000. The customer's contract revenue: $18,000.
The Idempotency and Concurrency Problem
Most batch pipelines are built on idempotency assumptions that are quietly fragile. The job runs once a night. Nobody else writes to the staging table during that window. Agents do not respect the implicit "I run once" contract. They retry. They run on demand. They invoke the same job multiple times in the same hour.
What this produces:
- Race conditions. Two agent-triggered runs of the same job overlap on a shared staging table, producing inconsistent intermediate state.
- Idempotency violations. The job, run twice, produces different outputs because internal counters or temp tables don't reset cleanly.
- State drift. Agent retries, after partial failures, leave intermediate state the original pipeline didn't expect.
A logistics company introduced agents to handle customer service automation. One agent flow triggered an onboarding pipeline directly when customers asked status questions. About 0.3% of new shippers ended up with duplicate records in the order management system, requiring manual cleanup. The pipeline hadn't changed. The new consumer's behavior was incompatible with the pipeline's idempotency assumptions.
The Lineage and Provenance Problem
Agent-generated records often lack provenance metadata. When an agent errors, the partial outputs it produced may persist in downstream systems with no signal that they're untrusted. Multi-agent workflows produce records with chains of provenance that no system was built to track.
A healthcare data platform used agents to enrich incoming patient referrals with relevant context. When a regulator asked for the lineage of a specific enriched record, the platform team discovered they couldn't fully answer. The original pipeline had clean lineage. The agent-enriched records had partial lineage. Reconstructing the lineage took weeks of forensic effort.
What an Agent-Grade Pipeline Looks Like
The fix is not to lecture agents about respecting implicit assumptions. The fix is to build pipelines that make the assumptions explicit and enforce them at the system level.
An agent-grade pipeline includes:
- Explicit data contracts at every boundary. Schema, semantics, update guarantees, ownership — all machine-readable.
- Semantic layer access by default. Agents query metrics and entities, not raw tables.
- Per-consumer quotas and budgets. Cost and concurrency limits enforced at the agent identity level.
- Quarantine zones for agent writes. Outputs are validated before propagating to trusted downstream systems.
- Reverse lineage indexes. Every record can answer "which agent wrote me, when, with what inputs."
- Reconciliation pipelines. Independent computation of critical metrics, with alerts on disagreement.
This is more infrastructure than most teams are running today. It is also the infrastructure that makes agent-augmented data platforms safe at scale.
The data pipeline you trust is the first thing agents break. The pipeline that survives is the one rebuilt to expect them.
Meritshot's Data Science and Full Stack programs build agent-grade data pipeline design — semantic layers, quarantine zones, reverse lineage, and reconciliation — into production projects where the stakes are real.





