5 min read

Composable Transformations in SQL With Pipe Syntax

Learn how BigQuery's new pipe syntax makes data crunching easier than ever.
Composable Transformations in SQL With Pipe Syntax
Photo by Quinten de Graaf / Unsplash

SQL, the trusty backbone of data processing for decades, is famously stable—perhaps even conservative—regarding innovation. Unlike rapidly evolving languages like Python or JavaScript, SQL’s core structure has remained unchanged since its inception.

This makes sense: a language that powers mission-critical business processes across the globe must prioritize reliability and backward compatibility.

But does this mean innovation in SQL is stagnant? Far from it. The real challenge with SQL lies in making the language more intuitive while keeping its declarative nature intact. Changes often focus on ergonomics—how easily engineers can express complex logic.

In this area, Google’s recent introduction of the pipe syntax for BigQuery represents a significant step forward.

Enter The Pipe: Linear, Intuitive, and Flexible

The pipe syntax introduces a streamlined way to write SQL queries that mimic the order in which data engineers logically think about their queries. Instead of following rigid, clause-based query structures (e.g., FROM, SELECT, WHERE, GROUP BY), pipe syntax offers a more linear, modular approach. You begin with a dataset and progressively transform it, step by step.

This is more than just syntactic sugar—it's more of a shift in how engineers and analysts can reason about their code. If you've worked with functional programming languages like R or Scala, the syntax will feel familiar.

Functions are piped from one to the next, transforming the dataset incrementally, much like R’s magrittr or the tidyverse pipelines.

Pipe Syntax in Action: A Better Way to Query

Let’s take a typical query scenario: aggregating and filtering data from a table. Normally, with standard SQL, you might need to create subqueries or common table expressions (CTEs) to make it manageable. Pipe syntax removes that friction.

Consider the following basic table:

CREATE TABLE mydataset.produce AS (
  SELECT 'apples' AS item, 2 AS sales, 'fruit' AS category
  UNION ALL
  SELECT 'apples' AS item, 7 AS sales, 'fruit' AS category
  UNION ALL
  SELECT 'bananas' AS item, 15 AS sales, 'fruit' AS category
);

In standard SQL, you filter and aggregate in specific stages. With pipe syntax, the query is written sequentially, reflecting the logical flow:

FROM mydataset.produce
|> WHERE sales > 0
|> AGGREGATE SUM(sales) AS total_sales, COUNT(*) AS num_sales
   GROUP BY item;

This "flow" of data transformations feels natural: first filter the rows, then aggregate. In traditional SQL, you might need to shuffle between clauses to get the same result.

Pipe syntax has the following key characteristics:

  • Pipe operators can be applied in any order, any number of times.
  • Pipe syntax works anywhere standard syntax is supported: queries, views, table-valued functions, and other contexts.
  • Pipe syntax can be mixed with standard syntax in the same query. For example, subqueries can use different syntax from the parent query.
  • A pipe operator can see every alias that exists in the table preceding the pipe.

Pipe syntax differs from standard syntax in the following ways:

  • Queries can start with a FROM clause.
  • The SELECT pipe operator doesn't perform aggregation. You must use the AGGREGATE pipe operator instead.
  • Filtering is always done with the WHERE pipe operator, which can be applied anywhere. The WHERE pipe operator, which replaces HAVING and QUALIFY, can filter the results of aggregation or window functions.

For more information and a complete list of pipe operators, see Pipe query syntax.

A Functional Flavor in a Declarative World

Engineers familiar with functional programming might see parallels here. Pipe syntax brings the concept of "composable transformations" into SQL. You can build queries step by step, adding transformations as you go, without jumping back and forth between SELECT and FROM or repeating columns in multiple clauses. It’s much like how you might apply functions in Haskell or R.

In fact, R users have been piping data with the %>% operator for years. The key advantage in both cases is readability. With pipe syntax, you read the query from top to bottom, following the logical sequence of operations, which makes debugging easier and helps break down complex queries into understandable chunks.

# Pipe example: multiply `x` by 2, filter rows where y > 5, and calculate the sum of `y`

result <- data %>%
  mutate(x2 = x * 2) %>%
  filter(y > 5) %>%
  summarise(sum_y = sum(y))

# View result
print(result)

R example using pipes

A Real-World Problem: Sales and Customer Insights

Let’s dive into a more practical use case that reflects a typical scenario for engineers and data analysts.

Suppose you’re working with e-commerce data, and you need to generate a customer sales report. You have two main tables:

  • customers: containing customer details like ID, region, and account creation date.
  • orders: containing order details like customer ID, product category, order amount, and order date.

Your goal is to generate insights about customer spending trends by region, focusing on:

  • Average order value (AOV) per customer.
  • The number of repeat customers.
  • Filtering only recent customers who joined within the last two years.

Here's how this query might have been written in standard SQL. It's not too complex but requires some focus:

WITH recent_customers AS (
    SELECT customer_id, region
    FROM customers
    WHERE account_creation_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR)
),
customer_spending AS (
    SELECT customer_id, SUM(order_amount) AS total_spent, COUNT(order_id) AS num_orders
    FROM orders
    GROUP BY customer_id
),
repeat_customers AS (
    SELECT customer_id, total_spent, num_orders
    FROM customer_spending
    WHERE num_orders > 1
)
SELECT region, AVG(total_spent) AS avg_order_value, COUNT(customer_id) AS repeat_customers
FROM repeat_customers
JOIN recent_customers USING (customer_id)
GROUP BY region;

This query is fine but involves several CTEs (which can be confusing to debug). Let’s now rewrite this using Pipe Syntax, and see how the flow can improve both readability and maintainability.

The Same Query with Pipe Syntax

FROM customers
|> WHERE account_creation_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR)
|> JOIN orders USING(customer_id)
|> AGGREGATE SUM(order_amount) AS total_spent, COUNT(order_id) AS num_orders
   GROUP BY customer_id, region
|> WHERE num_orders > 1
|> AGGREGATE AVG(total_spent) AS avg_order_value, COUNT(customer_id) AS repeat_customers
   GROUP BY region;

Let’s break it down:

  1. Filter recent customers: We start by selecting only those customers who created accounts in the last two years.
  2. Join orders: Instead of defining a separate CTE for customer_spending, we join directly with the orders table.
  3. Aggregate spending: The pipe syntax allows us to perform an aggregation (SUM and COUNT) without defining multiple layers of CTEs.
  4. Filter repeat customers: Instead of using a separate CTE for repeat_customers, we just add a WHERE condition to filter customers with more than one order.
  5. Region-level aggregation: We finish with another aggregation to calculate the average order value per region and count the number of repeat customers.

The pipe syntax helps make the query linear and intuitive. We’re able to build the query in the same order that we would think about the problem. Every step in the pipeline transforms the data logically and predictably, making it easier to debug and maintain.

In the old approach, we had to define multiple nested CTEs, which while useful, can sometimes obfuscate the intent of the query. With pipe syntax, the query flow becomes more readable.

Why Engineers Should Care

The pipe syntax is more than a novelty—it’s an opportunity to rethink how we structure queries, especially complex ones.

It encourages breaking down logic into discrete, manageable steps, making queries easier to maintain, debug, and modify. For data engineers dealing with complex ETL processes or analysts working on large datasets, the cognitive load reduction can be substantial.

In the end, SQL may not be the fastest-moving language in the tech world, but when innovations like this happen, they matter. They make everyday tasks simpler, more efficient, and frankly, more enjoyable for engineers and analysts alike.