DEV Community

1.4.9 Planner Preprocessing: Subquery Pull-up, Predicate Pushdown, Equivalence Classes

1.4.9 Planner Preprocessing: Subquery Pull-up, Predicate Pushdown, Equivalence Classes

So far we have seen the planner build candidate paths and rank them by cost, choose join methods and join orders, and estimate those costs from statistics. But before any of that cost comparison begins, the planner does something else first. It takes the incoming Query tree and rewrites it once into a shape that is easier to optimize. That rewriting is preprocessing.

In processing order, preprocessing is the planner's very first step. Yet why it pays off only becomes clear once you understand cost and joins. Why it is cheap to shrink the rows entering a join is explained by join cost, and why it matters to manufacture a missing join condition to avoid a Cartesian product is explained by join order. That is why a step that runs first in processing is the last one we look at in this book.

Preprocessing Rewrites the Query Tree Before Cost Comparison Starts

The planner's entry point runs preprocessing once for each Query it receives. The function that gathers this work is subquery_planner. The name makes it sound like it only deals with subqueries, but it actually does everything that should happen exactly once per Query object. Only after that preparation finishes does the cost-based candidate comparison begin.

One thing is worth pinning down here. The rewriter from chapter 1.3 also rewrites the Query tree, but its purpose is different from this preprocessing. The rewriter expands views into their contents, injects RLS policies as conditions, and fills in default values for columns omitted from an INSERT. In other words, it settles what the query actually means. Preprocessing does not touch that settled meaning at all. It leaves the rows that will come out exactly as they are, and rearranges only the shape of the tree so the same result can be produced more cheaply. If the rewriter finishes deciding "what to do," preprocessing is the first trimming of "how to do it."

Why Rewrite at All?

The Query tree the rewriter hands over still carries the shape the SQL author wrote. People write for readability, not for the planner's convenience. For clarity they wrap parts of the query in subqueries and pile several conditions into one WHERE clause. Leave that shape alone, and the cost and join comparison never sees the better candidates. A condition that only needs to filter one table tags along all the way to the join and inflates the join's input, and with no join condition spelled out in the SQL, the only join order available between two tables is a Cartesian product.

Preprocessing tidies up this shape so the cost comparison gets to see all its candidates. The five sections that follow look at that tidying one piece at a time.

Pull Subqueries Up Into the Parent Query

The first thing it does is pull subqueries in the FROM clause up into the parent query. Merging a subquery into the parent as a single level is called flattening, or subquery pull-up, in the sense of lifting the subquery upward. It runs in the opposite direction from the predicate pushdown we will see later, which pushes conditions down.

Take this query as an example:

SELECT *
FROM ( SELECT * FROM customers WHERE status = 'active' ) c
JOIN orders o ON c.id = o.customer_id;

The author wrapped the part that selects only active customers in a subquery to keep it readable. If the planner leaves this subquery as is, selecting active customers and joining with orders split into two stages. The inner subquery first produces an intermediate result of "active customers," and the outer query joins that result with orders. The problem is that because these two stages are processed separately, the planner cannot put customers and orders together into its join-order search. customers is trapped inside the subquery while orders sits outside, so the two belong to different processing units.

Flattening removes this boundary. It pulls the subquery up into the parent query, conceptually turning the query above into this shape:

SELECT *
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.status = 'active';

The SQL above is not text that actually gets re-emitted; it just renders the result of the tree transformation in SQL form. Now customers and orders are two tables in the same query, and the planner can put both into its join-order search to compare which one to read first and how to join them. The WHERE status = 'active' condition also moves outward and becomes a target for the predicate pushdown we will see shortly.

Not every subquery can be merged. If the subquery contains aggregation (GROUP BY or aggregate functions), window functions, DISTINCT, LIMIT, ordering, or HAVING, it is not merged. These features need the subquery computed as one closed result for its meaning to be preserved. For instance, if the subquery computes per-customer order totals with GROUP BY, mixing and merging it with the outer table breaks the aggregation grouping. So the planner only pulls up simple subqueries that produce the same result either way, and plans the rest as closed units on their own.

IN and EXISTS subqueries that appear in a WHERE clause or a JOIN condition are handled similarly. Conditions like WHERE customer_id IN (SELECT ...) or WHERE EXISTS (SELECT ...) are converted into joins when possible. Converting them lets the planner pick an order and method like any other join and optimize it, instead of checking the subquery separately for each outer row. This internal form of an IN or EXISTS turned into a join is called a semi-join, and it is not SQL syntax. The author does not write SEMI JOIN; the planner turns the IN or EXISTS into a kind of join internally.

Compute the Expressions That Can Be Reduced to Constants Ahead of Time

Next the planner scans the expressions in the query and computes ahead of time the parts that can be computed early. This is called constant folding.

WHERE amount > 2 + 2   -- as written
WHERE amount > 4       -- after constant folding

2 + 2 is a constant with no reason to be re-added for each row, so it is reduced to 4 ahead of time. There are cases where the result is determined even with non-constant parts mixed in. For example, if some view or condition expands into a form like WHERE is_vip OR true (is_vip being a boolean column for whether a customer is a VIP), it is always true whether is_vip is true or false, so it reduces to WHERE true and the condition itself disappears. Conversely, if AND false creeps in, that condition is always false and gets tidied away entirely.

That said, not every function is computed ahead of time. A function that can return a different value each time it is called must not be precomputed. nextval(), which advances a sequence every time it is called, is the classic example. PostgreSQL precomputes a function only when it is marked immutable (the property of always returning the same output for the same input); otherwise it reduces only the arguments as far as possible and leaves the function call itself for execution time.

This step has two effects. One is that the executor stops repeating the same computation for every row. Instead of adding 2 + 2 a million times across a million rows, it reduces it to 4 once when the plan is built. The other is that with expressions simplified, the transformations that follow recognize conditions more easily. For instance, the equivalence class we will see at the end of this section looks for "column = constant" conditions like region_id = 5 to spread to other tables, and if the expression sits unreduced as region_id = 2 + 3, it fails to recognize that as a constant comparison and misses the chance to spread it. Constant folding has to reduce 2 + 3 to 5 ahead of time for the next step to work.

Turn an Outer Join Into a Simpler Inner Join

An outer join costs more than an inner join. First, the difference between the two. An inner join produces only rows that have a match in both tables. An outer join, specifically a LEFT JOIN, keeps every row of the left table and, where there is no match on the right, manufactures a row with the right columns filled with NULL. This NULL-filling adds work and narrows the planner's freedom to reorder joins.

Yet there are cases where a LEFT JOIN the author wrote produces exactly the same result as an inner join. The planner finds those cases and turns the outer join into a simpler inner join. This does not mean an inner join ranks above an outer join; it means simplifying toward the side without the extra work of NULL-filling.

The key to the decision is the WHERE condition applied to the finished join. In SQL, WHERE applies to the entire result after FROM and JOIN have been processed.

SELECT *
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.amount > 100;

The LEFT JOIN was used to keep every customer. But the join result has WHERE o.amount > 100 on it. A row where orders is NULL-filled because there is no matching order has o.amount as NULL, and NULL > 100 is not true, so that row is dropped by WHERE. In other words, the NULL-filled rows cannot survive into the final result anyway. So there is no reason to do the NULL-filling in the first place. The planner turns this LEFT JOIN into an inner join. Conceptually the query above becomes this:

SELECT *
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.amount > 100;

For this reasoning to hold, the WHERE condition has to be the kind that "can never be true once NULL comes in." o.amount > 100 is such a condition. If amount is NULL the comparison cannot be true, so the NULL-filled rows are guaranteed to be filtered out. (PostgreSQL calls an operator that cannot be true on NULL input a strict operator. Most comparison operators, =, >, <, and so on, qualify.)

If the condition were o.customer_id IS NULL instead, it is the opposite. customer_id is the join key, so it is NULL only in rows that were NULL-filled for lack of a match. Letting only those rows through means "select only customers with no orders," which the planner recognizes as an anti-join. Like the semi-join, the anti-join is not SQL syntax but a kind of join internal to the planner.

The reason simplifying outer joins matters is that the fewer outer joins there are, the more freely the planner can shuffle the join order.

Push Each Condition Down to the Table It References

When you write several conditions joined by AND in a WHERE clause, they sit bundled together at the top of the query in the SQL text. But each of those conditions actually touches a different table. Some look at only one table, some at two together. The planner pushes each condition down toward the table it actually references, so it applies right where that table is read. This transformation is called predicate pushdown, and internally PostgreSQL calls it qual distribution. qual is short for qualification, PostgreSQL's shorthand for the filter conditions that go into a WHERE or ON.

SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region_id = 5;

c.region_id = 5 references only customers. Treat this condition literally as "a filter on the joined result," and it filters by region only after orders and customers are fully joined. The rows to be discarded were dragged all the way to the join. Instead the planner pushes this condition down to where customers is read. Conceptually it becomes this:

SELECT *
FROM orders o
JOIN ( SELECT * FROM customers WHERE region_id = 5 ) c
  ON o.customer_id = c.id;

The "after" above does not mean the tree actually turns into a subquery; it renders in SQL form the fact that region_id = 5 is applied first where customers is read. customers narrows to only the customers in region 5 first, and only those few customers join with orders.

The difference is clear in numbers. If there are 100,000 customers nationwide and 5,000 in region 5, then without pushing the condition down, all 100,000 customers are joined with orders and then filtered. Push it down, and only 5,000 enter the join. A join costs less the smaller its input, so the same result comes at far less cost.

As a result of this distribution, each table's workspace (RelOptInfo) accumulates a list of conditions that apply only to that table. Each condition goes into the list wrapped in an object called a RestrictInfo, whose additional contents the earlier section on the path tree covered. A join condition that references two tables together is handled not at one table but at the join, and among those, an equality join condition like o.customer_id = c.id gets one more level of special handling. That handling is the very last transformation we will see next.

Pull Out Missing Conditions by Grouping Equal Values Together

The last transformation fills in missing conditions by grouping equality conditions together. When several = conditions hang off a WHERE clause, the planner gathers them into a group of values known to be equal to each other. This group is called an equivalence class.

The principle is simple. If a = b and b = c, then a = c is also true. The planner can pull out this a = c, which is nowhere in the SQL, and use it as a new join condition. Why this is useful shows up in join order.

SELECT *
FROM orders o, customers c, regions r
WHERE o.region_id = c.region_id
  AND c.region_id = r.id;

-- conditions written by the author: orders-customers, customers-regions
-- condition pulled out by the planner: o.region_id = r.id (orders-regions)

The author wrote only the orders-customers condition and the customers-regions condition. There is no condition directly linking orders and regions. If the planner used only the written conditions, then trying to join orders and regions first, with no condition linking them, the only thing it could make is a Cartesian product. A Cartesian product is a join that produces every combination of rows from two tables, so 100,000 orders and 100 regions yield ten million rows. Almost always a disaster, so the planner discards that join order. But if the equivalence class pulls out o.region_id = r.id as a new condition, the planner can now join orders and regions directly without producing a Cartesian product, opening up many more join orders for cost comparison.

Comments

No comments yet. Start the discussion.