This is just a quick note to throw out a couple of of the lesser-known options for the opt_estimate() hint – and they may be variants that are likely to be most useful since they address a problem where the optimizer can produce consistently bad cardinality estimates. The first is the “group by” option – a hint that I once would have called a “strategic” hint but which more properly ought to be called a “query block” hint. Here’s the simplest possible example (tested under 12.2, 18.3 and 19.2):
rem rem Script: opt_est_gby.sql rem Author: Jonathan Lewis rem Dated: June 2019 rem create table t1 as select rownum id, mod(rownum,200) n1, lpad(rownum,10,'0') v1, rpad('x',100) padding ) from dual connect by level <= 3000 ; set autotrace on explain prompt ============================= prompt Baseline cardinality estimate prompt (correct cardinality is 10) prompt Estimate will be 200 prompt ============================= select /*+ qb_name(main) */ mod(n1,10), count(*) from t2 group by mod(n1,10) ;
I’ve generated a table of 3,000 rows with a column n1 holding 15 rows each of 200 distinct values. The query then aggregates on mod(n1,10) so it has to return 10 rows, but the optimizer doesn’t have a mechanism for inferring this and produces the following plan – the Rows value from the HASH GROUP BY at operation 1 is the only thing we’re really interested in here:
--------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 200 | 800 | 10 (10)| 00:00:01 | | 1 | HASH GROUP BY | | 200 | 800 | 10 (10)| 00:00:01 | | 2 | TABLE ACCESS FULL| T1 | 3000 | 12000 | 9 (0)| 00:00:01 | ---------------------------------------------------------------------------
It looks as if the optimizer’s default position is to use num_distinct from the underlying column as the estimate for the aggregate. We can work around this in the usual two ways with an opt_estimate() hint. First, let’s tell the optimizer that it’s going to over-estimate the cardinality by a factor of 10:
select /*+ qb_name(main) opt_estimate(@main group_by, scale_rows = 0.1) */ mod(n1,10), count(*) from t1 group by mod(n1,10) ; --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 20 | 80 | 10 (10)| 00:00:01 | | 1 | HASH GROUP BY | | 20 | 80 | 10 (10)| 00:00:01 | | 2 | TABLE ACCESS FULL| T1 | 3000 | 12000 | 9 (0)| 00:00:01 | ---------------------------------------------------------------------------
The hint uses group_by as the critical option parameter, and then I’ve used the standard scale_rows=nnn to set a scaling factor that should be used to adjust the result of the default calculation. At 10% (0.1) this gives us an estimate of 20 rows.
Alternatively, we could simply tell the optimizer how many rows we want it to believe will be generated for the aggregate – let’s just tell it that the result will be 10 rows.
select /*+ qb_name(main) opt_estimate(@main group_by, rows = 10) */ mod(n1,10), count(*) from t1 group by mod(n1,10) ; --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10 | 40 | 10 (10)| 00:00:01 | | 1 | HASH GROUP BY | | 10 | 40 | 10 (10)| 00:00:01 | | 2 | TABLE ACCESS FULL| T1 | 3000 | 12000 | 9 (0)| 00:00:01 | ---------------------------------------------------------------------------
We use the same group_by as the critical parameter, with rows=nnn.
Next steps
After an aggregation there’s often a “having” clause so you might consider using the group_by option to fix up the cardinality of the having clause if you know what the normal effect of the having clause should be. For example: “having count(*) > NNN” will use the optimizer’s standard 5% “guess” and “having count(*) = NNN” will use the standard 1% guess. However, having seen the group_by options I took a guess that there might be a having option to the opt_estimate() hint as well, so I tried it – with autotrace enabled here are three queries, first the unhinted baseline (which uses the standard 5% on my having clause) then a couple of others with hints to tweak the cardinality:
select /*+ qb_name(main) */ mod(n1,10), count(*) from t1 group by mod(n1,10) having count(*) > 100 ; select /*+ qb_name(main) opt_estimate(@main having scale_rows=0.4) */ mod(n1,10), count(*) from t1 group by mod(n1,10) having count(*) > 100 ; select /*+ qb_name(main) opt_estimate(@main group_by scale_rows=2) opt_estimate(@main having scale_rows=0.3) */ mod(n1,10), count(*) from t1 group by mod(n1,10) having count(*) > 100 ;
The first query gives us the baseline cardinality of 10 (5% of 200). The second query scales the having cardinality down by a factor of 0.4 (with means an estimate of 4). The final query first doubles the group by cardinality (to 400), then scales the having cardinality (which would have become 20) down by a factor of 0.3 with the nett effect of producing a cardinality of 6. Here are the plans.
---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10 | 40 | 10 (10)| 00:00:01 | |* 1 | FILTER | | | | | | -- 10 | 2 | HASH GROUP BY | | 10 | 40 | 10 (10)| 00:00:01 | -- 200 | 3 | TABLE ACCESS FULL| T1 | 3000 | 12000 | 9 (0)| 00:00:01 | ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 4 | 16 | 10 (10)| 00:00:01 | |* 1 | FILTER | | | | | | -- 4 | 2 | HASH GROUP BY | | 4 | 16 | 10 (10)| 00:00:01 | -- 200 | 3 | TABLE ACCESS FULL| T1 | 3000 | 12000 | 9 (0)| 00:00:01 | ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 6 | 24 | 10 (10)| 00:00:01 | |* 1 | FILTER | | | | | | -- 6 | 2 | HASH GROUP BY | | 6 | 24 | 10 (10)| 00:00:01 | -- 400 | 3 | TABLE ACCESS FULL| T1 | 3000 | 12000 | 9 (0)| 00:00:01 | ----------------------------------------------------------------------------
It’s a little sad that the FILTER operation shows no estimate while the HASH GROUP BY operation shows the estimate after the application of the having clause. It would be nice to see the plan reporting the figures which I’ve added at the end of line for operations 1 and 2.
You may wonder why one would want to increase the estimate for the group by then reduce it for the having. While I’m not going to go to the trouble of creating a worked example it shouldn’t be too hard to appreciate the idea that the optimizer might use complex view merging to postpone a group by until after a join – so increasing the estimate for a group by might be necessary to ensure that that particular transformation doesn’t happen, while following this up with a reduction to the having might then ensure that the next join is a nested loop rather than a hash join. Of course, if you don’t need to be this subtle you might simply take advantage of yet another option to the opt_estimate() hint, the query_block option – but that will (probably) appear in the next article in this series.