.comment-link {margin-left:.6em;}

Oracle Sponge -- Now Moved To Wordpress

Please use http://oraclesponge.wordpress.com

Saturday, August 20, 2005

Partition Not-Quite-Pruning

I'm in the middle of a migration of a fact table from one structure to another. The key difference is in the partitioning key -- the old structure was partitioned according to the month in which the data arrived, wheras the new structure is partitioned according to the fiscal month of the transaction. It's a very slight but very important difference, because it is the fiscal month that the users will query by, hence partition pruning is very much more likely to occur in the new structure.

The data is being migrated one new-table partition at a time, to allow MV maintenance-without-tears. Now a partition of the new table sources data almost entirely from one partition of the old table, but not quite. Probably around 99.5% of the new partition's data will come from a single partition of the old table (and will represent about 99.5% of that old partition's data). The other 0.5% can come from pretty much any other partition of the old table.

So, here's the problem -- does the optimizer scan the entire old table in lovely efficient multiblock reads (direct, reads, because of the query parallelism), or does it use single block reads and the bitmap index on fiscal_month? Either one of these is unsatisfactory, so here's the solution I came up with.

Instead of the straightforward statement:

Select *
From old_table
Where fiscal_month = '01-jan-2005'

... we instead can use:

Select *
From old_table
Where fiscal_month = '01-jan-2005' and
file_load_month = '01-jan-2005'
Union All
Select *
From old_table
Where fiscal_month = '01-jan-2005' and
file_load_month != '01-jan-2005'

The first result set selects the 99.5% of the required rows that represent 99.5% of the old tables file_load_month-based partition for '01-jan-2004', and it does this with an efficient (multiblock) full scan of a single partition.

The second result set selects the 0.5% of the required rows that are in any partition other than that for file_load_month '01-jan-2004', and it does this with single block reads through the combination of the bitmap indexes on fiscal_month and file_load_month.

So in this case we got the best of both worlds. The query time dropped from around 56 seconds to 17 seconds.

Now, what should we call this? "Partition Not-Quite-Pruning"? "Partition Semipruning?" ... ?


Continued here ...

8 Comments:

At 6:07 AM, Anonymous Anonymous said...

Interesting. Can you please post the execution plans of both the queries?. I assume you have compared the outputs of both queries and confirmed it is the same.

 
At 6:17 AM, Anonymous Anonymous said...

I don't really see the issue here. May be I am missing somethng. Both quereis are against the old table. The old table is partitioned by the file_load _date. The first query don't have that column in the 'where clause'. So no partition pruning is happening. The second query is selecting the records based on file_load_date which is the parition key. So partition pruning happens and the query is faster. What is the issue here?.

 
At 9:49 AM, Blogger David Aldridge said...

Janu,

shall do. See the followup posts to this. I'll get some plans that are more instructive than those I currently have.

Anon,

The issue is that the clustering of the required values in a known partition of the fact table allows us to do something that the optimizer won't do, which is retrieve the required rows using a hybrid solution that is partly partition pruning and partly not partition pruning. We leverage out own knowledge of the data patterns to make the query more efficient.

As I point out in the followup post, the optimizer actually has enough information to do this, but doesn't.

 
At 2:33 PM, Anonymous Anonymous said...

This post remindes me a few of the postings by the "world famous Oracle expert".

I don't think you have a good understanding of what you are posting!. This comment is specific this thread does not apply to most of the threads in this blog.

 
At 4:47 PM, Blogger David Aldridge said...

I think my understanding is good, but maybe my explanation is not. Perhaps I need to draw a diagram or something. I hate doing that, but I'll see what I can do.

 
At 4:50 PM, Blogger Thomas Kyte said...

anonymous -- I see what he is doing here and it is quite nice actually.

old_table is partitioned by file_load_month.

Hence:

Select *
From old_table
Where fiscal_month = '01-jan-2005' and
file_load_month = '01-jan-2005'

will tend to full scan the single partition that is that file_load_month (which according to Dave - has MOST of the data for that fiscal year, what he is after)

And then add to that:

Select *
From old_table
Where fiscal_month = '01-jan-2005' and
file_load_month != '01-jan-2005'

which will either

a) full scan EVERY OTHER partition
b) use the bitmap index he set up on fiscal_month and the bitmap index on file_load_month to find the very few rows that did not get loaded in the same file_load_month.


Conceptually -- I see it easily:

full scan one partition, getting MOST of the data

Add to it the remaining data acquired via a bitmap index combine.

Like he said -- best of both worlds.

 
At 6:24 PM, Anonymous Anonymous said...

Almost sounds like the best of 3 worlds. The new query, which runs faster, is being put in place temporarily while they move the data model in line with the data usage.

 
At 4:04 PM, Anonymous Anonymous said...

Hi David,

nice solution, but there is IMO an additional problem. In many cases there is some rule limiting the delay of the loaded data, so the query can be extended as follows:

Select *
From old_table
Where fiscal_month = '01-jan-2005'
and file_load_month between timestamp1 and timestamp2

The rules differs dependent on business, but probably in lot of cases the data can't be loaded in the future and there is some deadline to close the case.
From this point of view the possibility to access each partition with a different access path is of course important, but additional feature of pruning the partition based on a predicate with a non partition key column (as fiscal month in our case) will be even more important.
In your case where only .5% of data is spread there is no really a big difference, but assuming 50% of data is spread in 3 partitions of 100 this will be an issue.
The key is to simple not access those partition without 01/2005 fiscal_month data. Whether a partition contains a particular fiscal month or not can be cheaply verified if this column is indexed…
In the mean time a workaround can be used creating a view filtering the business criterion and limiting the technical load dimension; e.g.

create view my_fical_month_01_2005 as
Select *
From old_table
Where fiscal_month = '01-jan-2005'
and file_load_month between '01-apr-2005' and '01-jan-2005'

Regards,

Jaromir D.B. Nemec

 

Post a Comment

<< Home