.comment-link {margin-left:.6em;}

Oracle Sponge -- Now Moved To Wordpress

Please use http://oraclesponge.wordpress.com

Monday, May 22, 2006

The Prime Number Few

I was chasing links through Wikipedia the other day, which always leads somewhere interesting.

In this case I fell into the deep waters of linguistics and learned about garden path sentences. They caught my eye in particular because the article includes one of my favourite jokes, albeit with a slightly different construction to my usual one.

"If time flies like an arrow, do fruit flies like a banana?"

Now that I've read a deconstruction of the joke, it seems a little less funny though.

Thursday, May 18, 2006

The Three Pillars of Oracle Data Warehousing

Introduction

This is a basic topic for Oracle data warehousing beginners, based on some ideas that I'm hoping will stop buzzing round in my head if I commit them to virtual paper.

There are three Oracle features that provide a foundation for successful data warehousing:
  • Partitioning
  • Parallelism
  • Psummary Tables (the "p" is silent)
Here are the benefits that they bring to the system.


Partitioning

In a data warehouse we are interested in accessing large numbers of rows that are identified through some common attribute. Very often the same small numbers of attributes are used over and over again to filter the data. The most common is some form of date filter ("give me sales for this January"), and there are often others that are used very commonly ("give me inventory levels for repairable items", "give me backorders for California").

In an OLTP system we would use an index or a full tablescan to get this subset of data, but that can be inefficient if we are selecting a large number of rows representing a subset of the total table data because of the high number of logical i/o's required and the single block reads that are used to access the table.

In a data warehouse we can partition according to attributes commonly used as filters by the users, or according to logical child attributes of them (eg. users commonly filter by month but we partition by day). This achieves two important aims.
  1. Rows of data that are commonly used together are physically co-located.
  2. The optimizer can treat each partition as a table and can perform fast scans of multiple partitions to access the data ("partition pruning").
We can also use multicolumn or composite partitioning to partition by multiple attributes, so that we can partition both by date and by location and get partition pruning on either or both.

There are also benefits to manipulating bulk data -- we can use partition-based DDL operations to load and unload data from a table (a partition exchange) or to delete old data (partition drop or truncate).


Parallelism

In a parallel query operation the physical locations in which the required rows are held are subdivided into a number of ranges, each of which is then scanned by a different process
(a parallel query slave). The processes that read the data then pass it on to either a single query coordinator process or to another set of slaves which themselves pass the results to the query coordinator.

Insert, update and delete operations can also be parallelized to provide faster bulk data changes.

One of the key differences between serial and parallel queries are that the parallel queries read data directly from disk instead of checking for the required blocks in the SGAs block buffer area, and pass the result set directly to the user's PGA, thus completely bypassing the SGA. Read consistency is maintained through the query coordinator requesting a checkpoint to have commited dirty buffers of the scanned object written to disk before the PQ slaves start reading.

The coordination required in creating and controlling query slaves leads to an overall increase in resource usage which can easily overload a system, but when correctly implemented spare system resources can be used to improve the response time of queries that access or manipulate a large amount of data.

You probably need a lower degree of parallelism than you expect.


Psummary tables

OK, "summaries" then.

The static nature of data warehouse data allows the results of some frequently executed queries to be pre-computed and cached as summary tables. These are often created through the Oracle materialized view object type, although the use of an MV is not mandatory.

The query rewrite function of the cost-based optimizer can use the metadata stored in materialized view definitions or through declarations of query equivalence made through the DBMS_ADVANCED_REWRITE package (10g+ only) to redirect queries against large base tables (typically data warehouse fact tables) to the smaller summary tables. In most cases the summary table does not have to be an exact match for the query.

The major challenge in using summary tables is in finding an efficient mechanism for maintaining consistency between the summary and its base table(s).

The major benefit is that almost any end-user query can be executed extremely quickly with the correct use of summary tables.


PSummary


  • Partitioning allows efficient access to relatively large subsets of data and efficient bulk manipulation of data.
  • Parallelism allows the work of a single user process to be shared among multiple slave processes, leading to faster completion time but higher resource usage.
  • Psummaries provide extremely fast performance, but their maintenance and monitoring can require non-trivial efforts.
Now, hopefully these thoughts will leave my head alone.

Strange Thing To Carry

My kids (3, 4 & 6) intercepted and detained a pedestrian passing our house the other evening on the grounds that he was in Public Possession of Dogs, and they needed to subject him to the usual line of questioning: "Names?", "Ages?", "Boys or Girls?", "Do they bite?" etc..

Strangely none of them noticed what I saw immediately, which was the holstered handgun on his hip. I'm sure he had a license to do so, but what he was expecting might happen to him in our Nice Suburban Neighborhood is anyone's guess.

On the other hand one of my wife's co-workers reported a sighting of a mountain lion about two miles southeast of us a couple of years ago. I'm inclined to wonder whether *ahem* alcohol was a factor in that case because we're definitely on the wrong side of the city for that kind of wildlife -- antelope, deer, foxes and the occasional coyote maybe, but how a lion would pass unnoticed around Colorado Springs from the mountains on the west to the plains on the eastern side is a mystery to me.

However the mountains to the west of us aparantly have the highest density of mountain lions throughout the Rockies, which leads to some basic precautions when hiking with the kids -- don't let them straggle or walk too far in front is all it really amounts to. (That link contains other interesting information on bubonic plague, hanta virus, avalanches etc. by the way). Sightings and other less fortunate encounters seem to be pretty common up around Boulder though.

Here's a prime example of the kind of plump, tender morsel that a lion would enjoy most.



It's amazing how fast those legs will carry him if he's told that there's a lion behind him!

Thursday, May 11, 2006

Getting the (Sub)Partition Name for a Row

Adapted from a response to a question posted on Oracle-l, here are three methods for finding out what partition or subpartition a table row is stored in, or which partition or subpartition a row will be stored in.


Logical Inference

For range-based or list-based partitioning or for range-list composite partitioning it is often feasible to infer the partition or subpartition name from the values of the partition and subpartition key columns, as long as a sensible partition naming convention has been chosen.

For example, for a range partitioned table defined in part by ...

Create table T (col1 date ...)
Partition By Range (col1)
(
Partition Y2006_M01 values less than (date '2006-02-01'),
Partition Y2006_M02 values less than (date '2006-03-01'),
Partition Y2006_M03 values less than (date '2006-04-01'),
...
)


...we can use a SQL function to provide the partition name:

To_Char(col1,'"Y"YYYY"_M"MM')


Note the use of double-quotes in the above formula to identify literal strings in the date format.

This is a fast and simple method, although an historically varying granularity on the partitions or a complex partitioning scheme could make it difficult to maintain.


DBMS_MView.PMarker

In cases where it is difficult or impossible to do perform the logical inference (hash partitioning, for example, or with complex multicolumn range/list partitioning) there are a couple of other techniques that you can use.

In the DBMS_MVIEW package there is a PMARKER function that returns the data_object_id for the object in which the row resides, and you can join to user/dba/all_objects using that.

Here's a funky example script using variable numbers of hash subpartitions ...

drop table t
/

create table t (col1 , col2 , col3 )
partition by range (col1)
subpartition by hash (col2)
(
partition p1 values less than (2) subpartitions 2,
partition p2 values less than (3) subpartitions 4,
partition p3 values less than (4) subpartitions 8
)
as
select mod(rownum,3)+1,
floor(dbms_random.value(1,256)),
floor(dbms_random.value(1,10))
from all_objects
where rownum < 101
/

with obj as
(select --+ materialize
data_object_id,
subobject_name
from user_objects
where object_name = 'T' and
object_type = 'TABLE SUBPARTITION')
select subobject_name,
col3
from T,
obj
where data_object_id = DBMS_MView.PMarker(t.rowid)
order by 1
/


My unsubstantiated guess is that this uses the file#/block# of the rowid to perform a look-up on the extent that contains the row. It therefore also ought to be possible to "hand-roll" a similar method based on extracting the file# and block# from the rowid's and joining to the dba/user/all_extents view (or a materialized subset of it, for performance reasons) to get the segment properties. I don't know if I'd care to go that route myself.


TBL$OR$IDX$PART$NUM()

The TBL$OR$IDX$PART$NUM() function gives you the appropriate partition number for a table value or set of values. It's an undocumented (except through metalink, if that counts) function with a couple of magic numbers in it, but the general format to use is ...

TBL$OR$IDX$PART$NUM("PARTITIONED_TABLE_NAME", 0, d#, p#, "COLUMN_NAME")


The easiest way to get the appropriate format and magic numbers to use for this function is to run a trace on a "with validation" partition exchange against the table of interest, and you'll pretty much get the complete SQL that you need.

So you might end up with something similar to ...

with   utp as
(select --+ materialize
partition_position,
partition_name
from user_tab_partitions
where table_name = 'MY_TABLE')
select utp.partition_name,
last_name,
first_name
from my_table,
utp
where utp.partition_position = TBL$OR$IDX$PART$NUM("MY_TABLE", 0, 0,
65535, "PART_COL")
/


This method has the advantage that you can just supply it with arbitrary values for the partition key column(s) and you will get the appropriate partition number/name. Therefore you can use it to answer the question "Which partitions will these rows go into?". If you use partition exchanges "without validation" then you can also use it to run periodic checks on whether your ETL process has been putting rows in the wrong partitions.

The disadvantage is that it is undocumented and has those pesky magic numbers, or course.


Summary

My preference is to use the method of logical inference, if possible.

The other two are evidently based on very different methodologies -- one being based on the physical location of the row and the other on values of the partition key column(s), so they can be applied to different situations. The TBL$OR$IDX$PART$NUM method is rather more amenable to performance tuning by pre-aggregating the table data based on distinct partition keys prior to applying the function call.

Wednesday, May 03, 2006

Machine Music

Oooh, this is nice.

http://video.google.com/videoplay?docid=-5503582578132361295

Monday, May 01, 2006

A Day Without Immigrants

So today is apparantly "A Day Without Immigrants" and I just realised that this means me as well, technically speaking. Does it not seem strange that so immigrants should be held to be the source of so many problems in America, of all countries? And people in the southern border lands appear to be particularly prone to that finger-pointing, when they are living on land that was originally part of a Mexico too weakened by its own war for independence to be able to defend it from the young states to the north.

I'm coming up to the eighth anniversary of my "stepping off the boat" as it were -- it's easy to work out as I just add on one year and a couple of days to my wedding anniversary, which is how long it took for my permanent resident visa to come through, plus a couple of weeks to get the last things packed and get a flight. I missed my first wedding anniversary by just five days I think. From the visa (and later job) interview questions I was asked I was apparantly suspected of being an economic migrant, which made the enormous paycut that I took to move from London, UK to Dayton, Ohio rather ironic.

On the other hand, I don't really think of myself as an immigrant. I'm think I'm really someone who is just hanging out here for a while ... seeing the sights, doing a bit of shopping, being married, having kids etc.. On the other hand I've only been back to England once in all that time, for my brother's wedding in Cambridge last year. With my parents living in Spain and my parents-in-law living near Rome we tend to pay fleeting visits to Heathrow every now and then just to see if they've finished building it yet (which they haven't) and to stock-up on the few goodies that are available there and not from our local Fine Imported Goods emporium -- ie. the commissary at Peterson AFB. They sell Flakes, Crunchies, Milk Chocolate Hobnobs etc, although they fly off the shelves because everyone panic-buys them apparantly.

So anyway I get itchy feet everytime I see Oracle blogsters writing of their travels, and we're regularly overcome by an urge to move back to Europe. Any part of it at all will do. We were recently thwarted in an attempted move to Slovenia or Romania, and now we're trying for Germany. My brother tells me we should be moving to Denmark -- in fact a friend of a friend ended up there after some kind of incident involving rowing to Iceland in a replica of a Viking longboat, during which voyage he apparantly had an affair with the wife of the group leader which must have involved either extraordinary levels of discretion or some extraordinarily tense mealtimes -- but frankly it's probably Hobson's choice where we land. If the USAF doesn't have positions for a Developmental Engineer at least available then it's not on the list.

So where was I?

Ah yes, the immigrant day thing. Well it turns out that most of my work colleagues will be in a Customer Acceptance Test for most of the day so they wouldn't notice me not being here (here being 1,200 miles from them) anyway. However, I applaud the priniciple and I may go down the Monica's Taco Shop for a breakfast burrito to show support for the cause.