The Oracle Sponge Blog -- An Oracle Data Warehousing Resource

Monday, June 12, 2006

Moving to WordPress

Blogger irritates me. Let's see if Wordpress makes me happy.

http://oraclesponge.wordpress.com/

I've used the supplied import functionality to move over posts and comments, and it seems to have gone pretty well. Internal links need to be resolved to the new UR, and some text formatting has gone astray, but I can work my way through that.

The category functionality is great, and the interface is much more snappy, so I'll probably stick with it.

Friday, June 02, 2006

Modelling of Condominiums

A condominium in this case being a geographical area of which two of more sovereignties agree to share dominium, but that title ought to do wonders for my statistics.

Anyway, I was thinking about the mapping between cities and zip codes today, which is a many-to-many relationship, and the thought popped into my head, "I wonder if there are any cities which belong to multiple countries?" In other words, are there any cities which are part of a condominium?

So I went looking for one, not expecting to find any, and I was right -- there appear to be none. Probably a good job too.

In fact the only decent example of a condominium currently in effect seems to be that of Pheasant Island, which is a condominium of France and Spain. It apparantly is shared on a six-monthly basis between the two nations, which leads to an interesting modelling problem -- a cyclic time-dependent relationship between geography and sovereignty. In other words, it's a time-share! (Stats again).

Anyway, I'll leave that as "an exercise for the reader".

It also raises anther question: back when one needed a passport to cross between the two countries, was a crime commited if a passport-less person was present on the island at the time that sovereignty changed? Perhaps none -- perhaps the condominium agreement allowed people of either nationality to be present on the island at any time, regardless of which state currently was in charge of it. But if the agreement did not allow this then could you charge a person with crossing a border if they were stationary at the time the offence occured?

These thoughts usually turn-up on Fridays, oddly enough. Must be something to do with blood sugar.

Monday, May 22, 2006

The Prime Number Few

I was chasing links through Wikipedia the other day, which always leads somewhere interesting.

In this case I fell into the deep waters of linguistics and learned about garden path sentences. They caught my eye in particular because the article includes one of my favourite jokes, albeit with a slightly different construction to my usual one.

"If time flies like an arrow, do fruit flies like a banana?"

Now that I've read a deconstruction of the joke, it seems a little less funny though.

Thursday, May 18, 2006

The Three Pillars of Oracle Data Warehousing

Introduction

This is a basic topic for Oracle data warehousing beginners, based on some ideas that I'm hoping will stop buzzing round in my head if I commit them to virtual paper.

There are three Oracle features that provide a foundation for successful data warehousing:

Partitioning
Parallelism
Psummary Tables (the "p" is silent)

Here are the benefits that they bring to the system.

Partitioning

In a data warehouse we are interested in accessing large numbers of rows that are identified through some common attribute. Very often the same small numbers of attributes are used over and over again to filter the data. The most common is some form of date filter ("give me sales for this January"), and there are often others that are used very commonly ("give me inventory levels for repairable items", "give me backorders for California").

In an OLTP system we would use an index or a full tablescan to get this subset of data, but that can be inefficient if we are selecting a large number of rows representing a subset of the total table data because of the high number of logical i/o's required and the single block reads that are used to access the table.

In a data warehouse we can partition according to attributes commonly used as filters by the users, or according to logical child attributes of them (eg. users commonly filter by month but we partition by day). This achieves two important aims.

Rows of data that are commonly used together are physically co-located.
The optimizer can treat each partition as a table and can perform fast scans of multiple partitions to access the data ("partition pruning").

We can also use multicolumn or composite partitioning to partition by multiple attributes, so that we can partition both by date and by location and get partition pruning on either or both.

There are also benefits to manipulating bulk data -- we can use partition-based DDL operations to load and unload data from a table (a partition exchange) or to delete old data (partition drop or truncate).

Parallelism

In a parallel query operation the physical locations in which the required rows are held are subdivided into a number of ranges, each of which is then scanned by a different process
(a parallel query slave). The processes that read the data then pass it on to either a single query coordinator process or to another set of slaves which themselves pass the results to the query coordinator.

Insert, update and delete operations can also be parallelized to provide faster bulk data changes.

One of the key differences between serial and parallel queries are that the parallel queries read data directly from disk instead of checking for the required blocks in the SGAs block buffer area, and pass the result set directly to the user's PGA, thus completely bypassing the SGA. Read consistency is maintained through the query coordinator requesting a checkpoint to have commited dirty buffers of the scanned object written to disk before the PQ slaves start reading.

The coordination required in creating and controlling query slaves leads to an overall increase in resource usage which can easily overload a system, but when correctly implemented spare system resources can be used to improve the response time of queries that access or manipulate a large amount of data.

You probably need a lower degree of parallelism than you expect.

Psummary tables

OK, "summaries" then.

The static nature of data warehouse data allows the results of some frequently executed queries to be pre-computed and cached as summary tables. These are often created through the Oracle materialized view object type, although the use of an MV is not mandatory.

The query rewrite function of the cost-based optimizer can use the metadata stored in materialized view definitions or through declarations of query equivalence made through the DBMS_ADVANCED_REWRITE package (10g+ only) to redirect queries against large base tables (typically data warehouse fact tables) to the smaller summary tables. In most cases the summary table does not have to be an exact match for the query.

The major challenge in using summary tables is in finding an efficient mechanism for maintaining consistency between the summary and its base table(s).

The major benefit is that almost any end-user query can be executed extremely quickly with the correct use of summary tables.

PSummary

Partitioning allows efficient access to relatively large subsets of data and efficient bulk manipulation of data.
Parallelism allows the work of a single user process to be shared among multiple slave processes, leading to faster completion time but higher resource usage.
Psummaries provide extremely fast performance, but their maintenance and monitoring can require non-trivial efforts.

Now, hopefully these thoughts will leave my head alone.

Strange Thing To Carry

My kids (3, 4 & 6) intercepted and detained a pedestrian passing our house the other evening on the grounds that he was in Public Possession of Dogs, and they needed to subject him to the usual line of questioning: "Names?", "Ages?", "Boys or Girls?", "Do they bite?" etc..

Strangely none of them noticed what I saw immediately, which was the holstered handgun on his hip. I'm sure he had a license to do so, but what he was expecting might happen to him in our Nice Suburban Neighborhood is anyone's guess.

On the other hand one of my wife's co-workers reported a sighting of a mountain lion about two miles southeast of us a couple of years ago. I'm inclined to wonder whether *ahem* alcohol was a factor in that case because we're definitely on the wrong side of the city for that kind of wildlife -- antelope, deer, foxes and the occasional coyote maybe, but how a lion would pass unnoticed around Colorado Springs from the mountains on the west to the plains on the eastern side is a mystery to me.

However the mountains to the west of us aparantly have the highest density of mountain lions throughout the Rockies, which leads to some basic precautions when hiking with the kids -- don't let them straggle or walk too far in front is all it really amounts to. (That link contains other interesting information on bubonic plague, hanta virus, avalanches etc. by the way). Sightings and other less fortunate encounters seem to be pretty common up around Boulder though.

Here's a prime example of the kind of plump, tender morsel that a lion would enjoy most.

It's amazing how fast those legs will carry him if he's told that there's a lion behind him!

Thursday, May 11, 2006

Getting the (Sub)Partition Name for a Row

Adapted from a response to a question posted on Oracle-l, here are three methods for finding out what partition or subpartition a table row is stored in, or which partition or subpartition a row will be stored in.

Logical Inference

For range-based or list-based partitioning or for range-list composite partitioning it is often feasible to infer the partition or subpartition name from the values of the partition and subpartition key columns, as long as a sensible partition naming convention has been chosen.

For example, for a range partitioned table defined in part by ...

Create table T (col1 date ...)
Partition By Range (col1)
(
Partition Y2006_M01 values less than (date '2006-02-01'),
Partition Y2006_M02 values less than (date '2006-03-01'),
Partition Y2006_M03 values less than (date '2006-04-01'),
...
)

...we can use a SQL function to provide the partition name:

To_Char(col1,'"Y"YYYY"_M"MM')

Note the use of double-quotes in the above formula to identify literal strings in the date format.

This is a fast and simple method, although an historically varying granularity on the partitions or a complex partitioning scheme could make it difficult to maintain.

DBMS_MView.PMarker

In cases where it is difficult or impossible to do perform the logical inference (hash partitioning, for example, or with complex multicolumn range/list partitioning) there are a couple of other techniques that you can use.

In the DBMS_MVIEW package there is a PMARKER function that returns the data_object_id for the object in which the row resides, and you can join to user/dba/all_objects using that.

Here's a funky example script using variable numbers of hash subpartitions ...

drop table t
/

create table t (col1 , col2 , col3 )
partition by range (col1)
subpartition by hash (col2)
(
partition p1 values less than (2) subpartitions 2,
partition p2 values less than (3) subpartitions 4,
partition p3 values less than (4) subpartitions 8
)
as
select mod(rownum,3)+1,
      floor(dbms_random.value(1,256)),
      floor(dbms_random.value(1,10))
from   all_objects
where rownum < 101
/

with   obj as
      (select --+ materialize
              data_object_id,
              subobject_name
       from   user_objects
       where  object_name = 'T' and
              object_type = 'TABLE SUBPARTITION')
select subobject_name,
      col3
from   T,
      obj
where  data_object_id = DBMS_MView.PMarker(t.rowid)
order by 1
/

My unsubstantiated guess is that this uses the file#/block# of the rowid to perform a look-up on the extent that contains the row. It therefore also ought to be possible to "hand-roll" a similar method based on extracting the file# and block# from the rowid's and joining to the dba/user/all_extents view (or a materialized subset of it, for performance reasons) to get the segment properties. I don't know if I'd care to go that route myself.

TBL$OR$IDX$PART$NUM()

The TBL$OR$IDX$PART$NUM() function gives you the appropriate partition number for a table value or set of values. It's an undocumented (except through metalink, if that counts) function with a couple of magic numbers in it, but the general format to use is ...

TBL$OR$IDX$PART$NUM("PARTITIONED_TABLE_NAME", 0, d#, p#, "COLUMN_NAME")

The easiest way to get the appropriate format and magic numbers to use for this function is to run a trace on a "with validation" partition exchange against the table of interest, and you'll pretty much get the complete SQL that you need.

So you might end up with something similar to ...

with   utp as
      (select --+ materialize
              partition_position,
              partition_name
       from   user_tab_partitions
       where  table_name = 'MY_TABLE')
select utp.partition_name,
      last_name,
      first_name
from   my_table,
      utp
where  utp.partition_position = TBL$OR$IDX$PART$NUM("MY_TABLE", 0, 0,
65535, "PART_COL")
/

This method has the advantage that you can just supply it with arbitrary values for the partition key column(s) and you will get the appropriate partition number/name. Therefore you can use it to answer the question "Which partitions will these rows go into?". If you use partition exchanges "without validation" then you can also use it to run periodic checks on whether your ETL process has been putting rows in the wrong partitions.

The disadvantage is that it is undocumented and has those pesky magic numbers, or course.

Summary

My preference is to use the method of logical inference, if possible.

The other two are evidently based on very different methodologies -- one being based on the physical location of the row and the other on values of the partition key column(s), so they can be applied to different situations. The TBL$OR$IDX$PART$NUM method is rather more amenable to performance tuning by pre-aggregating the table data based on distinct partition keys prior to applying the function call.

Wednesday, May 03, 2006

Machine Music

Oooh, this is nice.

http://video.google.com/videoplay?docid=-5503582578132361295

Monday, May 01, 2006

A Day Without Immigrants

So today is apparantly "A Day Without Immigrants" and I just realised that this means me as well, technically speaking. Does it not seem strange that so immigrants should be held to be the source of so many problems in America, of all countries? And people in the southern border lands appear to be particularly prone to that finger-pointing, when they are living on land that was originally part of a Mexico too weakened by its own war for independence to be able to defend it from the young states to the north.

I'm coming up to the eighth anniversary of my "stepping off the boat" as it were -- it's easy to work out as I just add on one year and a couple of days to my wedding anniversary, which is how long it took for my permanent resident visa to come through, plus a couple of weeks to get the last things packed and get a flight. I missed my first wedding anniversary by just five days I think. From the visa (and later job) interview questions I was asked I was apparantly suspected of being an economic migrant, which made the enormous paycut that I took to move from London, UK to Dayton, Ohio rather ironic.

On the other hand, I don't really think of myself as an immigrant. I'm think I'm really someone who is just hanging out here for a while ... seeing the sights, doing a bit of shopping, being married, having kids etc.. On the other hand I've only been back to England once in all that time, for my brother's wedding in Cambridge last year. With my parents living in Spain and my parents-in-law living near Rome we tend to pay fleeting visits to Heathrow every now and then just to see if they've finished building it yet (which they haven't) and to stock-up on the few goodies that are available there and not from our local Fine Imported Goods emporium -- ie. the commissary at Peterson AFB. They sell Flakes, Crunchies, Milk Chocolate Hobnobs etc, although they fly off the shelves because everyone panic-buys them apparantly.

So anyway I get itchy feet everytime I see Oracle blogsters writing of their travels, and we're regularly overcome by an urge to move back to Europe. Any part of it at all will do. We were recently thwarted in an attempted move to Slovenia or Romania, and now we're trying for Germany. My brother tells me we should be moving to Denmark -- in fact a friend of a friend ended up there after some kind of incident involving rowing to Iceland in a replica of a Viking longboat, during which voyage he apparantly had an affair with the wife of the group leader which must have involved either extraordinary levels of discretion or some extraordinarily tense mealtimes -- but frankly it's probably Hobson's choice where we land. If the USAF doesn't have positions for a Developmental Engineer at least available then it's not on the list.

So where was I?

Ah yes, the immigrant day thing. Well it turns out that most of my work colleagues will be in a Customer Acceptance Test for most of the day so they wouldn't notice me not being here (here being 1,200 miles from them) anyway. However, I applaud the priniciple and I may go down the Monica's Taco Shop for a breakfast burrito to show support for the cause.

Oracle Sponge -- Now Moved To Wordpress

Monday, June 12, 2006

Moving to WordPress

Friday, June 02, 2006

Modelling of Condominiums

Monday, May 22, 2006

The Prime Number Few

Thursday, May 18, 2006

The Three Pillars of Oracle Data Warehousing

Strange Thing To Carry

Thursday, May 11, 2006

Getting the (Sub)Partition Name for a Row

Wednesday, May 03, 2006

Machine Music

Monday, May 01, 2006

A Day Without Immigrants

About Me

Spongeworthy Links

Some Articles

Practical Guide To Oracle Data Warehousing

Best Of The Sponge (IMHO)

Materialized View Articles

Sponge Lists

Spongeworthy Documentation

Version 9i

Version 10G

Previous Posts

Archives