Tuesday, March 02, 2010

Three table inner join vs. the staging table

or as we perl users say, "There's more than one way to do it."

I was faced with an odd situation at work a few days ago, and it led me to do a lot more raw SQL tinkering than I typically do. The situation takes a little setup. (I'll say up front that if you don't fancy yourself a geek, don't torture yourself with the rest of this post, but tune in next week for tales of back-to-back concerts at the Newport!)

My team performs application integration, and our preferred method of doing so is with web services that are modeled. "Modeled" in our parlance refers to being managed by a piece of software called webMethods Modeler. It provides pretty pictures of running services, complete with checkmarks as steps get completed, or big red Xs if they fail.

The services we provide mainly move documents around, doing any needed transformation on them. We take all the possible fields that a business process could want, and create a master document definition called a "canonical", whose definition resembles, but is not congruent with, the normal lay usage. We take inbound documents and map them to the canonical format, and subscribers to the documents get the canonical mapped to their preferred format. Because of this, a complete process definition has two models: one for the publisher, one for each subscriber.


There is a database that correlates conceptual processes with a source and target model. There is another database used internally by webMethods that records the ID number of models that ran recently, and another table in that same database correlating model ID numbers with friendly names. OK, that's the basic setup.

The situation I ran into was that in our development server, the monitoring tool we use to look at conceptual business processes was showing me output from a model that wasn't related to it. I had to prove that my code wasn't at fault, and the easiest way to do that was to find another affected process. Except that in our development server, not many processes run with any regularity, so I couldn't just poke around in the monitoring tool and hope to stumble upon another service that ran recently.

No, I had to take it down to the database level. I searched the "recently executed models" table and got some ID numbers, then took those ID numbers and found their friendly model names, and then took those names and searched for them in the database that correlates source and target models with process names. I asked the same question I always do when I have to do a lot of low-level manhandling: Why wasn't there a button I could click that would do all this for me?

So that's what I'm in the process of designing right now, with the magic of SQL joins. Here is what I'm allowed to show you of my database session, trying different techniques of correlation. The first thing I did I can't show you, as that would expose database names, but basically I created a link between the monitoring tool's database and the webMethods system database that records model information.

Here is the "SQLPlus" commandline client. It's pretty basic, just a prompt, and it feeds most of what you type directly to the database, and has a few local odds and ends to help you with output formatting.

>sqlplus (user)@(database)

SQL*Plus: Release 9.2.0.8.0 - Production on Tue Mar 2 11:41:46 2010

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

Enter password:

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters and Data Mining options

SQL>

That's the prompt. "Enter your command here." For those of you who never used a commandline client of any sort, that's ok, and more power to you, but it's all we had back in the day, and for most one-off problems I'm much more comfortable with a prompt and a keyboard than with a dialog box and a mouse.

First, I set my output size to 40 lines per page and 120 columns, turn off wrapping (if something is too wide for the screen then, it will just get truncated), and specify the maximum width I want to see on the screen of a few table columns I know I'll be dealing with.

SQL> set pagesize 40
SQL> set linesize 120
SQL> set wrap off
SQL> column processkey format a20
SQL> column processlabel format a50
SQL> column processname format a35
SQL> column sourceprocess format a35
SQL> column targetprocess format a35

Next, here is sample of what's in "wmprocessrecent", the table that holds the ID numbers and timestamps of recently executed models.
SQL> select processkey,ctrl_insert_dt from wmprocessrecent where rownum < 20 order by ctrl_insert_dt desc;

PROCESSKEY           CTRL_INSE
-------------------- ---------
P5EL4ASIT13          01-MAR-10
P5EL4ASIT13          26-FEB-10
P5EL4ASIT13          26-FEB-10
P5EL4ASIT13          26-FEB-10
P5EL4ASIT13          25-FEB-10
P5EL4ASIT13          25-FEB-10
P5EL4ASIT13          25-FEB-10
P5EL4ASIT13          25-FEB-10
P5EL4ASIT13          25-FEB-10
P5EL4ASIT13          25-FEB-10
P5EL4ASIT13          25-FEB-10
P5EL4ASIT13          25-FEB-10
P5EL4ASIT13          25-FEB-10
P5EL4ASIT13          25-FEB-10
P5EL4ASIT13          25-FEB-10
P123MGJ8KCSNZ        24-FEB-10
P123MGJ8KCSNZ        24-FEB-10
P123MGJ8KCSNZ        24-FEB-10
P123MGJ8KCSNZ        24-FEB-10

19 rows selected.

The "where rownum < 20" doesn't refer to the specific, unique rowid in the database, it refers to the number of output rows printed so far. My model, P5EL4ASIT13, was the only thing that had run in the last few days, apparently, so I had to dig deeper in the database to find something else that had run within the last week.

Here are a few rows of the process definition table, which correlates model ID numbers with friendly names.
SQL> select processkey,processlabel from wmprocessdefinition where rownum < 10;

PROCESSKEY           PROCESSLABEL
-------------------- --------------------------------------------------
PJGIFH9TRT           AEP_Wausau_X937_Process
PICKR3H4G00          AEP_CookNP_IndusPassport_Activity_Process
P1IVU4MWFR792        IFW_UDR_HeatRate_Process
P3T5UGBOX1NA         AEP_NMarketPJM_EnergySchedule_Process
P1IVU4MX2IPYR        IFW_PGads_SparkSpreadAndTemp_Process
PICKR3H3XGJ          AEP_CookNP_PeopleSoft_Department_Process
PICKR3H4CP0          AEP_CookNP_Department_Process
P1UAYJQK7            MyDummyTest_Process [ca - this is a dev server, after all]
P54KZKH9TC           AEP_PocketScout_StormData_FW_Process

9 rows selected.

...and the table in the remote database that correlates conceptual process names with source and target models.
SQL> select processname,sourceprocess,targetprocess from ifw_monitor_config@ifw where rownum < 10;

PROCESSNAME                         SOURCEPROCESS                       TARGETPROCESS
----------------------------------- ----------------------------------- -----------------------------------
AGCSchedules.WT.GCS[RT]             AEP_WT_Outbound_GenLoadSchedules_Pr AEP_GCS_GenLoadSchedules_Process
GenLoadSchedules.WT                 AEP_WT_Outbound_GenLoadSchedules_Pr
PositionManager-Deal                IFW_Solarc_Deal_Process             IFW_PositionManager_Deal_Process
TrueFactorsAndCoefficients.UDR.PBKR IFW_UDR_HeatRate_Process            AEP_PBroker_TrueFactorsAndCoefficie
MeterGeneration.CEAS.PGads[HRLY]    IFW_CEAS_Meter_Generation_Process   IFW_PGads_Generation_Process
ResourcePlan.GT.WT[MN/DA]_t         AEP_WebTrader_ResourcePlan_Process  AEP_WebTrader_ResourcePlan_Process
IncDecAwards.PBKR.PI[DLY]           IFW_PBroker_IncDecAwards_Process    IFW_PI_IncDecAwards_Process
GenerationAwards.PI[DLY]            IFW_PI_GenAwards_Process
LMP-Hourly.eDataFeed.nMarket[HRLY]_ AEP_nMarket_HourlyLMP_Process       AEP_nMarket_HourlyLMP_Process

9 rows selected.

(The formatting steps from above were to help that last query fit on the screen properly.)

My main goal here is to get a unique list of model ID numbers (processkey from wmprocessrecent), look up their friendly names (processlabel from wmprocessdefinition), and take all those friendly names and see what processname from ifw_monitor_config contains them as either a sourceprocess or targetprocess. This would be a three-way join, which in definition works on the Cartesian product of the tables, but in practice database optimizations prevent that from happening. A good thing, too, as it would be complete overkill and run the chance of bringing your database to its knees.

Speaking of optimizations, I'm interested in which syntactic approach to this problem will be the least resource intensive, and the main tool at my disposal for determining this is the Oracle timer. If a method runs fast, it probably uses fewer resources. So let's turn the timer on:

SQL> set timing on

My first attempt is an explicit nested join, and unless you live and breathe SQL, it will look pretty damned confusing.

SQL> select distinct processname from ifw_monitor_config@ifw a inner join
2 (select processlabel from wmprocessdefinition inner join wmprocessrecent using (processkey)) b
3 on b.processlabel in (a.sourceprocess, a.targetprocess);

PROCESSNAME
-----------------------------------
BilateralContract.WT.ODS[RT]
BilateralContract.WT.PT[RT]
zzPTMSReplacement.WT.SettlementCent
BilateralContract.WebTrader.EI[RT]
AcctngChangeNotification.PeopleSoft
Wausau-AEP-Internal

6 rows selected.

Elapsed: 00:00:00.08

First, the 2 and 3 under "SQL>" are not part of the query, they are SQLPlus prompting for more lines of input since the command hasn't been finalized with a semi-colon at the end. Rather than deconstruct the odd query itself, here is what the same query would look like without the "inner join x on y" syntax:

SQL> select distinct processname
2 from ifw_monitor_config@ifw a, WMPROCESSDEFINITION b, WMPROCESSRECENT c
3 where b.processkey = c.processkey and b.processlabel in (a.sourceprocess, a.targetprocess);

PROCESSNAME
-----------------------------------
BilateralContract.WT.ODS[RT]
BilateralContract.WT.PT[RT]
zzPTMSReplacement.WT.SettlementCent
BilateralContract.WebTrader.EI[RT]
AcctngChangeNotification.PeopleSoft
Wausau-AEP-Internal

6 rows selected.

Elapsed: 00:00:00.08

Much better to look at for database newbies. The top line defines all the columns I want to return, in this case just "processname", and "distinct" removes duplicates. There are three tables involved, and line 2 defines shortcut references to them as tables a, b, and c. Line 3 defines the predicates, the things that must be true in order to return the particular row you're looking at. Unsurprisingly, the query returned the same results in the same time, as the differences between the two queries are mainly syntactic sugar.

If the tables were larger, that .08 second query would have taken a lot longer and the time would have been more meaningful to compare against, but the small dev tables will work well enough for illustration here.

A second approach to the problem is to create a database table to stage the results of one join, and use the new table as part of a second join. There is a minor, one-time startup cost for creating the staging table:

SQL> create table ca_temp
2 (processlabel VARCHAR2(1024));

Table created.

Elapsed: 00:00:00.02

Here I'm giving the table one column, "processlabel", the only thing I'm concerned about in the first join. Next I populate it with the friendly names of models that have run recently.

SQL> insert into ca_temp
2 (select distinct processlabel from wmprocessdefinition
3 inner join wmprocessrecent using (processkey));

5 rows created.

Elapsed: 00:00:00.04

Here the "inner join" syntax is a little less confusing. "processkey" is the column common to both tables, and we're using it to join them into one set, returning unique friendly names from wmprocessdefinition, and inserting them into the staging table.

While the table creation is a one-time cost, the insertion into it could be a periodic cost, rather than a per-query cost. If I refresh this table every hour, say, then my only per-query cost will be the final step, correlating models with processes:

SQL> select distinct processname from ifw_monitor_config@ifw a inner join ca_temp b
2 on b.processlabel in (a.sourceprocess, a.targetprocess);

PROCESSNAME
-----------------------------------
BilateralContract.WT.ODS[RT]
BilateralContract.WT.PT[RT]
zzPTMSReplacement.WT.SettlementCent
BilateralContract.WebTrader.EI[RT]
AcctngChangeNotification.PeopleSoft
Wausau-AEP-Internal

6 rows selected.

Elapsed: 00:00:00.02

Another approach to reducing the visual overhead of the three-table join is to use a "view". A view is a query that looks to the user like a table. You query against it, and the database interpolates your query with the view definition's query.

The view below contains the same query that is used above to insert into the staging table.

SQL> create view recent_models as
2 select distinct processlabel
3 from WMPROCESSDEFINITION inner join WMPROCESSRECENT
4 using (processkey);

View created.

Elapsed: 00:00:00.06

Here there is also a one-time setup cost, but no speedup on the final query:

SQL> select distinct processname from ifw_monitor_config@ifw a inner join recent_models b
2 on b.processlabel in (a.sourceprocess, a.targetprocess);

PROCESSNAME
-----------------------------------
BilateralContract.WT.ODS[RT]
BilateralContract.WT.PT[RT]
zzPTMSReplacement.WT.SettlementCent
BilateralContract.WebTrader.EI[RT]
AcctngChangeNotification.PeopleSoft
Wausau-AEP-Internal

6 rows selected.

Elapsed: 00:00:00.08

Semantically, the query against the "recent_models" view is the same as the three-table joins, and runs in the same time (sometimes a scratch slower, as some up-front work is done to create the final SQL statement).

Based on my (completely unscientific) tests (that would make a statistician cry), the staging table that updates periodically is the most cost-efficient, especially since in my team's usage, it will be OK for the data to be slightly stale. We will be more interested in models that have run in the current day or earlier than in real-time results. A hybrid of the view and the staging table is the "materialized view". It is basically a staging table with built-in refresh controls, and not a true view at all.

Unfortunately, I don't have privileges to create a materialized view in our database, but the creation SQL would go something like this to update the query hourly:

create materialized view mv_recent_models
next sysdate + 1/24
as select distinct processlabel
from wmprocessdefinition inner join wmprocessrecent
using (processkey);

So what was the end result of all this? I found another process suffering from the same problem. We found the problem with our monitoring tool, and all was well. In other words: it wasn't me!

No comments:

Post a Comment