An example of why referential integrity is a good thing
As I wrap-up my latest gig, I look back to see just how I don’t miss working with the mainframe or AS/400. It has been almost six years since I have had to work exclusively with either, and I haven’t particularly miss them. So I’m querying data, and I get finished with a simple data fetch only to find out that I’m unexpectedly missing rows. WTF? I’m writing the code correctly, and I am following the little cheat-sheet that I was given. I think to myself “HTH can I be missing rows?? Reality then strikes me as I remember that I’m querying an AS/400, and most likely the data structures, schemas, and objects are probably not created correctly. And this turns out to be the case here. I find that there is no concept of referential integrity in the tables that I am working with! Moreover, working with this data I’ve already found violations of any and all normal forms, records referencing other records within the same table as you might find in old ISAM and VSAM files, entities with multiple primary keys, minimal referential integrity, mainframe low characters (which can oftentimes be very hard to remove), and my favorite - invalid dates. Rule #1 when working with the mainframe or the AS/400: you’re going to find crap, so be ready for it. How could I have forgotten?
Back to my example. After I realize that neither table knows what the other is doing, I think that I can use some OUTER JOINS to find the missing rows. The problem though is that I’d have to do either multiple outer joins or a FULL OUTER JOIN; I don’t particularly like using FULL OUTER JOINS, however, since you have to use ISNULL or COALESCE in order to remove the columns returning NULL values. Therefore, below is an example that you can investigate to find rows that could be missing in more than one table of a join. In my example here, the transactions table acts as a “ledger” table; you would think that a ledger table might have rows for each business day, but this cannot be assumed. The PlantDetail table and the Transaction tables are mutually exclusive – neither can be the “drive” table, if you will, for querying these two joined tables. My solution, then, is to create a list of plants and dates, and use this to drive the query.
IF OBJECT_ID ('Plant') IS NOT NULL
DROP TABLE Plant
GO
IF OBJECT_ID ('Transactions') IS NOT NULL
DROP TABLE Transactions
GO
IF OBJECT_ID ('PlantDetail') IS NOT NULL
DROP TABLE PlantDetail
GO
CREATE TABLE Plant (PlantId int, Name varchar (10)
, CONSTRAINT Pid PRIMARY KEY (PlantId))
GO
CREATE TABLE Transactions
(PlantId int, TransDt datetime, TransType char (2)
, CONSTRAINT Tid PRIMARY KEY(PlantId, TransDt))
GO
CREATE TABLE PlantDetail
(PlantId int, DetailDt datetime, Value varchar (30)
, CONSTRAINT Pdid PRIMARY KEY (PlantId, DetailDt))
GO
INSERT Plant VALUES (1,'North'), (2,'South'), (3, 'East'), (4, 'West');
GO
INSERT Transactions VALUES (1, '8/1/09', 'AA')
INSERT Transactions VALUES (2, '8/1/09', 'AA')
GO
INSERT PlantDetail VALUES (3, '8/2/09', 'Cloudy');
GO
--solution
SELECT P.PlantId, R.dt, dp.Value
FROM Plant p INNER JOIN
(
SELECT TransDt as dt, PlantId as PlantId
FROM Transactions
UNION ALL
SELECT DetailDt, PlantId
FROM PlantDetail
) r
ON P.PlantId = R.PlantId
LEFT OUTER JOIN Transactions t
ON t.PlantId = r.PlantId
AND t.TransDt = r.dt
LEFT OUTER JOIN PlantDetail dp
ON r.PlantId = dp.PlantId
AND r.dt= dp.DetailDt
GO
This one is much easier to implement than using FULL OUTER JOINs and then doing and ISNULL or COALESCE for the rows that may or may not be in either the Transactions or PlantDetail table.
Lee
----------------------------------------
Wait until you see the blog on using the SSIS task for SAPBW
eee81387-d6ad-47ba-9ea5-5512d548bdca|0|.0