Lee posted on April 3, 2011 09:01

I get to work with a lot of smart DBAs and database developers, and there aren’t too many smarter than my friend Harrison Doan. Harry works for Transamerica Life Insurance, and is a super DBA and database practitioner who specializes in data warehousing.  A former actuary, so you know he’s smart. What’s an actuary?  It’s those dudes that do this crazy wacked-out math that neither you nor I could touch. That engineering and trigonometry based calculus that you and I squeaked by as a freshman in college is probably the stuff he took in the seventh grade.

Harry has been showing me some of his tricks that I will be sharing (with his permission of course), and one that I liked in particular is his method for finding the difference between two tables.  You could use some third party tool, tablediff/tablediff gui, or whatever you choose, but I challenge you to take your little tool and test it against his method. I think that you’ll find that you wasted your money on it, especially on extremely large tables. Keep reading.

When he showed me his technique, I decided to try a couple that came to mind if I were tasked to find differences.  Let’s set up the data and tables:

Code:

DROP TABLE #tbl_01, #tbl_02
GO
CREATE TABLE #tbl_01
        (id int
        ,value1 varchar (10)
        ,value2 int
        )
CREATE TABLE #tbl_02
        (id int
        ,value1 varchar (10)
        ,value2 int
        )
INSERT #tbl_01 VALUES (1,'a',100)
INSERT #tbl_02 VALUES (1,'a',999)
INSERT #tbl_01 VALUES (2,'a',333)
INSERT #tbl_02 VALUES (2,'a',333)
INSERT #tbl_02 VALUES (4,'a',444)

SELECT * FROM #tbl_01
SELECT * FROM #tbl_02

Running the above, obviously the id value of 1 has a value2 of 100 in one table and 999 in the other. There is also a missing row besides a diff in my example.  So, let’s brainstorm to think of ways to find the different row. Here’s a few:

Ex. 1:

SELECT *
FROM #tbl_01 as a
INNER JOIN #tbl_02 b
    ON a.id = b.id
WHERE a.value1!=b.value1
OR a.value2!=b.value2

This one is OK. The row that is different pops out quickly. Has a problem if there is a missing row (id 4).

Ex. 2:

SELECT *
FROM #tbl_01 as a
LEFT OUTER JOIN #tbl_02 b
    ON a.id = b.id
WHERE b.id IS NULL
UNION ALL
SELECT *
FROM #tbl_01 as a
RIGHT OUTER JOIN #tbl_02 b
    ON a.id = b.id
WHERE a.id IS NULL

 

This one is an old stand-by for finding missing rows; I’ve simply added a UNION ALL so that I can see rows from ‘both sides’ of the query. Will be a lot of code and joins when there are a bunch of columns. Row for ID 4 pops out, but the difference in values doesn’t.

Ex. 3:

SELECT a.id, b.id, a.value1, b.value1, a.value2, b.value2
FROM
        (SELECT *
        FROM #tbl_01
        ) a
FULL OUTER JOIN
        (SELECT *
        FROM #tbl_02) b
        ON a.id = b.id

Maybe one you’ve never thought about…using a FULL OUTER JOIN. Works OK. Missing row from one table pops out this time, and you get the different row (id 1).

Ex. 4:

SELECT *
FROM #tbl_01 a
WHERE NOT EXISTS
        (select *
        from #tbl_02 b
        where a.id = b.id
        and a.value1 = b.value1
        and a.value2 = b.value2)
UNION 
SELECT *
FROM #tbl_02 b
WHERE NOT EXISTS
        (select *
        from #tbl_01 a
        where a.id = b.id
        and a.value1 = b.value1
        and a.value2 = b.value2)

Similar to example 2, this is sort of ugly. Well, very ugly. Lots of code, column joins, etc.  NOT isn’t good here. Missing row pops out and you get the different row, but a pain to write this one if you had to. I’d say for this one… no, not going to use.

There are others certainly, but these four are a few that came to mind quickly. You may have a favorite, and it will probably be like on of these or some derivation thereof – there’s only a few ways to find the diff between rows in two tables – EXISTS, NOT EXISTS, IN and NOT IN, OUTER JOINS, etc.  Writing one that shows both the different row and the missing row requires a slight modification of the query to the old OUTER JOIN as in example 2. Notice, however, a few things that are common amongst all of these techniques:

1) Code will become a pain to write when you have lots of columns. Try doing any of these if the table has 100 columns. You better have a text editor such as UltraEdit to help.

2) All require some sort of a join – Again, not easy to write, so if you have a ton of columns you have to join everything. A text editor a must. Also, performance deteriorate quickly because all of these joined columns won’t have any indexes on them (unless you take the time to put them on each column) but again this becomes impractical. Try it with 400 columns on two 50 million row tables.

3) Your going to have to put a WHERE clause on the query – For our examples, we had only a few rows, but this is inevitable, because you only want the diffs.

So, what’s Harry’s trick?  Check this out:

SELECT MIN(id) AS id, value1, value2
FROM   
        (SELECT id, value1, value2
        FROM #tbl_01
        UNION ALL
        SELECT id, value1, value2
        FROM #tbl_02) X
GROUP BY  value1, value2
HAVING COUNT(1)=1


Let’s think about this one for a moment. First, it’s simple to write: no joins left or right, no EXISTS, and no unclear code.  If I had a 200 column table to compare, I could simply drag the columns over from SSMS. No need to alias either side of the join, and no WHERE clause is necessary.  Not also that if you order the results, they are together, so pasting in Excel is great for doing comparisons on wide tables.  This is his “stacking” technique using UNION ALL for speed. Very simple, straightforward, and elegant.

Note that you could also use BINARY_CHECKSUM() in combination with this method, but I’m not comfortable that the damn thing even works on a consistent basis, so I’ll opt-out of this one. You can try it, but be careful. Here I get IDs but no columns. It works in our example with few rows:

SELECT MIN(ID), CKSUM
FROM (
    SELECT id, BINARY_CHECKSUM(*) AS CKSUM
    FROM #tbl_01
    UNION ALL
    SELECT id, BINARY_CHECKSUM(*)
    FROM #tbl_02) Z
GROUP BY CKSUM
HAVING COUNT(1)=1

 

Let’s test this thing on some data and see wut up. This will take a couple of minutes to run, depending on what your system is. I’m using a RAID 0 set of 2.5” disks in a little enclosure on Hyper-V. Didn’t take long to create, we’re going for 50 million rows here in our comparison:

USE tempdb
DROP TABLE tbl1, tbl2
GO
SELECT big.id, SPACE(1) AS col1, SPACE(2) AS col2, SPACE(3) as col3
INTO tbl1
FROM
    (SELECT TOP 50000000 ROW_NUMBER () OVER (ORDER BY a.id) AS id
    FROM sys.syscolumns AS a, sys.syscolumns AS b, sys.syscolumns AS c, sys.syscolumns AS d
    ) big

SELECT *
INTO tbl2
FROM tbl1
 

Make some changes to one of the tables:

UPDATE tbl2
SET col1 = 'a'
WHERE id = 76002358

UPDATE tbl2
SET col2 = 'b'
WHERE id = 14385068

UPDATE tbl2
SET col3 = 'c' 
WHERE id = 144856112

Let’s try this one for comparison. I’m going to take the GTO out of the garage, take a pit-stop :), and get ready to cut the grass. I’ll let this run for a few and be back. Note that I am going to prime this once so my files get expanded, clear the buffers and cache, so we’re comparing apples-to-apples as closely as we can.  Since we may or may not know that ID is the primary key for the table, let’s just join all columns as well. I’ll turn on stats IO and time too:

SELECT *
FROM
        (SELECT *
        FROM tbl1
        ) a
FULL OUTER JOIN
        (SELECT *
        FROM tbl2) b
        ON a.id = b.id
WHERE a.col1!=b.col1
OR a.col2!=b.col2
OR a.col3!=b.col3

image

SELECT MIN(id) AS id, col1, col2, col3
FROM   
        (SELECT id, col1, col2, col3
        FROM tbl1
        UNION ALL
        SELECT id, col1, col2, col3
        FROM tbl2) X
GROUP BY  col1, col2, col3
HAVING COUNT(1)=1

image

Crazy, eh?  Pretty awesome. The FULL OUTER JOIN took nearly ten minutes, while Harry’s took 41 seconds.  Sweet!  I’ll follow up with the reason why it’s so much better in another blog. So try this technique next time you’re hunting differences between two tables and see how it works for you.

Thanks for reading,

Lee Everest

 

--------------------

I didn’t get invited to speak at SQL Saturday 63. Well darn.


Posted in: TSQL  Tags:
blog comments powered by Disqus

by Lee Everest, M.S.

Info

Poll

Do you use Azure or cloud in your organization?



Show Results

Ads

Search


Month List

Calendar

«  May 2012  »
MoTuWeThFrSaSu
30123456
78910111213
14151617181920
21222324252627
28293031123
45678910
View posts in large calendar

Tags

Disclaimer
The opinions, code, examples, et.al. expressed herein are my own personal opinions and do not represent my employer's view in any way, shape form, or fashion.  All code for demonstration purposes - no guarantees, either written or implied, are made.

© Copyright 2012 Lee Everest's SQL Server, etc. weblog