I started a blog post several months ago on this but haven't had time to explore it further. If you recall, my old boss at the old job (Mindy Curnutt - have no idea where she is these days or what she's doing as we've both left the place...) had an idea of using XML to archive old data. It's not a bad idea and the storage is pretty compact. The problem is that when you try to query it, XML querying is designed to go across rows and not query the XML as efficiently within the row. Having said this, below is something that I worked up today and it's really not a bad way to move the data off to eventually remove from a production system.
First, let's create the table and a few queries:
USE AdventureWorks
GO
IF (OBJECT_ID(N'dbo.PurchaseOrderArchive') IS NOT NULL)
DROP TABLE dbo.PurchaseOrderArchive
GO
CREATE TABLE PurchaseOrderArchive
(id int IDENTITY (1,1) NOT NULL PRIMARY KEY
,Data xml
,LogDate datetime
,LogUser varchar (20) DEFAULT session_user
)
GO
INSERT PurchaseOrderArchive (Data,LogDate)
SELECT (SELECT * FROM Purchasing.PurchaseOrderHeader FOR XML AUTO, ELEMENTS)
,GETDATE()
SELECT * FROM PurchaseOrderArchive
We're using AdventureWorks, 2005 version, and are going to try and archive off some old data. Let's first try to get the thing working. I am going to use for this example the records straight out of the PurchaseOrderHeader table; if you saw the other blog post of mine, you can also join tables and archive data as well in combination. For the table, I use Joe Celko's favorite designation for a primary key (IDENTITY attribute), a column called "Data" which will store every record that we're archiving, a log date, and a log user. You can add whatever you like here. As you can see, I do a simple INSERT of the data, followed by a SELECT statement to make sure that I have a row inserted. So far, so good.
SELECT Data.query('/Purchasing.PurchaseOrderHeader/PurchaseOrderID') AS PurchaseOrderIDs
FROM PurchaseOrderArchive
SELECT Data.value('(/Purchasing.PurchaseOrderHeader/PurchaseOrderID)[3631]', 'int')
,Data.value('(/Purchasing.PurchaseOrderHeader/ShipDate)[3631]', 'datetime')
,Data.value('(/Purchasing.PurchaseOrderHeader/OrderDate)[3631]', 'datetime')
,Data.value('(/Purchasing.PurchaseOrderHeader/TotalDue)[3631]', 'money')
FROM PurchaseOrderArchive
Now, querying this thing is a bit different than querying normal XML. The first query I simply use the query method for XML to pluck out the IDs. Unfortunately this doesn't do me a whole lot of good. Next, I use the value method for XML. Notice that in order to get a specific record, I need to use the Data.value method for each attribute. Sort of klugey. No need for any type of WHERE clause here, because the row that I want is specified in the singleton [3631] specification for that particular PurchaseOrderID. This is sort of the strange part. I could do a simple query without the singleton syntax, but I only get the first ID! Remember that all of the IDs are in the column and not across rows. This is where this "technique" differs when using XML. Still not a bad idea though, cramming all of the rows of the table into one XML row!
When I create an XML index, I get some quirky behavior as well. The index seems to help out some, but I am not sure if it's original design was intended for what we're doing here. Nonetheless, check out what the results are using the index for our data storage.
CREATE PRIMARY XML INDEX idx_xxx ON PurchaseOrderArchive (Data)
Interesting, eh? I first clear out the cache, and then run the query three times with no index; then I create the index and run the query three times. The speed is 3x without the index, but the reads and writes are also considerably higher. I'd have to compare with other XML queries to see if they follow suit.
Finally, let's alter the table and take only the shipments for the 2000 calendar year, and store those off. The code might look something like so:
SELECT COUNT(1), CAST(CAST(OrderDate AS varchar (11)) AS datetime)
FROM Purchasing.PurchaseOrderHeader
GROUP BY CAST(CAST(OrderDate AS varchar (11)) AS datetime)
ORDER BY 2
DELETE PurchaseOrderArchive
GO
ALTER TABLE PurchaseOrderArchive
ADD ShipYear int
GO
INSERT PurchaseOrderArchive (Data,LogDate, ShipYear)
SELECT (SELECT * FROM Purchasing.PurchaseOrderHeader
WHERE YEAR(CAST(CAST(OrderDate AS varchar (11)) AS datetime))= 2002
FOR XML AUTO, ELEMENTS )
,GETDATE()
,2002
GO
SELECT * FROM PurchaseOrderArchive
GO
Conclusion (if there is any): Use your imagination for expanding what I have done here for storing or archiving off old data. This method is unproven, untested, and may not even work well in a production setting. However, based on what I have here, I can see where you could actually store off a ton of rows very easily in XML and actually query the data; while it's not a mature or well-thought method to grab the data, and probably shouldn't be used for something that you would go fetch often, you can get to your data and display the archived results.
Lee
----------------------------------
What great creativity for so early in the morning.
del.icio.us Tags:
Archive data,
archive old data,
XML,
archive using xml,
SQL 2005,
SQL Server 2005,
SQL Server 2008,
Lee Everest.SQL Server blog,
SQL blog,
SQL Server weblog,
XML data type