Modifying our delete statement
Before we begin, a couple of things. First off, let’s clean up the code, both formatting and capitalization. I always offer to my students at North Lake that when you write something down on paper, take ownership in it. After all, it has your name on it. This goes for your code as well. Take pride in what you do – what you leave lets everyone know what kind of SQL programmer that you are. And to a lesser degree, remember that some poor old slob will have to come along and read or even change the garbage that you left. For example, how does this look?
At some point you may be asked to remove old records in SQL Server. Normally a process has been created to first offload the data, either by archiving or by moving to a more suitable storage outlet, such as a star-schema design. In the first part, I showed you how to write a faster (and sweeter looking) delete statement than what the SQLCat folks at Microsoft wrote for MySpace.com. In part 2, I show how to modify this script to allow you to dynamically change the number of rows for a given iteration.
Get your act together
drop table t1
create table t1 (a int primary key, b char (100))
declare @i int
set @i=1
SET NOCOUNT ON
while (@i<100000)
begin
insert into t1 values (@i,'x');
set @i=@i+1
end
Compared to this?
SET NOCOUNT ON
DROP TABLE t1
GO
CREATE TABLE t1
(a int primary key
,b char (100)
)
GO
DECLARE @i int
SET @i=1
WHILE (@i<100000)
BEGIN
INSERT INTO t1 VALUES (@i,'x');
SET @i=@i+1
END
Are you kidding me? Take the time to write your code with proper capitalization, formatting, and indentions. It will leave a more professional look in your work. Ok, enough life-lessons for today, let’s move on. Secondly, if you are going to create some data to test, their looping process might be better suited with one of the great tools available to generate data. If you don’t have access to one, use something like the below script. This allows you to 1) get results a lot faster, and 2) doesn’t run all of the RAM out of your computer. This one creates the same data set as did their script. Notice how I used up over 700 MB RAM by doing a simple loop. Think sets of records and not row by row in the database. For this example, run the following script in your dev environment.
SET NOCOUNT ON
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
IF OBJECT_ID ('t1') IS NOT NULL
DROP TABLE t1;
GO
CREATE TABLE t1 (a int IDENTITY NOT NULL PRIMARY KEY, b char (100))
SET ROWCOUNT 5000000
INSERT INTO t1 (b)
SELECT 'a'
FROM
(SELECT 1 as Col
FROM sys.syscolumns A, sys.syscolumns b) as Records

Take a step back and think about your users
Alright, so I showed you that my delete was faster than the one that SQLCat came up with. This is all fine and dandy, but when deleting rows in a table, speed is not the only thing up for consideration here. You need to determine the optimal “chunk size” for removing data. If a delete in any way, shape, form, or fashion causes pain on your users, then it’s no good. And this is the first question that you must answer – simply comparing two sets of code, reads/writes, or the time that it took is going about deleting rows is entirely the wrong way to approach this problem. Would you sacrifice either users losing connections or losing business at the expense of getting your data deleted? Of course not. Note that my process that I will describe here can be left running as long as it takes to delete the data; you could schedule and leave it running for a weekend to gently remove rows while being kind to your user community.
Maximize your deletes
Back to the example script. If 10,000 rows is ok for a delete, what about 25,000 rows, is that better? Not if it starts causing pain. At what point do we start affecting users by holding locks too long on the data? How can we test this? Remember that SQL Server will treat different scenarios differently. For instance, if your table is 100,000 rows, and you delete 25,000 rows, you are more likely to get a different behavior with regards to the locking mechanism placed on the table vs. deleting 10 rows in a 100,000 row table. You might actually have SQL Server issue a full table lock or other undesirable escalation in order to delete the rows, not good for your users running transactions at the same time.
A DELETE script of a higher order
In order to be able to answer the question of how many rows can we safely remove at a time, let’s first modify my script from part 1 and put some controls into place. What we want to do is 1) add a target number of rows, which will be the maximum row count to delete for a particular, 2) add a “chunk size” variable to the script, which will be put into the TOP clause of the delete statement (remember that this is new for SQL Server 2005/2008 – see Books Online), and 3) add the additional code to support dynamically changing this “chunk size”; this is where we will define the number of rows that we will remove in a given pass, which will allow us to dynamically change the size of the delete in each iteration. But before running the script, create the following table and insert a value of 1 for the chunk size.
CREATE TABLE Chunksize
(id int identity (1,1) NOT NULL
,Chunk int )
GO
DECLARE @Rows int
,@Chunksize int
,@Target int
GO
SELECT @Rows=0
SELECT @Chunksize = Chunk from Chunksize where id=1
SELECT @Target = 3000000
There:
DELETE TOP (@Chunksize) FROM t1
SELECT @Rows = @Rows + @@ROWCOUNT
IF (@Rows < @Target)
BEGIN
SELECT @Chunksize = Chunk from Chunksize where id=1
GOTO There
END
GO
Here’s the skinny on the script:
1. @Rows – this variable will hold the number of rows that we will delete. When we reach the target variable value, then the script will stop.
2. @Chunksize – this variable holds the value that will go into the TOP clause parameter.
3. @Target – the total number of rows for a session. If the script is stopped, remember that you would change this value.
4. Chunksize table – this is the table that you will create and, in another window, change the value of the chunk on-the-fly as the script executes.
The chunk concept is extremely cool in my opinion. In order to dynamically change the number of rows that will be deleted for each iteration, you can change the value in another connection than where your script is running by updating the current value. So, if you start out with 1000 as the chunk size value, and decide that, based on other scripts that you can use to monitor blocking, your users are unaffected, you can up this value and delete even more rows to maximize your delete process. So let’s fire this thing off. Execute the script, and then in another window write a select statement to see the rows being removed. In that same window, update the chunk to 1000, and then witness the rows as they are removed at a faster rate.
Conclusion
Deleting rows from a table after old records have been archived is a necessary process for large production systems. Getting the job done quickly, however, is not the main concern here. You must make sure that your process does not affect your users; your users and their ability to continue to use the application within is the main concern. Simply issuing a delete statement to the table, or choosing a given option that runs faster over another option is not a wise option for a DBA. Using this script you can control the number of rows that are deleted out of a table.
Next time, we’ll look at how you could monitor the chunk size and how the process is treating your users, allowing us to determine how many rows can be safely deleted.
Lee
-----------------------------
“Although she lives with seven other men, she's not easy.”
