This article describes the ways of estimating data warehouse performance
Ever wonder why it takes so long to execute the Query? It works fine now but how long will it take to run in 3 years?
This article is not database specific
Below is the basic Star schema of a retail shop
It has one fact table and 3 dimensions
Let's have a close look at the Fact table structure and particularly fields types:
The STATE_CODE field: The type is nvarchar most likely it uses 2 bytes to store one character, plus it has nil as terminator so in total it is 5 bytes
Some of the databases store the length of a string, which could take up to 2-4 bytes
Let's do the same for the rest of the fields
|Field Name||Field Length|
|Total row size||45|
So how long it will take to run the following query?
Select STATE_CODE, PRODUCT_CODE, PERIOD_CODE, Sum(AMOUNT) as AMOUNT
GROUP BY STATE_CODE, PRODUCT_CODE, PERIOD_CODE
This SQL will use the full scan of the table: EG it will read the entire table and calculate summary
To answer this question we need to know our HDD performance and the number of rows in the table
HDtach can easily measure the hard drive performance
Let assume that we have 10 million rows and our RAID can do 50mb per sec sequential read
450 000 000 / (50 *1048576) = 8.58 sec
Those people who have actually run the queries against the table with 10 million records would say that this number is far too optimistic.
- The actual row size is much bigger
- It also holds some additional data such as row CRC or rowid's
- Plus data is stored in blocks and they do have some additional data as well
- Plus the table itself stores some additional information.
- HDD performance is not realistic as well in a multi-user environment
So what is the best way to estimate the time taken to execute the query?
We believe that practical approach gives the best results:
- Create a table
- Populate table with some dummy data
- Count number of rows
- Get actual table size using database-specific functions
SQL Server provides a built-in stored procedure that you can run to easily show the size of a table, including the size of the indexes. Which might surprise you.
Then use the formula table size/HDD performance to calculate the time
and the last step ran the SQL during peak times and low times
and compare the results.
If you have historical data this approach would also help you estimate how long will it take to run the SQL in 3 years.