This article describes the ways of estimating data warehouse performance

Ever wonder why it takes so long to execute the Query? It works fine now but how long will it take to run in 3 years?

This article is not database specific

Below is the basic Star schema of a retail shop
It has one fact table and 3 dimensions

Star Schema

Let's have a close look at the Fact table structure and particularly fields types: 

The STATE_CODE field: The type is nvarchar most likely it uses 2 bytes to store one character, plus it has nil as terminator so in total it is 5 bytes

Some of the databases store the length of a string, which could take up to 2-4 bytes

Let's do the same for the rest of the fields

Field Name  Field Length
STATE_CODE 5
PRODUCT_CODE 21
PERIOD_CODE 13
AMOUNT 6
Total row size 45

 

So how long it will take to run the following query?

Select STATE_CODE, PRODUCT_CODE, PERIOD_CODE, Sum(AMOUNT) as  AMOUNT
From FACT_TABLE
GROUP BY STATE_CODE, PRODUCT_CODE, PERIOD_CODE

This SQL will use the full scan of the table: EG it will read the entire table and calculate summary  

To answer this question we need to know our HDD performance and the number of rows in the table

HDtach can easily measure the hard drive performance

Let assume that we have 10 million rows and our RAID can do 50mb per sec sequential read

450 000 000 / (50 *1048576) = 8.58 sec

Those people who have actually run the queries against the table with 10 million records would say that this number is far too optimistic.

  • The actual row size is much bigger
  • It also holds some additional data such as row CRC or rowid's
  • Plus data is stored in blocks and they do have some additional data as well
  • Plus the table itself stores some additional information.
  • HDD performance is not realistic as well in a multi-user environment
So what is the best way to estimate the time taken to execute the query?

We believe that practical approach gives the best results:

  • Create a table
  • Populate table with some dummy data
  • Count number of rows
  • Get actual table size using database-specific functions

SQL Server provides a built-in stored procedure that you can run to easily show the size of a table, including the size of the indexes. Which might surprise you.

Syntax:

sp_spaceused ‘Tablename’

Then use the formula table size/HDD performance to calculate the time
and the last step ran the SQL during peak times and low times
and compare the results.

If you have historical data this approach would also help you estimate how long will it take to run the SQL in 3 years.

Direct link, no registration required.