In many cases, there are several alternative ways to fill a table in the database with identical values, but these ways can be quite different concerning the performance. In the process of filling extra large databases (with dozen millions of records), this issue can turn out to be a critical one as long as time is concerned. In this article, we will consider the questions of product performance, give the results of our measurements for some standard tasks and several recommendations for non-standard cases.
Note: all tests in the article were done on the following system: Intel P4 (3.4 GHz), 1024 Mb RAM, Windows XP Pro, local MS SQL Server 2000 (Developers Edition).
Unique Value Generation
Let us start from adding unique values for some field to the database. There are two ways to
ensure value uniqueness: to set the corresponding check box in the generation rule or to use
the unique index on the server. In the first case, the data generation software tries to create unique
values on its own and checks each new value with the list of those already generated.
Such checks require time and consume additional RAM, but you can avoid additional server
queries in this case. The second way takes the task of selecting records to the server.
The main disadvantage of the second method is that DTM Data Generator does not guarantee to add the
number of records the user specifies in the rule. It means that if 1000 records
are specified to be inserted and 100 records are rejected, only 900 records will
be inserted into the database. The first method almost always guarantees the necessary
number of records if it is possible (if there are unique values left).
Let us see what actually happens. Let us create two tables: the first one with a varchar
field and the other one with an integer field and try to insert 10,000 unique values
into each of them using each of the above methods:
First method (by Data Generator) | Second method (unique index) | |
---|---|---|
Integer value | 800 (records/sec) | 710 (10% rejected) |
Varchar(750) value | 764 (records/sec) | 780 |
In case the range of generation is rather large, there is no substantial difference between these two methods.
Now let us change the task so that the number of variants that can be generated is considerably reduced: specify the range from 0 to 50000 for the Integer field, and 5 characters for the Varchar field:
First method (by Data Generator) | Second method (unique index) | |
---|---|---|
Integer value in range | 450 (records/sec) | 770 (10% rejected) |
Varchar value by mask | 373 (records/sec) | 800 |
As we can see in this case, the performance of the first method can be considerably lower because the program has to generate several values, check them and reject the improper ones. It should be mentioned that if the connection to the server is slow or there are big timeouts, the ratio for getting the same number of inserted records is most likely to change for the opposite because the process of checking values will take a good deal of time.
Transaction Size and Data Generation Process Speed
The next important factor that will seriously influence the performance is the transaction "size" - the number of table rows that will be inserted within a single transaction. Let us change the task like this: insert 100,000 varchar values into the table changing the transaction size from 50 to 5000 rows per transaction and evaluating the resulting performance:
Rows per transaction | Performance (rows/second) |
---|---|
50 | 687 |
100 | 734 |
200 | 801 |
500 | 782 |
1000 | 740 |
2500 | 721 |
5000 | 673 |
We stopped and restarted the database server after following each rule in order to prevent the results from being influenced by the record cache.
Note: the optimal transaction size depends on the software and hardware configuration of the database server. You should probably make some experiments to find the optimal values for your server.
Database Fill Out Methods and Performance
Let us now compare the performance of the program when three various methods are used to specify the lists of values: directly, by selecting the list from a file and in the case of using Values Library. We will use the table with one 'varchar' field that was created in one of the above examples and fill it with 10,000 values.
List (5 values) | List in file (50 values) | Values Library (400 values) |
---|---|---|
490 | 473 | 448 |
As we can see, all three methods are practically identical in comparable conditions from the point of view of performance and this parameter almost does not depend on the size of the list.
Custom Database Population Performance Factors
There is a way to considerably speed up test data loading for Microsoft SQL Server users. In case the server is installed locally (i.e. DTM Data Generator is launched on one computer with the server), the program offers built-in means for working with the BCP utility (see the corresponding option in the Settings dialog box). If the server is remote, we recommend that you create an output text file instead of generating data directly and then use this file together with the command line BCP utility.
In conclusion, let us mention some factors that are critical for the performance and can be changed or customized by the user. To increase the performance, you should:
- Temporarily disable all triggers corresponding to the tables being filled, except those cases when they ensure data integrity.
- If possible, remove indexes, except the cases of creating unique values mentioned above.
- Temporarily disable the transaction mechanism for the specific client for the period of generating text data if your database server permits it.
The prologue and epilogue scripts for the rule or whole project are the perfect way to disable and enable mentioned objects.
See Also
- Data Scrambling.
- Using DTM Data Generator to create test data sets.
- Test data generation patterns.