The second aspect of database sizing is capacity. Capacity describes the maximum volume of data that the database can hold.
Since VoltDB is an in-memory database, the capacity is constrained by the total memory of all of the nodes in the cluster. Of course, one can never size servers too exactly. It is important to allow for growth over time and to account for other parts of the database server that use memory.
Chapter 4, Sizing Memory explains in detail how memory is assigned by the VoltDB server for database content. Use that chapter to perform accurate sizing when you have a known schema. However, as a rough estimate, you can use the following table to approximate the space required for each column. By adding up the columns for each table and index (including index pointers) and then multiplying by the expected number of rows, you can determine the total amount of memory required to store the database contents.
Table 3.1. Quick Estimates for Memory Usage By Datatype
Datatype | Bytes in Table | Bytes in Index |
---|---|---|
TINYINT | 1 | 1 |
SMALLINT | 2 | 2 |
INTEGER | 4 | 4 |
BIGINT | 8 | 8 |
DOUBLE | 8 | 8 |
DECIMAL | 16 | 16 |
TIMESTAMP | 8 | 8 |
VARCHAR[a] or VARBINARY (less than 64 bytes) | length + 1 | length + 1 |
VARCHAR[a] or VARBINARY (64 bytes or greater) | length | 8 |
index pointers | n/a | 40 or 48[b] |
[a] For VARCHAR columns declared in characters, rather than in bytes, the length is calculated as four bytes for every character. In other words, for storage calculations a string column declared as VARCHAR(16) has the same length as a column declared as VARCHAR(64 BYTES). [b] 40 bytes for tree indexes, 48 for hash indexes. |
You must also account for the memory required by the server process itself. If you know how many tables the database will contain and how many sites per host will be used, you can calculate the server process memory requirements using the following formula:
384MB + (10MB X number of tables) + (128MB X sites per host)
This formula assumes you use K-safety, which is recommended for all production environments. If the cluster is also the master database for database replication, you should increase the multiplier for sites per host from 128 to 256 megabytes:
384MB + (10MB X number of tables) + (256MB X sites per host)
If you do not know how many tables the database will contain or how many sites per host you expect to use, you can use 2 gigabytes as a rough estimate for the server process size for moderately sized databases and servers. But be aware that you may need to increase that estimate once your actual configuration is defined.
Finally, your estimate of the memory required for the server overall is the combination of the memory required for the content and the memory for the server process itself, plus 30% as a buffer.
Server memory = ( content + server process ) + 30%
When sizing for a cluster, where the content is distributed across the servers, the calculation for the memory required for content on each server is the total content size divided by the number of servers, plus some percentage for replicated tables. For example, if 20% of the tables are replicated, a rough estimate of the space required for each server is given by the following equation:
Per server memory = ( ( content / servers) + 20% + server ) + 30%
When sizing memory for VoltDB servers, it is important to keep in mind the following points:
Memory usage includes not only storage for the data, but also temporary storage for processing transactions, managing queues, and the server processes themselves.
Even in the best partitioning schemes, partitioning is never completely balanced. Make allowances for variations in load across the servers.
If memory usage exceeds approximately 70% of total memory, the operating system can start paging and swapping, severely impacting performance.
Keep memory usage per server within 50-70% of total memory.
Memory technology and density is advancing so rapidly, (similar to the increase in processor cores per server), it is feasible to configure a small number of servers with extremely large memory capacities that provide capacity and performance equivalent to a larger number of smaller servers. However, the amount of memory in use can impact the performance of other aspects of database management, such as snapshots and failure recovery. The next section discusses some of the trade offs to consider when sizing for these features.