Projects:Support Partial Indexes On Nullable Columns/Specs
Contents |
Functional Requirements
The main goal is to add a mechanism that helps to improve the performance of the indexes which can be added to nullable foreign key columns with a high ratio of null values.
Partial Database Indexes
A partial index is an index where it is possible to specify the rows that are indexed. This kind of indexes are useful for commonly used WHERE conditions that use constant values.
An index helps to get faster queries, but depending on the case they can waste a lot of disk space if most of the indexed data is rarely needed. With a partial index it is possible to index just the table data that is most commonly used, this way it will contain fewer rows reducing the amount of disk space used by the index.
Therefore, partial indexes provide the following advantages:
- Save disk space, as they index a subset of rows from the original table.
- Save time on inserts, because it is only needed to write to the partial index when new rows match the filter.
Not Null Partial Indexes On Nullable Columns
As mentioned before, the main goal of this project is to improve the performance of the indexes that are added to nullable foreign keys with a high ratio of null values.
Taking into account the advantages mentioned in the previous section, in this type of cases it seems reasonably to have a partial index to keep just those rows with NOT NULL values on those columns. Depending on the rows we want to access this type of indexes can have some drawbacks, but in general, this will provide us better performance.
For example, lets consider the following scenario where we have data distributed as follows:
Table Name | Column Name | Total Rows | Null Count |
c_order | c_return_reason_id | 6644690 | 6626194 |
According to this table, the 99.72% of the rows in c_order has a NULL value on the c_return_reason_id column. This means that most of the orders are barely returned.
Imagine that now we create an index for the c_return_reason_id column on the c_order table:
CREATE INDEX c_order_return_reason ON c_order (c_return_reason_id);
This index will be useful to improve queries where we want to know information about the returned orders, for example:
SELECT dateordered FROM c_order WHERE c_return_reason_id = '0B10695DC880488ABA24F70BB2D1442A'; -- 'Incorrect Product' return reason
The new c_order_return_reason index will fulfill its purpose, but it also will include many rows that are never searched by this type of queries, namely all the orders which are not a return from customer/vendor. We are making the query very fast even though it wastes a lot of disk space.
Now, lets suppose that we are able to create a partial index as follows:
CREATE INDEX c_order_return_reason ON c_order (c_return_reason_id) WHERE c_return_reason_id IS NOT NULL;
In this case, the index only contains the rows that satisfy the WHERE clause, i.e., it only contains the orders that have been returned. Therefore we are achieving the following improvements:
- Reducing the disk space usage of the index.
- Since the returned orders represent the 0.28% of our orders, comparing to the previous index this is a significant improvement in write performance: we essentially get this index for free in terms of insert time.
Nevertheless, a partial index can have some drawbacks depending on the case. Following the example, our partial index is not indexing those rows with NULL value on c_return_reason_id, therefore we will have slower queries if they are as follows:
SELECT dateordered FROM c_order WHERE c_return_reason_id IS NULL;
This is something that must be taken into account before creating a partial index: we have to analyze which type of queries we are using most in order to decide if we may use a partial index.
Not Null Partial Indexes: Oracle vs PostgresSQL
In an Oracle database, it does not include rows in an index if all indexed columns are NULL. That means that every index is a partial index.
This is not the behavior in PostgresSQL databases, where we will need to define the index as partial.
Affected Foreign Keys
In the following sheet it can be found the nullable foreign key columns present in core which the partial indexes based on NOT NULL expression would be applicable.
Technical Requirements
The effort within this project will be intended to provide the ability of implementing partial indexes in PostgresSQL. Besides, the XML definition of this new elements should not be considered in Oracle but at the same time we must keep the information in order to avoid its losing when exporting the database.
Two different approaches have been considered to decide how partial indexes should be applied on the nullable foreign keys of the core tables: explicit and implicit.
Explicit Approach
In this approach the decision is delegated to the developer. Each developer decides whether the index should be partial or not at the time of its creation. This approach has the following benefits:
- More control over the partial index generation.
- Easier to revert if it is detected that the index should not be partial.
- More appropriate if we decide to support different definitions of partial indexes and not just for nullable foreign key columns.
Implicit Approach
The partial indexes are created automatically during the update.database process giving us the following benefits:
- Change is transparent for developers.
- Homogeneous behavior with Oracle.
If this approach is taken, it should also be considered if the indexes must be created for every nullable column (even if it already has an index) or just for those columns which do not have an index.
In the first case, the time spent for the update can grow significantly, because of the fact that some indexes would be recreated.
Further Considerations
Exporting Partial Index Information into XML
As mentioned before, Oracle does not index nullable columns, so the definition of the NOT NULL partial indexes does not apply in this case. But this definition needs to be stored somehow because otherwise when the index is exported from Oracle to the XML files it would lose the partial index information.
To solve this problem, one possible approach could be the same as the one currently used by DBSourceManager in order to keep the operator class information. This approach consists of store this kind of information into a table comment with the following format:
indexName1.indexColumn1.operatorClass=operatorClass1$indexName2.indexColumn2.operatorClass=operatorClass2$...
These comments are updated each time an index that defines an operator class is added or removed, both when it is done at the same time of the table creation or after it. This information must also not be taken into account when comparing indexes in Oracle in order to prevent unneeded index recreations.
The main limitation of this approach is that comments in Oracle has a limited size of 4000 characters. This would limit the number of partial indexes that can be created for every table.
For this reason, it should be considered the possibility of inserting information at column level, using a column comment. This way the number of indexes that could be defined per table would be high enough.
Exporting Changes on Index to a Configuration Script
It must be reviewed the ability of exporting the index definition within a Configuration Script. This way, it would be possible to revert an index definition in a custom template making the database index configuration more adaptable to the particularities of every environment.
Known Limitations
Due to the existing limitation in Oracle which prevents to index more than once the same set of columns, if we define two partial indexes for the same column but with a different where clause it will work in PostgresSQL but that XML model will not be valid in Oracle. This is a known issue reported here.