The art of pagination – Offset vs. value based paging
The main goal of a database is to persist large data sets efficiently, consistently and durable. A database must answer client requests and provide required data respectively. The only way to provide large datasets to clients is to use a paging mechanism. Paging returns only a subset of the entire result set of a database query. As required, a further subset can be read from the table subsequently. This article describes why large amount of data should only be read as subsets, what a Paginator is and how it affects the performance of an application.
It’s well known that the database will need more time to response a client request if the amount of data in a table increases stronger. This problem reinforces when a query involves more than one table. The relational database model causes this problem. Related data is logically linked with primary and foreign key relations. The primary task of a database is to keep the consistency of the data. While executing a query, which involves a high amount of data, the database needs many internal resources to finish the query. The database needs two steps to execute a simple query which involves more than one table. The first step: The database creates a Cartesian product with the data of the involved tables. Next step: The database selects only the rows, which satisfies the relation. The consequence is that the amount of database operations increases proportionally to the growth of data.
It’s absolutely necessary to split the results of a query into subsequences to archive constant response time. At the splitting process, it is very important to keep paging transitions fluently. That means, rows can’t appear twice or miss completely in the result. For this purpose, the so-called Paginator comes into play.
The pagination of a document is the process to add page numbers on a header or/and a footer with a Paginator. This principle can be used for the problem, which is described above. The Paginator divides a query into sub queries at the correct position. The page size, which is the amount of rows in a subset, is freely selectable. A user of a web application can request the next results of a query with page numbers or “previous-next” buttons. For pagination a unique sort order is required. A query without an ORDER BY clause randomizes the order of the corresponding results and this may possibly lead to incorrect results. There are basically two different approaches to implement a Paginator. The next two chapters will describe the implementation approaches and show the pros and cons of each approach.
In the offset method, the database counts up all rows until the desired row has been found. The rows before the desired row are skipped. Therefore, SQL provides the keywords OFFSET and FETCH FIRST … VALUES ONLY. The keyword OFFSET defines the start position of the read pointer. The other keyword FETCH FIRST … VALUES ONLY limits the maximal amount of tuples of the query. For example OFFSET 100 LIMIT 20 will retrieve 20 rows starting with the 100th row.
This scheme could be also done with JPA. It provides the methods setFirstResult(int offset) for a OFFSET-clause and setMaxResult(int limit) for a FETCH FIRST-clause. Let’s have a look to the usage:
TypedQuery<Employee>query=entityManager.createQuery("SELECT e FROM Employee e ORDER BY e.name",Employee.class);
As you can see, the methods decorate the TypedQuery instance. That’s a big advantage because the main JPQL query instance stays untouched. So, it’s possible to execute the extensions under a certain condition.
This way of paging is very easy to implement, but the counting process is very inefficient. Assume a table contains 100000 rows. A query should retrieve the next 10 rows from row 50000. This could be expressed with following SQL query:
The database skips 49999 rows to retrieve the next 10 rows from row 50000. It’s wasting a high amount of CPU and I/O and this may lead to very long response times. The farther down a result set you page the bigger the waste in CPU and I/O. The following illustration describes the problem:
Other key factors, i.e. latency and utilization of the database, won’t be considered in this analysis. These elements also increase the response time of a request.
Value based method
In this method, the next records are returned based on the last record of the previous page. Instead of line numbers, the last row is used as an “offset” value. This requires the uniqueness of the search column. If this criterion is not fulfilled, the primary key of the table can be used to create a unique sort order. Through this approach, the offset condition is moved to the WHERE clause and an additional keyword OFFSET has been saved. A major advantage of this approach is that now a database index can be used to search the lower limit, if an index which contains the search columns exists. Consequently, the entries of the previous pages are really skipped, which is now much more efficient. This fact is illustrated by the following graph:
A disadvantage of this method is the complexity of the query, when the lookup column is not unique. That can be seen in the following example:
In addition, this query must be formulated manually and can’t be delegated to a particular JPA provider, because no API is provided for this approach of paging.
To proof the hypotheses, a test has been carried out and the results of both approaches have been compared. The results are summarized in the next chapters.
This section explains the structure of the database table. The present table is built very primitive. It contains an INTEGER column (as primary key), a DATE column, a VARCHAR column, a TIMESTAMP column and a DECIMAL column.
The following table shows the columns of the table:
A test for each approach has been done 3 times. The end result based on the average measures of all tests. A test consists of 20000 database queries. The page contains 50 rows. That means one test retrieves 1 million rows of a table. So the data table contains exact 1 million rows respectively. The inserted data has been generated. For this test a DB2 instance has been used. The database is running on the same machine, which also fires up the queries to the database. Thus, the results are not affected by additional network latency. The test application is a simple CLI, which is based on Java. It uses JDBC to connect and execute database queries.
The following table contains the results of each test. The data has been collected out of the snapshot monitor of the DB2 instance.
As you can see from the comparison, the results of the value based approach are outstanding. The value based method needs less read operations and it needs no addition effort for sorting. These factors are the main reasons, why this approach needs less execution time to retrieve 1 million rows from table. Why is this approach much faster than the offset method? This can be realized with the following execution plans, which have been generated with the DB2 dashboard.
The left plan shows the execution plan of the offset method. According to the execution plan, the database executes the table scan operation (TBSCAN) operation twice. An index has been created for both approaches. This causes the high amount of read operations and also the high execution time.
The right execution plan shows the execution process of the value based method. The row of the previous page is used to find the first row of the next page. Therefore the database uses the index to retrieve the start point of the next page. The results are fetched and sent back to the client. If you compare both execution plans, then you can see that no table scan operations have been executed. This leads to a very efficient query execution process and less execution time to retrieve the same result.
Response time of each query
Furthermore the response time of each query has been investigated. The following diagram shows the results:
This diagram shows, that the response time of the offset method increases the farther is scrolled. The response time of the 20000th query is 4 seconds in average. In contrast, the response times of the values based also increases but the 20000th query takes 200 milliseconds in average. This is 20 times faster than the offset method.
This article compares the 2 basic approaches to implement a Paginator to increase the performance of SQL queries. A paging mechanism to archive constant response times is strictly required, if the amount of data in the database is very high. I described two approaches how a Paginator could be implemented. The offset approach is a very simple to implement and it’s also easy to understand the underlying mechanism. But the results of the tests shows, that the counting operation is very inefficient, which needs much CPU and I/O operations. The value based approach uses the last row of the previous page as a “offset” to retrieve the next page from the database. Tests verified that this approach works very efficient and the response times are really constant.
The comparison of the described methods testifies that the value based method very appropriate as a paging mechanism. The complexity and the lack of support in the particular JPA providers are the biggest disadvantages. The advantages and drawbacks should be weighed before realization.