Technical writings of Shkrt
Most of the times, when implementing pagination feature in applications, we use the most trivial approach - limit and offset combination. Of course, in some cases this type of pagination is enough - when we do not have to care about performance issues, for example, for small applications, blogs, lightweight admin interfaces and so on.
Now we will look at example application. It has articles
table with about 700 thousand records. The table is not sharded/partitioned,
and has the following set of indexes:
ActiveRecord example:
This produces the following SQL query:
Let’s look at the explain
output for this.
Limit (cost=0.42..15.47 rows=10 width=128)
-> Index Scan Backward using index_articles_on_published_at_and_id on articles (cost=0.42..626361.59 rows=416338 width=128)
Filter: ((NOT preview) AND (status = 'published') AND (rubric = 'society'))
This query’s cost will increase with offset increasing. Let’s check this assumption by increasing the offset up to 2000:
Limit (cost=3009.33..3024.38 rows=10 width=128)
-> Index Scan Backward using index_articles_on_published_at_and_id on articles (cost=0.42..626361.59 rows=416338 width=128)
Filter: ((NOT preview) AND (status = 'published') AND (rubric = 'society'))
The cost is already about 3000. What if we set offset to 20000? The table is big enough and our pagination theoretically can reach this far:
Limit (cost=30089.50..30104.54 rows=10 width=128)
-> Index Scan Backward using index_articles_on_published_at_and_id on articles (cost=0.42..626361.59 rows=416338 width=128)
Filter: ((NOT preview) AND (status = 'published') AND (rubric = 'society'))
The cost numbers are terrifying now. And they will grow even bigger. The trick has been explained in this
blog post and numerous conference talks. The problem is, that on every page, we have to scan through + offset
amount of records.
Given that limit parameter is 10 when the offset is 0, we have to scan through 10 records. When the offset is 200, we have to go through 200 records,
when 2000 - 2000 records, and so on. So, the bigger the page is, the bigger the transaction cost becomes. To solve this problem,
we can implement keyset pagination. The keyset pagination does not rely on offset and simply uses a combination
of the values of the last fetched rows. Not all the values, but the ones that we use for ordering. In our example, we order by
published_at
and id
columns. To proceed to the next page, we should memorize published_at
and id
values for the
last fetched row, and take limit
amount of rows, that have published_at
and id
values bigger (or smaller, depending on
ordering direction) than memorized value. For example, the last row from the first page can be this:
id | title | published_at
---------+-----------------------------------------------------------------------------+----------------------------
1030984 | A-Bank Card Holders Are No Longer In Charge | 2017-11-19 14:50:17.18685
So our pagination SQL query will now look like this without using offset:
The output of explain for the first page now looks like this:
Limit (cost=0.42..14.29 rows=10 width=128)
-> Index Scan Backward using index_articles_on_published_at_and_id on articles (cost=0.42..627756.86 rows=412252 width=128)
Index Cond: ((published_at < '2017-11-19 14:50:17.18685'::timestamp without time zone) AND (id < 1030984))
Filter: ((NOT preview) AND (status = 'published') AND (rubric = 'society'))
And for the record, that was created about two weeks ago and roughly matches the offset of 2000:
Limit (cost=0.42..14.29 rows=10 width=128)
-> Index Scan Backward using index_articles_on_published_at_and_id on articles (cost=0.43..627755.37 rows=412212 width=128)
Index Cond: ((published_at < '2017-11-04 08:52:29.889916'::timestamp without time zone) AND (id < 1022805))
Filter: ((NOT preview) AND (status = 'published') AND (rubric = 'society'))
As you can see, with this approach, increasing page number has no performance impact at all. Of course, this kind of pagination
requires the presence of a respective index, in this example, it is published_at, id
index.
Also, we can use PostgreSQL’s row() function for the same purpose, when we do not order by autoincrementing column (id in our example)
Of course, all of these queries can easily be translated to your favorite ORM’s DSL. As a conclusion, I can advise not to use
offset pagination on the relatively big datasets. Even when the records count exceeds only tens of thousands, offset's
impact
on performance is becoming noticeable. But the use of offset is ok for applications operating on the relatively small
datasets.
Suggested reading:
Five ways to paginate in Postgres, from the basic to the exotic
[postgresql
activerecord
]