Variants of instant recovery
Hewlett Packard Labs
Traditional theory and practice of write-ahead logging and of database recovery focus on three failure classes. These are transaction failures (typically due to deadlocks) resolved by rollback; system failures (typically power or software faults) resolved by restart with log analysis, “redo,” and “undo” phases; and media failures (typically hardware faults) resolved by restore operations that combine multiple types of backups and log replay.
The recent addition of single-page failures as a fourth failure class and new techniques for single- page recovery have opened new opportunities far beyond its original aim of immediate, lossless repair of single-page wear-out in novel or traditional storage hardware. In the contexts of system and media failures, efficient single-page recovery enables on-demand incremental “redo” and “undo” as part of system restart or media restore operations. This can give the illusion of practically instantaneous recovery: instant restart permits processing new queries and updates seconds after system reboot and instant restore permits resuming queries and updates on empty replacement media as if those were already fully recovered. In the context of node and network failures, instant restart and instant restore combine to enable practically instant failover from a failing database node to one holding merely an out-of-date backup and a log archive, yet without loss of data or of transactional integrity.
1 Context and problem description
Modern hardware differs from hardware of 25 years ago, when many of the database recovery techniques used today were designed. Current hardware includes large memory and large buffer pools with many millions of pages and therefore, after system failures, many in-doubt pages and long restart recovery; semiconductor storage with single-page failures due to localized wear-out; high-density disks with single-page failures due to cross-track effects, e.g., in shingled recording; and high-capacity storage devices with long restore recovery after media failures, e.g., multiple hours even at their maximal advertised bandwidth.
Our techniques [GGS 14, SGH 15] for instant recovery employ and build on many proven techniques, in particular write-ahead logging, checkpoints, log archiving, sorting, and indexing. The foundation are two new ideas. First, single-page repair enables incremental recovery fast enough to run on demand without imposing major delays in query and transaction processing. Second, log archiving not only compresses the log records but also partially sorts or even indexes the log archive, which enables multiple access patterns, all reasonably efficient. In other words, the contributions of “instant recovery” are ubiquitous fine-grained on-demand recovery and a novel data organization of log archives that permits both efficient archiving and efficient restore operations, i.e., both creation and usage of the log archive.
These foundations enable incremental recovery actions invoked on demand, in particular after system failures (producing an impression of “instant restart”), after media failures (for an impression of “instant restore”), and after failover (“instant failover” and “instant failback”). In addition to incremental and thus apparently instant recovery, new techniques speed up offline backup and offline restore operations. In particular, differential and incremental backups become obsolete and full backups can be created efficiently without imposing any load on the active server process or its network.
The problem of recovery methods being out-of-date for today’s hardware exists equally for file systems, databases, key-value stores, and contents indexes in information retrieval and internet search. Similarly, the techniques and solutions discussed below apply not only databases, even if initially framed in database terms, but also to file systems, key-value stores, and contents indexes. In other words, the problems, techniques, and solutions apply to practically all persistent digital storage technologies that employ write-ahead logging.
The fundamental technique underlying all novel techniques summarized below are transactions and write-ahead logging. Transactions group multiple reads, writes, increments, etc. on multiple persistent objects into a single indivisible operation. Each transaction provides the ACID properties: atomicity (all-or-nothing failure atomicity), consistency (taking the database from one consistent state to another), isolation (also known as concurrency control or synchronization atomicity), and durability (surviving any failure, once committed). Write-ahead logging requires logging “redo” and “undo” information on “stable storage” before writing a database page from the buffer pool back to its old location on persistent storage.
Traditional theory and traditional implementations of database recovery focus on three failure classes. Recovery from a transaction failure employs transaction rollback or “undo.” Recovery from a system failure requires restart with log analysis, “redo” of all pre-crash updates, and “undo” of transactions that remained incomplete due to the failure. Recovery from a media failure requires restoring a backup and replaying the recovery log between backup and failure. In most systems, the log records of each transaction form a linked list (the per-transaction chain of log records). In some systems, e.g., Oracle and SQL Server, the log records pertaining to the same database page form another linked list (the per-page chain of log records). The pointers in these linked lists are really log sequence number (LSN) values, typically byte addresses within the recovery log. To form the per-page chain, each log record contains the prior PageLSN value of the data page, which recovery from system and media failures relies on for “exactly-once” application of log record.
Some of our novel techniques rely on b-tree indexes. Those have been ubiquitous in databases, key- value stores, and file systems for multiple decades. Some new techniques rely specifically on foster b- trees or on partitioned b-trees. For example, self-repairing indexes work best with indexes that, like Foster b-trees, guarantee that each page has only one incoming pointer at all times. For another example, instant restore works for any database structures and indexes but it requires a partitioned b-tree (or an equivalent) for the log archive.
Since backup and restore are anything but “hot” topics in contemporary research on databases, key-value stores, or file systems, there is little recent related work. Principal sources of related techniques are Gray [G 78] and Mohan [M 93, MHL 92, MN 93]. Not much research has been published on backup and restore in the last two decades, although all of the techniques below could have been invented 20 years ago.
3 Specific techniques and solutions
This research effort has produced the following new techniques.
3.1 Single-page failures and repair
i. Single-page repair: If the recovery log embeds per-page chains of log records, the history of each database page is readily accessible. In the first design, a central data structure (termed the “page recovery index”) held anchors to these linked lists. The page recovery index requires an update before the buffer pool evicts a dirty page. Recovery is required after the buffer pool reads an inconsistent or out-of-date page from persistent storage. – The advantages over prior techniques are that a local failure on flash storage or on shingled disks does not imply replacement and recovery of the entire device; and that recovery from local failure takes a second or less, not hours.
ii. Self-repairing indexes: This design requires entries in a page recovery index only for root pages of user-defined indexes. For all non-root index nodes, the parent node carries the necessary information. Specifically, instead of a key-pointer pair for each child node, a parent node requires a key-pointer-LSN triple for each child node. – The advantages over traditional indexes are more reliable and available indexes in production systems; and more efficient quality assurance and root cause analysis during software development.
iii, Write elision: The buffer pool may “forget” to write a dirty page when evicting it, relying on single- page recovery to bring the page up-to-date after it is read the next time. – This technique pays off if recovery (based on log records still in memory) is faster (or otherwise cheaper) than writing a dirty database page.
iv. Read elision: Instead of reading, modifying, and writing a database page, it is sufficient to log an update (typically an insertion) and register the new LSN value in the page recovery index or in the parent page such that the deferred update can be applied after the page is read the next time. – The technique gives the advantages of buffer trees without their increased tree depth.
v. Deferred “undo:” After a media failure in a traditional system, incomplete transactions with updates on the failed media can go neither forward (run to completion) nor backward (roll back). If such transactions hold locks on other parts of the database, these locks must remain in force, possibly stalling other transactions and applications. With deferred “undo,” such incomplete transactions can roll back by merely logging the “undo” actions without applying them to the database, i.e., the failed media. After media recovery, self-repairing indexes and single-page repair will apply the “undo” actions before new transactions may access the data on the replacement media. – The advantage is that a media failure on one device does not spread un-availability to other devices due to stalled transactions and their locks.
3.2 System failures and restart
vi. Backward log analysis: After a system failure, log analysis recovers the pre-crash server state, in particular transaction manager, lock manager, and buffer pool. With checkpoint intervals typically much longer than most transactions, the traditional forward log analysis (scanning the pre-crash recovery log from the last checkpoint to the crash) uselessly performs lock acquisition and release for many “winner” transactions, i.e., those that committed before the crash. – The advantages over traditional forward log analysis are simpler logic and less effort, since there is no longer a need to save the locations of checkpoint log records in some well-known place and since backward log analysis can easily restrict lock acquisition to the loser transactions, i.e., those without a commit log record.
vii. Parallel “redo” and “undo:” ARIES papers suggest parallel “redo” and parallel “undo,” both based on a single-threaded scan of the pre-crash recovery log. If log analysis recovers the entire server state, including the set of in-doubt database pages and the set of loser transactions, “redo” and “undo” can process the elements of these sets in any order and even in parallel. – The advantages over traditional methods are a high degree of parallelism and even a dynamic degree of parallelism.
viii. Instant restart: After log analysis has recovered the pre-crash server state, new transactions may start. If a new transaction attempts to access an in-doubt database page, “redo” (in effect single- page repair) can recover the page and make it available. If a new transaction attempts to acquire a lock conflicting with a lock held by a loser transaction, “undo” can roll back the loser transaction and release its locks. Thus, new transactions guide both “redo” and “undo” operations. Concurrently to new user transactions, traditional or parallel “redo” and “undo” may finish the restart recovery as quickly as possible. – The advantage over prior techniques is almost immediate availability after a reboot, thus a much lower mean time to repair and perhaps two additional nines in system availability, e.g., from 99% availability to 99.99% availability.
ix. Fast reboot: When a database server needs rebooting, e.g., for a security patch in the operating system, a large buffer pool may hold many dirty pages and thus a shutdown may take considerable time. An alternative to flushing those dirty pages is crashing the server deliberately, relying on restart recovery to bring the affected database pages up-to-date. In the past, there was no way around an offline period, either for flushing dirty pages during shutdown or for “redo” recovery during restart. Modern machines with larger memory and larger buffer pools make the problem worse, not better. With instant restart, the server may simply write a checkpoint and then terminate without flushing dirty pages from the buffer pool. Instant restart and its log analysis are fast as there are no log records after the last pre-crash checkpoint. – Instant restart eliminates the offline period as it permits immediate shutdown as well as new transactions immediately after reboot.
3.3 Media failures and restore
x. Fast log replay: The recovery log typically uses a device chosen for latency disregarding capacity, whereas the log archive typically uses a device chosen for long-term reliability disregarding latency. Some systems merely copy log records, others also compress them, and some transform the server history into a database history by stripping server events such as checkpoints, by omitting transaction rollback information, by stripping transaction identifiers from log records, by aggregating multiple successive changes to the same database page, etc. Our new technique supports all those ideas but also partially sorts log records on the identifier of the affected database page. Partial sorting here means that log archiving performs the run generation step of an external merge sort but leaves the final merge step to restore operations, should they become necessary. After the replacement device holds the most recent full backup, a single step merges the runs of the log archive and applies the fully sorted log records. Thus, no additional step is required for sorting yet log replay can be much more efficient than restoring traditional incremental backups. – With this technique, all incremental backups are obsolete such that production systems can switch from daily (incremental) backups to weekly or even monthly (full) backups, with no loss in restore performance, i.e., no loss in reliability or availability.
xi. Single-phase restore: With a sufficiently small number of runs in the log archive (ensured if necessary by a daily merge step of one day’s log records), a single merge step can combine the most recent full backup (sorted on database page identifiers), incremental backups (if any), and all runs in the log archive. – This technique enables complete restoration of an up-to-date device or database in the time a traditional restore operation merely restores the most recent full backup, i.e., eliminating the time for restoring incremental backups and for log replay, thus cutting restore times to a fraction (e.g., a quarter of the time traditionally required).
xii. Instant backup: In a traditional system, a backup command may take hours to produce a single large backup file. However, a small set of files permitting equally fast restore is just as good as a traditional backup file. Thus, an old backup and a log archive in runs ready for merging can serve the purpose of a traditional backup file. In order to mimic the effect of a traditional backup command, the log archiving logic has to switch from one run to another as part of the backup command. – The advantage of the technique is that a user request for a full database backup may run minutes, with little dependency on the database size, whereas a traditional backup command runs hours, depending on database and device sizes.
xiii. (Remote) virtual backup: A traditional backup operation imposes significant, even if temporary, load on the database server and perhaps its network connection to the desired location of the backup file. Thus, provisioning must scale the server and perhaps its network for this load. An alternative method for creation of a new up-to-date backup file merely merges an old backup file and the log archive. If backup and log archive are on the same network node, the primary database server might not even be involved in creation of a new backup. – This technique creates a new, up-to-date backup without any load on the database server or its network connection.
xiv. Instant restore: A (full or incremental) backup operation writes page images to a backup file ordered by their database page identifiers, omitting free database pages and often compressing allocated database pages, e.g., by omitting free space within pages. Due to compression, the location of a database page in a backup is not obvious; but due to the sort order, it is easy and efficient to create a b-tree index mapping each database page identifier to a byte offset in the backup file. Similarly, while sorting and merging runs in a log archive, it is easy and efficient to build a b-tree index on log records in the log archive, with a partition in the partitioned b-tree for each run in the log archive. With indexes on backup files and the log archive, the logic of single-phase restore can run for individual database pages or for runs of contiguous database pages, thus enabling efficient local restore operations. Instead of running a restore operation strictly in the order of page identifiers, user transactions and their data needs can guide on-demand local restore, with traditional bulk restore running in the background. – The advantage is much higher availability of information after a data loss, practically immediately after a replacement device is available.
xv. Hot restore: If a media failure occurs but a replacement device is available immediately, e.g., a spare that is formatted but empty, then active transactions may resume after a short delay required for initiating instant restore. – The advantage of resuming transactions rather than aborting them is particularly valuable for long-running transactions, e.g., index creation, but also increases general transaction processing throughput in the event of a media failure.
xvi. Transient restore: If a media failure occurs with no replacement device immediately available, transactions may run with database pages recovered in the buffer pool, queries and updates using the buffer pool, and buffer pool evictions relying on write elision and subsequent single-page repair. For a database page needed and evicted repeatedly, single-page recovery runs repeatedly. – While clearly not a desirable way to run a database or a business, it some cases is it better to limp along than to stand still entirely.
xvii.Effective optimizations for free space: If the log archive indicates de-allocation of a database page after a backup, there is no need to read this page from the backup media or to write it to the replacement media. While not particularly interesting for an individual page, this optimization is useful for large contiguous runs of pages, e.g., removal of a materialized view, a secondary index, or an entire table with all its indexes. In fact, the restore logic should ignore not only the backup but also the log records preceding de-allocation. The same idea applies after space is re-used: there is no need to read the backup. It is not sufficient to guide the restore logic with the space allocation map because free space management only captures the current state but not the history of a page, in particular de-allocation and re-allocation. – Immediately reading all log records for a database page or a run of pages can save substantial effort during restore operations.
xviii. Incremental backup guided by log archive: If an incremental backup is wanted, it can use the log archive to decide which database pages to include. Merging runs of the log archive, retaining only the page identifiers of modified database pages, produces a list of pages for inclusion in the incremental backup. The sort order of this list permits efficient reading in the database and efficient writing in the backup. A log record about page de-allocation permits reducing the backup volume. – There is no longer any need for an expensive in-database structure to guide differential or incremental backups, if they are actually wanted.
xix. Inline database reorganization: A restore operation may modify the database representation, e.g., remove ghost records (invalid records) from database pages. If the restore operation employs the database buffer pool beyond the input buffers of the merge logic, cross-page reorganization can include load balancing among siblings in a tree structure or adoption in foster b-trees (and other forms of overflow resolution in other storage structures). The restore logic can include a variety of compaction and defragmentation operations, even removal of secondary indexes and of materialized views if those are fully contained within the restore storage space. Restore may also include creation of secondary indexes or materialized and indexed views. – At the cost of more complex logic, a restore operation can include various database optimizations, saving much I/O compared to separate database reorganization and immediately including the optimized database pages in the next backup.
xx. Online database migration: When migrating all data from one media (disk, table space, etc.) to another, there used to be three methods: taking those data offline (no updates) for the duration of a copy operation, temporarily applying each update in both places, or moving stale data (eventually catching up based on the recovery log). Instant restore offers a new choice: “restoring” into the new space with the old data serving in the role of the backup. – The new method avoids duplicate efforts for immediate maintenance or for log-based catch-up, yet it runs online and guided by actual data usage.
3.4 Node failure and failover
xxi. Instant failover: Instant failover assumes nodes in a network and log shipping between them. Using a buffer pool, the database, and a recovery log, a primary node executes queries and updates; while a secondary node holds an out-of-date database backup and continuously adds to the log archive. Their principal communication is continuous log shipping from the primary node to the secondary node. In the traditional approach to high availability and fast failover, the secondary node holds a complete copy of the database and always keeps it up-to-date by immediate application of all log records received via log shipping. An alternative is mirroring – instead of log records, the primary node ships database pages as it writes them to its local database. Compared to log shipping, mirroring saves the effort to apply log records and it saves reading database pages that require changes. On the other hand, mirroring ships entire pages even when only a few bytes have changed. Moreover, both mirroring and log shipping require and maintain a full up-to-date copy of the database on each secondary node. In instant failover, the secondary node does not require an up- to-date copy of the database. It merely requires empty space for the database, a full database backup (days, weeks, or even months old), and a log archive covering the entire interval since the database backup. Both database backup and log archive must support fast access by page identifier, typically using indexes similar to those required for instant restore. Failover requires recovery of both server state and database contents. The server state includes transaction manager, lock manager, and buffer pool. Recovery or re-initialization of server state is quite similar to system restart, i.e., principally relies on log analysis. Recovery of database contents is quite similar to media restore operations, i.e., principally merges backup and log archive. – Instant failover provides the high availability of mirroring and of log shipping without the high network bandwidth of mirroring or the local costs for reading database pages and replaying log records; but most importantly, without the need of an up-to-date copy of the database. An out-of-date, compressed backup plus the log archive suffices for instant failover.
xxii. Instant failback: When a former primary node reconnects with a cluster (e.g., after a reboot), it should resume its workload yet its database is out-of-date. Traditional failback requires a catch-up effort in addition to shipping the most recent log records. Instant failback permits resuming the workload immediately after the log archive is available to the former primary node, using techniques of single-page repair and instant restore to catch up individual database pages on demand. – The advantage over traditional failback methods is faster resumption of the original workload partitioning among nodes in a cluster.
xxiii. Elasticity by failover: If a workload permits partitioning into more partitions than there are nodes in a cluster, each such partition can fail over to any other node practically any time. While this is relatively simple for stateless nodes, it has not been possible for other nodes, e.g., databases. – Instant failover and failback for individual partitions permits growing and shrinking the set of nodes assigned to a workload, with high reliability and availability yet with low overhead during transaction processing and during failover.
xxiv. Failover pools: In a cluster with multiple nodes, traditional techniques for high availability require up-to-date database copies on each secondary node, with a continuous workload due to mirroring or log shipping. Thus, each primary node requires a suitable number of dedicated secondary nodes. Since instant failover does not require an up-to-date database but merely access to a backup and the log archive, any node may take over the workload of any node in a cluster and a few spare nodes may serve as secondary nodes for a large cluster. – A small pool of spares even in a large cluster, yet with excellent reliability and availability, reduces data center size, power, cost, etc. by a factor almost equal to the redundancy desired for each node. For a workload requiring N primary nodes and resilience against the failure of any K nodes, a traditional data center requires N×(K+1) nodes whereas a cluster with a failover pool requires N+K nodes. For example, if transaction processing requires 100 nodes and resilience against any double failure, traditional failover design require 100×(1+2)=300 nodes whereas a failover pool requires only 100+2=102 nodes.
3.5 File systems
xxv. Self-repairing file systems: File systems hold data pages and various kinds of metadata, e.g., directories, allocation information, indirection blocks, metadata and contents indexes, etc. The metadata almost invariably fits into the constraints of a key-value store or a traditional database index; therefore, it could benefit from all the database techniques above. The data pages are different since they do not permit page headers and thus PageLSN values, which are required for traditional write-ahead logging and recovery. A self-repairing file system relies on write-ahead logging, employs self-repairing indexes for all metadata, and adapts self-repairing indexes for data pages. Instead of a PageLSN value in each data page, its parent page (e.g., the indirection block) augments each child pointer not only with an LSN value but also with a CRC value (cyclic redundancy check or other parity). When the buffer pool loads a data page, it verifies its CRC value and, in case of a failure, recovers the correct page contents using the LSN value and the recovery log or log archive. – The advantages of self-repairing file systems are the same as those of self- repairing indexes and thus self-repairing databases: transactional updates, efficient incremental consistency checks, efficient local repair, high reliability, and high availability; all with write-ahead logging rather than mirroring, which is more expensive.
xxvi. Instant (direct) log archiving: When writing a newly allocated page or when overwriting an existing page in its entirety, traditional logging writes two copies of the page to the recovery log (because the log is usually mirrored) and then another copy during log archiving (which usually uses a RAID- 6 device). Thus, logging is more expensive than mirroring for pages of a new index (in a database, in a key-value store, or in a file system) or of a new file (in a file system). Instant log archiving reverses this imbalance by writing the new pages directly to the log archive, of course in runs sorted and indexed on page identifier, and writes merely short log records about the page allocation to the recovery log. Thus, new pages are written not four times (as required when logging and archiving in dual copy), not three times (as required for mirroring resilient to two failures) but only two times (in the data store and the log archive) plus some redundancy required by RAID for the log archive, e.g., RAID-6. – The advantage over traditional mirroring file systems is a reduction in overall write volume, in addition to the other benefits of instant recovery such as instant single-page repair, instant system restart, instant media restore, instant node failover, obsolete incremental backups, and remote virtual backups.
Summary and conclusions
In summary, modern hardware triggered research into detection and recovery of single-page failures, which created opportunities for on-demand, incremental, seemingly instant recovery from system failures, media failures, and node failures. Moreover, efficient single-page recovery creates opportunities for more reliable data stores, e.g., in the form of self-repairing indexes, and for more efficient operations, e.g., using write elision and read elision.
For incremental recovery from media failures, two further innovations were required: indexing backup files and partially sorted, indexed log archives. Partially sorted log archives, optionally indexed for on-demand use, permit both efficient creation from the active recovery log and efficient use during restore operations. Archiving employs principally the logic of run generation well known from external merge sort; restoration employs principally the merge logic well known from external merge sort.
Even a very large log archive, if sorted, permits efficient log replay: instead of many random I/O operations in the database, database pages from the backup media and log records from the log archive only require merging. Large units of I/O ensure restore bandwidth near the limits of the devices employed. Thus, there is no longer any use for differential and incremental backup media; restore operation may simply merge the last full backup with the log archive. Similar logic also permits creating a new, up-to-date full backup without touching the active database; merging the previous backup with the log archive gives the same result with almost the same effort.
In compute clusters with independent nodes, techniques from instant restart and instant restore combine to instant failover. The new primary node recovers the failed, old primary node’s server state as in instant restart and its database contents as in instant restore. Compared to traditional mirroring and log shipping, instant failover does not require each secondary node to keep a complete copy of the data store up-to-date at all times. Instead, a secondary node merely requires access to a backup and a log archive partially sorted and indexed as for instant restore. With much less effort and state required on each secondary node, a few spare nodes can provide failover and availability for a large set of primary nodes. In the past, each primary node required a few spare node for high availability. In contrast, instant failover based on a small shared pool of failover nodes enables substantial savings in data center size, equipment, power, and cost.
Finally, based on cyclic redundancy checks or other parity calculations, self-repairing index techniques can be adapted to data pages that lack page headers, in particular the PageLSN values fundamental to the all ARIES recovery methods. Thus, all instant recovery technique can apply not only to databases and key-value stores but also to future file systems.
[GGS 14] Goetz Graefe, Wey Guy, Caetano Sauer: Instant recovery with write-ahead logging: page repair, system restart, and media restore. Synthesis Lectures on Data Management, Morgan & Claypool Publishers 2014, pp. 1-85. A revised and extended 2nd edition is imminent.
[MHL 92] C. Mohan, Donald J. Haderle, Bruce G. Lindsay, Hamid Pirahesh, Peter M. Schwarz: ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM TODS 17(1): 94-162 (1992).
[MN 93] C. Mohan, Inderpal Narang: An efficient and flexible method for archiving a database. ACM SIGMOD 1993: 139-146.
[SGH 15] Caetano Sauer, Goetz Graefe, Theo Härder: Single-pass restore after a media failure. BTW 2015: 217-236.