It’s well understood that backing up your database is, arguably, the single most important step in preventing data loss and building redundancy into your application. A thoroughly redundant backup plan will include having replica servers on standby and filesystem level backups for disaster recovery.
In this post…
The second most important step is verifying those same backups. Unfortunately, this step is not implemented often enough, and has precipitated several publicized failures to fully recover from an outage.
This article will detail how we use Python and LXC to verify PostgreSQL filesystem level backups created with the pg_basebackup tool.
System requirements:
The following is a list of software we used to implement this project, but any flavor of Linux and other versions of Python should work.
Prerequisites:
This article assumes you already have a filesystem level backup to work from. For reference, we run the following command on a replica instance to create the two backup tar files.
pg_basebackup -h localhost -U username -D /var/lib/pgsql/backup -Ft -Xs -z
Running this command will result in two files, named base.tar.gz and pg_wal.tar.gz. The -Ft option tells postgres to create tar files. The -Xs option tells postgres to stream the write-ahead log log in parallel, giving us the second tar file, so that we have a standalone backup. And finally, the -z option simply gzips the tar file output.
Step 1
Create the container and a shared directory
import lxc
import time
import datetime
import os
#create the container
c = lxc.Container("postgresql")
c.create(
"download",
lxc.LXC_CREATE_QUIET,
{"dist": "centos", "release": "7", "arch": "amd64"})
#append the mount point for the shared directory to the config file
#this should be the host directory where the backup files are located
mountentry = "/var/lib/pgsql/backup/ media/postgresql none bind,create=dir 0 0 "
if not mountentry in c.get_config_item("lxc.mount.entry"):
c.append_config_item("lxc.mount.entry", mountentry)
#start the container
c.start()
#wait for the network
while not c.get_ips(timeout=1):
print('Waiting for network...')
#sleep for a little longer just to be sure
print('Network found, pausing a moment to let it finish starting...')
time.sleep(5)
Before starting the container we’ve created, we need to share the backup directory so that the database can be restored inside the container. We’ll also use this directory to write and read text files so that the container can communicate its status with the host.
After we start the container, we need to wait for the network to start before we can continue on to the next step. I added a five second buffer because I found the network still needed a little more time after the container returned true for get_ips().
Step 2
Install PostgreSQL inside the container
#install repo and postgresql10 packages using yum
c.attach_wait(lxc.attach_run_command,
["yum", "install", "https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm", "-y"])
c.attach_wait(lxc.attach_run_command,
["yum", "install", "postgresql10-server", "-y"])
c.attach_wait(lxc.attach_run_command,
["yum", "install", "postgresql10-contrib", "-y"])
We use YUM to install the PostgreSQL repo and server packages. If you are using a different package manager, you’ll need to modify the code to reflect that.
Step 3
Uncompress backup files
#uncompress backup files into postgres data directory on the container
c.attach_wait(lxc.attach_run_command,
["mkdir", "/var/lib/pgsql/10/wal_archive"])
c.attach_wait(lxc.attach_run_command,
["chmod", "0700", "/var/lib/pgsql/10/wal_archive"])
c.attach_wait(lxc.attach_run_command,
["tar", "-xzvf", "/media/postgresql/base.tar.gz", "-C", "/var/lib/pgsql/10/data"])
c.attach_wait(lxc.attach_run_command,
["tar", "-xzvf", "/media/postgresql/pg_wal.tar.gz", "-C", "/var/lib/pgsql/10/wal_archive"])
c.attach_wait(lxc.attach_run_command,
["chown", "-R", "postgres:postgres", "/var/lib/pgsql/10/wal_archive"])
c.attach_wait(lxc.attach_run_command,
["chown", "-R", "postgres:postgres", "/var/lib/pgsql/10/data"])
Create a directory to store the log archives that will be played back once the database has started. The base files will go in the default postgres data directory. Since we’re running commands as root inside the container we need to give ownership of the files to the postgres user.
Step 4
Change the PostgreSQL configuration for performance (optional)
#modify postgresql.conf and comment out the following configuration settings.
c.attach_wait(lxc.attach_run_command,
["sed", "-i", "s/shared_buffers/#shared_buffers/g", "/var/lib/pgsql/10/data/postgresql.conf"])
c.attach_wait(lxc.attach_run_command,
["sed", "-i", "s/effective_cache_size/#effective_cache_size/g", "/var/lib/pgsql/10/data/postgresql.conf"])
c.attach_wait(lxc.attach_run_command,
["sed", "-i", "s/archive_mode/#archive_mode/g", "/var/lib/pgsql/10/data/postgresql.conf"])
c.attach_wait(lxc.attach_run_command,
["sed", "-i", "s/maintenance_work_mem/#maintenance_work_mem/g", "/var/lib/pgsql/10/data/postgresql.conf"])
#second, append the same configuration items with more suitable settings
c.attach_wait(lxc.attach_run_command,
["sh", "-c", "echo 'shared_buffers = 2GB' >> /var/lib/pgsql/10/data/postgresql.conf"])
c.attach_wait(lxc.attach_run_command,
["sh", "-c", "echo 'effective_cache_size = 4GB' >> /var/lib/pgsql/10/data/postgresql.conf"])
c.attach_wait(lxc.attach_run_command,
["sh", "-c", "echo 'archive_mode = off' >> /var/lib/pgsql/10/data/postgresql.conf"])
c.attach_wait(lxc.attach_run_command,
["sh", "-c", "echo 'maintenance_work_mem = 250MB' >> /var/lib/pgsql/10/data/postgresql.conf"])
There is a good chance that the container is not configured the same as the production replica server. Therefore, we need to change some of our settings so PostgreSQL will run. We did this by commenting out the production lines and appending more container-friendly entries to the end of the postgresql.conf file.
Step 5
Promote the replica to primary, and start the database server
#modify recovery.conf so the database will not be in standby mode when it starts up
c.attach_wait(lxc.attach_run_command,
["sed", "-i", "s/standby_mode/#standby_mode/g", "/var/lib/pgsql/10/data/recovery.conf"])
c.attach_wait(lxc.attach_run_command,
["sed", "-i", "s/primary_conninfo/#primary_conninfo/g", "/var/lib/pgsql/10/data/recovery.conf"])
#start postgresql
c.attach_wait(lxc.attach_run_command,
["systemctl", "start", "postgresql-10"])
Remember, we pull this backup from a replica server, which was running as a standby using streaming replication. We need the database to complete the recovery process and before starting up as a primary, so we comment out the above lines to turn off standby and streaming replication.
With that out of the way, we start the server (if systemctl hangs at this point, see the caveats section below for help).
Step 6
Wait for the database to start and complete its recovery
#wait for archives to catch up
#create status file in shared dir so both host can keep tabs on container output
c.attach_wait(lxc.attach_run_command,
["touch", "/media/postgresql/status"])
#check that recovery has completed by looping through text file until the container has updated it. Give up after 30 seconds.
i = 0
while len(open("/var/lib/pgsql/backup/status").readlines()) == 0 and i < 30:
c.attach_wait(lxc.attach_run_command,
["sh", "-c", "grep 'consistent recovery state reached' /var/lib/pgsql/10/data/pg_log/postgresql-" + str(datetime.date.today()) + ".log > /media/postgresql/status"])
print('Waiting for recovery to complete...')
time.sleep(1)
i += 1
#check that database is ready to accept connections
i = 0
while len(open("/var/lib/pgsql/backup/status").readlines()) == 1 and i < 30:
c.attach_wait(lxc.attach_run_command,
["sh", "-c", "grep 'database system is ready to accept connections' /var/lib/pgsql/10/data/pg_log/postgresql-" + str(
datetime.date.today() ) + ".log >> /media/postgresql/status"])
print('Waiting for database to accept connections...')
time.sleep(1)
i += 1
After we’ve started the database we need to wait for it to complete the recovery process. Unfortunately, the host can’t communicate directly with PostgreSQL running inside the container, so we use text files to bridge that disconnect.
PostgreSQL must finish playing back the log archives before it will be ready to accept connections. We’ll wait 30 seconds at most for the database to recover, and another 30 seconds at most for database to finish starting up. If PostgreSQL is not online by then, we can deduce that our backup is failing.
Step 7
Verify the database by querying it
#create text file in shared dir and set perms.
#the container will write the output to it, host will read to verify backup
c.attach_wait(lxc.attach_run_command,
["touch", "/media/postgresql/date"])
c.attach_wait(lxc.attach_run_command,
["chgrp", "postgres", "/media/postgresql/date"])
c.attach_wait(lxc.attach_run_command,
["chmod", "g+w", "/media/postgresql/date"])
Create the text file that we’ll use to communicate with PostgreSQL in the shared directory.
#run a query against the database to make sure we have recent data, write to date file
c.attach_wait(lxc.attach_run_command,
["su", "-", "postgres", "-c", "psql dbname -t -c 'SELECT datetime FROM table ORDER BY id DESC LIMIT 1' > /media/postgresql/date"])
#read the date file
with open("/var/lib/pgsql/backup/date", 'r') as datefileio:
backupdate = datefileio.read().strip()
#convert date string to datetime object
backupdate = datetime.datetime.strptime(backupdate, '%Y-%m-%d %H:%M:%S')
#if the date is too old, something is wrong with our backup
if backupdate < datetime.datetime.utcnow() - datetime.timedelta(1):
print("Backup data is old")
We pull daily backups from replica server, so we need to verify that the data is not older than 24 hours. We decided to verify the backup by querying one of the tables that we know is updated frequently with timestamp data. If we find the most recent timestamp is outside our acceptable window, the backup is not valid.
The above code is an example. You’re requirements for querying the database will vary based on how your data is structured.
Step 8
Tear down the container and delete text files
#stop and destroy container
c.stop()
c.destroy()
#delete status and date text files
os.remove("/var/lib/pgsql/backup/status")
os.remove("/var/lib/pgsql/backup/date")
Assuming we are going to run this script repeatedly, we need to clean up after we’re done. That means stopping and destroying the container, as well as removing our text files. Now we have a clean slate for the next run.
Caveat
The examples in this article are based on code we’ve implemented to verify our production backups. It works, but we have had to implement one workaround by adding this code just after postgresql10 packages are installed.
#restart the container, because systemctl is buggy
c.stop()
time.sleep(10)
c.start()
time.sleep(10)
Why are we restarting the container? Because, the postgresql10 install upgraded systemctl, which then caused systemctl to hang. It’s most likely a bug, so we chose to implement this workaround instead of chasing bugfoot down a rabbit hole.
Please reach out to us if you have encountered any this bug, especially if you have a solution. We could use the solidarity.
Questions? Comments? Criticism?
If you’ve got any ideas for improvements, or better way to verify PostgreSQL backups, let us know! Meanwhile, I’ve posted the code in its entirety at GitHub for those interested in using it.
I recently published a list of The World’s Best Python Courses in 2020. It is a great additional resource for the Python information covered in this post.
It is based on more than 10,000 real user reviews and can be checked out here
https://www.courseduck.com/programming/python/