Secure & Reliable Backups With Restic

11/15/20245:16:35 AM

Backups are a core component of any infrastructure, including your own. In this post I'm going to detail how I have backups configured for my personal infrastructure. Much of this post will include snippets from a closed-source configuration management tool I built called confck but the broad concepts and configuration should apply broadly. This setup handles my relatively small 2-3Tb of data across 3 hosts without issue.

Why Restic?

Prior to my current setup I had backups configured with borg. While the main motivator behind change was performance (borg can be painfully slow) I also wanted to play with something new. In the end I think either borg or restic would work fine for my application, but I've been swayed by the minor advantages of restic (static Go binaries, regular updates & fixes, etc).

Approach

Our backup process starts on my host jak which acts as both the storage pool for backups and a coordinator for executing restic. This host has a script called run-backups which is executed via a systemd unit & timer. The basic setup for this job is as follows:

const backupService = new Unit("backup", {
  description: "run backups",
  service: {
    type: "oneshot",
    execStart: "/usr/local/bin/run-backups",
    user: "restic",
    group: "restic",
  },
}).dependsOn(user);

new Unit(
  "backup",
  backupService.makeTimerConfig({
    onCalendar: "*-*-* 04:00:00 America/Los_Angeles",
  }),
  { enabled: true }
).dependsOn(backupService);

Next we get to the actual backup coordinator script which first needs to load our encrypted restic password. This password is stored within an encrypted ZFS partition that must be manually unlocked after a reboot. While it'd be ideal to implement a zero-trust approach where hosts do not have access to the encryption key, it's convoluted in practice and doesn't fall under my risk profile (e.g. these hosts are all trusted). I'd love to see a backup tool which uses public-key encryption to make this style of backups trivial to implement, but for now I'm ok with this approach.

#!/bin/bash

set -eou pipefail

export RESTIC_PASSWORD="$(cat /mnt/storage/nsa/keys/restic.txt)"

Once we've loaded our password into an env var we can go host by host and SSH in (passing through our RESTIC_PASSWORD env variable), executing our per-host backup. The sequential approach keeps things simple and reduces overall load on the storage pool/coordinator.

START=$(date +%s.%N)
for TARGET_HOST in <%= it.hosts.map(host => host.name).join(" ") %>; do
    echo "Running backup on $TARGET_HOST"
    ssh -p 50000 -o SendEnv=RESTIC_PASSWORD restic@$TARGET_HOST "/usr/local/bin/backup-restic"
done
END=$(date +%s.%N)
DIFF=$(echo "$END - $START" | bc)

echo "Finished running backups in $DIFF seconds"

The actual script being ran on each host is also relatively simple, and just passes through our paths and list of excludes as arguments:

#!/bin/bash

export RESTIC_PROGRESS_FPS=0.01666

restic -q --json -r /mnt/jak/restic backup \
    <% it.include.forEach(function (path) { %>
    <%= path %> \
    <% }) %>
    <% it.exclude.forEach(function (path) { %>
    --exclude "<%= path %>" \
    <% }) %>

And finally we need to delete old snapshots and prune the restic repository. I'm not a huge fan of the restic forget syntax as I think it requires wayyyyy too much thinking for something that could be (and is in other implementations) simpler. Luckily this falls into the "set it and forget it" territory so I'm not bothered by it.

echo "Deleting old snapshots"
restic -r /mnt/storage/restic forget --keep-daily=30 --keep-weekly=5 --keep-monthly=12 --keep-yearly=10

echo "Running prune"
restic -r /mnt/storage/restic prune

Restic Binary & Permissions

To install the restic binary across my machines I utilize my apt repository to mirror the restic binary from the GitHub releases page.

Additionally as I run restic under an unprivileged user account, we need to configure the binary so it can access the filesystem. Historically this would be done with something like suid but in modern linux we can use the capabilities system. At the time of writing I've done this with a basic setcap on the binary, but based on this thread I'll probably move to the systemd approach soon (its quite a bit more secure overall, preventing an entire class of attack-surface against the binary itself).

The basic setup for the above is just:

const pkg = new Package("restic");

new FileCapabilities("/usr/bin/restic", [
  [Capability.CAP_DAC_READ_SEARCH, "ep"],
]).dependsOn(pkg);

Storage & Offsite

The final pieces of the puzzle describe where the backup data actually resides. Data is written via NFS (for offsite machines via tailscale) to my primary ZFS storage array. This gives all the data a RAID10 level of redundancy should any drives fail. Finally data is synced into a cloud based long-term object storage solution, which gives us the final offsite and tier of redundancy.

Monitoring

Finally backups are useless without monitoring and verification. For monitoring I use my custom monitoring solution to graph and alert on the basic telemetry. In the following image you can see the impact a restic upgrade had on the backup duration, as it causes seemingly a lot of recomputation in between versions.

image showing a dashboard for monitoring backups, displaying a graph of historical backup durations, the time since the last backup was completed, and a detailed log of the previous backup

The alert part of the dashboard code looks like so:

const BACKUP_WARNING_AGE = 1000 * 60 * 60 * 24;
const BACKUP_ALERT_AGE = 1000 * 60 * 60 * 28;

export const backupAgeAlert = alert("last backup age", async function* () {
  const result = await query<{ when: number }>(`
    SELECT toUnixTimestamp(when) as when FROM yamon.logs
    WHERE service='systemd' AND data='Finished run backups.'
    ORDER BY when DESC
    LIMIT 1; 
  `);

  const sinceLastBackup =
    new Date().getTime() - new Date(result.data[0].when * 1000).getTime();

  yield check(
    "jak",
    [
      [Status.OK, sinceLastBackup < BACKUP_WARNING_AGE],
      [Status.WARNING, sinceLastBackup < BACKUP_ALERT_AGE],
      [Status.ALERT, true],
    ],
    `Last backup was ${humanizeDuration(sinceLastBackup)} ago`
  );
});

The system works and I've been alerted to issues before (in this case, a host reboot causing the encryption key to be unloaded):

image showing two alert messages in discord indicating an issue with backups, and that the issue was resolved

Finally I currently run a manual verification process monthly. This mostly just verifies that the entire process is working and gives me confidence that I can pull files out of a backup when I need. It'd be a fun extension in the future to play around with an automated approach to backup verification, but for now thats been overkill.