Manual DB sync-up in HADR mode

The Zoom High-Availability and Disaster Recovery (HADR) module automatically maintains all the configured Zoom server peers in sync, so that all of them serve the same data repository. However, at some times, such as when setting up a new peer, or after a peer’s DB location file system becomes permanently unavailable, a manual sync of the DB needs to be performed. This guide provides instructions for performing this in multiple ways (listed here in increasing order of complexity).

Only Option 1 can be used for syncing, if the server locale is different across peers.

Instructions

  1. Firstly, identify a suitable peer to manually perform a file sync from its DB directories to all other peers’ DB locations.
    • Monitor the HADR Dashboard Status in any running Zoom server Webmin.
    • The peer with the highest Last Delivered value is the one to perform a DB sync from.
    • If possible, wait till its Last Delivered value becomes equal to the highest Last Proposed value among all the peers. This will ensure that no intervening operations’ data is lost after sync.
    • Ensure that the selected peer is in writer mode.
      • The dlg-data.lck file in its DB redo directory must contain the name of this peer.
  2. Sync or copy the DB directories in one of the ways given below:
    • Option 1: If all peers can be stopped, no HADR history is required, and no client connections when checkpointing
      1. Restart all peers and wait for checkpoint to complete.
      2. Delete the following 2 dirs inside the db location of this peer:
        1. db/<PEER_NAME>
        2. db/redo/<PEER_NAME>
      3. Use this peer’s db dir to replace the entire db dir for all other peers. Copy or Rsync (with --delete-after) can be used for this.
    • Option 2: If all peers can be stopped
      1. Preferably, restart all peers and wait for checkpoint to complete.
      2. Stop all peers.
      3. Rsync the HADR redo and DBs in the following order:
        1. db/redo/<PEER_NAME> –> db/redo/<OTHER-PEER_NAME>
        2. db/redo/filedata
        3. db/<PEER_NAME> –> db/<OTHER-PEER_NAME>
      4. Rsync the db dir from this peer to all other peers, with --delete-after option, while excluding above HADR db dirs and db/redo/dlg-data.lck.
    • Option 3: If all peers cannot be stopped
      1. Ensure that the selected peer has finished checkpoint. Then edit its dlg-data.lck file to write some other text (not server name), so that in case the peer crashes and restarts, it will restart in reader mode.
      2. Stop all the other peers.
      3. Rsync the HADR redo and DBs in the following order:
        1. db/redo/<PEER_NAME> –> db/redo/<OTHER-PEER_NAME>
        2. db/redo/filedata
        3. db/<PEER_NAME> –> db/<OTHER-PEER_NAME>
      4. Rsync the db dir from this peer to all other peers with the following args and in the given order:
        1. --exclude=db/<PEER_NAME>
        2. --exclude=db/redo/<PEER_NAME>
        3. --exclude=db/redo/dlg-data.lck
        4. --delete-after
        5. Rsync order:
          1. db/redo/dlg-*.data
          2. db/redo/filedata
          3. db/
  3. Restart all the peers.

Leave a Comment