ZooKeeper Log Files: Transaction and Snapshot Logging System

This chapter analyzes ZooKeeper's logging system, focusing on transaction logs and snapshot logs rather than standard log.info output.

FileTxnSnapLog Analysis

Let's examine the FileTxnSnapLog class, which is responsible for storing transaction and snapshot logs. First, we'll analyze its constructor:

public FileTxnSnapLog(File dataDir, File snapDir) throws IOException {
  LOG.debug("Opening datadir:{} snapDir:{}", dataDir, snapDir);

  this.dataDir = new File(dataDir, version + VERSION);
  this.snapDir = new File(snapDir, version + VERSION);

  boolean enableAutocreate = Boolean.valueOf(
    System.getProperty(ZOOKEEPER_DATADIR_AUTOCREATE,
                       ZOOKEEPER_DATADIR_AUTOCREATE_DEFAULT));

  trustEmptySnapshot = Boolean.getBoolean(ZOOKEEPER_SNAPSHOT_TRUST_EMPTY);
  LOG.info(ZOOKEEPER_SNAPSHOT_TRUST_EMPTY + " : " + trustEmptySnapshot);

  if (!this.dataDir.exists()) {
    if (!enableAutocreate) {
      throw new DatadirException("Missing data directory "
                                 + this.dataDir
                                 + ", automatic data directory creation is disabled ("
                                 + ZOOKEEPER_DATADIR_AUTOCREATE
                                 + " is false). Please create this directory manually.");
    }

    if (!this.dataDir.mkdirs() && !this.dataDir.exists()) {
      throw new DatadirException("Unable to create data directory "
                                 + this.dataDir);
    }
  }
  if (!this.dataDir.canWrite()) {
    throw new DatadirException("Cannot write to data directory " + this.dataDir);
  }

  if (!this.snapDir.exists()) {
    if (!enableAutocreate) {
      throw new DatadirException("Missing snap directory "
                                 + this.snapDir
                                 + ", automatic data directory creation is disabled ("
                                 + ZOOKEEPER_DATADIR_AUTOCREATE
                                 + " is false). Please create this directory manually.");
    }

    if (!this.snapDir.mkdirs() && !this.snapDir.exists()) {
      throw new DatadirException("Unable to create snap directory "
                                 + this.snapDir);
    }
  }
  if (!this.snapDir.canWrite()) {
    throw new DatadirException("Cannot write to snap directory " + this.snapDir);
  }

  if (!this.dataDir.getPath().equals(this.snapDir.getPath())) {
    checkLogDir();
    checkSnapDir();
  }

  txnLog = new FileTxnLog(this.dataDir);
  snapLog = new FileSnap(this.snapDir);
}

The constructor takes two key parameters:

dataDir: The directory for data files
snapDir: The directory for snapshot files

Note that these parameters represent top-level directories. ZooKeeper stores actual files in the version-2 subdirectory. The constructor performs several validation steps:

Checks if automatic directory creation is enabled
Verifies directory existence and creation success
Validates write permissions
Creates txnLog and snapLog member variables

After understanding the constructor, let's analyze the restore method, which recovers the database by reading snapshots and transaction logs:

public long restore(DataTree dt, Map<Long, Integer> sessions,
                    PlayBackListener listener) throws IOException {
  // Get the last zxid
  long deserializeResult = snapLog.deserialize(dt, sessions);
  // Create transaction log
  FileTxnLog txnLog = new FileTxnLog(dataDir);

  RestoreFinalizer finalizer = () -> {
    long highestZxid = fastForwardFromEdits(dt, sessions, listener);
    return highestZxid;
  };

  // If the last zxid is -1
  if (-1L == deserializeResult) {
    // If the last zxid in transaction log is not -1
    if (txnLog.getLastLoggedZxid() != -1) {
      // Check if we trust empty snapshots
      if (!trustEmptySnapshot) {
        // Throw exception if we don't trust empty snapshots
        throw new IOException(EMPTY_SNAPSHOT_WARNING + "Something is broken!");
      } else {
        LOG.warn("{}This should only be allowed during upgrading.",
                 EMPTY_SNAPSHOT_WARNING);
        // Execute data recovery operation
        return finalizer.run();
      }
    }
    // Save snapshot
    save(dt, (ConcurrentHashMap<Long, Integer>) sessions);
    return 0;
  }

  // Execute data recovery operation
  return finalizer.run();
}

In this method, we first get the last zxid. Then, we create a transaction log and a restore finalizer. If the last zxid is -1, we check if the last zxid in the transaction log is not -1. If it is not, we check if we trust empty snapshots. If we don't, we throw an exception. Otherwise, we execute the data recovery operation.

Next, let's analyze the fastForwardFromEdits method, which reads transaction logs and applies them to the database:

public long fastForwardFromEdits(DataTree dt, Map<Long, Integer> sessions,
                                 PlayBackListener listener) throws IOException {
  // Read transaction log
  TxnIterator itr = txnLog.read(dt.lastProcessedZxid + 1);
  // Get maximum zxid from data tree
  long highestZxid = dt.lastProcessedZxid;
  // Transaction header information
  TxnHeader hdr;
  try {
    while (true) {
      // Get transaction header
      hdr = itr.getHeader();
      // If header is null, return the maximum zxid from data tree
      if (hdr == null) {
        //empty logs
        return dt.lastProcessedZxid;
      }
      // If transaction header has zxid less than highest zxid and highest zxid is not 0
      if (hdr.getZxid() < highestZxid && highestZxid != 0) {
        // Log error
        LOG.error("{}(highestZxid) > {}(next log) for type {}",
                  highestZxid, hdr.getZxid(), hdr.getType());
      } else {
        // Update highest zxid
        highestZxid = hdr.getZxid();
      }
      try {
        // Process transaction
        processTransaction(hdr, dt, sessions, itr.getTxn());
      } catch (KeeperException.NoNodeException e) {
        throw new IOException("Failed to process transaction type: " +
                              hdr.getType() + " error: " + e.getMessage(), e);
      }
      // Process transaction completion callback
      listener.onTxnLoaded(hdr, itr.getTxn());
      if (!itr.next())
        break;
    }
  } finally {
    if (itr != null) {
      itr.close();
    }
  }
  return highestZxid;
}

In this method, we read transaction logs from the last processed zxid + 1. We then iterate through the transaction logs, applying each transaction to the database. If a transaction has a lower zxid than the highest zxid, we log an error. Otherwise, we update the highest zxid and process the transaction.

Next, let's analyze the processTransaction method, which applies a transaction to the database:

public void processTransaction(TxnHeader hdr, DataTree dt,
                               Map<Long, Integer> sessions, Record txn)
  throws KeeperException.NoNodeException {
  // Transaction processing result storage object
  ProcessTxnResult rc;
  // Handle different transaction header types
  switch (hdr.getType()) {
      // Create session
    case OpCode.createSession:
      // Add session information to session container
      sessions.put(hdr.getClientId(),
                   ((CreateSessionTxn) txn).getTimeOut());
      if (LOG.isTraceEnabled()) {
        ZooTrace.logTraceMessage(LOG, ZooTrace.SESSION_TRACE_MASK,
                                 "playLog --- create session in log: 0x"
                                 + Long.toHexString(hdr.getClientId())
                                 + " with timeout: "
                                 + ((CreateSessionTxn) txn).getTimeOut());
      }
      // Process transaction
      rc = dt.processTxn(hdr, txn);
      break;
      // Close session
    case OpCode.closeSession:
      // Remove session from session container
      sessions.remove(hdr.getClientId());
      if (LOG.isTraceEnabled()) {
        ZooTrace.logTraceMessage(LOG, ZooTrace.SESSION_TRACE_MASK,
                                 "playLog --- close session in log: 0x"
                                 + Long.toHexString(hdr.getClientId()));
      }
      // Process transaction
      rc = dt.processTxn(hdr, txn);
      break;
      // Default processing
    default:
      // Process transaction
      rc = dt.processTxn(hdr, txn);
  }

  if (rc.err != Code.OK.intValue()) {
    LOG.debug(
      "Ignoring processTxn failure hdr: {}, error: {}, path: {}",
      hdr.getType(), rc.err, rc.path);
  }
}

In this method, we handle different types of transactions. For createSession transactions, we add the session to the sessions map. For closeSession transactions, we remove the session from the sessions map. For other transactions, we process the transaction using the DataTree's processTxn method.

Finally, let's analyze the addCommittedProposal method, which adds a committed proposal to the committed log:

public void addCommittedProposal(Request request) {
  // Write lock
  WriteLock wl = logLock.writeLock();
  try {
    // Lock
    wl.lock();
    // If committed log size exceeds default maximum commit count
    if (committedLog.size() > commitLogCount) {
      // Remove first element
      committedLog.removeFirst();
      // Set minimum committed log zxid to first transaction log
      minCommittedLog = committedLog.getFirst().packet.getZxid();
    }
    // If committed log is empty
    if (committedLog.isEmpty()) {
      // Set both minimum and maximum committed log zxid to request zxid
      minCommittedLog = request.zxid;
      maxCommittedLog = request.zxid;
    }

    // Parse request data
    byte[] data = SerializeUtils.serializeRequest(request);
    // Construct data packet
    QuorumPacket pp = new QuorumPacket(Leader.PROPOSAL, request.zxid, data, null);
    // Create proposal
    Proposal p = new Proposal();
    p.packet = pp;
    p.request = request;
    // Add proposal to committed log
    committedLog.add(p);
    // Update maximum committed log zxid to proposal zxid
    maxCommittedLog = p.packet.getZxid();
  } finally {
    // Unlock
    wl.unlock();
  }
}

In this method, we add a committed proposal to the committed log. If the log is full, we remove the oldest proposal and update the minimum committed log zxid. We then add the new proposal to the log and update the maximum committed log zxid.

TxnLog Analysis

The TxnLog interface defines the transaction log functionality. It has several methods for appending, reading, and truncating transaction logs.

append Method Analysis

The append method appends a transaction to the transaction log:

public synchronized boolean append(TxnHeader hdr, Record txn)
  throws IOException {

  if (hdr == null) {
    return false;
  }

  if (hdr.getZxid() <= lastZxidSeen) {
    LOG.warn("Current zxid " + hdr.getZxid()
             + " is <= " + lastZxidSeen + " for "
             + hdr.getType());
  } else {
    lastZxidSeen = hdr.getZxid();
  }
  // Log stream is empty
  if (logStream == null) {
    if (LOG.isInfoEnabled()) {
      LOG.info("Creating new log file: " + Util.makeLogName(hdr.getZxid()));
    }

    // Create log file
    logFileWrite = new File(logDir, Util.makeLogName(hdr.getZxid()));
    // Create file output stream
    fos = new FileOutputStream(logFileWrite);
    // Initialize log stream
    logStream = new BufferedOutputStream(fos);
    // Initialize output archive
    oa = BinaryOutputArchive.getArchive(logStream);
    // Construct file header
    FileHeader fhdr = new FileHeader(TXNLOG_MAGIC, VERSION, dbId);
    // Serialize file header
    fhdr.serialize(oa, "fileheader");
    // Write to file
    logStream.flush();
    // Calculate current channel size and set to file padding
    filePadding.setCurrentSize(fos.getChannel().position());
    // Add file output stream to collection
    streamsToFlush.add(fos);
  }
  // Pad file
  filePadding.padFile(fos.getChannel());
  // Serialize transaction header and transaction
  byte[] buf = Util.marshallTxnEntry(hdr, txn);
  // Check if serialization result is null or empty
  if (buf == null || buf.length == 0) {
    throw new IOException("Faulty serialization for header " +
                          "and txn");
  }
  // Get checksum algorithm
  Checksum crc = makeChecksumAlgorithm();
  // Update checksum
  crc.update(buf, 0, buf.length);
  // Write checksum to output archive
  oa.writeLong(crc.getValue(), "txnEntryCRC");
  // Write transaction to output archive
  Util.writeTxnBytes(oa, buf);

  return true;
}

In this method, we first check if the transaction header is null. If it is, we return false. We then check if the transaction zxid is less than or equal to the last seen zxid. If it is, we log a warning. Otherwise, we update the last seen zxid.

We then check if the log stream is null. If it is, we create a new log file and initialize the log stream and output archive. We serialize the transaction header and write it to the log file.

We then pad the file to ensure it is the correct size. We serialize the transaction and write it to the log file. We also update the checksum and write it to the log file.

read Method Analysis

The read method reads a transaction log from a given zxid:

public TxnIterator read(long zxid, boolean fastForward) throws IOException {
    return new FileTxnIterator(logDir, zxid, fastForward);
}

In this method, we create a new FileTxnIterator object and return it.

getLastLoggedZxid Method Analysis

The getLastLoggedZxid method returns the last logged zxid:

public long getLastLoggedZxid() {
    // Get all log files from log directory
    File[] files = getLogFiles(logDir.listFiles(), 0);
    // Determine maximum log number
    // If log file count > 0, get zxid from name, otherwise return -1
    long maxLog = files.length > 0 ?
            Util.getZxidFromName(files[files.length - 1].getName(), LOG_FILE_PREFIX) : -1;

    // Set zxid to maximum log number initially
    long zxid = maxLog;
    TxnIterator itr = null;
    try {
        // Convert log directory to transaction log
        FileTxnLog txn = new FileTxnLog(logDir);
        // Read TxnIterator from transaction log
        itr = txn.read(maxLog);
        // Process data in TxnIterator until no more entries
        while (true) {
            if (!itr.next()) {
                break;
            }
            // Get header information and update zxid
            TxnHeader hdr = itr.getHeader();
            zxid = hdr.getZxid();
        }
    } catch (IOException e) {
        LOG.warn("Unexpected exception", e);
    } finally {
        close(itr);
    }
    return zxid;
}

In this method, we first get all log files from the log directory. We then determine the maximum log number by getting the zxid from the last log file. We set the zxid to the maximum log number.

We then create a FileTxnLog object and read the transaction log from the maximum log number. We iterate through the transaction log and update the zxid to the last seen zxid.

truncate Method Analysis

The truncate method truncates the transaction log to a given zxid:

public boolean truncate(long zxid) throws IOException {
  FileTxnIterator itr = null;
  try {
    // Create FileTxnIterator object
    itr = new FileTxnIterator(this.logDir, zxid);
    // Get input stream from FileTxnIterator
    PositionInputStream input = itr.inputStream;
    // If input stream is null, throw exception
    if (input == null) {
      throw new IOException("No log files found to truncate! This could " +
                            "happen if you still have snapshots from an old setup or " +
                            "log files were deleted accidentally or dataLogDir was changed in zoo.cfg.");
    }
    // Get position from input stream
    long pos = input.getPosition();
    // Truncate log file
    RandomAccessFile raf = new RandomAccessFile(itr.logFile, "rw");
    raf.setLength(pos);
    raf.close();
    // If there are subsequent log files, delete them
    while (itr.goToNextLog()) {
      // Delete log file
      if (!itr.logFile.delete()) {
        LOG.warn("Unable to truncate {}", itr.logFile);
      }
    }
  } finally {
    close(itr);
  }
  return true;
}

In this method, we first create a FileTxnIterator object and get the input stream. We then get the position from the input stream and truncate the log file to that position.

We then delete any subsequent log files.

commit Method Analysis

The commit method commits the transaction log:

public synchronized void commit() throws IOException {
  // Log stream is not null, flush it
  if (logStream != null) {
    logStream.flush();
  }
  // Loop through file output streams and flush them
  for (FileOutputStream log : streamsToFlush) {
    // Flush output stream
    log.flush();
    // If force sync is enabled, force channel to write out data
    if (forceSync) {
      // Get current nanosecond time
      long startSyncNS = System.nanoTime();
      // Get channel
      FileChannel channel = log.getChannel();
      // Force channel to write out data
      channel.force(false);
      // Calculate time difference
      syncElapsedMS = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startSyncNS);
      // If time difference exceeds maximum allowed sync time, log warning
      if (syncElapsedMS > fsyncWarningThresholdMS) {
        // If server stats is not null, increment fsync threshold exceed count
        if (serverStats != null) {
          serverStats.incrementFsyncThresholdExceedCount();
        }
        LOG.warn("fsync-ing the write ahead log in "
                 + Thread.currentThread().getName()
                 + " took " + syncElapsedMS
                 + "ms which will adversely effect operation latency. "
                 + "File size is " + channel.size() + " bytes. "
                 + "See the ZooKeeper troubleshooting guide");
      }
    }
  }
  // If file output stream collection size is greater than 1, remove first element and close it
  while (streamsToFlush.size() > 1) {
    streamsToFlush.removeFirst().close();
  }
}

In this method, we first flush the log stream if it is not null. We then loop through the file output streams and flush each one. If force sync is enabled, we force the channel to write out the data and calculate the time difference. If the time difference exceeds the maximum allowed sync time, we log a warning.

We then remove and close the first file output stream if there are more than one.

SnapShot Analysis

The SnapShot interface defines the snapshot functionality. It has several methods for serializing and deserializing snapshots.

deserialize Method Analysis

The deserialize method deserializes a snapshot from a file:

public long deserialize(DataTree dt, Map<Long, Integer> sessions)
  throws IOException {
  // Find up to 100 valid snapshots
  List<File> snapList = findNValidSnapshots(100);
  // If no valid snapshots are found, return -1
  if (snapList.size() == 0) {
    return -1L;
  }
  File snap = null;
  // Flag to indicate if a valid snapshot is found
  boolean foundValid = false;
  // Loop through snapshots
  for (int i = 0, snapListSize = snapList.size(); i < snapListSize; i++) {
    // Get current snapshot
    snap = snapList.get(i);
    LOG.info("Reading snapshot " + snap);
    // Open snapshot input stream and checksum stream
    try (InputStream snapIS = new BufferedInputStream(new FileInputStream(snap));
         CheckedInputStream crcIn = new CheckedInputStream(snapIS, new Adler32())) {
      // Convert checksum stream to input archive
      InputArchive ia = BinaryInputArchive.getArchive(crcIn);
      // Deserialize data tree from input archive
      deserialize(dt, sessions, ia);
      // Get checksum from checksum stream
      long checkSum = crcIn.getChecksum().getValue();
      // Get value from input archive
      long val = ia.readLong("val");
      // If value and checksum do not match, throw exception
      if (val != checkSum) {
        throw new IOException("CRC corruption in snapshot :  " + snap);
      }
      // Set flag to indicate a valid snapshot is found
      foundValid = true;
      // Break out of loop
      break;
    } catch (IOException e) {
      LOG.warn("problem reading snap file " + snap, e);
    }
  }
  // If no valid snapshot is found, throw exception
  if (!foundValid) {
    throw new IOException("Not able to find valid snapshots in " + snapDir);
  }
  // Calculate zxid from snapshot file name and set it to data tree's last processed zxid
  dt.lastProcessedZxid = Util.getZxidFromName(snap.getName(), SNAPSHOT_FILE_PREFIX);
  // Return zxid
  return dt.lastProcessedZxid;
}

In this method, we first find up to 100 valid snapshots. We then loop through the snapshots and deserialize each one. If we find a valid snapshot, we set the last processed zxid to the snapshot's zxid and return it.

serialize Method Analysis

The serialize method serializes a snapshot to a file:

public synchronized void serialize(DataTree dt, Map<Long, Integer> sessions, File snapShot)
  throws IOException {
  // Check if snapshot is closed
  if (!close) {
    // Open output stream and checksum output stream
    try (OutputStream sessOS = new BufferedOutputStream(new FileOutputStream(snapShot));
         CheckedOutputStream crcOut = new CheckedOutputStream(sessOS, new Adler32())) {
      // Get output archive
      OutputArchive oa = BinaryOutputArchive.getArchive(crcOut);
      // Create file header
      FileHeader header = new FileHeader(SNAP_MAGIC, VERSION, dbId);
      // Serialize data
      serialize(dt, sessions, oa, header);
      // Get checksum
      long val = crcOut.getChecksum().getValue();
      // Write to archive
      oa.writeLong(val, "val");
      oa.writeString("/", "path");
      sessOS.flush();
    }
  } else {
    throw new IOException("FileSnap has already been closed");
  }
}

In this method, we first check if the snapshot is closed. If it is not, we open the output stream and checksum output stream. We then create a file header and serialize the data. We get the checksum and write it to the output archive. We then flush the output stream.

findMostRecentSnapshot Method Analysis

The findMostRecentSnapshot method finds the most recent snapshot:

public File findMostRecentSnapshot() throws IOException {
  // Find the single most recent valid snapshot
  List<File> files = findNValidSnapshots(1);
  if (files.size() == 0) {
    return null;
  }
  return files.get(0);
}

In this method, we find the most recent snapshot by finding the first valid snapshot.

isValidSnapshot Method Analysis

The isValidSnapshot method checks if a snapshot is valid:

public static boolean isValidSnapshot(File f) throws IOException {
  // Check if file is null or has invalid zxid
  if (f == null || Util.getZxidFromName(f.getName(), FileSnap.SNAPSHOT_FILE_PREFIX) == -1)
    return false;
  // Convert file object to RandomAccessFile
  try (RandomAccessFile raf = new RandomAccessFile(f, "r")) {
    // Return false if file length is less than 10 bytes
    if (raf.length() < 10) {
      return false;
    }
    raf.seek(raf.length() - 5);
    byte bytes[] = new byte[5];
    int readlen = 0;
    int l;
    while (readlen < 5 &&
           (l = raf.read(bytes, readlen, bytes.length - readlen)) >= 0) {
      readlen += l;
    }
    // Return false if read length does not match byte array length
    if (readlen != bytes.length) {
      LOG.info("Invalid snapshot " + f
               + " too short, len = " + readlen);
      return false;
    }
    // Convert byte array to ByteBuffer
    ByteBuffer bb = ByteBuffer.wrap(bytes);
    // Read fourth byte as integer
    int len = bb.getInt();
    // Read fifth byte
    byte b = bb.get();
    // Validate length and byte value
    if (len != 1 || b != '/') {
      LOG.info("Invalid snapshot " + f + " len = " + len
               + " byte = " + (b & 0xff));
      return false;
    }
  }

  return true;
}

In this method, we check if a snapshot file is valid by verifying several conditions:

The file exists and has a valid zxid in its name
The file size is at least 10 bytes
The last 5 bytes contain specific validation data
The fourth byte must be 1 and the fifth byte must be '/'

These checks ensure the integrity and validity of snapshot files before they are used for system recovery operations.

Summary

This chapter provides a comprehensive analysis of ZooKeeper's logging mechanisms, focusing on two primary types of logs:

Transaction Logs: Implemented through the TxnLog interface, these logs record all modifications to the ZooKeeper data tree in a sequential manner. They are crucial for maintaining consistency and enabling recovery in case of failures.
Snapshot Logs: Implemented through the SnapShot interface, these logs provide point-in-time snapshots of the entire ZooKeeper data tree. They optimize recovery time by providing a baseline state from which to replay transaction logs.

Key components analyzed include:

FileTxnSnapLog: The central class managing both transaction and snapshot logs
TxnLog Interface: Defines the contract for transaction log operations
SnapShot Interface: Specifies the requirements for snapshot management
File Management: Detailed examination of file naming, structure, and validation
Data Integrity: Implementation of checksums and validation mechanisms
Recovery Process: How logs are used to restore system state

This analysis demonstrates ZooKeeper's robust approach to maintaining data consistency and durability in a distributed system through careful log management and validation.

For further reading on distributed systems logging and consistency mechanisms, consider exploring:

ZooKeeper's consensus protocol implementation
Comparison with other distributed logging systems
Performance optimization techniques for log management
Advanced recovery scenarios and failure handling

// Check if the file is closed
if (!close) {
    // Open output stream and checksum output stream
    try (OutputStream sessOS = new BufferedOutputStream(new FileOutputStream(snapShot));
         CheckedOutputStream crcOut = new CheckedOutputStream(sessOS, new Adler32())) {
        // Get output archive
        OutputArchive oa = BinaryOutputArchive.getArchive(crcOut);
        // Create file header
        FileHeader header = new FileHeader(SNAP_MAGIC, VERSION, dbId);
        // Serialize data
        serialize(dt, sessions, oa, header);
        // Get checksum
        long val = crcOut.getChecksum().getValue();
        // Write to archive
        oa.writeLong(val, "val");
        oa.writeString("/", "path");
        sessOS.flush();
    }
} else {
    throw new IOException("FileSnap has already been closed");
}