Adding another box to my home network, I went through my usual fiddling with rsync to synch some of its data with other hosts on the network. To make sure I was correctly understanding the man pages, I did a little googling and ran across Bryan Pendleton's post that linked the rsync algorithm description. I had always wondered how rsync worked under the covers and Bryan said the algorithm paper was an easy read so I read it. Wow, what a nice piece of work!
Before describing the algorithm, let's look at the magic. The unix rsync utility synchronizes the contents of files and directories. It is designed to work efficiently when the things being synchronized are on different computers connected by a network. At the end of the algorithm paper, the authors describe a test that they did running rsync using tar archives of different linux kernel source distributions as the "source" and "destination" files. The tars contained source files with lots of changes - about 2MB of combined diffs. Computing and applying the necessary changes to make the destination the same as the source using rsync moved less data than diff generated and took less time. For those not familiar with it, diff is a unix command that determines the differences between two files, producing an output stream displaying old and new versions of lines that have changed. What is impressive here is that determining the differences remotely (i.e. without the two files being available in the same process space), communicating and applying the changes took less time (and passed less data) than just determining and presenting the changes locally would have using diff.
So how do they do that? There are two core ideas underneath the algorithm that make these kinds of results possible. The first is to focus on the most common scenario in file version change: contiguous blocks inserted, deleted or replaced. Consider the following simple example. File A contains "a b c d e f g h i" and B contains "a b z w f g h i". What has happened here is that "c d e" in the middle of A has been replaced by "z w" in B. So to make A like B, we can keep "a b" and "f g h i" as is and just replace the middle bit. The rsync algorithm is optimized to find the unchanged blocks, even if they have "moved" and to focus only on communicating and applying the necessary changes.
The second core idea is that hash functions can be used in place of file data in determining differences. If two blocks have different hashes, they must be different. If they have the same hash, they are probably the same, with the probability dependent on the strength of the hash function and the block size. By combining weak and strong hash functions, with the weak versions very fast to compute, the algorithm can find most differences very quickly, using the strong hashing to verify when it has found a match. Regardless of the size of blocks being analyzed, hashes are small, fixed-size byte arrays; so communicating them across the network instead of passing actual block data back and forth represents a big savings.
Using these ideas, the algorithm is simple. First, the host holding the "receiving file" (the one that needs to be modified, call this host B) divides its file into fixed size blocks (say, 500 bytes per block) and computes both a weak and a strong hash for each of the blocks and sends these hashes to the "sending host." The sending host (call this one A) then moves through its file in jumps as follows. Starting at the first byte in the file, A computes the weak hash of the block that starts in that position. If it can find that hash in the set of weak block hashes that B sent, it then checks the strong hash. If this also matches, A sends a message to B communicating this fact and "jumps" to the start of the next block. If it can't find a match, it moves one byte forward and tries again. Whenever A finds a match, it communicates where it found the match to B, and the bytes it had to skip before finding the next matching block. At the end, B then has all of the information that it needs to reconstruct A's file.
To make it clearer what is going on here, lets look at the simple example above and assume a block size of 2 bytes. B starts by sending both strong and weak hashes of "a b", "z w", " f g" and "h i". A then starts moving through its file, looking at the first block "a b" and happily immediately finds a match (first checking the weak hash and then when it succeeds, the strong one). When A moves to "c d" that block is not found (only weak hashes have to be checked, as all will fail), so it then moves on to "d e" and again has no joy. Next comes "e f" which again does not match. Finally, "f g" matches and A tells B that, along with the fact that it had to skip "c d e" which it sends literally to B for insertion. Finally, A jumps to the final block, "h i" which matches. What B has to do is now clear: start with its first block, insert "c d e" and follow with its last two blocks.
This dive down into the guts of the algorithm shows us how to be nice to rsync, which is to use it on files with change sets consisting of mostly contiguous blocks inserted, deleted or replaced. If, for example, the source distros used in the test described above had all of their line endings changed from one to the other, the results would not have been nearly as good. Last night, I inadvertently included directories that held VM images - similarly named images on different hosts. The results were not pretty. Of course, you can't always control what goes into rsync, but knowing a little about how it works can help understand and avoid some performance problems.
Two more comments before we end this walk. First, while the how to be kind to rysnc comments above are definitely relevant to file-level rsync, most uses of rsync spend most of their time transferring new files between hosts rather than patching changed files. The only thing you can do to make that easier is to not use rsync as a replacement for cp / scp or expect it to perform as well as they do for bulk file transfer.
My last comment is that while the probability is very small, rsync can fail to pick up changes due to hash collisions. The algorithm does include a final whole file hash check, so the probability that a changed block would hash to the same value as a block on the receiver and the whole file hash would match is ridiculously small; but it is not zero. It is an interesting problem to estimate this probability, given file and block sizes.
Before describing the algorithm, let's look at the magic. The unix rsync utility synchronizes the contents of files and directories. It is designed to work efficiently when the things being synchronized are on different computers connected by a network. At the end of the algorithm paper, the authors describe a test that they did running rsync using tar archives of different linux kernel source distributions as the "source" and "destination" files. The tars contained source files with lots of changes - about 2MB of combined diffs. Computing and applying the necessary changes to make the destination the same as the source using rsync moved less data than diff generated and took less time. For those not familiar with it, diff is a unix command that determines the differences between two files, producing an output stream displaying old and new versions of lines that have changed. What is impressive here is that determining the differences remotely (i.e. without the two files being available in the same process space), communicating and applying the changes took less time (and passed less data) than just determining and presenting the changes locally would have using diff.
So how do they do that? There are two core ideas underneath the algorithm that make these kinds of results possible. The first is to focus on the most common scenario in file version change: contiguous blocks inserted, deleted or replaced. Consider the following simple example. File A contains "a b c d e f g h i" and B contains "a b z w f g h i". What has happened here is that "c d e" in the middle of A has been replaced by "z w" in B. So to make A like B, we can keep "a b" and "f g h i" as is and just replace the middle bit. The rsync algorithm is optimized to find the unchanged blocks, even if they have "moved" and to focus only on communicating and applying the necessary changes.
The second core idea is that hash functions can be used in place of file data in determining differences. If two blocks have different hashes, they must be different. If they have the same hash, they are probably the same, with the probability dependent on the strength of the hash function and the block size. By combining weak and strong hash functions, with the weak versions very fast to compute, the algorithm can find most differences very quickly, using the strong hashing to verify when it has found a match. Regardless of the size of blocks being analyzed, hashes are small, fixed-size byte arrays; so communicating them across the network instead of passing actual block data back and forth represents a big savings.
Using these ideas, the algorithm is simple. First, the host holding the "receiving file" (the one that needs to be modified, call this host B) divides its file into fixed size blocks (say, 500 bytes per block) and computes both a weak and a strong hash for each of the blocks and sends these hashes to the "sending host." The sending host (call this one A) then moves through its file in jumps as follows. Starting at the first byte in the file, A computes the weak hash of the block that starts in that position. If it can find that hash in the set of weak block hashes that B sent, it then checks the strong hash. If this also matches, A sends a message to B communicating this fact and "jumps" to the start of the next block. If it can't find a match, it moves one byte forward and tries again. Whenever A finds a match, it communicates where it found the match to B, and the bytes it had to skip before finding the next matching block. At the end, B then has all of the information that it needs to reconstruct A's file.
To make it clearer what is going on here, lets look at the simple example above and assume a block size of 2 bytes. B starts by sending both strong and weak hashes of "a b", "z w", " f g" and "h i". A then starts moving through its file, looking at the first block "a b" and happily immediately finds a match (first checking the weak hash and then when it succeeds, the strong one). When A moves to "c d" that block is not found (only weak hashes have to be checked, as all will fail), so it then moves on to "d e" and again has no joy. Next comes "e f" which again does not match. Finally, "f g" matches and A tells B that, along with the fact that it had to skip "c d e" which it sends literally to B for insertion. Finally, A jumps to the final block, "h i" which matches. What B has to do is now clear: start with its first block, insert "c d e" and follow with its last two blocks.
This dive down into the guts of the algorithm shows us how to be nice to rsync, which is to use it on files with change sets consisting of mostly contiguous blocks inserted, deleted or replaced. If, for example, the source distros used in the test described above had all of their line endings changed from one to the other, the results would not have been nearly as good. Last night, I inadvertently included directories that held VM images - similarly named images on different hosts. The results were not pretty. Of course, you can't always control what goes into rsync, but knowing a little about how it works can help understand and avoid some performance problems.
Two more comments before we end this walk. First, while the how to be kind to rysnc comments above are definitely relevant to file-level rsync, most uses of rsync spend most of their time transferring new files between hosts rather than patching changed files. The only thing you can do to make that easier is to not use rsync as a replacement for cp / scp or expect it to perform as well as they do for bulk file transfer.
My last comment is that while the probability is very small, rsync can fail to pick up changes due to hash collisions. The algorithm does include a final whole file hash check, so the probability that a changed block would hash to the same value as a block on the receiver and the whole file hash would match is ridiculously small; but it is not zero. It is an interesting problem to estimate this probability, given file and block sizes.