notes/old/draft-korn-vcdiff-01.txt - subversion - Git at Google

 [ This paper makes reference to the "Sfio" library.  See
   http://www.research.att.com/sw/tools/sfio/ for more.  -kff]


 Network Services Research Lab                  David Korn  and  Kiem-Phong Vo
                                                                     AT&T Labs
                                                Submission:     March 09, 2000
                                                Expiration: September 09, 2000


         The VCDIFF Generic Differencing and Compression Data Format
                        <draft-korn-vcdiff-01.txt>


 Status of this Memo

     This document is an Internet-Draft and is in full conformance
     with all provisions of Section 10 of RFC2026.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as
     Internet-Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as
     "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.


 Abstract

     This memo describes a general and efficient data format suitable
     for encoding compressed and/or differencing data so that they can
     be easily transported among computers.

 Table of Contents:

     1.  Executive Summary ............................................  1
     2.  Algorithm Conventions ........................................  3
     3.  Delta Instructions ...........................................  3
     4.  Vcdiff Encoding Format .......................................  4
     5.  Instruction Code Tables ...................................... 13
     6.  Compression and Differencing Algorithms ...................... 18
     7.  Summary ...................................................... 23
         ACKNOWLEDGEMENTS ............................................. 23
         REFERENCES ................................................... 23
         AUTHOR'S ADDRESS ............................................. 24


 1.  Executive Summary

     Compression and differencing techniques can greatly improve storage
     and transmission of files and file versions.  Since files are often
     transported across machines with distinct architectures and performance
     characteristics, such data should be encoded in a form that is portable
     and can be decoded with little or no knowledge of the encoders.
     This document describes Vcdiff, a new compact portable encoding format
     that is independent of encoding algorithms and also efficient to decode.

     Data differencing is the process of computing a compact and invertible

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     encoding of a "target file" given a "source file".  Data compression
     is similar but without the use of source data.  The UNIX utilities diff,
     compress, and gzip are well-known examples of data differencing and
     compression tools.  For data differencing, the computed encoding is
     called a "delta file", and, for data compression, it is called
     a "compressed file".  Delta and compressed files are good for storage
     and transmission because they are often smaller than the originals.

     Data differencing and data compression are traditionally treated
     as distinct types of data processing.  However, compression can be
     thought of as a special case of differencing in which the source
     data is empty. This blending of differencing and compression was first
     introduced in the Vdelta technique [1] by Korn and Vo.  The basic idea
     is to unify the string parsing scheme used in the Lempel-Ziv'77 style
     compressors [2], and the block-move technique of Tichy [3].  Loosely
     speaking, this works as follows:

         a. Concatenate source and target data.
         b. Parse the data from left to right just like LZ'77 but
 	   make sure that a parsed segment starts target data.
         c. Start to output when reaching target data.

     Parsing is based on string matching algorithms such as suffix trees [4]
     or hashing with different time and space performance characteristics.
     Vdelta uses a new string matching algorithm that performs competitive
     to other techniques [5].  However, even with a memory-efficient and fast
     string matching algorithm, the computing resource requirements can be
     prohibitive for processing large files. The standard way to deal with
     this is to partition input files into "windows" to be processed
     separately. Here, except for some unpublished work by Vo, little has
     been done on designing effective windowing schemes.  Current techniques,
     including Vdelta, simply use windows of the same size with corresponding
     addresses across source and target files.

     String matching and windowing algorithms have large influence on the
     compression rate of delta and compressed files. However, it is desirable
     to have a portable encoding format that is independent of such algorithms.
     This enables construction of client-server applications in which a server
     may serve clients with unknown computing characteristics.  Unfortunately,
     all current differencing and compressing tools, including Vdelta, fall
     short in this resspect. Their storage formats are closely intertwined
     with the implemented algorithms.

     The encoding format Vcdiff proposed here addresses the above issues.
     Vcdiff achieves the below characteristics:

 	Output compactness:
             The basic encoding format compactly represents compressed or delta
 	    files. In specific cases, applications can further extend it with
 	    "secondary encoders" (e.g., a Huffman or arithmetic encoder) to
 	    achieve more compression.
 	Data portability:
 	    The basic encoding format is free from byte order and word size
 	    issues for integer representations.
     	Algorithm genericity:
 	    The decoding algorithm for the basic encoding format is independent
 	    from string matching and windowing algorithms.
     	Decoding efficiency:
 	    The decoding algorithm for the basic encoding format runs in time
 	    proportional to the size of the target file and uses space

                                    -2-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


 	    proportional to the maximal window size.

     The Vcdiff data format and the algorithms for decoding data shall be
     described next.  Since Vcdiff treats compression as a special case of
     differencing, we shall use the term "delta file" to indicate the
     compressed output for both cases.


 2. Algorithm Conventions

     Algorithms to encode and decode the Vcdiff data format will be presented
     in the ANSI-C language. To ease the presentation, we shall generally omit
     error checking. The below conventions will be observed:

 	a. Tokens with all upper cases letters will be C macros.
 	b. Variables with capitalized names are global.
         c. Code fragments will be presented with line numbers to be used
 	   in the subsequent COMMENTS sections.

     Data will be encoded in byte units.  For portability, control data
     generated by Vcdiff shall be limited to the lower eight bits of a byte
     even on machines with larger bytes. The bits in a byte are named from
     right to left so that the first bit has the lowest value, 1<<0 or 1,
     and the eighth bit has value 1<<7 or 128.

     To facilitate the algorithms, we shall assume a few types and functions
     for I/O on streams.  To be definite, we shall use interfaces similar to
     that provided in the Sfio library. Below are descriptions of some of
     these types and functions. Others can be found in reference [6].

 	uchar_t:
 	    This is the type "unsigned char".

 	int_t:
 	    This is the largest integer type on the platform in use.

 	Sfio_t:
 	    This is the type of a stream.

 	Sfio_t* sfstropen(uchar_t* buf, int n):
 	    This is not an Sfio function but it can be easily implemented
 	    on top of the Sfio primitive sfnew(). sfstropen() creates a stream
 	    from a given buffer with a given size. We shall assume that such
 	    a stream is both readable and writable. As with Sfio, a write stream
 	    will extend the buffer as necessary to accomodate output data.


 3.  Delta Instructions

     A target file is partitioned into non-overlapping sections or windows
     to be processed separately. A target window T of length n may be
     compared against some source data segment S of length m.  Such a source
     data segment may come from some earlier part of the target file or
     it may come from the source file, if there is one. It is assumed that
     there is sufficient memory so that both T and S can be processed
     in main memory.

     For string processing, we treat S and T as substrings of a superstring U
     formed by concatenating T and S like this:


                                    -3-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


         s[0]s[1]...s[m-1]t[0]t[1]...t[n-1]

     The index of a byte in S or T is referred to by its location in U.
     For example, T[i] is referred to as U[m+i].

     The instructions to encode and direct the reconstruction of a target
     window are called delta instructions.  There are three types:

 	ADD:
             This instruction has two arguments, a size s and a sequence of
 	    s bytes to be copied to reconstruct the target window.
 	COPY:
             This instruction has two arguments, a size s and an address p
 	    in the string U. The arguments specify the substring of U that
 	    must be copied. We shall assert that such a substring must be
 	    entirely contained in either S or T.
 	RUN:
             This instruction has two arguments, a size s and a byte c that
 	    will be repeated s times.


     Below are example source and target strings and the delta instructions
     that encode the target string in terms of the source string.

         a b c d e f g h i j k l m n o p
         a b c d w x y z e f g h e f g h e f g h e f g h z z z z

         COPY  4, 0
         ADD   4, w x y z
         COPY  4, 4
         COPY 12, 24
 	RUN   4, z

     Note that the third COPY instruction copies data from T itself since
     address 24 is position 8 in T.  In addition, parts of the data to be
     copied by this instruction will be reconstructed during its execution.
     This allows efficient encoding of periodic sequences, i.e., sequences
     with regularly repeated subsequences. The RUN instruction is a compact
     way to encode a sequence repeating the same byte even though such
     a sequence can be thought of as a periodic sequence with period 1.


 4.  Vcdiff Encoding Format

     A large target file is partitioned into non-overlapping sections called
     "target windows". In practice, window sizes should be chosen so that
     each window can be completely processed in memory during both encoding
     and decoding.  A target window may be compared against some source data
     segment or none. In the compression case, such a source segment would
     be from some earlier part of the target file while, for differencing,
     it may also be from the source file.


 4.1 Delta File Layout

     A Vcdiff delta file is organized into sections as follows:

                 4-byte header

                                    -4-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


                 Section 1
                 Section 2
 		...

     	Header:
 	    The header of a delta file consists of 4 bytes to identify it as
 	    a Vcdiff delta file. The first three bytes are: 0xd6, 0xc3, 0xc4,
 	    i.e., the ASCII letters 'V', 'C' and 'D' or-ed with the eighth bit.
 	    The fourth byte indicates a version number which is currently 0.

     	Sections:
 	    Following the 4-byte header are a number of sections. Each section
 	    encodes either an instruction code table (Section 5) or a window.
 	    Each section starts with an indicator byte. If the 8th bit of this
 	    byte is on, the section encodes an instruction table; otherwise,
 	    it encodes a window. The remaining bits of this byte are used to
 	    encode other information as necessary.


 4.1.1 The Format of a Window

     Each window is organized as below. Note that items bracketed are present
     only when certain corresponding bits in the indicator byte are on.

 		Indicator byte
 		Length of target window
 		Length of instruction dataset
 		Length of raw dataset
 		[Index of code table]
 		[Secondary instruction compressor]
 		[Length of compressed instruction dataset]
 		[Secondary data compressor]
 		[Length of compressed raw dataset]
 		[Source segment size]
 		[Source segment position]
 		Instruction dataset
 		Raw dataset

     	Indicator byte:
 	    The bits of the indicator byte for a window are as follows:

 		0  T  I  D  WT  WS  X  X

 	    0:	This is the 0-bit indicating that this section is a window
 		of data to be decoded.

 	    T:	This bit, if on, indicates that a non-default instruction
 		code table should be used for decoding. In this case, the
 		item [Index of code table] is a single byte indicating the
 		index of this code table.

 	    I:	This bit, if on, indicates that the instruction dataset
 		has been compressed with a secondary compressor. In this
 		case, the item [Secondary instruction compressor] is a single
 		byte indicating the secondary compressor while the item
 		[Length of compressed instruction dataset] has the length
 		of the compressed instruction dataset encoded as a
 		variable-sized integer.

 	    D:	This bit is similar to the I-bit but it is for the raw

                                    -5-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


 		dataset.

 	    WT:	This bit, if on, indicates that a source data segment was
 		used to compare the target window with. Further, this source
 		segment is from an earlier part of the target file. In this
 		case, the size and position of the source segment are given
 		in [Source segment size] and [Source segment position].
 		Both of these items are encoded as variable-sized integers.

 	    WS: This bit is similar to WT but, if on, indicates that the
 		source data segment is from the source file.

 	    X:	These bits are currently unused.

     	Length of target window:
 	    This is a variable-sized integer indicating the size of the
 	    target window.

 	Lengths of instruction and raw datasets:
 	    The delta instructions ADD and RUN have associated raw data
 	    (unmatched bytes for ADD and the repeating byte for RUN).
 	    The Vcdiff encoding format maintains two separate datasets,
 	    one for the raw data and one for the instructions. The lengths
 	    of the "instruction dataset" and the "raw dataset" are stored
 	    respectively as two variable-sized integers.

     	Instruction dataset:
 	    This is the set of bytes that define the delta instructions.
 	    If I was 1, the dataset has been encoded by the indicated
 	    secondary encoder.

         Raw dataset:
 	    This is the set of raw data accompanying the ADD and RUN
 	    instructions.  If D was 1, the dataset has been encoded
 	    by the indicated secondary encoder.


 4.1.2 Processing a Delta File

     Below is the basic algorithm to process a delta file:

      1. sfgetc(Delta);
      2. sfgetc(Delta);
      3. sfgetc(Delta);
      4. sfgetc(Delta);

      6. for(;;)
      7. {   if((indi = sfgetc(Delta)) < 0)
      8.          break;
      9.     if((indi & (1<<7)) != 0)
     10.          code_define();
     11.     else
     12.     {    n_tar = sfgetu(Delta);
     13.          tar = malloc(n_tar);
     14.          win_inflate(indi, tar, n_tar, NULL, 0);
     15.          sfwrite(Target, tar, n_tar);
     16.          free(tar);
     17.     }
     18. }

                                    -6-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


 COMMENTS.

    1-4: These lines read the 4-byte header.
    7-8: These lines read the indicator byte and terminate if reached end-of-file.
   9-10: These lines define a new code table.
  12-16: These lines decode the target window and output it to the target file.


     Next is the function to recompute a target window:

      1. win_inflate(int indi, uchar_t* tar, int n_tar, uchar_t* src, int n_src)
      2. {   int      n_inst, n_data;
      3.     uchar_t  *inst, *data;
      4.     int      s_inst, s_data;
      5.     int      p_src, i_code, f_inst, f_data;
      6.     Sfio_t   *sf, *instf, *dataf;

      7.     n_inst = sfgetu(Delta);
      8.     n_data = sfgetu(Delta);

      9.     if(indi & (1<<6))
     10.          i_code = sfgetc(Delta);

     11.     if(indi & (1<<5))
     12.     {    f_inst = sfgetc(Delta);
     13.          s_inst = sfgetu(Delta);
     14.     }

     15.     if(indi & (1<<4))
     16.     {    f_data = sfgetc(Delta);
     17.          s_data = sfgetu(Delta);
     18.     }

     19.     if(indi & ((1<<3)|(1<<2)) )
     20.     {    n_src = sfgetu(Delta);
     21.          p_src = sfgetu(Delta);
     22.     }

     23.     inst = malloc(n_inst);
     24.     if(indi & (1<<5))
     25.     {    sfread(Delta, inst, s_inst);
     26.          decompress(inst, n_inst, s_inst, f_inst);
     27.     }
     28.     else sfread(Delta, inst, n_inst);
     29.     instf = sfstropen(inst, n_inst);

     30.     data = malloc(n_data);
     31.     if(indi & (1<<4))
     32.     {    sfread(Delta, data, s_data);
     33.          decompress(data, n_data, s_data, f_data);
     34.     }
     35.     else sfread(Delta, data, n_data);
     36.     dataf = sfstropen(data, n_data);

     37.     if(indi & ((1<<3)|(1<<2)) )
     38.     {    src = malloc(n_src);
     39.          sf = (indi & (1<<3)) ? Target : Source;
     40.          sfseek(sf, p_src, SEEK_SET);
     41.          sfread(sf, src, n_src);

                                    -7-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     42.     }

     43.     win_decode(tar, n_tar, src, n_src, i_code, instf, dataf);
     44.     sfclose(instf); sfclose(dataf); free(inst); free(data);
     45. }

 COMMENTS.

    7-8: These lines read the sizes of the instruction and raw datasets.
   9-10: These lines read the index of the instruction code table if needed.
  11-14: These lines read the secondary instruction compressor indicator and
 	the size of the compressed instruction dataset.
  15-18: These lines read the secondary data compressor indicator and the
 	size of the compressed raw dataset.
  19-22: These lines read the size and position of the source data segment.
 	Note that win_inflate() may also be called to decode an instruction
 	code table (Section 5). In that case, "src" will be given and
 	the WT and WS bits should be zero.
  23-36: These lines read the instruction and raw datasets.  If these were
 	compressed, decompress() is called to reconstruct them. The function
 	decompress() uses the method index f_inst or f_data to decide on
 	the proper operations. Note that these datasets are then made into
 	streams via the sfstropen() calls to ease data accessing later.
  37-42: These lines construct the source data segment if any.
     43: This line calls win_decode() (Section 4.4) to decode the delta
 	instructions.
     44: This line frees resources allocated for processing.


 4.2 Integer Encoding Formats

     Vcdiff compactly encodes integer values using the same formats introduced
     in the Sfio library [6]. The code presented below is not quite correct with
     respect to type handling as in Sfio but they show the ideas.


 4.2.1 Encoding Integers Using a Variable-Sized Format

     The variable-sized encoding of integers follows the same algorithm
     used in the Sfio library. This treats an integer as a number in
     base 128 so that each digit can be coded in one byte. The eighth bit
     of such a byte is the continuation bit. It is 1 if there are more
     digits to come or 0 if it is the last digit.

                         +---------------------+
                         | 11100000 | 00111001 |
                         +---------------------+
 			  byte 0       byte 1

 				Table 1

     Table 1 shows the variable sized encoding of 12345.  The bytes in
     the encoding are presented in binary to make clear the use of
     the continuation bit. Omitting the continuation bit, the encoding
     of 12345 uses two bytes with values 96 and 57.

     Below are the algorithms:

      1. int sfputu(Sfio_t* f, int_t v)
      2. {   uchar_t c[2*sizeof(int_t)], *s;

                                    -8-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


      3.     s = &c[sizeof(c)-1];
      4.     *s = v & 127;
      5.     while((v >>= 7) != 0)
      6.          *--s = (v & 127) | (1<<7);

      7.     sfwrite(f, s, (&c[sizeof(c)]) - s);
      8.     return  &c[sizeof(c)]-s;
      9. }

     10. int_t sfgetu(Sfio_t* f)
     11. {   int_t b, v;

     12.     for(v = 0;; )
     13.     {    b = sfgetc(f);
     14.          v = (v << 7) | (b & 127);
     15.          if(!(b & 128) )
     16.               return v;
     17.     }
     18. }

 COMMENTS.

      1: This line declares the formal arguments to sfputu(), a stream f
 	to store the encoding and the value v to be encoded.
      2: This line declares an array large enough to store the encoding.
    4-6: These lines extract digits in base 128 and store them in
 	the array. Note that the right-most digit is extracted first and
 	does not carry a continuation bit.
    7-8: These lines write the encoding out to the given stream f and
 	return the length of that encoding.
  10-18: These lines define the decoding algorithm.


 4.2.2 Encoding Integers with a Given Range

     The address of a COPY instruction is an unsigned integer with a known
     range, i.e., it cannot be larger than the current position.  Such a
     value can be encoded more compactly than as done in the Sfio scheme.
     For example, if a value is in the range 0-255, it can be encoded with
     a single byte. On the other hand, if the same value happens to be larger
     than 127, it would require two bytes to encode in the variable-sized
     scheme. Below are the algorithms to encode integers with known ranges:


      1. int sfputm(Sfio_t* f, int_t v, int_t max)
      2. {   uchar_t c[sizeof(int_t)], *s;
      3.     s = &c[sizeof(c)-1];
      4.     *s = v & 255;
      5.     while((max >>= 8) != 0)
      6.     {   v >>= 8;
      7.         *--s = v & 255;
      8.     }
      9.     sfwrite(f, s, (&c[sizeof(c)]) - s);
     10.     return (&c[sizeof(c)]) - s;
     11. }

     12. int_t sfgetm(Sfio_t* f, int_t max)
     13. {   int_t v;
     14.     for(v = 0;;)

                                    -9-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     15.     {   v = (v << 8) | sfgetc(f);
     16.         if((max >>= 8) == 0)
     17.             return v;
     18.     }
     19. }


 4.3 Delta Instruction Coding Format

     Delta instructions are encoded as control bytes with associated data.
     Each control byte is an index into an instruction code table of 256
     entries. Each code entry is of the type Vcdcode_t as defined below:

      1. typedef struct _vcdinst_s
      2. {   unsigned char type;  /* instruction type         */
      3.     unsigned char size;  /* >0 if size is coded here */
      4.     unsigned char mode;  /* address mode for COPYs   */
      5. } Vcdinst_t;

      6. typedef struct _vcdcode_s
      7. {   Vcdinst_t     inst1; /* first instruction        */
      8.     Vcdinst_t     inst2; /* second instruction       */
      9. } Vcdcode_t;

     Thus, each control byte identifies up to two instructions.  Below are
     the different instruction types:

      1. #define VCD_NOOP  0      /* not an instruction       */
      2. #define VCD_ADD   1      /* an ADD instruction       */
      3. #define VCD_RUN   2      /* a  RUN instruction       */
      4. #define VCD_COPY  3      /* a COPY instruction       */


     Each instruction has a size n of the data involved.  If the field
     Vcdinst_t.size is non-zero, it is the value for n.  Otherwise,
     n is encoded next in the instruction dataset as a variable-sized
     integer.  If the instruction is a COPY, the copy address will follow
     next in the instruction dataset. Its encoding depends on some
     addressing scheme to be discussed next.

     A COPY address can be encoded in different ways. The field
     Vcdinst_t.mode has values in the range [0-7] and defines the below
     address encoding modes:

      1. #define VCD_SELF  0      /* coded as itself          */
      2. #define VCD_HERE  1      /* from current position    */
      3. #define VCD_SAME  2      /* index into "same" cache  */
      4. #define VCD_NEAR  3      /* index into "near" cache  */

     VCD_SAME and VCD_NEAR indicate two address caching methods designed
     to take advantage of the heuristic that successive copying addresses
     tend to be the same or fairly close to one another.  Note that as
     discussed below, there are 4 VCD_NEAR addresses corresponding to
     Vcdinst_t.mode values in the range [VCD_NEAR,VCD_NEAR+3].

     Let A be the COPY address and H the current location in the target
     data. Below are the encodings:

     VCD_SELF: A is the next integer in the instruction dataset that is
 	encoded in the range [0-H].

                                    -10-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     VCD_HERE: Let B be the next byte in the instruction dataset.
 	Then A is H-B. That is, the distance from A to H, H-A, must have
 	been in the range [0-255].

     VCD_SAME: The "same" cache keeps 256 addresses. Let B be the next byte
 	in the instruction dataset. Then A is same[B].

     VCD_NEAR: The "near" cache keeps 4 addresses.  Let B be the next byte
 	in the instruction dataset and t be Vcdinst_t.mode-VCD_NEAR.
 	Then, A is near[t]+B-127.


     Below are the algorithms to maintain address caches.

      1. typedef struct _cache_s
      2. {   int   n;
      3.     int   near[4];
      4.     int   same[256];
      5. } Cache_t;

      6. cache_init(Cache_t* ka)
      7. {    int i;
      8.      for(i = 0; i < 4; ++i)
      9.          ka->near[i] = 0;
     10.      ka->n = 0;
     11.      for(i = 0; i < 256; ++i)
     12.          ka->same[i] = 0;
     13. }


     14. cache_update(Cache_t* ka, int_t addr)
     15. {    ka->near[ka->n] = addr;
     16.      if((ka->n += 1) >= 4)
     17.          ka->n = 0;
     18.      ka->same[addr&255] = addr;
     19. }

 COMMENTS.

    1-5: These lines define the caches. As discussed, the "near" cache
 	has 4 addresses and the "same" cache has 256 addresses.
   6-13: These lines initialize addresses in the caches to zero.
  14-19: These lines update the caches with a given address.  Note that
 	the "near" cache is updated in a round-robin manner and the lower
 	eight bits of an address is its index in the "same" cache.


     Below is the function to encode a COPY address:

      1. int addr_encode(Cache_t* ka, int addr, int here, int* best)
      2. {   int  i, d, mode = -1;

      3.     if((d = (here-addr)) < 256)
      4.     {    *best = d;
      5.          mode = VCD_HERE;
      6.     }
      7.     else if(ka->same[d = addr&255] == addr)
      8.     {    *best = d;

                                    -11-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


      9.          mode = VCD_SAME;
     10.     }
     11.     else
     12.     {    for(i = 0; i < 4; ++i)
     13.          {    if((d = addr - ka->k_near[i]) >= -127 && d <= 128)
     14.               {    *best = d + 127;
     15.                    mode = VCD_NEAR+i;
     16.               }
     17.          }
     18.     }
     19.     if(mode < 0)
     20.     {    *best = addr;
     21.          mode = VCD_SELF;
     22.     }

     23.     cache_update(ka,addr);
     24.     return mode;
     25. }

 COMMENTS.

      1: This lines declare the formal arguments. "addr" is the address to be
 	encoded. "here" is the current location in the target data.  "best" is
 	used to return the value to be used to encode "addr".
    3-6: If "addr" is within the range [0-255] away from the current location
 	"here", then the addressing mode is VCD_HERE and the address is encoded
 	as a single byte showing this distance.
   7-10: Using the lower eight bits of "addr" to index the "same" cache, if the
 	address in the cache exactly matches "addr", then VCD_SAME is used
 	and "addr" is encoded as this index.

  12-18: Check each address in the "near" cache to see if "addr" is within
 	[-127,128] away from such an address.  If there is one, the addressing
 	mode is VCD_NEAR plus the index of the address and "addr" is encoded
 	as the distance plus 127 (so the encoded value is in the range [0-255]).
  19-22: If none of the above addressing modes applies, then VCD_SELF is used
 	and "addr" is encoded as a value in the range [0-here].
     23: This line updates the address caches.


     Below is the function to decode a COPY address:

      1. int addr_decode(Cache_t* ka, int here, int type, Sfio_t* instf)
      2. {   int  addr;

      3.     if(type == VCD_SELF)
      4.          addr = sfgetm(instf, here);
      5.     else if(type == VCD_HERE)
      6.          addr = here - sfgetc(instf);
      7.     else if(type == VCD_SAME)
      8.          addr = ka->same[sfgetc(instf)];
      9.     else addr = ka->near[type] + sfgetc(instf) - 127;

     10.     cache_update(ka, addr);
     11.     return addr;
     12. }


 4.4 Decoding A Target Window


                                    -12-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     The algorithm to decode a target window is as follows:

      1. win_decode(uchar_t* tar, int n_tar, uchar_t* src, int n_src,
 		   int i_code, Sfio_t* instf, Sfio_t* dataf)
      2. {    int_t     here, size, addr, i;
      3.      Cache_t   ka;
      4.      Vcdinst_t *inst;
      5.      Vcdcode_t *code, *table = Code[i_code];

      6.      cache_init(&ka);
      7.      for(here = 0; here < n_tar; )
      8.      {    code = &table[sfgetc(instf)];
      9.           for(i = 1; i <= 2; ++i)
     10.           {    inst = i == 1 ? &code->inst1 : &code->inst2;
     11.                if(inst->type == VCD_NOOP)
     12.                     continue;
     13.                if((size = inst->size) == 0)
     14.                     size = sfgetu(instf);
     15.                if(inst->type == VCD_ADD)
     16.                {    for(; size > 0; --size)
     17.                          tar[here++] = sfgetc(dataf);
     18.                }
     19.                else if(inst->type == VCD_RUN)
     20.                {    int  c = sfgetc(dataf);
     21.                     for(; size > 0; --size)
     22.                          tar[here++] = c;
     23.                }
     24.                else if(inst->type == VCD_COPY)
     25.                {    uchar_t* from;
     26.                     addr = addr_decode(&ka,here,inst->mode,instf);
     27.                     from = addr < nsrc ? src+addr : tar+addr-nsrc;
     28.                     for(; size > 0; --size)
     29.                          tar[here++] = *from++;
     30.                }
     31.           }
     32.      }
     33. }

 COMMENTS.

      5: "table" is initialized to be the instruction code table to be used.
 	We assume that "Code" is a global variable pointing to the list of 256
 	possible code tables.
      6: This line initializes the address caches.
      8: This line reads a control byte from the instruction dataset and gets
 	the corresponding code table to decode instructions.
   9-32: These lines process the given pair of delta instructions. Note that
 	the data for VCD_ADD and VCD_RUN are read from the raw dataset.


 5.  Instruction Code Tables

     The delta instructions are encoded based on some instruction code table.
     The Vcdiff format allows applications to tailor such code tables to the
     particular data characteristics to enhance output compactness. Up to 256
     instruction tables with indices in the range [0,255] are allowed.
     However, the first 8 indices [0,7] are reserved by Vcdiff.


 5.1 The Encoding of a Code Table in a Delta File

                                    -13-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     It is desirable to compactly encode a code table in a delta file.
     To accomplish this, we first represent a table to be output as a string.
     Since each table entry encodes two instructions and each is 3 bytes,
     the total string size is 6*256 or 1536 bytes.  Such a string can then
     be compared against some other code table string (e.g., that of table 0)
     using the window encoding algorithm win_deflate() to reduce the output
     data. Thus, the format for a code table in a delta file looks like this:

 		Code table to compare with
 		Encoded data

     	Code table to compare with:
 	    This is a single byte indicating the code table used to compare
 	    the current table with.

     	Encoded data:
 	    The data uses the same format as that of a window (Section 4.1.1)
 	    where the target data is the code string of the table being encoded
 	    and the source data segment is the code string of the code table
 	    with the above index.


     Two functions tab2str() and str2tab() are used for converting between
     a code table and its code string.  Below is the description of tab2str().
     str2tab() is just as straightforward so its description will be omitted.
     Note that bytes of the same type are grouped together to induce more
     matching when code strings are compared.

      1. tab2str(Vcdcode_t* tab, uchar_t data[6*256])
      2. {    int  i, n;
      3.      n = 0;
      4.      for(i = 0; i < 256; ++i)
      5.           data[n++] = tab[i].inst1.type;
      6.      for(i = 0; i < 256; ++i)
      7.           data[n++] = tab[i].inst2.type;
      8.      for(i = 0; i < 256; ++i)
      9.           data[n++] = tab[i].inst1.size;
     10.      for(i = 0; i < 256; ++i)
     11.           data[n++] = tab[i].inst2.size;
     12.      for(i = 0; i < 256; ++i)
     13.           data[n++] = tab[i].inst1.mode;
     14.      for(i = 0; i < 256; ++i)
     15.           data[n++] = tab[i].inst2.mode;
     16. }


     Below is the function code_define() to retrieve an encoded instruction
     code table from the delta file.

      1. code_define()
      2. {   uchar_t src[6*256], tar[6*256];
      3.     int     index;

      4.     index = sfgetc(Delta);
      5.     tab2str(Code[sfgetc(Delta)], src);
      8.     win_inflate(0, tar, 1536, src, 1536);
      9.     Code[index] = str2tab(data);
     10. }


                                    -14-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


 COMMENTS.

    4:	This line reads the index to install the new code table.
    5:	This line constructs the source string to be compared with.
    8-9: These lines construct the new table and install it.


 5.2 Reserved Instruction Code Tables

     A COPY instruction with small data size may use more encoding space than
     the data itself. Thus, it is good to restrict COPY instructions to some
     minimum sizes. In turn, a code table can take advantage of such
     a restriction to enhance compactness. Vcdiff provides two reserved
     instruction code tables with indices 0 and 1 for minimium COPY sizes of
     4 and 3 respectively. As noted, the starting code table has index 0
     so it is optimized for minimum COPY size of 4. Note also that even when
     a table optimized for a particular minimum COPY size, this does not mean
     that a COPY instruction with a smaller size cannot be encoded. It simply
     means that the size of such an instruction must be coded separately.


            TYPE    SIZE      MODE    TYPE    SIZE     MODE    INDEX
         -------------------------------------------------------------
         a.  RUN     0                NOOP                       0
         b.  ADD     0                NOOP                       1
         c.  ADD   [1,14]             NOOP                     [2,15]
         d. COPY     0       [0,6]    NOOP                    [16,22]
         e. COPY   [M,M+10]  [0,6]    NOOP                    [23,99]
         f. COPY     0       [0,6]     ADD   [1,4]           [100,127]
         g. COPY   [M,M+3]   [0,6]     ADD   [1,4]           [128,239]
         h. COPY   [M,M+3]    SELF    COPY   [M,M+3]   SELF  [240,255]
         -------------------------------------------------------------

                             	Table 2


     For a given minimum COPY size M, Table 2 shows how the 256 code values
     are partitioned into different categories of indices:

     	a. Index 0 defines a RUN instruction whose size is always coded
 	   separately as an unsigned integer. This is signified by having
 	   the "size" field being 0.
     	b. Index 1 defines an ADD instruction with size coded separately.
     	c. Indices 2-15 define 14 ADD instructions with sizes in
 	   the range [1,14].
     	d. Indices 16-22 define 7 COPY instructions for the 7 different
 	   address encoding modes discusses in Section 4.3. The sizes of
 	   these instructions are coded separately.
     	e. Indices 23-99 define 77 COPY instructions using all 7 addressing
 	   modes and having sizes in the range [M,M+10].
     	f. Indices 100-127 defines 28 pairs of COPY and ADD instructions.
 	   The COPY sizes are coded separately while the ADD sizes are
 	   in the range [1,4].
     	g. Indices 128-239 define 112 pairs of COPY and ADD instructions
 	   whose sizes are in the ranges [M,M+3] and [1,4] respectively.
     	h. Finally, indices 240-255 defines 16 pairs of COPY and COPY
 	   instructions with sizes in the range [M,M+3] and both using
 	   the same VCD_SELF addressing mode.

     Next we present the algorithms to construct an instruction code table.

                                    -15-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     Where appropriate, the algorithms also construct the inverted tables
     usable for data encoding. Any part of a table not explicitly set will
     be assumed to be initialized to be either zero or the null pointer as
     the case warranted.


      1. mk_codetab(Vcdcode_t tab[256], int add[16], int copy[15][7],
                     int copyadd[8][7][5], int copycopy[8][8], int M)
      2. {
      3.     tab[0].inst1.type = VCD_RUN;
      4.     tab[0].inst1.size = 0;
      5.     tab[0].inst2.type = VCD_NOOP;

      6.     mk_add(tab, add);
      7.     mk_copy(tab, copy, M);
      8.     mk_copyadd(tab, copyadd, M);
      9.     mk_copycopy(tab, copycopy, M);
     10. }

 COMMENTS.

      1: This line declares the arguments for mk_codetab().  The inverted
 	tables add[], copy[][], copyadd[][][], and copycopy[][] are designed
 	to be indexed by sizes and address modes. The argument M is the minimum
 	COPY size used for the table.
    3-5: These lines construct the RUN instruction.
    6-9: These lines call various subroutines to construct different parts of
 	the instruction table.


     Below is the algorithm to construct the single ADD instructions:

      1. mk_add(Vcdcode_t tab[256], int add[16])
      2. {   int s, n;

      3.     tab[1].inst1.type = VCD_ADD;
      4.     tab[1].inst1.size = 0;
      5.     tab[1].inst2.type = VCD_NOOP;
      6.     add[0] = 1;

      7.     for(n = 2, s = 1; s <= 14; ++s, ++n)
      8.     {   tab[n].inst1.type = VCD_ADD;
      9.         tab[n].inst1.size = s;
     10.         tab[n].inst2.type = VCD_NOOP;
     11.         add[s] = n;
     12.     }
     13. }

 COMMENTS.

    3-6: These lines construct the ADD instruction whose size is to be coded
 	separately.  Location 0 in the inverted table add[] points to this code.
   7-12: These lines construct the ADD instructions with size in
 	the range [1-14] as discussed earlier.


     Below is the function to construct the single COPY instructions:

      1. mk_copy(Vcdcode_t tab[256], int copy[15][7], int M)

                                    -16-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


      2. {   int s, m, n;

      3.     for(n = 16, m = 0; m < 7; ++m, ++n)
      4.     {   tab[n].inst1.type = VCD_COPY;
      5.         tab[n].inst1.size = 0;
      6.         tab[n].inst1.mode = m;
      7.         tab[n].inst2.type = VCD_NOOP;
      8.         copy[0][m] = n;
      9.     }

     10.     for(s = M; s <= M+10; ++s)
     11.     {   for(m = 0; m < 7; ++m, ++n)
     12.         {   tab[n].inst1.type = VCD_COPY;
     13.             tab[n].inst1.size = s;
     14.             tab[n].inst1.mode = m;
     15.             tab[n].inst2.type = VCD_NOOP;
     16.             copy[s][m] = n;
     17.         }
     18.     }
     20. }

 COMMENTS.

    3-9: These lines construct the 7 COPY instructions with whose data sizes
 	are coded separately.
  10-18: These lines construct the 77 COPY instructions whose data sizes
 	are in the range [M,M+10].


     The below function constructs the COPY/ADD instruction pairs:

      1. mk_copyadd(Vcdcode_t tab[256], int copyadd[8][7][5], int M)
      2. {   int s, m, a, n;
      3.     for(n = 100, m = 0; m < 7; ++m)
      4.     {   for(a = 1; a <= 4; ++a, ++n)
      5.         {   tab[n].inst1.type = VCD_COPY;
      6.             tab[n].inst1.size = 0;
      7.             tab[n].inst1.mode = m;
      8.             tab[n].inst2.type = VCD_ADD;
      9.             tab[n].inst2.size = a;
     10.             copyadd[0][m][a] = n;
     11.         }
     12.     }
     13.     for(s = M; s <= M+3; ++s)
     14.     {   for(m = 0; m < 7; ++m)
     15.         {   for(a = 1; a <= 4; ++a, ++n)
     16.             {   tab[n].inst1.type = VCD_COPY;
     17.                 tab[n].inst1.size = s;
     18.                 tab[n].inst1.mode = m;
     19.                 tab[n].inst2.type = VCD_ADD;
     20.                 tab[n].inst2.size = a;
     21.                 copyadd[s][m][a] = n;
     22.             }
     23.         }
     24.     }
     25. }


     Finally, the algorithm to construct the 16 COPY/COPY pairs:


                                    -17-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


      1. mk_copycopy(Vcdcode_t tab[256], int copycopy[8][8], int M)
      2. {   int s, c, n;
      3.     for(n = 240, s = M; s <= M+3; ++s)
      4.     {   for(c = M; c <= M+3; ++c, ++n)
      5.         {   tab[n].inst1.type = VCD_COPY;
      6.             tab[n].inst1.size = s;
      7.             tab[n].inst1.mode = VCD_SELF;
      8.             tab[n].inst2.type = VCD_COPY;
      9.             tab[n].inst2.size = c;
     10.             tab[n].inst2.mode = VCD_SELF;
     11.             copycopy[s][c] = n;
     12.         }
     13.     }
     14. }


 6.  Compression and Differencing Algorithms

     Sections 4 and 5 describe how to reconstruct a target file from data in
     a delta file. In this section, we discuss briefly general architectures
     for compressors and differencers. A few abstract functions are assumed:

     int win_target(uchar_t** tar);

         This function may be called many times to get sequential windows of
 	data from the target file to be compressed. It returns the length of
 	the window while the window itself is returned in "*tar".

     int win_match(uchar_t* tar, int n_tar, int p_tar,
 		  uchar_t** src, int* n_src, int_t* p_src);

 	This function computes a source data segment, if any, to be compared
 	with a target window. It returns 0, 1 or 2 for no source segment,
 	source segment from the source file, or source segment from
 	the target file. The arguments "src", "n_src" and "p_src" are used
 	to return the source data segment, its size, and its position in
 	either the source file or target file.

     int str_match(uchar_t* tar, int n_tar, int p_tar,
 		  uchar_t* src, int n_src, int* match);

 	This function computes a string in either the source data segment or
 	the target window that matches with a prefix of the target data being
 	processed. Such a matched string can be encoded as a COPY instruction.
 	The argument "p_tar" indicates the current position in the target
 	window to be processed. The function returns the length of the match
 	and, if that is positive, "*match" is set to the match position.

     int run_check(uchar_t* tar, int n_tar, int p_tar);

 	This function checks to see if there is a run of the same bytes
 	starting at the current location in the target string.


     The performance of an encoder depends on what algorithms are used to
     implement the above functions. However, as shown earlier, decoding
     performance is completely independent of such choices.

     Below is the algorithm for encoding a target file.  We shall assume

                                    -18-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     that there is no secondary encoding for the resulting instruction and
     raw datasets.

      1. sfputc(Delta,0xd6);
      2. sfputc(Delta,0xc3);
      3. sfputc(Delta,0xc4);
      4. sfputc(Delta,0);

      5. for(p_tar = 0; ; p_tar += n_tar)
      6. {   if((n_tar = win_target(&tar)) <= 0)
      7.         break;
      9.     win_deflate(tar, n_tar);
     10. }

 COMMENTS.

    1-4: These lines output the 4-byte header for a delta file.
      6: This line calls win_target() to get a target window. If the target file
 	has been completely processed, the return value is non-positive and
 	the loop terminates.
      8: This line computes a source data segment to be compared with the given
 	target window.
      9: This line calls win_deflate() to compute the delta instructions and
 	output the data to the delta file.


     Below is the function win_deflate():

      1. win_deflate(uchar_t* tar, int n_tar)
      2. {
      3.     int     type, indi, n_src;
      4.     int_t   p_src;
      5.     uchar_t *src;
      6.     Sfio_t  *instf, *dataf;

      7.     type = win_match(tar, n_tar, &src, &n_src, &p_src);
      8.     instf = sfstropen(NULL,0);
      9.     dataf = sfstropen(NULL,0);
     10.     win_encode(tar, n_tar, src, n_src, instf, dataf);

     11.     indi = type == 0 ? (1<<2) : type == 1 ? (1<<3) : 0;
     12.     sfputc(Delta, indi);

     13.     sfputu(Delta, n_tar);
     14.     sfputu(Delta, sfsize(instf));
     15.     sfputu(Delta, sfsize(dataf));

     16.     sfmove(instf,Delta,-1,-1);
     17.     sfmove(dataf,Delta,-1,-1);

     18.     sfclose(instf); sfclose(dataf);
     18. }


 COMMENTS.

      7: This line calls win_match() to obtain a source data segment.
    8-9: These lines create two streams to output instructions and raw data.
     10: This line calls win_encode() to compute the delta instructions.
  11-12: These lines output the indicator byte.

                                    -19-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


  13-15: These lines output the sizes of the target window and instruction
 	and raw datasets.
  16-17: These lines use the Sfio function sfmove() to output the instruction
 	and raw datasets to the delta file.


     Below is the function to compute encoding of a target window:

      1. int win_encode(uchar_t* tar, int n_tar, uchar_t* src, int n_src,
                        Sfio_t* instf, Sfio_t* dataf)
      2. {   int      add, here, addr, s, size;
      3.     Cache_t  ka;

      4.     cache_init(&ka);

      5.     for(here = 0, size = 0, add = -1; here < n_tar; )
      6.     {   if( (s = run_check(tar, n_tar, here)) > 0)
      7.              addr = -1;
      8.         else s = str_match(tar,n_tar,here,src,n_src,&addr);
      9.         if(s > 0)
     10.         {    if(add >= 0)
     11.                   size += add_encode(here-add,tar+here-add,instf,dataf);
     12.              if(addr < 0)
     13.                   size += run_encode(s,tar[here],instf,dataf);
     14.              else size += copy_encode(&ka,s,addr,here,instf);

     15.              add = -1;
     16.              here += s;
     17.         }
     18.         else
     19.         {    if(add < 0)
     20.                   add = here;
     21.              here += 1;
     22.         }
     23.     }
     24.     if(add >= 0)
     25.          size += add_encode(here-add,tar+here-add,instf,dataf);
     26.     else size += copy_encode(0,0,0,0);

     27.     return size;
     28. }

 COMMENTS.

      4: This line initializes the address caches.
    6-8: These lines check to see if there is a RUN or a COPY instruction.
 	The function run_check() is straightforward to implement so its
 	description will be omitted.
   9-17: These lines output the appropriate delta instruction.
  19-22: These lines are executed if there is no RUN or COPY instruction.
 	The "add" variable is set to collect the data for an ADD instruction.
  24-25: These lines output the last unmatched part of the target window
 	in an ADD instruction.
     26: This line outputs the last saved COPY instruction, if any. See the
 	function copy_encode() below.
     27: This line returns the number of bytes used for encoding the data.


     To finish up the encoding algorithm, we need to describe the functions
     add_encode(), run_encode() and copy_encode(). We shall assume that

                                    -20-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     the default code instruction table 0 is used. The below code declares
     the tables and define a few convenience macro functions.

      1. Vcdcode_t  Tab[256];
      2. Vcdcode_t* Add[16];
      3. Vcdcode_t* Copy[15][7];
      4. Vcdcode_t* Copyadd[8][7][5];
      5. Vcdcode_t* Copycopy[8][8];

      6. #define    INDEXADD(a)     ((a >= 1 && a <= 14) ? a : 0)
      7. #define    INDEXCOPY(c)    ((c >= 4 && c <= 14) ? c : 0)
      8. #define    INDEXCOPYADD(c) ((c >= 4 && c <=  7) ? c : 0)
      9. #define    ISTINYADD(a)    ((a >= 1 && a <= 4) ? 1 : 0)
     10. #define    ISTINYCOPY(c)   ((c >= 4 && c <= 7) ? 1 : 0)

 COMMENTS.

    6-8: These lines define functions to compute the index into the tables
         Add, Copy, and Copyadd for any given ADD or COPY size.
   9-10: These lines define functions to test if a given size is small.
 	These are used to determine if certain pair of instructions could
 	be coded using a single byte code.


     Below is the function to encode a COPY instruction.

      1. int Size, Addr, Here, Mode;

      2. int copy_encode(Cache_t* ka,int size,int addr,int here,Sfio_t* instf)
      3. {   int code, mode = 0, best = 0, ndel = 0;

      4.     if(size > 0)
      5.         mode = addr_encode(ka, addr, here, &best);

      6.     if(Size > 0)
      7.     {   if(Mode == mode && mode == VCD_SELF &&
      8.            ISTINYCOPY(Size) && ISTINYCOPY(size) )
      9.         {    code = Copycopy[Size][size];
     10.              ndel += sfputc(instf, code);
     11.              ndel += sfputm(instf, Addr, Here);
     12.              ndel += sfputm(instf, addr, here);
     13.              Size = -1;
     14.              return ndel;
     15.         }
     16.         code = Copy[INDEXCOPY(Size)][Mode];
     17.         ndel += sfputc(instf, code);
     18.         if(Tab[code].inst1.size == 0)
     19.              ndel += sfputu(instf, Size);
     20.         if(Mode == VDC_SELF)
     21.              ndel += sfputm(instf, Addr, Here);
     22.         else ndel += sfputc(instf, Addr);
     23.     }

     24.     Size = size;
     25.     Addr = best;
     26.     Mode = mode;
     27.     Here = here;

     28.     return ndel;
     29. }

                                    -21-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


 COMMENTS.

      1: This line defines variables to save a COPY instruction so that it may
         be coded jointly with another instruction later.
    4-5: These lines compute the mode and value to encode the given COPY
         address if there is one (i.e., "size" is positive).
   6-23: These lines process a previously saved COPY instruction.
   7-15: These lines output a code that encode both COPY instructions.
 	Note that this is done only when the data sizes are small and both
 	instructions are using the VCD_SELF addressing mode (Table 2).
  16-22: These lines output the previously saved COPY instruction singly.
  24-27: These lines save the current COPY instruction if it has not
         been merged to a previous COPY instruction as discussed.


     Below is the function to output an ADD instruction:

      1. int add_encode(int size, uchar_t* data, Sfio_t* instf, Sfio_t* dataf)
      2. {   int code, ndel = 0;

      3.     if(Size > 0)
      4.     {   if(ISTINYADD(size))
      5.         {    code = Copyadd[INDEXCOPYADD(Size)][Mode][size];
      6.              ndel += sfputc(instf, code);
      7.              if(Tab[code].inst1.size == 0)
      8.                   ndel += sfputu(instf, Size);
      9.              if(Mode == VCD_SELF)
     10.                   ndel += sfputm(instf, Addr, Here);
     11.              else ndel += sfputc(instf, Addr);
     12.              for(; size > 0; --size, ++data)
     13.                   ndel += sfputc(dataf, *data);
     14.              return ndel;
     15.         }
     16.         else ndel += copy_encode(0,0,0,0);
     17.     }

     18.     code = Add[INDEXADD(size)];
     19.     ndel += sfputc(instf, code);
     20.     if(Tab[code].inst1.size == 0)
     21.         ndel += sfputu(instf, size);
     22.     ndel += size;
     23.     for(; size > 0; --size, ++data)
     24.         sfputc(dataf, *data);

     25.     return ndel;
     26. }


 COMMENTS.

   3-17: These lines consider the case when there is a previously saved COPY
 	instruction. Then, if the ADD size is small, a single code is output
 	for both COPY and ADD instructions.  Otherwise, the COPY instruction
 	is output by itself.
  18-24: These lines output an ADD instruction by itself.


     Below is the function to output a RUN instruction:


                                    -22-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


      1. int run_encode(int size, int byte, Sfio_t* instf, Sfio_t* dataf)
      2. {   int ndel = 0;
      3.     if(Size > 0)
      4.         ndel += copy_encode(0,0,0,0);
      5.     ndel += sfputc(instf,0);
      6.     ndel += sfputu(instf,size);
      7.     ndel += sfputc(dataf,byte);
      8.     return ndel;
      9. }

 COMMENTS.

    3-4: These lines output a previously saved COPY instruction, if any.
    5-7: These lines output the RUN instruction.


 7.  Summary

     We have described a general and portable encoding format for compression
     and differencing. The format is compact. For compression only, using
     the same LZ-77 string parsing strategy and without any secondary encoders,
     the typical compression rate is better than Unix compress and close to gzip.
     For differencing, the same data format is better than all known methods.
     Ignoring application-specific secondary encoder issues, the decoding
     algorithms run in linear time and require working space proportional
     to window sizes.


 ACKNOWLEDGEMENTS

     Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff
     who provided much encouragement to publicize Vcdiff.


 REFERENCES

     [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
         Practical Reusable Unix Software, Editor B. Krishnamurthy,
         John Wiley & Sons, Inc., 1995.

     [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
         Compression, IEEE Transactions on Information Theory,
         23(3):337-343, May 1977.

     [3] W. Tichy, The String-to-String Correction Problem with Block Moves,
         ACM Transactions on Computer Systems, 2(4):309-321, November 1984.

     [4] E.M. McCreight, A Space-Economical Suffix Tree Construction
         Algorithm, Journal of the ACM, 23:262-272, 1976.

     [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta
         Algorithms, IEEE Software Configuration and Maintenance Workshop,
         1996.

     [6] G.S. Fowler, D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
         Accepted for publication in Software Practice & Experience, 1999.


 AUTHOR'S ADDRESS

                                    -23-

 RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     Kiem-Phong Vo (main contact)
     AT&T Labs, Room D223
     180 Park Avenue
     Florham Park, NJ 07932

     Phone: 973-360-8630
     Email: kpv@research.att.com


     David G. Korn
     AT&T Labs, Room D237
     180 Park Avenue
     Florham Park, NJ 07932

     Phone: 973-360-8602
     Email: dgk@research.att.com


 EXPIRATION DATE

     September 09, 2000


                                    -24-