notes/line-endings-and-keywords.txt - subversion - Git at Google

    Oh Most Noble and Fragrant Emacs, please be in -*- outline -*- mode!

              Newline Conversion and Keyword Substitution
              ===========================================

 * Newline conversion

 We've finally settled on a proposal articulated by Greg Hudson,
 derived from Ben's original proposal plus much list discussion.

 Here's Greg's mail with the proposal, then a few clarifying selections
 from the followup discussion:

    Alright, I'll make a proposal which is like yours but (in my
    opinion) a little clearer.  First, let's look at the different use
    cases:

    1. The most common case--text files which want native line endings.
       These should be stored in the repository using LF line endings,
       and in the working dir using native line endings.

    2. Binary files.  These files we don't want to touch at all.

    3. Text files which, for one reason or another, want a specific
       line ending format regardless of platform.  These should be
       stored in the repository and in the working directory using the
       specified line ending.  We probably don't have to worry so much
       about data safety for these files since a particular, odd
       behavior has been specified for them.

    There are, of course, a hundred different ways we could arrange the
    metadata.  I propose an "svn:newline-style" property with the
    possible values "none", "native", "LF", "CR", and "CRLF".  The
    values mean:

    none: Use case 2.  don't do any newline translation

    native: Use case 1.  Store with LF in repository, and with native
    line endings in the working copy.

    LF, CR, CRLF: Use case 3.  Store with specified format in the
    repository and in the working copy.

    On commit, we apply the following rules to transform the data
    committed to the server:

    If newline-style is none, do nothing.

    If newline-stle is native, translate <native newline style> -> LF.
    If we notice any CRs or LFs which aren't part of a native-style
    newline, abort the commit.

    If newline-style is LF, CR, or CRLF, translate <native newline
    style> -> <requested newline style>.  If we notice any CRs or LFs
    which aren't part of a native-style newline and aren't part of a
    requested-style newline, abort the commit.  If the commit succeeds,
    apply the <native newline style> -> <requested newline style>
    translation to the working copy as well, so that it matches what we
    would get from a checkout of the new rev.

    On checkout, we translate LF -> <native newline style> if
    newline-style is native; otherwise, we leave the file alone.

    For now, let's say the default value of svn:newline-style is none.
    In the future, we'll want to think about things like how to enable
    newline-translation over the whole repository except for files
    which don't appear to be text.

    I think that's a complete proposal.  Some possible variations:

    Variation 1: If newline-style is native, on commit, translate
    <first newline style seen> -> LF.  If we see any CRs or LFs which
    don't match the first newline style seen, abort the commit.

    Variation 2: If newline-style is native, before commit, examine the
    file to see if it uses only the native newline style.  If it
    doesn't, set the newline-style property to "none" and commit with
    no translation.

    Variation 3: Combine variations 1 and 2; if newline-style is
    native, then if before commit, examine the file to see if it uses a
    single consistent newline style.  If it does, translate <that
    newline style> -> LF; if not, commit with newline-style set to
    "none" and no translation.

    Variation 4: If newline-style is native, then on commit, we edit a
    property "svn:newline-conversion" to something like "CRLF LF" to
    show what conversion we did.  This enables mechanical reversal of
    the translation if the file is later determined to be binary.
    (Particularly useful with variations 1 or 3 where the transform
    might not be obvious from the platform where the file was checked
    in.)

 We decided to hold off on doing any of the "variations".  We'll just
 do the basic proposal first, then see how well things work out.

 Next, Greg responded to an observation by William Uther, in which
 William pointed out that the behavior is not reversible when
 svn:newline-style is LF or CRLF.  Greg agrees, but explains why that's
 okay:

    On Fri, 2001-12-14 at 15:48, William Uther wrote:
    > --On Friday, 14 December 2001 1:16 PM -0500 Greg Hudson
    > <ghudson@MIT.EDU> wrote:

    > >   If newline-style is LF, CR, or CRLF, translate <native
    > > newline style> -> <requested newline style>.  If we notice any
    > > CRs or LFs which aren't part of a native-style newline and
    > > aren't part of a requested-style newline, abort the commit.  If
    > > the commit succeeds, apply the <native newline style> ->
    > > <requested newline style> translation to the working copy as
    > > well, so that it matches what we would get from a checkout of
    > > the new rev.
    >
    > I don't think this preserves reversability.  If a file contains
    > BOTH <native-style newline> and <requested-style newline> then
    > you neet to abort.  If you translate just <native-style newline>
    > then you can't undo the transformation - you don't know which
    > newlines need to be untransformed.

    This particular transform (for files marked CRLF, CR, or LF) is not
    reversible.  See where I said:

    "We probably don't have to worry so much about data safety for
     these files since a particular, odd behavior has been specified
     for them."

    However, let's add a possible variation to my proposal, for those
    who are still uncomfortable with data-destroying transformations
    applied to such flies:

    Variation 5: If the file is marked CRLF, CR, or LF, we translate
    <native-style newline> to <requested-style newline> during commit,
    and abort the commit if we notice any kind of mixing of newline
    styles.  (Can also combine with variation 1.)

 Colin Putney also followed up to William Uther's post, agreeing with
 Greg and explaining why in even more detail:

    > I don't think this preserves reversability.  If a file contains
    > BOTH <native-style newline> and <requested-style newline> then
    > you neet to abort.  If you translate just <native-style newline>
    > then you can't undo the transformation - you don't know which
    > newlines need to be untransformed.

    > Stated simply: You should only translate when the newline style
    > is entirely consistent.  Anything else removes the inconsistency
    > and hence loses information.

    True, this scheme doesn't preserve reversibility. But in this case
    that's OK, because the newline-style decrees what the newline style
    must be. If there are native-style newlines mixed in with the
    requested-style newlines, this is probably the result of corruption
    by some native-newline-obsessive user tool. So the non-reversible
    transform will actually undo the corruption.

    For example, the file foo.dsp, which has newline-style of
    CRLF. It's stored in the repository with CRLF newlines and on
    checkout, no transformation is done. If Linus checks out the file
    and edits it in an old version of emacs, any lines he adds will be
    terminated with a bare LF. Since this is his native style of
    newline, the transformation Greg described will undo this damage.

    If the newline-style is set to a specific newline-style (ie. CR,
    LF, or CRLF), then we know that (1) the file is text, not binary,
    and (2), any other style of newline present is corruption.

    A file should not be marked with a specific newline style unless
    (1) user does so explicitly, or (2) it matches some heuristic when
    it's added, *and* the file contents conform to that newline style.

    So the only real possibility for corruption is if some user tool
    creates a binary file that matches a heuristic for a specific
    newline style. In our running example, William creates a vector
    graphics file called foo.dsp and adds it. By chance, this file
    happens to have CRLFs scattered though it, but no bare CRs, LFs,
    '\0' characters or other harbingers of binary files. On the commit,
    svn will notice the extension, set the newline-style to CRLF and
    send it to the repository. William may get an error if he tries to
    commit a change that introduces a bare CR or LF, but he won't
    corrupt the file.

    Linus can corrupt the file if he makes a change that introduces a
    bare LF, which will get transformed into CRLF on
    commit. Alternatively, Madeleine (was that her name?) Can introduce
    a bare CR and commit, which will also corrupt the file.

    That's a pretty long string of unlikely coincidences though, while
    the opposite case, where this transformation *fixes* corruption, is
    quite common.

    Colin

 Finally, Greg followed up, confirming again what he meant by that
 portion of the proposal, but also saying that he could go the other
 way on the question too:

    > +1 on Greg Hudson's latest proposal -- and I think we're now
    > ready to Actually Do It. :-)

    I hope so.  For a while I was afraid we had hit our first failure
    to achieve livable consensus.  My apologies for not realizing the
    reversability thing until two days and several thousand lines of
    misguided debate had already gone by.

    > My assumption is that "in the working copy" means both text-base
    > and working file, for the sake of an efficient is-modified-p
    > test, and since the repository file is just an automatic
    > transform off the text-base anyway.

    Actually, I was assuming that text-base would be a verbatim copy of
    the repository contents.  But that's kind of an implementation
    detail; let's leave that up to Ben (assuming he's doing the
    implementation).

    > Otherwise, then the is-modified-p check has to be tweaked in a
    > way that will make modifiedness checks a lot slower in some
    > cases.

    No... it just means that if the mod times force a contents check,
    you have to translate the text-base contents as you compare them
    against the normal contents.  That's "a teeny tiny bit slower,"
    not, "a lot slower."

    > The second sentence of the above paragraph isn't about allowing
    > mixed-style files.  It's saying that if the entire file is native
    > format, allow that (and transform when necessary), OR if the
    > entire file is in the requested style, then allow that too.  The
    > latter situation could happen if someone used a LF-style tool
    > under Windows, for example, so that when an LF-style file got
    > saved, the whole thing would be LF-style now, not native style.
    > No reason to disallow this.
    >
    > Right?

    See my last message, as well as Colin Putney's argument.  In
    summary, that's not actually what I meant, but I don't really care
    either way.

 We're implementing Greg's original meaning, since (as Colin Putney
 pointed out) the chance of "bad" data corruption is very small, and
 the chance that it would undo corruption is actually greater.


 * Keyword substitution

 Quick summary: there's one property, named "svn:keywords".  It's value
 is a whitespace-separated list of keywords to expand.  Since keywords
 have both long and short forms, both ways are allowed in the value of
 the svn:keywords property.  Here are all the keywords:

    "LastChangedBy"       "Author"         ---> either one expands author
    "LastChangedDate"     "Date"           ---> either one expands date
    "LastChangedRevision" "Revision" "Rev" ---> any one expands rev
    "HeadURL"             "URL"            ---> either one expands url

 Here are some example values of the property:

    "Rev LastChangedDate HeadURL"

    "Author\nDate  \n  LastChangedRevision       URL"

 Unrecognized words are ignored; absence of the property is the same as
 an empty value or a value with no valid keywords in it.

 Keywords (long and short forms) are case-sensitive, as in CVS.
	Oh Most Noble and Fragrant Emacs, please be in -- outline -- mode!

	Newline Conversion and Keyword Substitution
	===========================================

	* Newline conversion

	We've finally settled on a proposal articulated by Greg Hudson,
	derived from Ben's original proposal plus much list discussion.

	Here's Greg's mail with the proposal, then a few clarifying selections
	from the followup discussion:

	Alright, I'll make a proposal which is like yours but (in my
	opinion) a little clearer. First, let's look at the different use
	cases:

	1. The most common case--text files which want native line endings.
	These should be stored in the repository using LF line endings,
	and in the working dir using native line endings.

	2. Binary files. These files we don't want to touch at all.

	3. Text files which, for one reason or another, want a specific
	line ending format regardless of platform. These should be
	stored in the repository and in the working directory using the
	specified line ending. We probably don't have to worry so much
	about data safety for these files since a particular, odd
	behavior has been specified for them.

	There are, of course, a hundred different ways we could arrange the
	metadata. I propose an "svn:newline-style" property with the
	possible values "none", "native", "LF", "CR", and "CRLF". The
	values mean:

	none: Use case 2. don't do any newline translation

	native: Use case 1. Store with LF in repository, and with native
	line endings in the working copy.

	LF, CR, CRLF: Use case 3. Store with specified format in the
	repository and in the working copy.

	On commit, we apply the following rules to transform the data
	committed to the server:

	If newline-style is none, do nothing.

	If newline-stle is native, translate <native newline style> -> LF.
	If we notice any CRs or LFs which aren't part of a native-style
	newline, abort the commit.

	If newline-style is LF, CR, or CRLF, translate <native newline
	style> -> <requested newline style>. If we notice any CRs or LFs
	which aren't part of a native-style newline and aren't part of a
	requested-style newline, abort the commit. If the commit succeeds,
	apply the <native newline style> -> <requested newline style>
	translation to the working copy as well, so that it matches what we
	would get from a checkout of the new rev.

	On checkout, we translate LF -> <native newline style> if
	newline-style is native; otherwise, we leave the file alone.

	For now, let's say the default value of svn:newline-style is none.
	In the future, we'll want to think about things like how to enable
	newline-translation over the whole repository except for files
	which don't appear to be text.

	I think that's a complete proposal. Some possible variations:

	Variation 1: If newline-style is native, on commit, translate
	<first newline style seen> -> LF. If we see any CRs or LFs which
	don't match the first newline style seen, abort the commit.

	Variation 2: If newline-style is native, before commit, examine the
	file to see if it uses only the native newline style. If it
	doesn't, set the newline-style property to "none" and commit with
	no translation.

	Variation 3: Combine variations 1 and 2; if newline-style is
	native, then if before commit, examine the file to see if it uses a
	single consistent newline style. If it does, translate <that
	newline style> -> LF; if not, commit with newline-style set to
	"none" and no translation.

	Variation 4: If newline-style is native, then on commit, we edit a
	property "svn:newline-conversion" to something like "CRLF LF" to
	show what conversion we did. This enables mechanical reversal of
	the translation if the file is later determined to be binary.
	(Particularly useful with variations 1 or 3 where the transform
	might not be obvious from the platform where the file was checked
	in.)

	We decided to hold off on doing any of the "variations". We'll just
	do the basic proposal first, then see how well things work out.

	Next, Greg responded to an observation by William Uther, in which
	William pointed out that the behavior is not reversible when
	svn:newline-style is LF or CRLF. Greg agrees, but explains why that's
	okay:

	On Fri, 2001-12-14 at 15:48, William Uther wrote:
	> --On Friday, 14 December 2001 1:16 PM -0500 Greg Hudson
	> <ghudson@MIT.EDU> wrote:

	> > If newline-style is LF, CR, or CRLF, translate <native
	> > newline style> -> <requested newline style>. If we notice any
	> > CRs or LFs which aren't part of a native-style newline and
	> > aren't part of a requested-style newline, abort the commit. If
	> > the commit succeeds, apply the <native newline style> ->
	> > <requested newline style> translation to the working copy as
	> > well, so that it matches what we would get from a checkout of
	> > the new rev.
	>
	> I don't think this preserves reversability. If a file contains
	> BOTH <native-style newline> and <requested-style newline> then
	> you neet to abort. If you translate just <native-style newline>
	> then you can't undo the transformation - you don't know which
	> newlines need to be untransformed.

	This particular transform (for files marked CRLF, CR, or LF) is not
	reversible. See where I said:

	"We probably don't have to worry so much about data safety for
	these files since a particular, odd behavior has been specified
	for them."

	However, let's add a possible variation to my proposal, for those
	who are still uncomfortable with data-destroying transformations
	applied to such flies:

	Variation 5: If the file is marked CRLF, CR, or LF, we translate
	<native-style newline> to <requested-style newline> during commit,
	and abort the commit if we notice any kind of mixing of newline
	styles. (Can also combine with variation 1.)

	Colin Putney also followed up to William Uther's post, agreeing with
	Greg and explaining why in even more detail:

	> I don't think this preserves reversability. If a file contains
	> BOTH <native-style newline> and <requested-style newline> then
	> you neet to abort. If you translate just <native-style newline>
	> then you can't undo the transformation - you don't know which
	> newlines need to be untransformed.

	> Stated simply: You should only translate when the newline style
	> is entirely consistent. Anything else removes the inconsistency
	> and hence loses information.

	True, this scheme doesn't preserve reversibility. But in this case
	that's OK, because the newline-style decrees what the newline style
	must be. If there are native-style newlines mixed in with the
	requested-style newlines, this is probably the result of corruption
	by some native-newline-obsessive user tool. So the non-reversible
	transform will actually undo the corruption.

	For example, the file foo.dsp, which has newline-style of
	CRLF. It's stored in the repository with CRLF newlines and on
	checkout, no transformation is done. If Linus checks out the file
	and edits it in an old version of emacs, any lines he adds will be
	terminated with a bare LF. Since this is his native style of
	newline, the transformation Greg described will undo this damage.

	If the newline-style is set to a specific newline-style (ie. CR,
	LF, or CRLF), then we know that (1) the file is text, not binary,
	and (2), any other style of newline present is corruption.

	A file should not be marked with a specific newline style unless
	(1) user does so explicitly, or (2) it matches some heuristic when
	it's added, and the file contents conform to that newline style.

	So the only real possibility for corruption is if some user tool
	creates a binary file that matches a heuristic for a specific
	newline style. In our running example, William creates a vector
	graphics file called foo.dsp and adds it. By chance, this file
	happens to have CRLFs scattered though it, but no bare CRs, LFs,
	'\0' characters or other harbingers of binary files. On the commit,
	svn will notice the extension, set the newline-style to CRLF and
	send it to the repository. William may get an error if he tries to
	commit a change that introduces a bare CR or LF, but he won't
	corrupt the file.

	Linus can corrupt the file if he makes a change that introduces a
	bare LF, which will get transformed into CRLF on
	commit. Alternatively, Madeleine (was that her name?) Can introduce
	a bare CR and commit, which will also corrupt the file.

	That's a pretty long string of unlikely coincidences though, while
	the opposite case, where this transformation fixes corruption, is
	quite common.

	Colin

	Finally, Greg followed up, confirming again what he meant by that
	portion of the proposal, but also saying that he could go the other
	way on the question too:

	> +1 on Greg Hudson's latest proposal -- and I think we're now
	> ready to Actually Do It. :-)

	I hope so. For a while I was afraid we had hit our first failure
	to achieve livable consensus. My apologies for not realizing the
	reversability thing until two days and several thousand lines of
	misguided debate had already gone by.

	> My assumption is that "in the working copy" means both text-base
	> and working file, for the sake of an efficient is-modified-p
	> test, and since the repository file is just an automatic
	> transform off the text-base anyway.

	Actually, I was assuming that text-base would be a verbatim copy of
	the repository contents. But that's kind of an implementation
	detail; let's leave that up to Ben (assuming he's doing the
	implementation).

	> Otherwise, then the is-modified-p check has to be tweaked in a
	> way that will make modifiedness checks a lot slower in some
	> cases.

	No... it just means that if the mod times force a contents check,
	you have to translate the text-base contents as you compare them
	against the normal contents. That's "a teeny tiny bit slower,"
	not, "a lot slower."

	> The second sentence of the above paragraph isn't about allowing
	> mixed-style files. It's saying that if the entire file is native
	> format, allow that (and transform when necessary), OR if the
	> entire file is in the requested style, then allow that too. The
	> latter situation could happen if someone used a LF-style tool
	> under Windows, for example, so that when an LF-style file got
	> saved, the whole thing would be LF-style now, not native style.
	> No reason to disallow this.
	>
	> Right?

	See my last message, as well as Colin Putney's argument. In
	summary, that's not actually what I meant, but I don't really care
	either way.

	We're implementing Greg's original meaning, since (as Colin Putney
	pointed out) the chance of "bad" data corruption is very small, and
	the chance that it would undo corruption is actually greater.


	* Keyword substitution

	Quick summary: there's one property, named "svn:keywords". It's value
	is a whitespace-separated list of keywords to expand. Since keywords
	have both long and short forms, both ways are allowed in the value of
	the svn:keywords property. Here are all the keywords:

	"LastChangedBy" "Author" ---> either one expands author
	"LastChangedDate" "Date" ---> either one expands date
	"LastChangedRevision" "Revision" "Rev" ---> any one expands rev
	"HeadURL" "URL" ---> either one expands url

	Here are some example values of the property:

	"Rev LastChangedDate HeadURL"

	"Author\nDate \n LastChangedRevision URL"

	Unrecognized words are ignored; absence of the property is the same as
	an empty value or a value with no valid keywords in it.

	Keywords (long and short forms) are case-sensitive, as in CVS.