blob: 7b4b64f71d5d3ccaece5fad7d9e70348c97df7b9 [file] [log] [blame]
*Chapter 3*
h1. Lexical Structure
This chapter specifies the lexical structure of the Groovy programming language.
Programs are written in Unicode ($3.1), but lexical translations are provided ($3.2) so that Unicode escapes ($3.3) can be used to include any Unicode character using only ASCII characters.
Line terminators are defined ($3.4) to support the different conventions of existing host systems while maintaining consistent line numbers.
The Unicode characters resulting from the lexical translations are reduced to a sequence of input elements ($3.5), which are white space ($3.6), comments ($3.7), and tokens.
The tokens are the identifiers ($3.8), keywords ($3.9), literals ($3.10), separators ($3.11), and operators ($3.12) of the syntactic grammar.
{anchor:3.1}
h2. 3.1 Unicode
Programs are written using the Unicode character set.
Information about this character set and its associated character encodings may be found at:
bq. http://www.unicode.org
The Java platform tracks the Unicode specification as it evolves.
The precise version of Unicode used by a given release is specified in the documentation of the class {{Character}}.
Versions of the Groovy programming language up to and including 1.0 final use Unicode version 3.0 because J2SE 1.4 does.
Upgrades to newer versions of the Unicode Standard occurred in J2SE 5.0 (to Unicode 4.0).
*J2SE 5.0+*
The Unicode standard was originally designed as a fixed-width 16-bit character encoding.
It has since been changed to allow for characters whose representation requires more than 16 bits.
The range of legal code points since J2SE 5.0 is now U+0000 to U+10FFFF, using the hexadecimal U+n _notation_.
Characters whose code points are greater than U+FFFF are called supplementary characters.
To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16.
In this encoding, supplementary characters are represented as pairs of 16-bit code units, the first from the high-surrogates range, (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF).
For characters in the range U+0000 to U+FFFF, the values of codepoints and UTF-16 code units are the same.
The Groovy programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.
A few APIs, primarily in the {{Character}} class, use 32-bit integers to represent code points as individual entities.
The Java platform provides methods to convert between the two representations.
This book uses the terms _code_ _point_ and _UTF-16_ _code_ _unit_ where the representation is relevant, and the generic term _character_ where the representation is irrelevant to the discussion.
*All Platforms*
Except for comments ($3.7), identifiers, and the contents of string literals ($3.10.5), all input elements ($3.5) in a program are formed only from ASCII characters (or Unicode escapes ($3.3) which result in ASCII characters). ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The first 128 characters of the Unicode character encoding are the ASCII characters.
Note: Groovy has no character literals, see ($3.10.4)
{anchor:3.2}
h2. 3.2 Lexical Translations
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
1. A translation of Unicode escapes ($3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form {{\\\uxxxx}}, where {{xxxx}} is a hexadecimal value, represents the UTF-16 code unit whose encoding is {{xxxx}}. This translation step allows any program to be expressed using only ASCII characters.
1. A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators ($3.4).
1. A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements ($3.5) which, after white space ($3.6) and comments ($3.7) are discarded, comprise the tokens ($3.5) that are the terminal symbols of the syntactic grammar ($2.3).
The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. Thus the input characters {{a\-\-b}} are tokenized ($3.5) as {{a}}, {{--}}, {{b}}, which is not part of any grammatically correct program, even though the tokenization {{a}}, {{-}}, {{-}}, {{b}} could be part of a grammatically correct program.
{anchor:3.3}
h2. 3.3 Unicode Escapes
Implementations first recognize _Unicode_ _escapes_ in their input, translating the ASCII characters {{\u}} followed by four hexadecimal digits to the UTF-8 or UTF-16 code unit ($3.1) with the indicated hexadecimal value, and passing all other characters unchanged.
*J2SE 5.0+*
Representing supplementary characters requires two consecutive Unicode escapes.
*All Platforms*
This translation step results in a sequence of Unicode input characters:
{code}
UnicodeInputCharacter:
UnicodeEscape
RawInputCharacter
UnicodeEscape:
'\\\' UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
'u'
UnicodeMarker 'u'
RawInputCharacter:
any Unicode character
HexDigit: one of
'0' '1' '2' '3' '4' '5' '6' '7' '8' '9'
'a' 'b' 'c' 'd' 'e' 'f'
'A' 'B' 'C' 'D' 'E' 'F'
{code}
The {{\\\}}, {{u}}, and hexadecimal digits here are all ASCII characters.
In addition to the processing implied by the grammar, for each raw input character that is a backslash {{\\\}}, input processing must consider how many other \ characters contiguously precede it, separating it from a non-{{\\\}} character or the start of the input stream.
If this number is even, then the {{\\\}} is eligible to begin a Unicode escape;
if the number is odd, then the {{\\\}} is not eligible to begin a Unicode escape.
For example, the raw input {{"\\\\\\u0061=\\\u0061"}} results in the eleven characters {{" \\\ \\\ u 0 0 6 1 = a "}} ({{\\\u0061}} is the Unicode encoding of the character "a").
If an eligible {{\\\}} is not followed by {{u}}, then it is treated as a _RawInputCharacter_ and remains part of the escaped Unicode stream.
If an eligible {{\\\}} is followed by {{u}}, or more than one {{u}}, and the last {{u}} is not followed by four hexadecimal digits, then a compile-time error occurs.
The character produced by a Unicode escape does not participate in further
Unicode escapes.
For example, the raw input {{\\\u005cu005a}} results in the six characters {{\\\ u 0 0 5 a}}, because {{005c}} is the Unicode value for {{\\\}}.
It does not result in the character {{Z}}, which is Unicode character {{005a}}, because the {{\\\}} that resulted from the {{\\\u005c}} is not interpreted as the start of a further Unicode escape.
The Groovy programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools.
The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra {{u}} - for example, {{\\\uxxxx}} becomes {{\\\uuxxxx}} - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single {{u}} each.
This transformed version is equally acceptable to a compiler for the Groovy programming language ("Groovy compiler") and represents the exact same program.
The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple {{u}}'s are present to a sequence of Unicode characters with one fewer {{u}}, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
Implementations should use the {{\\\uxxxx}} notation as an output format to display Unicode characters when a suitable font is not available.
{anchor:3.4}
h2. 3.4 Line Terminators
Implementations next divide the sequence of Unicode input characters into lines by recognizing _line_ _terminators_.
This definition of lines determines the line numbers produced by a Groovy compiler or other system component.
It also specifies the termination of the {{//}} form of a comment ($3.7).
{code}
LineTerminator:
the ASCII LF character, also known as "newline"
the ASCII CR character, also known as "return"
the ASCII CR character followed by the ASCII LF character
InputCharacter:
UnicodeInputCharacter but not CR or LF
{code}
Lines are terminated by the ASCII characters CR, or LF, or CR LF. The two
characters CR immediately followed by LF are counted as one line terminator, not
two.
The result is a sequence of line terminators and input characters, which are the terminal symbols for the third step in the tokenization process.
{anchor:3.5}
h2. 3.5 Input Elements and Tokens
The input characters and line terminators that result from escape processing ($3.3) and then input line recognition ($3.4) are reduced to a sequence of _input_ _elements_.
Those input elements that are not white space ($3.6) or comments ($3.7) are
tokens. The tokens are the terminal symbols of the syntactic grammar ($2.3).
This process is specified by the following productions:
{code}
Input:
InputElements(opt) Sub(opt)
InputElements:
InputElement
InputElements InputElement
InputElement:
WhiteSpace
Comment
Token
Token:
Identifier
Keyword
Literal
Separator
Operator
Sub:
the ASCII SUB character, also known as "control-Z"
{code}
White space ($3.6) and comments ($3.7) can serve to separate tokens that, if adjacent, might be tokenized in another manner. For example, the ASCII characters - and = in the input can form the operator token -= ($3.12) only if there is no intervening white space or comment.
As a special concession for compatibility with certain operating systems, the ASCII SUB character ({{\\\u001a}}, or control-Z) is ignored if it is the last character in the escaped input stream.
Consider two tokens {{x}} and {{y}} in the resulting input stream. If {{x}} precedes {{y}}, then we say that {{x}} is _to_ _the_ _left_ _of_ {{y}} and that {{y}} is _to_ _the_ _right_ _of_ {{x}}.
For example, in this simple piece of code:
{code}
class Empty {
}
{code}
we say that the {{\}}} token is to the right of the {{\{}} token, even though it appears, in this two-dimensional representation on paper, downward and to the left of the {{\{}} token.
This convention about the use of the words left and right allows us to speak, for example, of the right-hand operand of a binary operator or of the left-hand side of an assignment.
{anchor:3.6}
h2. 3.6 White Space
_White_ _Space_ is defined as the ASCII space, horizontal tab, and form feed characters, as well as line terminators ($3.4).
{code}
WhiteSpace:
the ASCII SP character, also known as "space"
the ASCII HT character, also known as "horizontal tab"
the ASCII FF character, also known as "form feed"
LineTerminator
{code}
{anchor:3.7}
h2. 3.7 Comments
There are three kinds of _comments_:
{table}
|
{{/\* text \*/}} | A _traditional_ _comment_: all the text from the ASCII
| characters {{/\*}} to the ASCII characters {{\*/}} is ignored
| (as in C and C++).
|
{{// text}} | An _end_-_of_-_line_ _comment_: all the text from the ASCII
| characters {{//}} to the end of the line is ignored (as in
| C++).
|
{{# text}} | A _shell_ _comment_: all the text from the ASCII
| character {{#}} to the end of the line is ignored (as in
| Unix shell scripts).
{table}
These comments are formally specified by the following productions:
{code}
Comment:
TraditionalComment
EndOfLineComment
ShellComment
TraditionalComment:
'/' '*' CommentTail
EndOfLineComment:
'/' '/' CharactersInLine(opt)
ShellComment:
'#' CharactersInLine(opt)
CommentTail:
'*' CommentTailStar
NotStar CommentTail
CommentTailStar:
'/'
'*' CommentTailStar
NotStarNotSlash CommentTail
NotStar:
InputCharacter but not '*'
LineTerminator
NotStarNotSlash:
InputCharacter but not '*' or '/'
LineTerminator
CharactersInLine:
InputCharacter
CharactersInLine InputCharacter
{code}
These productions imply all of the following properties:
- Comments do not nest
- {{#}}, {{/\*}} and {{\*/}} have no special meaning in comments that begin with {{//}}.
- {{#}} and {{//}} have no special meaning in comments that begin with {{/\*}} or {{/\*\*}}.
- {{//}}, {{/\*}} and {{\*/}} have no special meaning in comments that begin with {{#}}.
As a result, the text:
{code}
/* a comment which... /* # // /** ends here: */
{code}
is a single complete comment.
The lexical grammar implies that comments do not occur within string literals ($3.10.5) or regex literals ($3.10.todo).
{anchor:3.8}
h2. 3.8 Identifiers
An _identifier_ is an unlimited-length sequence of _Java_ _letters_ and _Java_ _digits_, the first of which must be a Java letter. An identifier cannot have the same spelling (Unicode character sequence) as a keyword ($3.9), boolean literal ($3.10.3), or the null literal ($3.10.7). todo - other literals????
{code}
Identifier:
IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
JavaLetter
IdentifierChars JavaLetterOrDigit
JavaLetter:
any Unicode character that is a Java letter (see below)
JavaLetterOrDigit:
any Unicode character that is a Java letter-or-digit (see below)
{code}
Letters and digits may be drawn from the entire Unicode character set, which supports most writing scripts in use in the world today, including the large sets for Chinese, Japanese, and Korean. This allows programmers to use identifiers in their programs that are written in their native languages.
A "Java letter" is a character for which the method {{Character}} {{.isJavaIdentifierStart(int)}} returns {{true}}. A "Java letter-or-digit" is a character for which the method {{Character}} {{.isJavaIdentifierPart(int)}} returns {{true}}.
The Java letters include uppercase and lowercase ASCII Latin letters {{A}}-{{Z}} ({{\\\u0041}}-{{\\\u005a}}), and {{a}}-{{z}} ({{\\\u0061}}-{{\\\u007a}}), and, for historical reasons, the ASCII underscore ({{_}}, or {{\\\u005f}}) and dollar sign ({{$}}, or {{\\\u0024}}). The {{$}} character should be used only in mechanically generated source code or, rarely, to access preexisting names on legacy systems.
The "Java digits" include the ASCII digits {{0}}-{{9}} ({{\\\u0030}}-{{\\\u0039}}).
Two identifiers are the same only if they are identical, that is, have the same Unicode character for each letter or digit.
Identifiers that have the same external appearance may yet be different. For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A ({{A}}, {{\\\u0041}}), LATIN SMALL LETTER A ({{a}}, {{\\\u0061}}), GREEK CAPITAL LETTER ALPHA ({{Α}}, {{\\\u0391}}), CYRILLIC SMALL LETTER A ({{а}}, {{\\\u0430}}) and MATHEMATICAL BOLD ITALIC SMALL A ({{a}}, {{\\\ud835\\\udc82}}) are all different.
Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE ({{Á}}, {{\\\u00c1}}) could be considered to be the same as a LATIN CAPITAL LETTER A ({{A}}, {{\\\u0041}}) immediately followed by a NON-SPACING ACUTE ({{́}}, {{\\\u0301}}) when sorting, but these are different in identifiers. See _The_ _Unicode_ _Standard_, Volume 1, pages 412ff for details about decomposition, and see pages 626-627 of that work for details about sorting.
Examples of identifiers are:
{code}
String i3 αφεψ MAX_VALUE isLetterOrDigit
{code}
{anchor:3.9}
h2. 3.9 Keywords
The following character sequences, formed from ASCII letters, are reserved for
use as _keywords_ and cannot be used as identifiers ($3.8):
{table}
Keywords: | one of | | |
{{abstract}} | {{continue}} | {{for }} | {{new }} | {{switch }}
{{assert }} | {{default }} | {{if }} | {{package }} | {{synchronized}}
{{boolean }} | {{do }} | {{goto }} | {{private }} | {{this }}
{{break }} | {{double }} | {{implements}} | {{protected }} | {{throw }}
{{byte }} | {{else }} | {{import }} | {{public }} | {{throws }}
{{case }} | {{enum }} | {{instanceof}} | {{return }} | {{transient }}
{{catch }} | {{extends }} | {{int }} | {{short }} | {{try }}
{{char }} | {{final }} | {{interface }} | {{static }} | {{void }}
{{class }} | {{finally }} | {{long }} | {{strictfp }} | {{volatile }}
{{const }} | {{float }} | {{native }} | {{super }} | {{while }}
| | | |
{{any }} | {{def }} | {{threadsafe}} | |
{{as }} | {{in }} | {{with}} | |
{table}
The keywords {{any}}, {{const}}, {{goto}}, {{threadsafe}} and {{with}} are reserved,
even though they are not currently used. This may allow a Groovy compiler to produce better error messages if these keywords incorrectly appear in programs.
While {{true}} and {{false}} might appear to be keywords, they are technically Boolean literals ($3.10.3). Similarly, while {{null}} might appear to be a keyword, it is technically the null literal ($3.10.7).