|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectcom.ibm.icu.text.BreakIterator
com.ibm.icu.text.RuleBasedBreakIterator
com.ibm.icu.text.RuleBasedBreakIterator_Old
A subclass of BreakIterator whose behavior is specified using a list of rules.
There are two kinds of rules, which are separated by semicolons: substitutions and regular expressions.
A substitution rule defines a name that can be used in place of an expression. It consists of a name, an equals sign, and an expression. (There can be no whitespace on either side of the equals sign.) To keep its syntactic meaning intact, the expression must be enclosed in parentheses or square brackets. A substitution is visible after its definition, and is filled in using simple textual substitution (when a substitution is used, its name is enclosed in curly braces. The curly braces are optional in the substition's definition). Substitution definitions can contain other substitutions, as long as those substitutions have been defined first. Substitutions are generally used to make the regular expressions (which can get quite complex) shorter and easier to read. They typically define either character categories or commonly-used subexpressions.
There is one special substitution. If the description defines a substitution called "_ignore_", the expression must be a [] expression, and the expression defines a set of characters (the "ignore characters") that will be transparent to the BreakIterator. A sequence of characters will break the same way it would if any ignore characters it contains are taken out. Break positions never occur before ignore characters, except when the character before the ignore characters is a line or paragraph terminator.
A regular expression uses a syntax similar to the normal Unix regular-expression syntax, and defines a sequence of characters to be kept together. With one significant exception, the iterator uses a longest-possible-match algorithm when matching text to regular expressions. The iterator also treats descriptions containing multiple regular expressions as if they were ORed together (i.e., as if they were separated by |).
The special characters recognized by the regular-expression parser are as follows:
* Specifies that the expression preceding the asterisk may occur any number of times (including not at all). + Specifies that the expression preceding the asterisk may occur one or more times, but must occur at least once. ? Specifies that the expression preceding the asterisk may occur once or not at all (i.e., it makes the preceding expression optional). () Encloses a sequence of characters. If followed by * or +, the sequence repeats. If followed by ?, the sequence is optional. Otherwise, the parentheses are just a grouping device and a way to delimit the ends of expressions containing |. | Separates two alternative sequences of characters. Either one sequence or the other, but not both, matches this expression. The | character can only occur inside (). . Matches any character. *? Specifies a non-greedy asterisk. *? works the same way as *, except when there is overlap between the last group of characters in the expression preceding the * and the first group of characters following the *. When there is this kind of overlap, * will match the longest sequence of characters that match the expression before the *, and *? will match the shortest sequence of characters matching the expression before the *?. For example, if you have "xxyxyyyxyxyxxyxyxyy" in the text, "x[xy]*x" will match through to the last x (i.e., "xxyxyyyxyxyxxyxyxyy", but "x[xy]*?x" will only match the first two xes ("xxyxyyyxyxyxxyxyxyy"). [] Specifies a group of alternative characters. A [] expression will match any single character that is specified in the [] expression. For more on the syntax of [] expressions, see below. / Specifies where the break position should go if text matches this expression. (e.g., "[a-z]*/[:Zs:]*[1-0]" will match if the iterator sees a run of letters, followed by a run of whitespace, followed by a digit, but the break position will actually go before the whitespace). Expressions that don't contain / put the break position at the end of the matching text. \ Escape character. The \ itself is ignored, but causes the next character to be treated as literal character. This has no effect for many characters, but for the characters listed above, this deprives them of their special meaning. (There are no special escape sequences for Unicode characters, or tabs and newlines; these are all handled by a higher-level protocol. In a Java string, "\n" will be converted to a literal newline character by the time the regular-expression parser sees it. Of course, this means that \ sequences that are visible to the regexp parser must be written as \\ when inside a Java string.) All characters in the ASCII range except for letters, digits, and control characters are reserved characters to the parser and must be preceded by \ even if they currently don't mean anything. ! If ! appears at the beginning of a regular expression, it tells the regexp parser that this expression specifies the backwards-iteration behavior of the iterator, and not its normal iteration behavior. This is generally only used in situations where the automatically-generated backwards-iteration behavior doesn't produce satisfactory results and must be supplemented with extra client-specified rules. (all others) All other characters are treated as literal characters, which must match the corresponding character(s) in the text exactly.
Within a [] expression, a number of other special characters can be used to specify groups of characters:
- Specifies a range of matching characters. For example "[a-p]" matches all lowercase Latin letters from a to p (inclusive). The - sign specifies ranges of continuous Unicode numeric values, not ranges of characters in a language's alphabetical order: "[a-z]" doesn't include capital letters, nor does it include accented letters such as a-umlaut. ^ Inverts the expression. All characters the expression includes are excluded, and vice versa. (i.e., it has the effect of saying "all Unicode characters except...") This character only has its special meaning when it's the first character in the [] expression. (Generally, you only see the ^ character inside a nested [] expression used in conjunction with the syntax below.) (all others) All other characters are treated as literal characters. (For example, "[aeiou]" specifies just the letters a, e, i, o, and u.)
[] expressions can nest. There are some other characters that have special meaning only when used in conjunction with nester [] expressions:
:: Within a nested [] expression, a pair of colons containing a one- or two-letter code matches all characters in the corresponding Unicode category. The :: expression has to be the only thing inside the [] expression. The two-letter codes are the same as the two-letter codes in the Unicode database (for example, "[[:Sc:][:Sm:]]" matches all currency symbols and all math symbols). Specifying a one-letter code is the same as specifying all two-letter codes that begin with that letter (for example, "[[:L:]]" matches all letters, and is equivalent to "[[:Lu:][:Ll:][:Lo:][:Lm:][:Lt:]]"). Anything other than a valid two-letter Unicode category code or a single letter that begins a valide Unicode category code is illegal within the colons. | Two nested [] expressions juxtaposed or separated only by a | character are merged together into a single [] expression matching all the characters in either of the original [] expressions. (e.g., "[[ab][bc]]" is equivalent to "[abc]", and so is "[[ab]|[bc]]". NOTE: "[ab][bc]" is NOT the same thing as "[[ab][bc]]". The first expression will match two characters: an a or b followed by either another b or a c. The second expression will match a single character, which may be a, b, or c. The nesting is required for the expressions to merge together. & Two nested [] expressions with only & between them will match any character that appears in both nested [] expressions (this is a set intersection). (e.g., "[[ab]&[bc]]" will only match the letter b.) - Two nested [] expressions with - between them will match any character that appears in the first nested [] expression but not the second one (this is an asymmetrical set difference). (e.g., "[[:Sc:]-[$]]" matches any currency symbol except the dollar sign. "[[ab]-[bc]] will match only the letter a. This has exactly the same effect as "[[ab]&[^bc]]".) For a more complete explanation, see http://oss.software.ibm.com/icu/docs/papers/text_boundary_analysis_in_java/index.html. For examples, see the resource data (which is annotated).
- Author:
- Richard Gillam
- Status:
- Internal. This API is Internal Only and can change at any time.
Nested Class Summary protected class
RuleBasedBreakIterator_Old.Builder
The Builder class has the job of constructing a RuleBasedBreakIterator_Old from a textual description.
Field Summary protected static byte
IGNORE
A token used as a character-category value to identify ignore characters
Fields inherited from class com.ibm.icu.text.RuleBasedBreakIterator WORD_IDEO, WORD_IDEO_LIMIT, WORD_KANA, WORD_KANA_LIMIT, WORD_LETTER, WORD_LETTER_LIMIT, WORD_NONE, WORD_NONE_LIMIT, WORD_NUMBER, WORD_NUMBER_LIMIT
Fields inherited from class com.ibm.icu.text.BreakIterator DONE, KIND_CHARACTER, KIND_LINE, KIND_SENTENCE, KIND_TITLE, KIND_WORD
Constructor Summary RuleBasedBreakIterator_Old(String description)
Constructs a RuleBasedBreakIterator_Old according to the description provided.
Method Summary protected static void
checkOffset(int offset, CharacterIterator text)
Throw IllegalArgumentException unless begin <= offset < end.Object
clone()
Clones this iterator.int
current()
Returns the current iteration position.void
debugDumpTables()
Dump out a more-or-less human readable form of the complete state table and character class definitionsstatic void
debugPrintln(String s)
boolean
equals(Object that)
Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.int
first()
Sets the current iteration position to the beginning of the text.int
following(int offset)
Sets the iterator to refer to the first boundary position following the specified position.int
getRuleStatus()
Deprecated. This is a draft API and might change in a future release of ICU.int
getRuleStatusVec(int[] fillInArray)
Deprecated. This is a draft API and might change in a future release of ICU.CharacterIterator
getText()
Return a CharacterIterator over the text being analyzed.protected int
handleNext()
This method is the actual implementation of the next() method.protected int
handlePrevious()
This method backs the iterator back up to a "safe position" in the text.int
hashCode()
Compute a hashcode for this BreakIteratorboolean
isBoundary(int offset)
Returns true if the specfied position is a boundary position.int
last()
Sets the current iteration position to the end of the text.protected int
lookupBackwardState(int state, int category)
Given a current state and a character category, looks up the next state to transition to in the backwards state table.protected int
lookupCategory(char c)
Looks up a character's category (i.e., its category for breaking purposes, not its Unicode category)protected int
lookupState(int state, int category)
Given a current state and a character category, looks up the next state to transition to in the state table.protected RuleBasedBreakIterator_Old.Builder
makeBuilder()
Creates a Builder.int
next()
Advances the iterator to the next boundary position.int
next(int n)
Advances the iterator either forward or backward the specified number of steps.int
preceding(int offset)
Sets the iterator to refer to the last boundary position before the specified position.int
previous()
Advances the iterator backwards, to the last boundary preceding this one.void
setText(CharacterIterator newText)
Set the iterator to analyze a new piece of text.String
toString()
Returns the description used to create this iteratorprotected void
writeSwappedInt(int x, DataOutputStream out, boolean littleEndian)
protected void
writeSwappedShort(short x, DataOutputStream out, boolean littleEndian)
void
writeTablesToFile(FileOutputStream file, boolean littleEndian)
Write the RBBI runtime engine state transition tables to a file.
Methods inherited from class com.ibm.icu.text.RuleBasedBreakIterator getInstanceFromCompiledRules
Methods inherited from class java.lang.Object finalize, getClass, notify, notifyAll, wait, wait, wait
Field Detail IGNORE
protected static final byte IGNORE
- A token used as a character-category value to identify ignore characters
- See Also:
- Constant Field Values
- Status:
- Stable ICU 2.0.
Constructor Detail RuleBasedBreakIterator_Old
public RuleBasedBreakIterator_Old(String description)
- Constructs a RuleBasedBreakIterator_Old according to the description provided. If the description is malformed, throws an IllegalArgumentException. Normally, instead of constructing a RuleBasedBreakIterator_Old directory, you'll use the factory methods on BreakIterator to create one indirectly from a description in the framework's resource files. You'd use this when you want special behavior not provided by the built-in iterators.
- Status:
- Stable ICU 2.0.
Method Detail makeBuilder
protected RuleBasedBreakIterator_Old.Builder makeBuilder()
- Creates a Builder.
- Status:
- Stable ICU 2.0.
clone
public Object clone()
- Clones this iterator.
- Overrides:
clone
in classRuleBasedBreakIterator
- Returns:
- A newly-constructed RuleBasedBreakIterator_Old with the same behavior as this one.
- Status:
- Stable ICU 2.0.
equals
public boolean equals(Object that)
- Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.
- Overrides:
equals
in classRuleBasedBreakIterator
- Status:
- Stable ICU 2.0.
toString
public String toString()
- Returns the description used to create this iterator
- Overrides:
toString
in classRuleBasedBreakIterator
- Status:
- Stable ICU 2.0.
hashCode
public int hashCode()
- Compute a hashcode for this BreakIterator
- Overrides:
hashCode
in classRuleBasedBreakIterator
- Returns:
- A hash code
- Status:
- Stable ICU 2.0.
debugDumpTables
public void debugDumpTables()
- Dump out a more-or-less human readable form of the complete state table and character class definitions
- Status:
- Internal. This API is Internal Only and can change at any time.
writeTablesToFile
public void writeTablesToFile(FileOutputStream file, boolean littleEndian) throws IOException
- Write the RBBI runtime engine state transition tables to a file. Formerly used to export the tables to the C++ RBBI Implementation. Now obsolete, as C++ builds its own tables.
- Throws:
IOException
- Status:
- Internal. This API is Internal Only and can change at any time.
writeSwappedShort
protected void writeSwappedShort(short x, DataOutputStream out, boolean littleEndian) throws IOException
- Throws:
IOException
- Status:
- Internal. This API is Internal Only and can change at any time.
writeSwappedInt
protected void writeSwappedInt(int x, DataOutputStream out, boolean littleEndian) throws IOException
- Throws:
IOException
- Status:
- Internal. This API is Internal Only and can change at any time.
first
public int first()
- Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).
- Overrides:
first
in classRuleBasedBreakIterator
- Returns:
- The offset of the beginning of the text.
- Status:
- Stable ICU 2.0.
last
public int last()
- Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).
- Overrides:
last
in classRuleBasedBreakIterator
- Returns:
- The text's past-the-end offset.
- Status:
- Stable ICU 2.0.
next
public int next(int n)
- Advances the iterator either forward or backward the specified number of steps. Negative values move backward, and positive values move forward. This is equivalent to repeatedly calling next() or previous().
- Overrides:
next
in classRuleBasedBreakIterator
- Parameters:
n
- The number of steps to move. The sign indicates the direction (negative is backwards, and positive is forwards).- Returns:
- The character offset of the boundary position n boundaries away from the current one.
- Status:
- Stable ICU 2.0.
next
public int next()
- Advances the iterator to the next boundary position.
- Overrides:
next
in classRuleBasedBreakIterator
- Returns:
- The position of the first boundary after this one.
- Status:
- Stable ICU 2.0.
previous
public int previous()
- Advances the iterator backwards, to the last boundary preceding this one.
- Overrides:
previous
in classRuleBasedBreakIterator
- Returns:
- The position of the last boundary position preceding this one.
- Status:
- Stable ICU 2.0.
checkOffset
protected static final void checkOffset(int offset, CharacterIterator text)
- Throw IllegalArgumentException unless begin <= offset < end.
- Status:
- Stable ICU 2.0.
following
public int following(int offset)
- Sets the iterator to refer to the first boundary position following the specified position.
- Overrides:
following
in classRuleBasedBreakIterator
- Parameters:
offset
- The position from which to begin searching for a break position.- Returns:
- The position of the first break after the current position.
- Status:
- Stable ICU 2.0.
preceding
public int preceding(int offset)
- Sets the iterator to refer to the last boundary position before the specified position.
- Overrides:
preceding
in classRuleBasedBreakIterator
- Parameters:
offset
- The position to begin searching for a break from.- Returns:
- The position of the last boundary before the starting position.
- Status:
- Stable ICU 2.0.
isBoundary
public boolean isBoundary(int offset)
- Returns true if the specfied position is a boundary position. As a side effect, leaves the iterator pointing to the first boundary position at or after "offset".
- Overrides:
isBoundary
in classRuleBasedBreakIterator
- Parameters:
offset
- the offset to check.- Returns:
- True if "offset" is a boundary position.
- Status:
- Stable ICU 2.0.
current
public int current()
- Returns the current iteration position.
- Overrides:
current
in classRuleBasedBreakIterator
- Returns:
- The current iteration position.
- Status:
- Stable ICU 2.0.
getRuleStatus
public int getRuleStatus()
- Deprecated. This is a draft API and might change in a future release of ICU.
- Return the status tag from the break rule that determined the most recently returned break position. The values appear in the rule source within brackets, {123}, for example. For rules that do not specify a status, a default value of 0 is returned. If more than one rule applies, the numerically largest of the possible status values is returned.
Note that for old style break iterators (implemented by this class), no status can be declared, and a status of zero is always assumed.
- Overrides:
getRuleStatus
in classRuleBasedBreakIterator
- Returns:
- the status from the break rule that determined the most recently returned break position.
- Status:
- Draft ICU 3.0.
- Status:
- Deprecated in This is a draft API and might change in a future release of ICU..
getRuleStatusVec
public int getRuleStatusVec(int[] fillInArray)
- Deprecated. This is a draft API and might change in a future release of ICU.
- Get the status (tag) values from the break rule(s) that determined the most recently returned break position. The values appear in the rule source within brackets, {123}, for example. The default status value for rules that do not explicitly provide one is zero.
Note that for old style break iterators (implemented by this class), no status can be declared, and a status of zero is always assumed.
- Overrides:
getRuleStatusVec
in classRuleBasedBreakIterator
- Parameters:
fillInArray
- an array to be filled in with the status values.- Returns:
- The number of rule status values from rules that determined the most recent boundary returned by the break iterator. In the event that the array is too small, the return value is the total number of status values that were available, not the reduced number that were actually returned.
- Status:
- Draft ICU 3.0.
- Status:
- Deprecated in This is a draft API and might change in a future release of ICU..
getText
public CharacterIterator getText()
- Return a CharacterIterator over the text being analyzed. This version of this method returns the actual CharacterIterator we're using internally. Changing the state of this iterator can have undefined consequences. If you need to change it, clone it first.
- Overrides:
getText
in classRuleBasedBreakIterator
- Returns:
- An iterator over the text being analyzed.
- Status:
- Stable ICU 2.0.
setText
public void setText(CharacterIterator newText)
- Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text.
- Overrides:
setText
in classRuleBasedBreakIterator
- Parameters:
newText
- An iterator over the text to analyze.- Status:
- Stable ICU 2.0.
handleNext
protected int handleNext()
- This method is the actual implementation of the next() method. All iteration vectors through here. This method initializes the state machine to state 1 and advances through the text character by character until we reach the end of the text or the state machine transitions to state 0. We update our return value every time the state machine passes through a possible end state.
- Status:
- Stable ICU 2.0.
handlePrevious
protected int handlePrevious()
- This method backs the iterator back up to a "safe position" in the text. This is a position that we know, without any context, must be a break position. The various calling methods then iterate forward from this safe position to the appropriate position to return. (For more information, see the description of buildBackwardsStateTable() in RuleBasedBreakIterator_Old.Builder.)
- Status:
- Stable ICU 2.0.
lookupCategory
protected int lookupCategory(char c)
- Looks up a character's category (i.e., its category for breaking purposes, not its Unicode category)
- Status:
- Internal. This API is Internal Only and can change at any time.
lookupState
protected int lookupState(int state, int category)
- Given a current state and a character category, looks up the next state to transition to in the state table.
- Status:
- Internal. This API is Internal Only and can change at any time.
lookupBackwardState
protected int lookupBackwardState(int state, int category)
- Given a current state and a character category, looks up the next state to transition to in the backwards state table.
- Status:
- Internal. This API is Internal Only and can change at any time.
debugPrintln
public static void debugPrintln(String s)
- Status:
- Internal. This API is Internal Only and can change at any time.
Overview Package Class Use Tree Index Help PREV CLASS NEXT CLASS FRAMES NO FRAMES SUMMARY: NESTED | FIELD | CONSTR | METHOD DETAIL: FIELD | CONSTR | METHOD
Copyright (c) 2004 IBM Corporation and others.