I recently was asking about parsing IGES files using xtext, which included Hollerith strings in the specification. These strings are denoted by an int value, the number of characters, followed by a 'H' and then the string. To parse such tokens, you recommended I use a custom lexer. I was able to get decent parsing to work using this approach, but was curious if the way I am lexing is not optimal or recommended.
To handle a token like 9Hmy String, for example, I added a terminal rule in my grammar called HOLLERITH with this definition:
terminal HOLLERITH:
INT 'H' . ;
and then created a new CustomIGESLexer that extended the generated InternalIEGSLexer. I then overrode the mTokens() method to check for these Hollerith strings first before allowing the internal lexer to work for any other token. I was wondering if this is a good approach, because I do not want to write a completely unique lexer, I just want to provide custom lexing for the Hollerith strings. The code is something like this:
public void mTokens() throws RecognitionException {
if (isHollerith()) {
myRULE_HOLLERITH();
} else {
super.mTokens();
}
}
myRULE_HOLLERITH() {
try {
int _type = RULE_HOLLERITH;
int _channel = DEFUALT_TOKEN_CHANNEL;
//... get the token, match the characters with match()
state.type = _type;
state.channel = _channel;
} finally {
}
}
I tried to resemble the style of the internal lexer when creating the custom rules. The isHollerith() just checks for an int followed immediately by a 'H'
private boolean isHollerith() {
int index = 1;
int cur = input.LA(index);
// See if an int starts the string
while (cur >= '0' && cur <= '9') {
index++;
cur = input.LA(index);
}
// Followed by an 'H'
return index > 1 && cur == 'H';
}
This might be a terrible way to customize the lexer rules, but it works for now.
Thank you,
Kasper Gammeltoft
Oak Ridge National Lab,
Computer Science & Mathematics Division
Computer Science Research Group