284441 – Error: could not match input in HTML Tokenizer

Bug 284441 - Error: could not match input in HTML Tokenizer

Summary: Error: could not match input in HTML Tokenizer

Status:	RESOLVED NOT_ECLIPSE

Alias:	None

Product:	WTP Source Editing
Classification:	WebTools
Component:	wst.html (show other bugs)
Version:	3.1
Hardware:	PC Windows XP

Importance:	P3 normal (vote)
Target Milestone:	---
Assignee:	Nick Sandonato
QA Contact:	Nitin Dahyabhai

URL:
Whiteboard:
Keywords:	info

Duplicates (1):	278123 (view as bug list)
Depends on:
Blocks:

Reported:	2009-07-23 11:19 EDT by David Williams
Modified:	2009-08-22 02:17 EDT (History)
CC List:	3 users (show)

See Also:

Attachments
two sample stack traces (591.92 KB, application/octet-stream) 2009-07-23 11:22 EDT, David Williams	no flags	Details
test jar (19.65 KB, application/java-archive) 2009-07-24 21:21 EDT, David Williams	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Williams

2009-07-23 11:19:18 EDT

I tried doing some huge search in my workspace for some string, looking in all files ... not that smart, but I was desperate :) 

After a while, I started getting "Error: could not match input" in the console, thousands of times. 

I did a few "ctl-breaks" while this was going on, and from the stack traces, appears to be happening in HTMLHeadTokenizer.

I'm not sure file (or files) its getting this on (and not sure how to easily tell) but guess at the worst I can write a small test that calls the tokenizer on each of my files to see if I can find out if file specific.

Comment 1 David Williams

2009-07-23 11:22:16 EDT

Created attachment 142410 [details]
two sample stack traces

Comment 2 David Williams

2009-07-23 17:33:59 EDT

Findings so far: its not the input, per se. I wrote a small test to invoke the tokenizer on all the files in my workspace (all 175000 of them!), and with debugging on, the problem does not occur, but with dubugging off, it always occurs. 

The test itself is single threaded, but I'm running this on a multi processor machine, so could still be related to context switching. 

The pattern is that it starts to fail at different spots, anywhere from the 2000th file to the 6000th file. Then, oddly, once it fails, it will fail for 10 to 20 files in a row, then have a few successes, then 10 to 20 failures in a row, etc. This doesn't make a lot of sense, given a new instance of the tokenizer is created for each file (so not like it would be left in a "bad" state, but maybe will provide a hint of what the problem is.

Comment 3 David Williams

2009-07-23 23:30:22 EDT

Just to cross-reference, there is a similar issue in bug 278123.

Comment 4 David Williams

2009-07-24 01:24:01 EDT

I have not been able to reproduce this using IBM's 1.5 JRE nor Sun's 1.5 JRE, only IBM's 1.6 VM. (Specifically, it happens to be IBM's 1.6 SR5 installed). 

That does not necessarily mean it is a VM problem, but ... I tried marking everything synchronized and volatile but the error still occurred. 

Instead of Reader, as input, I also tried InputStream and BufferedInputStream with the same (error) results -- the only (small) difference was with a BufferedInputStream I was getting an "IllegalStateException" ... even more of a sign it is some sort of "corrupt" program data. 

Perhaps something is going wrong when the code is converted to JIT code (which would explain, perhaps, why it takes a while to show up. Methods typically have to be called 20 or 50 thousand times before they are JITed.

Comment 5 David Williams

2009-07-24 14:56:41 EDT

Note: I don't think the "IllegalState" exception was due to using Buffered input, per se. Since even after changing back to Reader, I still get IllegalStateException. This is the one that prints out as

Error: could not match input
java.lang.IllegalStateException: Instance: 56361820 offset:0 state:0

In any case, I think that is just another symptom of a JIT bug.

I think I have narrowed the root error down to a JIT bug, which I'll document here in detail, just so next time I need to do this, I won't have to look it all up again.

I followed the procedures and options given in
http://publib.boulder.ibm.com/infocenter/javasdk/tools/index.jsp?topic=/com.ibm.java.doc.igaa/_1vg0001321a8887-11c8a42abcc-7fff_1001.html

Another good command line reference is
http://www.ibm.com/developerworks/java/jdk/linux/6/sdkandruntimeguide.lnx.html#cmdline

First I tried -Xint. No error.

Note, -Xint, "interpretive mode" is equivalent to -Xnojit -Xnoaot, where AOT stands for ahead-of-time compilation.

Then I tried -Xnojit. No error.

Then I tried disabling the specific methods involved. First I excluded HTMLTokenizer from JIT. Error still occurred. Then I excluded IntStack from JIT. Error still occurred.

Then I excluded both IntStack and HTMLTokenizer, in that order, from JIT. No Error! The syntax to exclude methods and classes from JIT processing is complex, but best I could tell, the statement to do it is

-Xjit:exclude={com/ibm/ISecurityLocalObjectBaseL13Impl/CSIServerRI.send_exception(Lorg/eclipse/wst/html/core/internal/contenttype/IntStack;)V|org/eclipse/wst/html/core/internal/contenttype/HTMLHeadTokenizer.*}

Someone could use this as a work-around to the problem, until the root problem was fixed.

Comment 6 David Williams

2009-07-24 16:41:23 EDT

To help further diagnos the probable JIT problem, I ran further tests, as described in the above reference from publib.boulder.ibm.com. 

First, I tried 

-Xjit:count=0,disableInlining

and the problem occurred as it had before. 

Then I tried 

-Xjit:count=0,disableInlining,optLevel=scorching

and the problem did not occur. 

Note that 'scorching' is the highest setting for optimization (that is, the most optimization). It was surpising, to me, as I would have thought the highest setting ("scorching") would be the same as not specifying any optimization level, But guess there's something even more that goes on if optimization not "reduced" to scorching?

At any rate, this can, possibly, result in a slightly better optimization, allowing some JITing to go on, instead of excluding it completely. The command line parameter for this reduced optimization would be 

-Xjit:{org/eclipse/wst/html/core/internal/contenttype/IntStack;)V|org/eclipse/wst/html/core/internal/contenttype/HTMLHeadTokenizer.*}(optLevel=scorching)

If this has to be used in practice (not sure the VM team recommends it) but it make a big difference in performance. In an informal test, scanning 10000 files takes about 90 seconds with JIT excluding these two classes but only 45 seconds with JIT optimization "reduced" to 'scorching'.

Comment 7 David Williams

2009-07-24 21:21:19 EDT

Created attachment 142581 [details]
test jar 

I could reproduce the same issues on Linux. 

I created a standalone java-only test jar to run to make demonstrating and testing the issue easier. The packages have been renamed, but otherwise pretty much same code. 

I've stored the source in our source editing cvs under 'development' directory: 

sourceediting/development/org.eclipse.wst.sse.tokenizerJitTest

To invoke the jar, I use a script similar to the following: 
= = = = = = = = = = = = = = = = 

# Sample script to invoke the test code

export JAVA_HOME=/shared/webtools/apps/ibm-java-ppc-605

# if no options below are uncommented, the test jar should exhibit the 
# error condition after a few thousand files. 
# You can use almost any directory, as long as there are thousands of files, 
# that can be accessed for read. 

# reduce optimization level 
#export IBM_JAVA_OPTIONS=-Xjit:"{org/eclipse/wst/sse/tokenizerJitTest/IntStack;)V|org/eclipse/wst/sse/tokenizerJitTest/HTMLHeadTokenizer.*}(optLevel=scorching)"

# eclude methods
#export IBM_JAVA_OPTIONS=-Xjit:exclude="{com/ibm/ISecurityLocalObjectBaseL13Impl/CSIServerRI.send_exception(Lorg/eclipse/wst/sse/tokenizerJitTest/IntStack;)V|org/eclipse/wst/sse/tokenizerJitTest/HTMLHeadTokenizer.*}"

#capture log
#export IBM_JAVA_OPTIONS=-Xjit:verbose="{compileStart|compileEnd},vlog=filename.log,exclude={com/ibm/ISecurityLocalObjectBaseL13Impl/CSIServerRI.send_exception(Lorg/eclipse/wst/sse/tokenizerJitTest/IntStack;)V|org/eclipse/wst/sse/tokenizerJitTest/HTMLHeadTokenizer.*}"

${JAVA_HOME}/jre/bin/java -jar testTokenizerJIT.jar $HOME

= = = = = = = = = = = = = = = =

Comment 8 David Williams

2009-07-27 00:36:53 EDT

I found a  better reference for how to specify methods to exclude: 

http://www-01.ibm.com/support/docview.wss?rs=180&uid=swg21294023

Also discovered that using the logging capability, it's easier to learn the right form of the specific methods that have been compiled into JIT code. 

I started, excluding all method of HTMLTokenizer and IntStack, and then removed them until the error showed up. 

Using that method, the faulty method is 'primGetNextToken'. Thus, the correct work around would to use 

-Xjit:exclude={org/eclipse/wst/html/core/internal/contenttype/HTMLHeadTokenizer.primGetNextToken()Ljava/lang/String;}

In several shell scripts, I've found the bracket part had to be quoted: 

-Xjit:exclude="{org/eclipse/wst/html/core/internal/contenttype/HTMLHeadTokenizer.primGetNextToken()Ljava/lang/String;}"

Comment 9 David Williams

2009-07-27 04:21:15 EDT

In addition to 

HTMLHEADTokenizer


I also checked the follow tokenizers: 

CSSHEADTokenizer
CSSTokenizer

XMLHEADTokenizer
XMLTokenizer

JSPHeadTokenizer
JSPTokenizer

Only the last one, JSPTokenizer, also exibited odd "no match found" behavior. In this case, I narrowed down two methods that had to be 
excluded from JIT compiling, to avoid errors. The similar 'primGetNextToken' had to be excluded and that got rid of the 
"no matching input errors". But, then a "null pointer" would eventually start occuring in the constructor (which is impossible, 
given the only thing there is an assignment statement. Another symptom, if these two methods were not excluded, was 
that the tokenizer would occasionally hang (or loop indefinitely). 

Thus, the total "work-around", so far, 3 methods (one for HTMLHead, and two for JSP):

-Xjit:exclude={org/eclipse/jst/jsp/core/internal/parser/internal/JSPTokenizer.primGetNextToken()Ljava/lang/String;},exclude={org/eclipse/jst/jsp/core/internal/parser/internal/JSPTokenizer.<init>(Ljava/io/Reader;)V},exclude={org/eclipse/wst/html/core/internal/contenttype/HTMLHeadTokenizer.primGetNextToken()Ljava/lang/String;}



Are there any other jflex based scanners I've omitted? JSF? DTD?

Comment 10 Gary Karasiuk

2009-07-27 12:52:15 EDT

Wow, Good work David!

Comment 11 David Williams

2009-07-27 17:13:11 EDT

Just do add a bit more documentation. 

I did find another JFlex based parser, JSPedCSSTokenizer, but it (also) did not exhibit the problem. 

I have checked in, to sse/development, a project that uses all these tokenizers in test Eclipse applications. The project is named org.eclipse.wst.sse.tokenizerJitTestApps. The reason is that for some, such as JSPTokenizer, it would be hard to pull out into a java-only tsst as I did for HTMLHEADTokenizer. 

I have opened a bug on IBM's JRE, 
https://eureka.hursley.ibm.com//eureka/servlet/userlist?form=pmr&pmrsearcharg=58679,001,866

I think that is a "closed" system, so will update this bug when they respond. 
I'm not sure what their process is, maybe it moves to their open system eventually? 

I have also added the JIT exlude work-around while running our unit tests:
            <env 
                 key="IBM_JAVA_OPTIONS" 
                 value="-Xjit:exclude={org/eclipse/jst/jsp/core/internal/parser/internal/JSPTokenizer.primGetNextToken()Ljava/lang/String;},exclude={org/eclipse/jst/jsp/core/internal/parser/internal/JSPTokenizer.%lt;init%gt;(Ljava/io/Reader;)V},exclude={org/eclipse/wst/html/core/internal/contenttype/HTMLHeadTokenizer.primGetNextToken()Ljava/lang/String;}" />   
 

Maybe that will help cut down on the number of intermittent unit test failures we get? 

With all that, I don't think there's more for us to do here, so marking as "not eclipse".

Comment 12 David Williams

2009-08-05 02:20:14 EDT

Just to update status: the JIT team reproduced the issue and while they continue to investigate, they have found another work-around (besides excluding completely from being compiled) is to specify 'disableLookahead' for that method, so for the test case, that'd be 

-Xjit:{org/eclipse/wst/sse/tokenizerJitTest/HTMLHeadTokenizer.primGetNex
tToken*}(disableLookahead)

Comment 13 David Williams

2009-08-05 03:59:27 EDT

For our WTP code, I've tried the suggested "disableLookahead" and it does avoid the bug in my special test cases. 

That VM options would be: 

-Xjit:{org/eclipse/wst/html/core/internal/contenttype/HTMLHeadTokenizer.primGetNextToken()Ljava/lang/String;}(disableLookahead),{org/eclipse/jst/jsp/core/internal/parser/internal/JSPTokenizer.primGetNextToken()Ljava/lang/String;}(disableLookahead)

Comment 14 David Williams

2009-08-05 04:26:25 EDT

*** Bug 278123 has been marked as a duplicate of this bug. ***

Comment 15 David Williams

2009-08-22 02:17:28 EDT

Just to give a final update. A fix was created for this JIT bug, and is planned to be in SR6 ... "later in the year" (they don't give exact dates for SRs).