Bug 162381 - [Indexer][Scanner] Strange null characters during C source parsing
Summary: [Indexer][Scanner] Strange null characters during C source parsing
Status: RESOLVED FIXED
Alias: None
Product: CDT
Classification: Tools
Component: cdt-core (show other bugs)
Version: 3.1.1   Edit
Hardware: PC Windows 2000
: P3 major (vote)
Target Milestone: 4.0 M5   Edit
Assignee: Markus Schorn CLA
QA Contact:
URL:
Whiteboard:
Keywords: contributed
Depends on:
Blocks:
 
Reported: 2006-10-26 07:08 EDT by Cheong, Jeong-Sik CLA
Modified: 2008-06-20 10:41 EDT (History)
3 users (show)

See Also:


Attachments
avoid unexpected null character problem (1.13 KB, patch)
2006-10-27 08:33 EDT, Cheong, Jeong-Sik CLA
bjorn.freeman-benson: iplog+
Details | Diff
improve search of long array and avoid unexpected null character (8.59 KB, patch)
2006-10-27 08:37 EDT, Cheong, Jeong-Sik CLA
bjorn.freeman-benson: iplog+
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Cheong, Jeong-Sik CLA 2006-10-26 07:08:39 EDT
I'm developing NS chart generator for C source code based on Eclipse CDT.
During development, I experienced bad performance & OutOfMemory Error caused by CDT indexer.

I tried identifying what was wrong at debug mode and found a strange thing.

During parsing C source, so many IASTProblem objects are generated in DOMScanner.handleProblem() and cause OutOfMemory Error. As I understand, These objects are generated to log SCANNER_BAD_CHARACTER problems. In detail, the problem is caused by null character( '\0' ) found in C source.

many null character => many IASTProblem objects => OutOfMemory Error.

In order to identify which source includes null character, I modify BaseScanner.fetchToken() as following,

BaseScanner.java
line 2026
============================================
          // skip over anything we don't handle
                char [] x = new char [1];
                x[0] = buffer[pos];
                if( x[0] == '\0') {
                	System.out.println("Suspicious String:");
                	int s = Math.max(0, pos-100);
                	for( int i=s; i<pos; i++ ){
                		System.out.print( buffer[i]);
                	}
                	System.out.println();
                }else {
                	handleProblem( IASTProblem.SCANNER_BAD_CHARACTER, pos, x );
                }
============================================

I tested it using Apache Httpd 2.2.3 source code and I got following log.
This log was generated during parsing "mod_auth_basic.c" file and logged string was part of "ap_config.h". 
But, "ap_config.h" file did not include any null character.

============================================
Suspicious String:
VERSION__ >= 199901L)
#define AP_HAVE_DESIGNATED_INITIALIZER
#endif

#endif /* AP_CONFIG_H */

Suspicious String:
ERSION__ >= 199901L)
#define AP_HAVE_DESIGNATED_INITIALIZER
#endif

#endif /* AP_CONFIG_H */

Suspicious String:
RSION__ >= 199901L)
#define AP_HAVE_DESIGNATED_INITIALIZER
#endif

#endif /* AP_CONFIG_H */
============================================

I guess that unintentional null charaters are appended to buffer during macro extenstion of C source file and these cause waste of CPU and memory.

Because I don't understand CDT inside, My guess may be wrong or right. 

Please check my guess, and I hope my observation would be helpful to improve CDT.

In addition, I found linear search logic for long array from several position of CDT source code. As we know, performance of linear search is O(n) or in proportion to data size. so I think we can consider changing it into other algorithm as binary search for handling heavy project. ( of course, only when the array is sorted )

examples of method using linear search for array.
============================
ArrayUtil.append()
_CompositeContext.findContextContainingOffset()
LocationMap.getLocation()
Comment 1 Cheong, Jeong-Sik CLA 2006-10-26 22:57:41 EDT
I found out where null characters are inserted.

In CodeReader.load() method, unintentional null characters are inserted during reading file stream and decoding to character array.

CodeReader.load() line 102.
===============================
		CharBuffer charBuffer = Charset.forName(encoding).decode(byteBuffer);
		if (charBuffer.hasArray())
			return charBuffer.array();
		// Got to copy it out
		char[] buff = new char[charBuffer.length()];
		charBuffer.get(buff);
		return buff;
===============================

In order to aviod this problem, I modified it like following,

===============================
		CharBuffer charBuffer = Charset.forName(encoding).decode(byteBuffer);
		//if (charBuffer.hasArray())
		//	return charBuffer.array();
		// Got to copy it out
		char[] buff = new char[charBuffer.length()];
		charBuffer.get(buff,0,charBuffer.length());
		return buff;
===============================

and I checked SCANNER_BAD_CHARACTER problem caused by null char didn't occur in BaseScanner.fetchToken() method after this modification.

My solution is proper? please check.

regards,
Comment 2 Cheong, Jeong-Sik CLA 2006-10-27 01:28:24 EDT
As summary, 
In my PC ( it is old and very slow :) ), I tested fast indexer using Apache Httpd 2.2.3 source project.

when using original CDT core, 
my eclipse crashed after about 4 minutes with OutOfMemory Error generating 300KB PDOM index.

when using CDT core modified by me,
my eclipse completed index after about 4 minute generating 4MB PDOM index.

Following is my modification :
1. modify CodeReader.load() in order to avoid unintentional null character.
2. replace linear search algorithm with binary search algorithm in following mehtods, 
   ArrayUtil.append()
   _CompositeContext.findContextContainingOffset()
   LocationMap.getLocation()

regards,
Comment 3 Markus Schorn CLA 2006-10-27 04:54:00 EDT
(In reply to comment #2)
> Following is my modification :
> 1. modify CodeReader.load() in order to avoid unintentional null character.
> 2. replace linear search algorithm with binary search algorithm in following
> mehtods, 
>    ArrayUtil.append()
>    _CompositeContext.findContextContainingOffset()
>    LocationMap.getLocation()
> regards,
Can you help out by providing a patch?
Comment 4 Cheong, Jeong-Sik CLA 2006-10-27 05:57:16 EDT
(In reply to comment #3)
> (In reply to comment #2)
> Can you help out by providing a patch?

I think publishing new patch is beyond my authority 
and I already submit all information that I found for authorized developers of CDT.

Most of all, because I don't understand CDT inside well, I can not sure my modification has no side effect. we need careful check of experts before applying something.

regards,
Comment 5 Markus Schorn CLA 2006-10-27 06:16:53 EDT
> I think publishing new patch is beyond my authority 
> and I already submit all information that I found for authorized developers of
> CDT.
Everybody that has a bugzilla account can attach patches to bugs. See http://wiki.eclipse.org/index.php/CDT/contributing

> Most of all, because I don't understand CDT inside well, I can not sure my
> modification has no side effect. we need careful check of experts before
> applying something.
A patch will be applied only after it has been looked at. You seem to have enough expertise to provide helpful input.

Comment 6 Cheong, Jeong-Sik CLA 2006-10-27 08:33:36 EDT
Created attachment 52833 [details]
avoid unexpected null character problem

it slightly change way to get character array from C source file.
Comment 7 Cheong, Jeong-Sik CLA 2006-10-27 08:37:15 EDT
Created attachment 52834 [details]
improve search of long array and avoid unexpected null character

in addition attachment #52833 [details],
it changes search logic of long array from linear search into binary search.

but I'm not sure my change is correct or not. please check.
Comment 8 Cheong, Jeong-Sik CLA 2006-10-27 11:07:55 EDT
Even though my suggested patches are based on source code in CVS( it may be 4.0.0 rc? ), My investigation covered CDT 3.1.1 not 4.0.

Actually, I have seen CDT 4.0 code first time only to make patch files,
and I feel CDT 4.0 is somewhat different from 3.1.1 in many points of view.
Maybe my patches can not apply to CDT 4.0, I think.

with CDT core 4.0.0(?), I tested fast index facility on Apache Httpd 2.2.3 project.

when my patches was not applied: 
About 4 minutes later since indexer started, 
eclipse crashed with OutOfMemory Error and 300KB index file.

after applying attachment #52833 [details],
About 20 minutes later since indexer started,
Indexer was still working and generated 2MB index file. I stopped indexer by force.

after applying attachment #52834 [details],
About 20 minutes later since indexer started,
Indexer was still working and generated 20MB index file. I stopped indexer by force.

Actually, I was surprised the result because I expected indexing would complete with 4MB index file in 4~5 minutes just like CDT 3.1.1 after applying attachment #52834 [details].

regards,


Comment 9 Markus Schorn CLA 2006-10-30 10:43:07 EST
Thanks Jeong-Sik! I have applied your first patch. There were bugs in the
second one, I have made changes to the LocationMap along your ideas.
I stayed away from changing ArrayUtil, see bug 162755.

Fixed in 4.0 > 20061030.
Comment 10 Doug Schaefer CLA 2008-06-03 15:00:42 EDT
assigning
Comment 11 Doug Schaefer CLA 2008-06-03 15:01:04 EDT
done