Bug 266658 - [console][i18n] Console internal buffer corrupts multi-byte characters
Summary: [console][i18n] Console internal buffer corrupts multi-byte characters
Status: RESOLVED DUPLICATE of bug 545769
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Debug (show other bugs)
Version: 3.4.2   Edit
Hardware: PC Windows 2000
: P3 normal with 2 votes (vote)
Target Milestone: ---   Edit
Assignee: Platform-Debug-Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords: accessibility, helpwanted
: 337711 (view as bug list)
Depends on:
Blocks:
 
Reported: 2009-03-02 04:44 EST by Robert Lu CLA
Modified: 2019-11-22 07:44 EST (History)
8 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Lu CLA 2009-03-02 04:44:05 EST
Build ID: M20090211-1700

Steps To Reproduce:
1.Run the following java program in a Japanese Windows environment:
    public static void main(String[] args) {
        System.out.print("aaa");
        for (int i = 0; i < 1000; i++) {
            for (int j = 0; j < 31; j++) {
                System.out.print((char) (0x3042));
            }
            System.out.println();
        }
    }

2.The unicode character 0x3042 is a Japanese Hiragana Letter "A". The expected output of the above program should be 1000 lines, each line contains 31 characters(except for the first line). 
However, at least in my environment, the last character of each 128 lines corrupts and instead, two "?" are displayed in the console.


More information:
The character 0x3042 is choosed just as an example. The bug can be reproduced with almost all kinds of 2-bytes charactes (when encoded in SJIS or MS932).

My system environment is Japanese Windows 2000 Service Pack 4(5.00.2195).  I changed the eclipse console buffer size to 1,000,000 characters.  The other configurations of eclipse are kept as default.

It seems that the console has an internal buffer with 8192 bytes, so that if an multi-byte character is cut by the buffer border, it cannot be displayed correctly in the console.

In my test program, each line of output sends 64 bytes to the standard output (31 characters * 2 bytes + new-line character as 2 bytes).  The additional 3 "a"s are added before these lines so that the 8192 bytes buffer border will cut the last character of every 128 lines.

If a thread wait is added when, for example, every 100 lines are displayed, this bug can be avoided.  (That is, the internal buffer is flushed when the thread is waiting).
Comment 1 Robert Lu CLA 2009-03-04 09:55:21 EST
I checked the eclipse source code and found that the 8192 bytes buffer exists in class org.eclipse.debug.internal.core.OutputStreamMonitor.

In the private method read(), a 8192 bytes buffer is used to read from the connected stream, and then encoded to a String. So if a multi-byte character is located just at the tail of the buffer, it cannot be correctly encoded.

I think a possible solution to this problem is to check if the tail of the buffer is encoded successfully whenever exactly 8192 bytes are read from the stream.  If the tail cannot be encoded, push it back to the input stream (or researve it locally and connect it to the head of the next read).
Comment 2 Pawel Piech CLA 2011-06-08 15:02:21 EDT
*** Bug 337711 has been marked as a duplicate of this bug. ***
Comment 3 Joachim Kanbach CLA 2013-12-11 02:47:35 EST
I think I have encountered a variation of this bug which doesn't even involve any code points beyond Latin-1. With the following code, there's usually at least one line which contains two unprintable characters and one expected character, like this:

...
µÈ
µÉ
��Ê
µË
µÌ
µÍ
...

The malformed character is a different one each time, i.e. it's completely random, but I think only 2-byte characters are affected. This happens both on Linux and Windows, although it seems it happens more often on Linux (in a VM). Converting the line in question from the example gives a hex code of

EF BF BD EF BF BD C3 8A (C3 8A is the "Ê")

An example from another run looks like this:

...
Vª
V«
V��
V­
V®
...

Hex: 56 EF BF BD EF BF BD (56 is the "V")



Code to reproduce (note: may have to try a couple of times before it happens!):

java.util.List<Character> myList = new java.util.ArrayList<Character>();

for ( char c = 0; c < 256; c++ )
{

   if ( !Character.isISOControl( c ) )
   {
      myList.add( c );
   }
}

for ( Character myChar : myList )
{
   for ( Character myNestedChar : myList )
   {
      char[] myArray = new char[] { myChar, myNestedChar };
      String s = new String( myArray );
      System.out.println( s );
   }
}
Comment 4 Ted Shaneyfelt CLA 2015-02-21 03:34:10 EST
I also got this bug with C++ and it happens any time you print about a screenful or more of multibyte characters to the console. "���" that is several occurances of Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD) spuriously appearing in the output in place of some other multibyte character that was sent to the console stream. 

It looks like this has been a known problem for about 5 years now. Anyone looking at it now? Here's a C++ program that I used to reproduce the problem:
(I've tried flushing and yielding, hoping that it would allow the buffer to reset, but no such luck)

The problem occurs around line 21 or 24 when running immediately after being built, but on subsequent runs, it happens on line 15.  I'm running it on a VirtualBox Virtual Machine nearly clean install Linux Mint 17 XFCE 64bit on a MacBook Pro OSX SnowLeopard machine. I can provide a 4GB virtual machine image if there's any problem reproducing it. Here's my code to reproduce the problem:

#include <iostream>
using namespace std;
int main() {
	locale::global(locale(""));
	for (int i=0;i<40;i++) {
		for (int j=0;j<3;j++) {
				wcout << L"▁▁▂▂▂▃▃▄▅▆▆▇▇▇████▇▇▇▆▆▅▄▃▃▂▂▂▁▁";
		}
		wcout << i << endl;
	}
	return 0;
}

And a few lines of the result:
▁▁▂▂▂▃▃▄▅▆▆▇▇▇████▇▇▇▆▆▅▄▃▃▂▂▂▁▁▁▁▂▂▂▃▃▄▅▆▆▇▇▇████▇▇▇▆▆▅▄▃▃▂▂▂▁▁▁▁▂▂▂▃▃▄▅▆▆▇▇▇████▇▇▇▆▆▅▄▃▃▂▂▂▁▁14
▁▁▂▂��▃▃▄▅▆▆▇▇▇████▇▇▇▆▆▅▄▃▃▂▂▂▁▁▁▁▂▂▂▃▃▄▅▆▆▇▇▇████▇▇▇▆▆▅▄▃▃▂▂▂▁▁▁▁▂▂▂▃▃▄▅▆▆▇▇▇████▇▇▇▆▆▅▄▃▃▂▂▂▁▁15
▁▁▂▂▂▃▃▄▅▆▆▇▇▇████▇▇▇▆▆▅▄▃▃▂▂▂▁▁▁▁▂▂▂▃▃▄▅▆▆▇▇▇████▇▇▇▆▆▅▄▃▃▂▂▂▁▁▁▁▂▂▂▃▃▄▅▆▆▇▇▇████▇▇▇▆▆▅▄▃▃▂▂▂▁▁16

You can see that line 15 has two spurious replacement characters here in place of a missing double-byte character.

However, it seems to happen less often when doing the following (not sure less predictable is good, but here it is anyway):

I have set up an External Tool Configuration to run my program with valgrind. (on Debian type systems, you should sudo apt-get install valgrind before doing this)

In the Main tab for a new tool, use the following
Location:     /usr/bin/valgrind  
Arguments:    -q  ${project_loc}/Debug/${project_name}
Name:         valgrind

Then to run the program, click the project in the project explorer, and click the run tool button. Sometimes it runs without errors this way, but not always.
Comment 5 Sarika Sinha CLA 2015-02-22 22:05:19 EST
No one is looking at it currently, but we will be happy to review the patch contributions.
Comment 6 Ted Shaneyfelt CLA 2015-03-08 14:18:24 EDT
My ugly OS/dependent workaround is to spawn a process to launch a new console window and run the program under test in it. 

	if (window) {
		stringstream launch;
		launch << "xfce4-terminal -H -x "<< argv[0] ;
		system(launch.str().c_str());
	}

If you do this in a thread, be sure to keep the launch string around until after the thread terminates. I'd personally like to see an option in Eclipse to launch in a real console window like this without modifying the source of the unit under test, and/or to have a completely new console within eclipse that would support things like ANSI emulation for cursor control and to be able to clear its screen. Ideally it would have options to allow you to buffer, rewind, and replay the output step by step, and support unicode. Just make it behave like a real console. Whether WIndows or Linux or OSX you get this behavior in the real console these days.
Comment 7 Paul Pazderski CLA 2019-11-22 07:44:24 EST
We fixed bug 545769 in Eclipse 4.12. Should be the same problem as this.

*** This bug has been marked as a duplicate of bug 545769 ***