191263 – [Connection] 100% CPU used in TimeoutOutputStream.writeUntilDone

Bug 191263 - [Connection] 100% CPU used in TimeoutOutputStream.writeUntilDone

Summary: [Connection] 100% CPU used in TimeoutOutputStream.writeUntilDone

Status:	ASSIGNED

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	CVS (show other bugs)
Version:	3.3
Hardware:	PC Windows XP

Importance:	P3 major (vote)
Target Milestone:	---
Assignee:	platform-cvs-inbox
QA Contact:

URL:
Whiteboard:
Keywords:	performance

Depends on:	221378
Blocks:
	Show dependency tree

Reported:	2007-06-06 10:03 EDT by Martin Aeschlimann
Modified:	2019-09-06 16:17 EDT (History)
CC List:	5 users (show)

See Also:

Attachments
Thread dump of Eclipse eating 100% CPU (15.13 KB, text/plain) 2007-09-28 09:28 EDT, Martin Oberhuber	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Aeschlimann

2007-06-06 10:03:35 EDT

20070606

I got in a state where Eclipse took over almost all of the CPU time for the last 30 minutes. CTrl + Break didn't work, but I could connect to the VM with a profiled.

According to the profiler, all time is spend in 
org.eclipse.team.internal.core.streams.TimeoutOutputStream.runThread()
  org.eclipse.team.internal.core.streams.TimeoutOutputStream.writeUntilDone()
   java.io.BufferedOutputStream.flush()
    com.jcraft.jsch.Channel$1.write(int)
     com.jcraft.jsch.Channel$1.write(byte[], int, int)

Ui was not bloicked, but just very slow.
No operations shown in the progress view.
I had to kill Eclipse to continue.

Comment 1 Michael Valenta

2007-06-06 10:22:33 EDT

This could be related to the fact that I just ran an rlog command against the server to extract all the userids that have committed code to the repository so we could include them in the IP log. The command took quite a while to run. This may have lead to a situation where the connection never timed out but sending and receiving data may have been affected.

It would be good to know what you were doing so we can try and reproduce this. We're you synchronizing?

Comment 2 Martin Aeschlimann

2007-06-06 10:39:29 EDT

I think it was something that ran in the background, maybe quick diff.
I later managed to close all editors (that took 15 minutes) but CPU was still blocked.

Comment 3 Atsuhiko Yamanaka

2007-06-07 03:28:06 EDT

Is that problem reproducible?

In JSch 0.1.31 source code, I found lines, where CPU time is eaten, but 
I have not found the exact reason.  It seems that the channel was not been
correctly initialized due to some rare reason. If that problem occurs on jsch
0.1.28, Eclipse SDK will not be blocked, but ArrayIndexOutOfBoundException
will be thrown.

I have modified jsch 0.1.31 to work around such blocking and named it as 0.1.31.1.  It is available at
  http://eclipse.jcraft.com/plugins/com.jcraft.jsch_0.1.31.1.jar
and you can update it automatically at http://eclipse.jcraft.com/site.xml

Is it acceptable to use 0.1.31.1?  If not, IMHO, Eclipse 3.3 should use
jsch 0.1.28.  Of course, we will publicly release the source code of 
0.1.31.1, and the next formal release 0.1.34 will also include same fix.

Comment 4 Atsuhiko Yamanaka

2007-06-08 02:35:40 EDT

I have a question for the reporter.

At that time, did you get an error message dialog like 
"I/O has been interrupted."?

Comment 5 Martin Aeschlimann

2007-06-08 03:33:33 EDT

I can't reproduce this problem and it is the first time I've seen it happen.

There was no error dialog and also no error in the log, but maybe it got lost because I had to kill the process.

Comment 6 Michael Valenta

2007-06-08 08:53:19 EDT

At this stage in the release cycle it is too late to make any changes (i.e. we are shipping with 0.1.31). If, over the next few weeks, we start seeing more reports like this, we can investigate the possibility of including 0.1.31.1 in 3.3.1.

Comment 7 Atsuhiko Yamanaka

2007-06-08 10:32:56 EDT

(In reply to comment #5)
> I can't reproduce this problem and it is the first time I've seen it happen.
> There was no error dialog and also no error in the log, but maybe it got lost
> because I had to kill the process.

Thank you for information.

If that problem written at comment #3 had occurred, any data will not be
transfered and the error dialog will appear soon thanks to TimeoutOutputStream.
So, I guess that data had been sent continuously.  Now, I think that
my comment #3 is not related to this problem and Michael's guess at #1 may be 
right.  I'll withdraw jsch 0.1.31.1.

Comment 8 Michael Valenta

2007-07-18 11:54:45 EDT

I believe this is caused by what I described in comment 1. As such, it is not necessary to address this for 3.3.1.

Comment 9 Martin Oberhuber

2007-09-28 09:28:14 EDT

Created attachment 79389 [details]
Thread dump of Eclipse eating 100% CPU

I think I just ran into the very same issue, with the brand-new Eclipse 3.3.1

Here is what I did:
* Start Eclipse, Team > Synchronize
  --> Found several files to be synchronized, because local resources were not
      up-to-date
  --> Dbl click each of them in a row to open compare viewer (about 50 of them).
      Compare viewer opened and closed immediately because it found that there
      were not really diffs and just the local file had been out of sync.
* Clear Error Log because it showed lots of "Resource out of Sync"
* Select all projects, Team > Tag as Version "R2_0_1"
  --> Tagging ran for a while, but I did _not_ put it into background
* When tagging was done, open a few editors, with History open
* Work in a TM Terminal

At that point, I noticed Eclipse taking 100% CPU.
I ran jconsole, and the StackTrace tool from http://www.adaptj.com/root/main/tracehowtos

Trying to find out who the culprit was, I closed all open editors and all Perspectives. Result was the Thread dump attached.

I find it odd that there are _two_ instances of TimeoutOutputStream. These seem to be the only ones who are always RUNNABLE so I assume it's them eating the CPU.

Comment 10 Martin Oberhuber

2007-09-28 09:55:05 EDT

I did not see any other errors logged, or any error dialog. Progress View was empty. I wonder how likely this scenario is to occur? I wonder who is really hogging the CPU by constantly running code?

In the jconsole, I see the memory usage run like a sawtooth pattern, so it looks like somebody is constantly creating objects which gc later removes.

I inspected the code in JSch's Channel.java around line 275, and I'm not sure but it looks like this could loop forever in the case where the remote minimum packet size (rmpsize) is very small or even still zero, and dataLen==0:

        public void write(byte[] buf, int s, int l) throws java.io.IOException{
          /*...*/
          byte[] _buf=buffer.buffer;   //initialized to rmpsize
          int _bufl=_buf.length;       //_bufl==rmpsize, e.g. 0
          while(l>0){                  //user wants to write e.g. 32 bytes
            int _l=l;
            if(l>_bufl-(14+dataLen)-32-20){  //_bufl very small so this holds true
              _l=_bufl-(14+dataLen)-32-20;   //_l gets negative
            }
            if(_l<=0){
              flush();                 //does nothing if dataLen==0
              continue;                //back to the loop, l is unchanged!!
            }
        }

Atsuhiko could you investigate? I'm wondering under what circumstances this OutputStream is used and how likely this issue is to occur... (I tried debugging it but never dropped into the code in question)

Comment 11 Martin Oberhuber

2007-09-28 09:59:50 EDT

I also noticed that when making multiple thread dumps at different times, they all looked exactly the same: particularly, the TimeoutInputStream's were always running in the same Thread, locking the same objects, being always RUNNABLE and never in a stack frame below flush()

So I tend to assume that this issue is indeed an endless loop in the OutputStream's write() method with a very specific set of parameters given such that the "l" variable is never modified because flush() returns immediately, and thus the while() loop never ends.

Comment 12 Atsuhiko Yamanaka

2007-09-28 11:28:20 EDT

May I ask you to download
  http://eclipse.jcraft.com/plugins/com.jcraft.jsch_0.1.31.1.jar
to your plugins/ directory?
# You may need to remove com.jcraft.jsch_0.1.31.jar to enable that new jar.

As I wrote at comment #3
 > In JSch 0.1.31 source code, I found lines, where CPU time is eaten, but 
 > I have not found the exact reason.  It seems that the channel was not been
 > correctly initialized due to some rare reason.

That new jar has creafully initialized the channel.

Comment 13 Martin Oberhuber

2007-09-28 11:39:57 EDT

Would you ask me to perform any tests with 0.1.31.1 or would it just be to make my environment safe, and rule out potential problem candidates in case it happens again for me?

I'd still like to know under what circumstances Eclipse Team/CVS uses Channel.getOutputStream() such that I'd have a better chance trying to reproduce the issue / verify the fix.

Thanks!

Comment 14 Atsuhiko Yamanaka

2007-09-28 11:58:59 EDT

(In reply to comment #13)
> Would you ask me to perform any tests with 0.1.31.1 or would it just be to make
> my environment safe, and rule out potential problem candidates in case it
> happens again for me?

It seems you can reproduce that problem very easily, so I just ask you to test
if you can reproduce it with 0.1.31.1 or not.  
If you can, my guess is not right.

> I'd still like to know under what circumstances Eclipse Team/CVS uses
> Channel.getOutputStream() such that I'd have a better chance trying to
> reproduce the issue / verify the fix.

out=channel.getOutputStream();
channel.connect();
out.write("??");

If the channel is not ready before writing data to the output stream,
that problem will happen.  The channel will be initialized by other
thread and, due to heavy load, there is the possibility that 
the initialization may be delayed.

Comment 15 Martin Oberhuber

2007-09-28 15:00:42 EDT

No it's not easy to reproduce... looks like the system must be under really heavy load, like Michael indicated with his running a large remote cvs rlog.

My several thread dumps were all from one and the same occurrence, I haven't been able to reproduce it since (with the old jsch-0.1.31). I guess I'm still going to install jsch-0.1.31.1 though.

Comment 16 Atsuhiko Yamanaka

2007-10-16 22:21:54 EDT

I have not been able to reproduce this problem yet,
but I have found the bug, which may cause this problem and fixed it.

That fix has been included in jsch-0.1.35, which is already released.
If you have opportunities to run a large remote cvs rlog, please try
  http://eclipse.jcraft.com/plugins/com.jcraft.jsch_0.1.35.jar
  (or install it via the update site, http://eclipse.jcraft.com/site.xml)

As for changes since 0.1.31 included in Eclipse SDK 3.3,
there was not any code from others(third parities), so I hope that
the legal review will not take so much time.

Comment 17 Michael Valenta

2007-10-17 08:31:42 EDT

Atsuhiko, since you are the provider of the Jsch client and an Eclipse committer, it may be easiest if you submit the request to the IP team for permission to include the latest JSch client in Eclipse. You can issue the request by going to the My Foundation Portal at:

 https://dev.eclipse.org/portal/myfoundation/portal/portal.php

Once you log in using your dev.eclipse user name and password, you can click on the [request] link in the IP Policy section. The fields should be fairly easy for you to fill out.

Comment 18 Atsuhiko Yamanaka

2007-10-22 01:54:38 EDT

(In reply to comment #17)
> Atsuhiko, since you are the provider of the Jsch client and an Eclipse
> committer, it may be easiest if you submit the request to the IP team for
> permission to include the latest JSch client in Eclipse. You can issue the
> request by going to the My Foundation Portal at:
>  https://dev.eclipse.org/portal/myfoundation/portal/portal.php

Thank you for your suggestion.
I have filed it;  CQ 1826.

Comment 19 Eclipse Webmaster

2019-09-06 16:17:26 EDT

This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.