Bug 230831 - [efs][ssh][performance] NPE and Connection lost during sequental accessing for all files from Remote SSH project
Summary: [efs][ssh][performance] NPE and Connection lost during sequental accessing fo...
Status: RESOLVED NOT_ECLIPSE
Alias: None
Product: Target Management
Classification: Tools
Component: RSE (show other bugs)
Version: 3.0   Edit
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: 3.0   Edit
Assignee: Martin Oberhuber CLA
QA Contact: Martin Oberhuber CLA
URL:
Whiteboard:
Keywords: helpwanted, performance
Depends on: 256581
Blocks:
  Show dependency tree
 
Reported: 2008-05-07 04:17 EDT by Andrei Sobolev CLA
Modified: 2008-11-26 04:45 EST (History)
2 users (show)

See Also:


Attachments
Exception traces (8.78 KB, text/plain)
2008-05-07 04:17 EDT, Andrei Sobolev CLA
no flags Details
JSch-0.1.38 preview (196.87 KB, application/x-java-archive)
2008-05-07 04:59 EDT, Martin Oberhuber CLA
no flags Details
Exception with 0_1_38 (3.10 KB, text/plain)
2008-05-07 05:40 EDT, Andrei Sobolev CLA
no flags Details
Configuration details (739.33 KB, text/plain)
2008-05-07 06:04 EDT, Andrei Sobolev CLA
no flags Details
RSE SSH Log (35.52 KB, text/plain)
2008-05-08 04:31 EDT, Andrei Sobolev CLA
no flags Details
Solaris disconnect log (178.98 KB, text/plain)
2008-05-14 01:11 EDT, Andrei Sobolev CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrei Sobolev CLA 2008-05-07 04:17:33 EDT
Created attachment 99016 [details]
Exception traces

Hi,

I DLTK we need to scan all files in project during build.
But for big projects in in some configurations like, connecting to Solaris, we have see connection lost dialog message "RSEG 1058", and appropriate RSE exception.

I've discovered that this errors are caused by NPE in jsch.

This happen for both 3.0M5 and 3.0M6 versions.

Error logs are in attachment.
I've tested with Eclipses 3.2.1 and 3.4(m5,m6,m7).

On servers we use:
Linux - OpenSSH 4.7_p1-r6


Any ideas?
Thanks.
Comment 1 Martin Oberhuber CLA 2008-05-07 04:59:36 EDT
Created attachment 99023 [details]
JSch-0.1.38 preview

Hm... apparently this happens from the EFS provider, which uses the JSch Streams support. 

The Platform switched to a newer JSch version (0.1.37) in Eclipse 3.4M6, with bug 221378. So we're only going to focus on Eclipse Platform M6 or later, with RSE 3.0M6 or later.

Since then, JSch-0.1.38 was released on Apr/22 with the following announcement:
http://sourceforge.net/mailarchive/forum.php?thread_name=200804221521.AAA16510%40jcraft.com&forum_name=jsch-users
"- bugfix: NPE should not be thrown at unexpeced session drop.  FIXED."

Attached is an unofficial "eclipsified" version of JSch-0.1.38. Just drop it into your plugins (or dropins) folder and start eclipse -clean, it should get picked up (you can rename JSch-0.1.37 to 0.1.37.orig if you want to be super safe). Please try if this improves the situation for you.
Comment 2 Andrei Sobolev CLA 2008-05-07 05:40:41 EDT
Created attachment 99030 [details]
Exception with 0_1_38

No, using of 0.1.38 not helps at all.
But exception now are different. I've attached log.
Comment 3 Martin Oberhuber CLA 2008-05-07 05:47:37 EDT
It would be interesting to know what JVM you are using, since java.io.PipedInputStream is part of your backtrace and that class is known to be problematic at times.

An you attach a complete configuration log (Eclipse Help > About > Configuration Details > Copy&Paste into a file).

A complete Thread Dump from your running Eclipse (while file transfers are ongoing ) could also be interesting, to see how many SSH transfers are going on in parallel. 
Comment 4 Andrei Sobolev CLA 2008-05-07 06:04:35 EDT
Created attachment 99039 [details]
Configuration details

I've using Java(TM) SE Runtime Environment (build 1.6.0_02-b05).

As for connections, I've have only one "Connection thread".
Comment 5 Martin Oberhuber CLA 2008-05-07 06:29:47 EDT
Just to be safe, could you try JDK 6 Update 6 and / or Java 5.0 update 15
http://java.sun.com/javase/downloads/index_jdk5.jsp 
Comment 6 Andrei Sobolev CLA 2008-05-08 04:30:25 EDT
Hi,
I've checked with jdk1.6.0_06 problem still exist.

Also in last tests I've could see about 10-15 threads with names like "Connect Thread SOLARIS.LOCAL session". I suppose this is JSch sessions, in which disconnect are happen?

Also, I've enabled RSE SSH tracing and attached log then disconnect are happen.

Part of log:
>>>>
2008-05-08 15:15:38.137 | Worker-4 | SftpFileService.getInputStream tomAmp.tcl
2008-05-08 15:15:38.220 | Worker-4 | SftpFileService.getInputStream tomAmp.tcl ok
2008-05-08 15:15:38.292 | Worker-4 | SftpFileService.getInputStream s_user_type.test
2008-05-08 15:15:38.383 | Worker-4 | SftpFileService.getInputStream s_user_type.test ok
2008-05-08 15:15:38.389 | Worker-4 | SftpFileService.getInputStream atsSR.tcl
2008-05-08 15:15:38.389 | ModalContext | SshConnectorService.disconnect
2008-05-08 15:15:38.391 | ModalContext | SshConnectorService.sessionDisconnect
2008-05-08 15:15:38.394 | ModalContext | SftpFileService.uninitService
2008-05-08 15:15:38.395 | ModalContext | SftpFileService.disconnect
2008-05-08 15:15:38.441 | Worker-4 | SftpFileService.getInputStream /export/home/haiodo/tcl/ats5.0.0/ats_lib/autoeasy/lib/atsSR.tcl failed: com.jcraft.jsch.JSchException: session is down
2008-05-08 15:15:38.460 | Worker-4 | SshConnectorService.connecting...
2008-05-08 15:15:39.023 | Worker-4 | SshConnectorService.connected
2008-05-08 15:15:39.024 | Worker-4 | SftpFileService.initService
>>>>

I'm wondering what ModalContext is, and why it calls disconnect method of SshConnectorService?

Best regards,
Andrei.
Comment 7 Andrei Sobolev CLA 2008-05-08 04:31:00 EDT
Created attachment 99236 [details]
RSE SSH Log
Comment 8 Andrei Sobolev CLA 2008-05-08 04:34:14 EDT
One more point:
My current implementation will call IFile.getContents(true) for a ten times if first time it was failed.

Is it possibly on RSE side to wait until connection will be reestablished?
Comment 9 Hemang Lavana CLA 2008-05-08 10:50:33 EDT
Adding myself on the CC list.
Comment 10 Andrei Sobolev CLA 2008-05-12 10:10:42 EDT
I've found what in most cases disconnects are happen then we try to create more than ten ssh channels.

I've updated my code, and now disconnects are almost not happen.

Is it possibly to handle such situations in RSE?
For example if RSE create more what one SSH connection, or wait, if we have more than ten requests?
Comment 11 Andrei Sobolev CLA 2008-05-14 01:11:52 EDT
Created attachment 100117 [details]
Solaris disconnect log

I've enabled tracing into JSch implementation and found what disconnects are happen because of following error:
SSH_MSG_DISCONNECT: 2 Could not create socket pairs: Too many open files

This happen on each 244 channel creation.

May be this is ssh server implementation limitation?
Or it is bug in JSch, and channels are not properly closed?
Comment 12 Martin Oberhuber CLA 2008-05-15 18:06:35 EDT
JSch definitely does close its channel(s), when the client application closes it. 244 channels is a whole lot, and it looks like you might be hitting a server limitation here. Most (openSSH) servers support 10 concurrent channels, some smaller servers support only one -- which is very problematic for RSE, see bug 209090.

In your case, I'd recommend checking why there are so many open channels. Looks like you're using the EFS integration and your client(s) don't close their Streams so this might be a leak in your client code.
Comment 13 Martin Oberhuber CLA 2008-05-15 18:07:14 EDT
BTW, I have requested an upgrade to JSch-0.1.38 with bug 232416.
Comment 14 Andrei Sobolev CLA 2008-05-16 00:37:12 EDT
Hi Martin,

244 is not a count of channels created.
In one time I've have only one channel created.

Following scenario:
1) Get file A -> open channel -> read -> close channel
2) Get file B -> open channel -> read -> close channel.
etc.

But on each 244 request I've have this problem.
Comment 15 Martin Oberhuber CLA 2008-05-16 03:46:42 EDT
Atsuhiko, do you have an idea what could be going wrong in JSch?
Comment 16 Atsuhiko Yamanaka CLA 2008-05-16 06:30:24 EDT
(In reply to comment #1)

> Since then, JSch-0.1.38 was released on Apr/22 with the following announcement:
> http://sourceforge.net/mailarchive/forum.php?thread_name=200804221521.AAA16510%40jcraft.com&forum_name=jsch-users
> "- bugfix: NPE should not be thrown at unexpeced session drop.  FIXED."

I think this fix does not related the reported problem.
That fix is for NPE at unexpected socket drop during file transfer.

The reported problem has occured in establishing sftp channel as you know.
Comment 17 Atsuhiko Yamanaka CLA 2008-05-16 06:35:47 EDT
(In reply to comment #14)
> 244 is not a count of channels created.
> In one time I've have only one channel created.
> Following scenario:
> 1) Get file A -> open channel -> read -> close channel
> 2) Get file B -> open channel -> read -> close channel.
> etc.
> But on each 244 request I've have this problem.

Have you explicitly closed the input-stream from RSE?
I mean the stream from SftpFileService#getInputStream.  

  |com.jcraft.jsch.JSchException: java.lang.NullPointerException
  |	at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:234)
  |	at com.jcraft.jsch.Channel.connect(Channel.java:199)
  |	at com.jcraft.jsch.Channel.connect(Channel.java:144)
  |	at org.eclipse.rse.internal.services.ssh.files.SftpFileService.getInputStream(SftpFileService.java:1123)

Comment 18 Andrei Sobolev CLA 2008-05-16 06:41:46 EDT
Yes, I've close streams before opening next one.
Comment 19 Atsuhiko Yamanaka CLA 2008-05-16 07:08:55 EDT
Where can I get the source code of org.eclipse.dltk.*?
Comment 20 Andrei Sobolev CLA 2008-05-16 07:17:03 EDT
You could obtain code from dev.eclipse.org cvs

root: dev.eclipse.org/technology/
projects: org.eclipse.dltk/core/plugins/org.eclipse.dltk.*

rse intergation are in:
org.eclipse.dltk/rse/plugins/org.eclipse.dltk.rse.*
Comment 21 Atsuhiko Yamanaka CLA 2008-05-16 07:50:16 EDT
(In reply to comment #20)
> You could obtain code from dev.eclipse.org cvs
> root: dev.eclipse.org/technology/
> projects: org.eclipse.dltk/core/plugins/org.eclipse.dltk.*
> rse intergation are in:
> org.eclipse.dltk/rse/plugins/org.eclipse.dltk.rse.*

Thank you for informations.

I want to reproduce that problem on my environment.
How to do it?

Which version I should download from 
  http://download.eclipse.org/technology/dltk/downloads/  ?
Then please introduce the steps to re-produce that problem.

It seems resources have been leaked on the sshd.
I guess that files or channles are not closed on the remote
due to some reasons.

Comment 22 Atsuhiko Yamanaka CLA 2008-05-17 09:37:22 EDT
I could reproduce the problem without DLTK and RSE
and have found the reason at last.

The following code will make that error in sshd with
"Could not create socket pairs: Too many open files",

    session.connect();
    Channel c=null;
    int i=0;
    try{
      for(i=0; i<1024; i++){
        c=session.openChannel("sftp");
        c.connect();
        c.disconnect();
      }
    }
    catch(Exception e){
      System.out.println(i);
      System.out.println(e);
    }

The message "Could not ..." has come from OpenSSH's session.c[1],

    int inout[2], err[2];
    /* Uses socket pairs to communicate with the program. */
    if (socketpair(AF_UNIX, SOCK_STREAM, 0, inout) < 0 ||
        socketpair(AF_UNIX, SOCK_STREAM, 0, err) < 0)
      packet_disconnect("Could not create socket pairs: %.100s", strerror(errno));

I reviewed that session.c and have found a bug, which has missed to
free 'err' for subsystem channels.  It seems that this bug has been fixed
just a week ago[2]

Anyway, jsch can not work around this bug unfortunately,
and all we can do is to re-use sftp channel as possible as.

[1] http://www.openbsd.org/cgi-bin/cvsweb/src/usr.bin/ssh/session.c
[2] http://www.openbsd.org/cgi-bin/cvsweb/src/usr.bin/ssh/session.c.diff?r1=1.236&r2=1.237
Comment 23 Martin Oberhuber CLA 2008-05-20 15:03:08 EDT
Whow! Thanks Atsuhiko for this in-depth analysis.

We could try and pool the Sftp Channels for re-use, but the drawback here is that we need to keep the open Channel(s) around so they are not available for others even if we idle.

I'll investigate this a bit more to see what (Performance) impact it has. Perhaps just keeping one or two Channels available in a Pool would be sufficient since Eclipse hardly ever opens multiple Streams in parallel. I'm inclined to doing this if it also boosts our Performance, although the original problem here is in OpenSSHd so it's NOT_ECLIPSE.

Andrei could you help out a bit here, e.g. writing a Unittest that I could use to measure Performance before and after any change?

Tentatively targeting RC3 for a fix but could get in earlier.
Comment 24 Martin Oberhuber CLA 2008-05-26 13:26:39 EDT
I've written a Unittests that creates 255 files through SSH EFS connection (IFileStore.openOutputStream()) and have not hit the exception. It took 77,556 seconds on my connection.

I then hacked SftpFileService to always re-use a single shared channel; that's possible in my Unittest because I KNOW that I always access the same channel sequentially. With the hacked single channel, the test was done in 4,516 seconds on my connection.

This is a cut down to 6% of the original time, or a 17 times performance improvement. Sufficient to go and shoot for an implementation which re-uses channels, although it'll be considerably harder to do when multiple channels can be used simultaneously.
Comment 25 Martin Oberhuber CLA 2008-06-09 19:58:01 EDT
Given that what makes the issue "major" is not an Eclipse problem, but a bug in OpenSSH, only the performance issue remains.

We'll need to defer addressing that to RSE 3.0.1 -- is it OK for you to decrease severity to "Normal" since there is a workaround available for the Major issue (i.e. upgrade your buggy OpenSSH version)?

I'm keeping Priority P2 such that we don't forget adressing this early in 3.0.1 since lots of people will benefit from the performance improvement, and the fix is likely risky so we'd better address it early in the stream.
Comment 26 Martin Oberhuber CLA 2008-09-16 12:39:56 EDT
Bulk update of target milestone
Comment 27 Martin Oberhuber CLA 2008-11-26 04:45:37 EST
Closing this as NOT_ECLIPSE since as per comment 22, the NPE problem was due to a bug in OpenSSH. As per comment 10, a workaround to alleviate the situation is avoiding opening too many channels in client code.

I have opened bug 256581 in order to track the remaining performance issue. Once the implementation is changed as proposed in that bug, the problem should go away since we'll use way fewer channels and they will be more long-lived.