Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[ptp-dev] Runtime System Status Update

I committed some code earlier which brought me up to this status point. The changelog includes some of these comments but I'm adding some more here about what needs to be done next.

Where we are: You can now run OMPI jobs through PTP. When the job starts we get notified of the job identifier which I then use to populate the runtime model - signifying that our Universe knows of one new Job. I have changed the Job viewer - it now lists the jobs on the left and, when you click on one, it lists some statistics of the Job on the right. The information right now is minimal but we'll obviously add more in the future. You can also click on a job and then click the Terminate icon to send a kill message to the Control System to kill that job. This identifies the job and sends it down to the subsystem correctly - however, at this time I'm not actually doing a kill in OMPI.

What we aren't: There's a few immediate problems and things to do.
   * I am capturing the correct jobID at the control layer but am not
     actually killing the job yet with OMPI.  I believe I know how to
     do this, so this should be easy and likely the next step.
   * I can't yet get information about the Job that I've just started
     except it's JobID.  I believe I know how to get some basic
     information, but what I really need is the ID of each process that
     started, and then for each process I need to know the ID of the
     Node which it is running on.  It's unclear to me at this time if
     this is just something I don't know how to do yet or if it's
     unimplemented in the ORTE.  I'll be determining this in the near
     future.
   * (BIGGEST PROBLEM/ANNOYANCE) I can start the ORTEdaemon from PTP
     through the JNI interface and I can also tell it to cleanup itself
     and stop.  When I do this I can see the process start with the
     correct args and when I stop it I can see it cleanup itself
     perfectly.  Once it's started I can communicate with it - talking
     to the registry.  However, I cannot spawn an MPI job.  What's
     worse is that I also cannot spawn an MPI job at the command-line
     if I've started the daemon in this way.  I've tried several exec()
     calls as well as the cheesy 'system()' function - all with the
     same results.  The MPI programs just immediately return.  If I
     debug the program I can see that it gets assigned a JobID by ORTE
     but ORTE immediately sends an ABORT and TERMINATED message to me,
     saying the job is done.  I don't believe the job ever starts, frankly.
   * Greg has written some code that does state of health monitoring on
     bproc.  I need to integrate this with the JNI library and,
     subsequently, into the UI to make the UI actually reflect the
     cluster it sees.

What can you do: You can test these improvements if you want. Sadly, you have to start the orted on the console before you run the PTP. If you do this, it works perfectly. You can start MPI jobs all day, concurrently, watch the JobViewer fill up, look at the messages coming out of the processes' stdout/stderr. You can pretend to kill a job and notice in the print statements that it finds the right job. You can play around with the preferences and configure the ORTEd, and if you do things wrong or get errors from the OMPI/JNI layer you'll even get nice helpful popup error boxes.

--
-- Nathan
Correspondence
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndebard@xxxxxxxx
---------------------------------------------------------------------



Back to the top