Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] Re: Adding SLURM support to PTP 3.0

Hi Jie,

I've completed the changes as I described. Each resource manager that wishes to extend the icons that are displayed for the different model elements needs to provide an org.eclipse.ptp.ui.runtimeModelPresentation extension. This extension supplies a class that implements org.eclipse.ptp.ui.IRuntimeModelPresentation. You can use the jobStatus, processStatus and nodeStatus attributes (or your own attributes) to supply information that is used by this class to select the appropriate images or test. See the Open MPI resource manager for details on what this class needs to do. I'll be using this for the PBS RM, but I haven't implemented it yet.

Let me know if you have any questions.

Cheers,
Greg
 
On Sep 20, 2009, at 11:17 AM, JiangJie wrote:

Hi Greg,

Also see answers below.


From: g.watson@xxxxxxxxxxxx
Subject: Re: [ptp-dev] Re: Adding SLURM support to PTP 3.0
Date: Thu, 17 Sep 2009 17:45:15 -0400
To: yangtzj@xxxxxxxxxxx


I think you've confused job states with process states (EXITED, EXITED_SIGNALED are process states), but I get your meaning. I think this highlights a weakness in the current implementation that needs to be addressed. Job states are only important for two things: 1) the launcher needs to know when the job begins running and when it is terminated; and 2) to display an icon next to the job in the view. For process states it is only important to know when the process has exited, to be able to change its state to suspended when it's being debugged, and to change it's icon otherwise.

I think the way to handle this correctly is as follows:

1. Replace the job and process en! umerated state attributes with string attributes. This string can be set to anything that the resource manager desires, so you can use all the normal job states that SLURM supports. You can also either set the process state attribute or not.

2. Provide a means of allowing resource managers to map job/process states into icons. This should be relatively easy to do using WorkbenchAdapters that the RM provides.

3. Provide explicit interfaces to test for/set important states:

boolean IPJob.isRunning(); // defined as all processes running
boolean IPJob.isTerminated(); // defined as all processes terminated
void IPJob.setAllRunning(boolean running); // set all processes to running
void IPJob.setAllTerminated(); // set all processes to terminated

boolean IPProcess.isRunning();
void IPProcess.setRunning(boolean running);
boolean IPProcess.isSuspended();
void IPProcess .setSuspended(boolean suspended);
boolean IPProcess.isTerminated();
void IPProcess.setTerminated(boolean terminated);

Resource managers will need explicitly call these interfaces rather than just setting states as they do now.

Let me know what you think of this idea.

>>>>>>>>>>>>>>>>>>>>>>>>>
Sound like a good idea.
If current enumerated job states are replaced by string attributes that will be provided by remote runtime resource management system,
since different RM will provide different definition of job state, so it is up to each RM to decide whether a job is running or terminated.
Right?

If the underlying RM can provide state for each process in a job, the job state can be decided by its process state.
For RM like SLURM that provide job state instead of  process state, the process st! ate can be decided by the job state.
It seems very flexible.
But I think just knowing a process/job has terminated is not enough. It would be better to provide the additional description of the terminate reason,
that is , the job/process termination is caused by error, timeout, nodefail, or something else. This information  can be displayed in process view.
>>>>>>>>>>>>>>>>>>>>>>>>>>


3. The third question is about the debug support in ptp. The current implementation of ptp debugger requires (1) wrtie a routing file with the runtime information(nodeid, node IP address/hostname) to create the binomial tree for sdm client and all sdm servers. (2) start sdm client process locally.

As for the routing file, since SLURM can't provide enoug! h information for the routing file during sdm server job launch, our s olution is to attach to the srun process (which launches the sdm server job) and get the required information (by reading from a data structure in srun address space) after sdm server job has started running. So we write the routing file in ptp_slurm_proxy, instead of ptp debugger. However, the other existing runtime system (such as MPICH2, ORTE, LL, PE) rely on ptp debugger to create the ro! uting file.  How to solve this contradiction?  Or is there a ny way I can judge if the current RMS is SLURM or something others?

Then it comes to the launch of sdm client. Current implementation limits the ptp debugger to run on the same node where the runtime system/resource manager runs. Hoever, since ptp debugger uses socket to communicate with sdm client, in principle, it is possible to support remote debug (where ptp debugger runs on a remote machine, and sdm client/servers run on another one). My suggestion is to let the runtime proxy launch both the sdm client pr! ocess and sdm server job as a response to the debug-job launch request. In this way, remote debug is feasible. 

You can now turn off the auto generation of the routing file and launch of the client sdm using the IResourceManagerConfiguration.needsDebuggerLaunchHelp()interface. If your implementation of this returns false (the default implementation of AbstractResourceManagerServiceProvider does this), it will be left up to the RM to generate the routing file and launch the client. See the org.eclipse.ptp.rm.ibm.pe.* resource manager for what is required.
>>>>>>>
Thanks for your tips. It seems that overriding 
needsDebuggerLaunchHelp() method to return false in SLURMServicePro! vider can realize what I desire.


< hr />Messenger保护盾2.0,更安全可靠的Messenger聊天! 现在就下载!


Back to the top