Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
RE: [ptp-dev] Re: Adding SLURM support to PTP 3.0


Greg,

I have some problems in the process of poritng SLURM support to ptp 3.0.

1. The original job states defined in ptp include "STARTING, RUNNING, EXITED, EXITED_SIGNALED, SUSPENDED, ERROR, UNKNOWN". However, SLURM defines job states as "PENDING, RUNNING, SUSPENDED, COMPLETE, CANCELLED, FAILED, NODEFAIL, TIMEOUT". Obviously, there are several job states not defined in ptp by now. So I have to extend the job state enumeration to support SLURM job states.  But this will break current core definition in ptp. Is it OK to extend the job state definition in ptp?

2. Because SLURM doesn't provide process state for each process of a job, we have to provide process state the same as the job state. So the process state enumeration in ProcessAttribute.java is extended to include additional "PENDING, CANCELLED, FAILED, NODEFAIL, TIMEOUT" states.  This modification will cause changes to other source files,like MPICH2RuntimeSystem.java, OpenMPIRuntimeSystemJob.java, etc, where the process states is handled in a switch-case branch.
Is it a good idea to modify such source files in order to suppor SLURM?

3. The third question is about the debug support in ptp. The current implementation of ptp debugger requires (1) wrtie a routing file with the runtime information(nodeid, node IP address/hostname) to create the binomial tree for sdm client and all sdm servers. (2) start sdm client process locally.

As for the routing file, since SLURM can't provide enough information for the routing file during sdm server job launch, our solution is to attach to the srun process (which launches the sdm server job) and get the required information (by reading from a data structure in srun address space) after sdm server job has started running. So we write the routing file in ptp_slurm_proxy, instead of ptp debugger. However, the other existing runtime system (such as MPICH2, ORTE, LL, PE) rely on ptp debugger to create the routing file.  How to solve this contradiction?  Or is there any way I can judge if the current RMS is SLURM or something others?

Then it comes to the launch of sdm client. Current implementation limits the ptp debugger to run on the same node where the runtime system/resource manager runs. Hoever, since ptp debugger uses socket to communicate with sdm client, in principle, it is possible to support remote debug (where ptp debugger runs on a remote machine, and sdm client/servers run on another one). My suggestion is to let the runtime proxy launch both the sdm client process and sdm server job as a response to the debug-job launch request. In this way, remote debug is feasible.



聊天+搜索+邮箱 想要轻松出游,手机MSN帮你搞定! 立刻下载!

Back to the top