Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-user] slurm tasks not honoured?

I think a more general question would be
With a slurm resource manager running - which seems to be fine - How do I enter specific mvapich2 settings to ensure that when the job is submitted to slurm, the correct launch procedure is followed? 

the PBS resource manager seems to have all the options for changing the MPI command etc, but I can't find the equivalent using slurm.
(our system changed from PBS to slurm a few months ago and this is my first attempt to setup things since then).
(we are using slurm-2.3.0-pre5 by the look of things)

thanks (hopefully)

JB


-----Original Message-----
From: ptp-user-bounces@xxxxxxxxxxx [mailto:ptp-user-bounces@xxxxxxxxxxx] On Behalf Of Biddiscombe, John A.
Sent: 20 February 2012 12:29
To: PTP User list
Subject: [ptp-user] slurm tasks not honoured?

Seeing the email about the release of ptp 5.0.5 I updated eclipse and downloaded the proxy zip file recompiled utils, proxy and sdm
all seems fine, but when I run a job, the num tasks is always 1 it seems. 

Launching with 16 tasks on one node, it outputs this (note the exception every time on job launch)

SLURM@Local: ptp_slurm_proxy: Job step aborted: Waiting up to 2 seconds for job step to finish.
SLURM@Local: Send Job/Process StateChange Event: state=32772
SLURM@Local: job[15974] iothread exit on EOF/ERROR of stdout fd
SLURM@Local: job[15974] iothread exit on Error/EOF of stderr fd.
SLURM@Local: Send Job/Process StateChange Event: state=4
SLURM@Local: Job[15974] no longer exist in SLURM. Romove it!
SLURM@Local: SLURM_SubmitJob (2):
SLURM@Local: job submit commands:
SLURM@Local:    jobTimeLimit=55
SLURM@Local:    launchedByPTP=true
SLURM@Local:    jobNumProcs=16
SLURM@Local:    execPath=/project/csvis/biddisco/eiger/build/pv-os/bin
SLURM@Local:    progArgs=-rc
SLURM@Local:    progArgs=-ch=148.187.14.220
SLURM@Local:    progArgs=--use-offscreen-rendering
SLURM@Local:    jobNumNodes=1
SLURM@Local:    execName=pvserver
SLURM@Local:    jobPartition=stdMem
SLURM@Local:    jobSubId=JOB_13297370315374
SLURM@Local: Job[15975] io thread create done.
SLURM@Local: Send Job/Process StateChange Event: state=1
java.lang.NullPointerException
        at org.eclipse.ptp.ui.views.MachinesNodesView$JobListener.handleEvent(MachinesNodesView.java:111)
        at org.eclipse.ptp.rmsystem.AbstractResourceManagerMonitor.fireJobChanged(AbstractResourceManagerMonitor.java:241)
        at org.eclipse.ptp.rmsystem.AbstractResourceManager.fireJobChanged(AbstractResourceManager.java:510)
        at org.eclipse.ptp.rtsystem.AbstractRuntimeResourceManager.fireJobChanged(AbstractRuntimeResourceManager.java:145)
        at org.eclipse.ptp.rtsystem.AbstractRuntimeResourceManagerMonitor.doUpdateJobs(AbstractRuntimeResourceManagerMonitor.java:988)
        at org.eclipse.ptp.rtsystem.AbstractRuntimeResourceManagerMonitor.handleEvent(AbstractRuntimeResourceManagerMonitor.java:348)
        at org.eclipse.ptp.rtsystem.AbstractRuntimeSystem.fireRuntimeJobChangeEvent(AbstractRuntimeSystem.java:90)
        at org.eclipse.ptp.rtsystem.AbstractProxyRuntimeSystem.handleEvent(AbstractProxyRuntimeSystem.java:368)
        at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient.fireProxyRuntimeJobChangeEvent(AbstractProxyRuntimeClient.java:249)
        at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient.processRunningEvent(AbstractProxyRuntimeClient.java:677)
        at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient.runStateMachine(AbstractProxyRuntimeClient.java:937)
        at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient$StateMachineThread.run(AbstractProxyRuntimeClient.java:94)
        at java.lang.Thread.run(Thread.java:736)

and doing a scontrol show job ID --details gives this

JobId=15975 Name=pvserver
   UserId=biddisco(20569) GroupId=csstaff(1000)
   Priority=11025 Account=csstaff QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:01:08 TimeLimit=00:55:00 TimeMin=N/A
   SubmitTime=12:23:51 EligibleTime=12:23:51
   StartTime=12:23:51 EndTime=12:24:59
   PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
   Partition=stdMem AllocNode:Sid=eiger220:4509
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=eiger200
   BatchHost=eiger200
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
     Nodes=eiger200 CPU_IDs=1 Mem=0
   MinCPUsNode=1 MinMemoryCPU=12000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=(null)

I suspect the generation of the slurm params is fishy. Is it possible to edit them by hand? (I think there was a template somewhere, but I can't remember/find it). 

It's quite possible I'm doing something wrong as I'm new to this.

Any advice welcome. 
thanks

JB


_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-user


Back to the top