Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-user] slurm tasks not honoured?

Greg

Thanks very much. I have upgraded everything and tested the slurm startup. Things haven't worked quite right, but I'm now sure that slurm is broken on the cluster. mpiexec -n ... works fine, but srun -n ... fails to produce the correct jobs. Once this is fixed, I'll try again.

How can I change the srun command in the job script. There are lots of options, but I can't seem to make them behave the way I expect/want. (where is the xml that holds the defaults etc actually kept? if I could modify that to use mpiexec I might be ok)

In the meantime, I have been perusing the ptp developers pages and decided to try installing all that stuff. I used to know how to work with large java projects and it'd be handy to be able to fix some bugs without help.

ta

JB

-----Original Message-----
From: ptp-user-bounces@xxxxxxxxxxx [mailto:ptp-user-bounces@xxxxxxxxxxx] On Behalf Of Greg Watson
Sent: 12 March 2012 17:33
To: PTP User list
Subject: Re: [ptp-user] slurm tasks not honoured?

John,

The new SLURM support is available as follows:

1. Download the Eclipse Juno M5 build from here: http://www.eclipse.org/downloads/packages/eclipse-ide-parallel-application-developers/junom5

2. Go to Help>Install New Software... and add the following update site: http://download.eclipse.org/tools/ptp/builds/juno/nightly

3. Click on Parallel Tools Platform and update it.

4. When Eclipse restarts, switch to the System Monitoring perspective and select "Add Resource Manager..." from the context menu. Add a "SLURM-Generic-Batch" RM.

This RM is just a copy of the BG/Q version with the BG-specific parts removed. Let me know if it work for you, or what problems you run into.

Cheers,
Greg
 
On Feb 21, 2012, at 10:42 AM, Biddiscombe, John A. wrote:

> Greg,
> 
> I'm away from the office until Friday, but if I can help, then 
> consider me volunteered. I have no clue where to start, so if you can 
> send instructions that I can follow to experiment, then please do and 
> I'll happily play with things when I get back
> 
> JB
> 
> -----Original Message-----
> From: ptp-user-bounces@xxxxxxxxxxx 
> [mailto:ptp-user-bounces@xxxxxxxxxxx] On Behalf Of Greg Watson
> Sent: 21 February 2012 15:40
> To: PTP User list
> Subject: Re: [ptp-user] slurm tasks not honoured?
> 
> John,
> 
> I'm not sure that the current SLURM resource manager has been very thoroughly tested, so it's possible you're seeing some bugs with this implementation. Ideally we would like to transition from this version to the new RM framework (the one used for PBS), but need someone who has access to a SLURM system to volunteer to write/test a configuration file.
> 
> Regards,
> Greg
> 
> On Feb 20, 2012, at 11:18 AM, Biddiscombe, John A. wrote:
> 
>> I think a more general question would be With a slurm resource 
>> manager running - which seems to be fine - How do I enter specific mvapich2 settings to ensure that when the job is submitted to slurm, the correct launch procedure is followed?
>> 
>> the PBS resource manager seems to have all the options for changing the MPI command etc, but I can't find the equivalent using slurm.
>> (our system changed from PBS to slurm a few months ago and this is my first attempt to setup things since then).
>> (we are using slurm-2.3.0-pre5 by the look of things)
>> 
>> thanks (hopefully)
>> 
>> JB
>> 
>> 
>> -----Original Message-----
>> From: ptp-user-bounces@xxxxxxxxxxx [mailto:ptp-user-bounces@xxxxxxxxxxx] On Behalf Of Biddiscombe, John A.
>> Sent: 20 February 2012 12:29
>> To: PTP User list
>> Subject: [ptp-user] slurm tasks not honoured?
>> 
>> Seeing the email about the release of ptp 5.0.5 I updated eclipse and 
>> downloaded the proxy zip file recompiled utils, proxy and sdm all seems fine, but when I run a job, the num tasks is always 1 it seems.
>> 
>> Launching with 16 tasks on one node, it outputs this (note the 
>> exception every time on job launch)
>> 
>> SLURM@Local: ptp_slurm_proxy: Job step aborted: Waiting up to 2 seconds for job step to finish.
>> SLURM@Local: Send Job/Process StateChange Event: state=32772
>> SLURM@Local: job[15974] iothread exit on EOF/ERROR of stdout fd
>> SLURM@Local: job[15974] iothread exit on Error/EOF of stderr fd.
>> SLURM@Local: Send Job/Process StateChange Event: state=4
>> SLURM@Local: Job[15974] no longer exist in SLURM. Romove it!
>> SLURM@Local: SLURM_SubmitJob (2):
>> SLURM@Local: job submit commands:
>> SLURM@Local:    jobTimeLimit=55
>> SLURM@Local:    launchedByPTP=true
>> SLURM@Local:    jobNumProcs=16
>> SLURM@Local:    execPath=/project/csvis/biddisco/eiger/build/pv-os/bin
>> SLURM@Local:    progArgs=-rc
>> SLURM@Local:    progArgs=-ch=148.187.14.220
>> SLURM@Local:    progArgs=--use-offscreen-rendering
>> SLURM@Local:    jobNumNodes=1
>> SLURM@Local:    execName=pvserver
>> SLURM@Local:    jobPartition=stdMem
>> SLURM@Local:    jobSubId=JOB_13297370315374
>> SLURM@Local: Job[15975] io thread create done.
>> SLURM@Local: Send Job/Process StateChange Event: state=1 
>> java.lang.NullPointerException
>>       at org.eclipse.ptp.ui.views.MachinesNodesView$JobListener.handleEvent(MachinesNodesView.java:111)
>>       at org.eclipse.ptp.rmsystem.AbstractResourceManagerMonitor.fireJobChanged(AbstractResourceManagerMonitor.java:241)
>>       at org.eclipse.ptp.rmsystem.AbstractResourceManager.fireJobChanged(AbstractResourceManager.java:510)
>>       at org.eclipse.ptp.rtsystem.AbstractRuntimeResourceManager.fireJobChanged(AbstractRuntimeResourceManager.java:145)
>>       at org.eclipse.ptp.rtsystem.AbstractRuntimeResourceManagerMonitor.doUpdateJobs(AbstractRuntimeResourceManagerMonitor.java:988)
>>       at org.eclipse.ptp.rtsystem.AbstractRuntimeResourceManagerMonitor.handleEvent(AbstractRuntimeResourceManagerMonitor.java:348)
>>       at org.eclipse.ptp.rtsystem.AbstractRuntimeSystem.fireRuntimeJobChangeEvent(AbstractRuntimeSystem.java:90)
>>       at org.eclipse.ptp.rtsystem.AbstractProxyRuntimeSystem.handleEvent(AbstractProxyRuntimeSystem.java:368)
>>       at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient.fireProxyRuntimeJobChangeEvent(AbstractProxyRuntimeClient.java:249)
>>       at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient.processRunningEvent(AbstractProxyRuntimeClient.java:677)
>>       at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient.runStateMachine(AbstractProxyRuntimeClient.java:937)
>>       at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient$StateMachineThread.run(AbstractProxyRuntimeClient.java:94)
>>       at java.lang.Thread.run(Thread.java:736)
>> 
>> and doing a scontrol show job ID --details gives this
>> 
>> JobId=15975 Name=pvserver
>>  UserId=biddisco(20569) GroupId=csstaff(1000)
>>  Priority=11025 Account=csstaff QOS=normal  JobState=COMPLETED 
>> Reason=None Dependency=(null)
>>  Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
>>  DerivedExitCode=0:0
>>  RunTime=00:01:08 TimeLimit=00:55:00 TimeMin=N/A
>>  SubmitTime=12:23:51 EligibleTime=12:23:51
>>  StartTime=12:23:51 EndTime=12:24:59
>>  PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0  
>> Partition=stdMem AllocNode:Sid=eiger220:4509
>>  ReqNodeList=(null) ExcNodeList=(null)
>>  NodeList=eiger200
>>  BatchHost=eiger200
>>  NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
>>    Nodes=eiger200 CPU_IDs=1 Mem=0
>>  MinCPUsNode=1 MinMemoryCPU=12000M MinTmpDiskNode=0
>>  Features=(null) Gres=(null) Reservation=(null)  Shared=OK 
>> Contiguous=0 Licenses=(null) Network=(null)
>>  Command=(null)
>>  WorkDir=(null)
>> 
>> I suspect the generation of the slurm params is fishy. Is it possible to edit them by hand? (I think there was a template somewhere, but I can't remember/find it). 
>> 
>> It's quite possible I'm doing something wrong as I'm new to this.
>> 
>> Any advice welcome. 
>> thanks
>> 
>> JB
>> 
>> 
>> _______________________________________________
>> ptp-user mailing list
>> ptp-user@xxxxxxxxxxx
>> https://dev.eclipse.org/mailman/listinfo/ptp-user
>> _______________________________________________
>> ptp-user mailing list
>> ptp-user@xxxxxxxxxxx
>> https://dev.eclipse.org/mailman/listinfo/ptp-user
> 
> _______________________________________________
> ptp-user mailing list
> ptp-user@xxxxxxxxxxx
> https://dev.eclipse.org/mailman/listinfo/ptp-user
> _______________________________________________
> ptp-user mailing list
> ptp-user@xxxxxxxxxxx
> https://dev.eclipse.org/mailman/listinfo/ptp-user

_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-user


Back to the top