Bug 372482 - [LML] Problem mapping jobs to nodes
Summary: [LML] Problem mapping jobs to nodes
Status: ASSIGNED
Alias: None
Product: PTP
Classification: Tools
Component: RM (show other bugs)
Version: 5.0.4   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Wolfgang Frings CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-02-24 08:24 EST by Greg Watson CLA
Modified: 2013-09-14 16:36 EDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Greg Watson CLA 2012-02-24 08:24:55 EST
I have been doing some work on getting my SLURM LML code to work for clusters and not just Blue Gene.  While working on this I discovered a potential bug with the mapping of jobs to nodes and the display.  When running SLURM with my simulator (no jobs initially), the jobs were always allocated the lowest numbered nodes and moved upwards as needed.  When this was first displayed in PTP, the jobs seemed to be distributed to a random set of nodes which was unexpected.  On investigating the code, I saw the remapping of SLURM generated node names to LML node numbers, which is the order of nodes in the display.  The code snipet from LML_gen_nodedisplay.pm is below:

409 sub _get_system_size_cluster  {
410    my($self) = shift;
411    my($indataref) = $self->{INDATA};
412    my($numnodes);
413    my ($key,$ref,$name,$ncores);
414
415    keys(%{$self->{LMLFH}->{DATA}->{OBJECT}}); # reset iterator
416    while(($key,$ref)=each(%{$self->{LMLFH}->{DATA}->{OBJECT}})) {
417	next if($ref->{type} ne 'node');
418	$name=$ref->{name};
419	$ncores=$self->{LMLFH}->{DATA}->{INFODATA}->{$key}->{ncores};
420	if(!defined($ncores)) {
421	    print "_get_system_size_cluster: suspect node: $name, assuming 1 cores\n"  if($self->{VERBOSE});
422	    $ncores=1;
423	}
424	if($ncores<0) {
425	    print "_get_system_size_cluster: suspect node: $name negative number of cores, assuming 1 cores\n"  if($self->{VERBOSE});
426	    $ncores=1;
427	}
428	push(@{$self->{NODESIZES}->{$ncores}},$name);
429    }
430
431
432    $numnodes=0;
433    foreach $ncores (sort {$a <=> $b} keys %{$self->{NODESIZES}}) {
434	foreach $name (@{$self->{NODESIZES}->{$ncores}}) {
435	    # register new node 
436	    if(!exists($self->{NODEMAPPING}->{$name})) {
437		$self->{NODEMAPPING}->{$name}=sprintf($self->{NODENAMENAMASK},$numnodes);
438#		print "_get_system_size_cluster: remap '$name' -> '$self->{NODEMAPPING}->{$name}'\n";
439		$numnodes++;
440	    } else {
441		print "ERROR: _get_system_size_cluster: duplicate node '$name' -> '$self->{NODEMAPPING}->{$name}'\n";
442	    }
443	}
444	printf("_get_system_size_cluster: found %4d nodes of size: %d\n", scalar @{$self->{NODESIZES}->{$ncores}},$ncores) 
445	    if($self->{VERBOSE});
446    }
447    printf("_get_system_size_cluster: Cluster found of size: %d\n",$numnodes) if($self->{VERBOSE});
448    
449    return($numnodes);
450 }

Lines 416-428 iterate through all the node definitions and push each one onto a list, indexed by the number of cores in the node.  Then lines 433-443 iterate through all the nodes again, sorted by increasing number of cores and remap each node to an internal numbered node at line 437.  The problem here is that the order of the nodes in the initial loop (lines 416-428) is random and not in the order of the node definitions in the XML file - nd000001, nd000002, ...  According to the Perl documentation, "each" returns hashes in a random order.  Therefore in the second loop the mapping to the internal node number is random as well and thus jobs appear distributed to random nodes across the cluster.  While this might be intentional, as a user really doesn't care where on the machine a particular job is running, it is confusing as the PTP display does not in any way reflect the numbering of nodes on the actual machine.

The simple solution that I found is to sort the second loop by changing line 434 to:

434	foreach $name (sort @{$self->{NODESIZES}->{$ncores}}) {

Alternatively you could sort the first loop.

The other interesting thing I found is if the cluster is made up of nodes of different core counts, then the PTP display always shows the lower core count nodes before the higher core count nodes, regardless to the node numbering used on the machine.  This is because of the sorting of the node lists based on core count on line 433.  This is obviously intentional, but again could be confusing to users.

If you think this issue is a bug, then similar changes are required for the similar PBS cluster code in the same file.

IMHO I think the PTP display should reflect the logical node numbering used by the machine scheduler and not some random ordering.
Comment 1 Greg Watson CLA 2012-02-24 08:25:29 EST
Reported by Simon Wail