Community
Participate
Working Groups
I have been doing some work on getting my SLURM LML code to work for clusters and not just Blue Gene. While working on this I discovered a potential bug with the mapping of jobs to nodes and the display. When running SLURM with my simulator (no jobs initially), the jobs were always allocated the lowest numbered nodes and moved upwards as needed. When this was first displayed in PTP, the jobs seemed to be distributed to a random set of nodes which was unexpected. On investigating the code, I saw the remapping of SLURM generated node names to LML node numbers, which is the order of nodes in the display. The code snipet from LML_gen_nodedisplay.pm is below: 409 sub _get_system_size_cluster { 410 my($self) = shift; 411 my($indataref) = $self->{INDATA}; 412 my($numnodes); 413 my ($key,$ref,$name,$ncores); 414 415 keys(%{$self->{LMLFH}->{DATA}->{OBJECT}}); # reset iterator 416 while(($key,$ref)=each(%{$self->{LMLFH}->{DATA}->{OBJECT}})) { 417 next if($ref->{type} ne 'node'); 418 $name=$ref->{name}; 419 $ncores=$self->{LMLFH}->{DATA}->{INFODATA}->{$key}->{ncores}; 420 if(!defined($ncores)) { 421 print "_get_system_size_cluster: suspect node: $name, assuming 1 cores\n" if($self->{VERBOSE}); 422 $ncores=1; 423 } 424 if($ncores<0) { 425 print "_get_system_size_cluster: suspect node: $name negative number of cores, assuming 1 cores\n" if($self->{VERBOSE}); 426 $ncores=1; 427 } 428 push(@{$self->{NODESIZES}->{$ncores}},$name); 429 } 430 431 432 $numnodes=0; 433 foreach $ncores (sort {$a <=> $b} keys %{$self->{NODESIZES}}) { 434 foreach $name (@{$self->{NODESIZES}->{$ncores}}) { 435 # register new node 436 if(!exists($self->{NODEMAPPING}->{$name})) { 437 $self->{NODEMAPPING}->{$name}=sprintf($self->{NODENAMENAMASK},$numnodes); 438# print "_get_system_size_cluster: remap '$name' -> '$self->{NODEMAPPING}->{$name}'\n"; 439 $numnodes++; 440 } else { 441 print "ERROR: _get_system_size_cluster: duplicate node '$name' -> '$self->{NODEMAPPING}->{$name}'\n"; 442 } 443 } 444 printf("_get_system_size_cluster: found %4d nodes of size: %d\n", scalar @{$self->{NODESIZES}->{$ncores}},$ncores) 445 if($self->{VERBOSE}); 446 } 447 printf("_get_system_size_cluster: Cluster found of size: %d\n",$numnodes) if($self->{VERBOSE}); 448 449 return($numnodes); 450 } Lines 416-428 iterate through all the node definitions and push each one onto a list, indexed by the number of cores in the node. Then lines 433-443 iterate through all the nodes again, sorted by increasing number of cores and remap each node to an internal numbered node at line 437. The problem here is that the order of the nodes in the initial loop (lines 416-428) is random and not in the order of the node definitions in the XML file - nd000001, nd000002, ... According to the Perl documentation, "each" returns hashes in a random order. Therefore in the second loop the mapping to the internal node number is random as well and thus jobs appear distributed to random nodes across the cluster. While this might be intentional, as a user really doesn't care where on the machine a particular job is running, it is confusing as the PTP display does not in any way reflect the numbering of nodes on the actual machine. The simple solution that I found is to sort the second loop by changing line 434 to: 434 foreach $name (sort @{$self->{NODESIZES}->{$ncores}}) { Alternatively you could sort the first loop. The other interesting thing I found is if the cluster is made up of nodes of different core counts, then the PTP display always shows the lower core count nodes before the higher core count nodes, regardless to the node numbering used on the machine. This is because of the sorting of the node lists based on core count on line 433. This is obviously intentional, but again could be confusing to users. If you think this issue is a bug, then similar changes are required for the similar PBS cluster code in the same file. IMHO I think the PTP display should reflect the logical node numbering used by the machine scheduler and not some random ordering.
Reported by Simon Wail