Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-user] System Monitoring ** System Type == IBM LoadLeveler

Dear Christoph,

you are right: the monitoring server scripts are unable to map jobs to
their compute nodes, because the node names have fully qualified names
like "p076.dkrz.de", while many "nodelist" attributes are using only the
first token such as "p076".

The fastest solution for teaching the monitoring server the correct
names is the following:

1) go to the .eclipsesettings folder on your target system
2) create the file .LML_da_options with the following line as content:
nocheckrequest=1

3) Replace the file samples/layout_default.xml with the layout.xml file
from one of the tmp_<hostname>_<pid> directories produced when keeptmp
was set to true

4) Edit the samples/layout_default.xml file: replace the entire
nodedisplay tag

<nodedisplay id="nd_1" title="system: blizzard2">
...
</nodedisplay>

with

<nodedisplay id="nd_1" title="system: blizzard2">
<scheme>
<el1 tagname="node" min="0" max="253" mask="p%03d">
        <el2 tagname="core" min="0" max="63" mask="-c%02d">
        </el2>
</el1>
</scheme>
<data>
<el1>
</el1>
</data>
</nodedisplay>

5) Refresh the monitoring perspective in Eclipse

6) As soon as the system monitoring perspective works successfully, you
can delete the .LML_da_options file.

Note, that this procedure is only a workaround. We are planning an
enhancement for PTP, which allows to configure this layout configuration
file directly from the Eclipse client.

Moreover, I found a couple of jobs, which have nodelist attributes with
fully qualified names (e.g. in your tmp directory example the job with
id j000160). So there are a few jobs having fully qualified node names
as connected compute nodes, while the bigger part of the jobs have the
short node names. With the adapted layout file, the jobs with fully
qualified node names will not be mapped correctly anymore. In order to
debug this issue, I would need outputs from the following commands of
your target system:

llq -l
llstatus -l

It is possible, that the server scripts have to be adapted for this
target system, since the command line outputs might differ from other
Loadleveler versions.

Best regards,

Carsten


On 11/10/12 14:06, Christoph Pospiech wrote:
Once again, with correct subject line.

On Wednesday, November 07, 2012 12:00:09

Carsten Karbach <c.karbach@xxxxxxxxxxxxx
<mailto:c.karbach@xxxxxxxxxxxxx>> wrote:

 > Dear Christoph,

 >

 > this sounds like the server part of PTP's monitoring system is unable to

 > map the running jobs to the compute nodes. By clicking on a job all

 > nodes are grayed out, which do not belong to this job. This might look

 > like all nodes are highlighted, since there are no compute nodes mapped

 > to any job.

 >

 > To get more information about reasons for this behavior you can try the

 > following:

 >

 > 1. On the remote machine, go to the ".eclipsesettings" directory,

 > located in your home directory

 > 2. Create a file called ".LML_da_options" containing a single line

 > "keeptmp=1" (no quotes).

 > 3. Restart the monitor.

 > 4. You should now find a directory called "tmp_<hostname>_<pid>" in the

 > ".eclipsesettings" directory. It should contain an error log file, plus

 > a bunch of other files. Check these files to see if you can see the

 > cause of the error.

 > 5. Remember to remove the ".LML_da_options" file once you have finished.

 >

 > Best regards,

 >

 > Carsten

Dear Carsten,

it looks as if your guess is correct. LML_da.errlog shows many error
messages

like this.

insert_job_into_nodedisplay: Error: could not map node p076-c32

insert_job_into_nodedisplay: Error: could not map node p118-c00

In file jobs_LML.xml, I can find jobs in status running with a list of
nodes

specified, for example the following XML stanza.

<info oid="j000076" type="short">

  <data key="queue"          value="cluster"/>

  <data key="dispatchdate"   value="Sat Nov 10 05:29:37 CET 2012"/>

  <data key="favored"        value="No"/>

  <data key="name"           value="o3so2000"/>

  <data key="step"           value="pio01.dkrz.de.2809055.0"/>

  <data key="group"          value="mh0469"/>

  <data key="owner"          value="m214074"/>

  <data key="queuedate"      value="Sat Nov 10 05:29:12 CET 2012"/>

  <data key="restart"        value="yes"/>

  <data key="state"          value="Running"/>

  <data key="nodelist"       value="(p076,32)(p076,33)(p076,34)(p076,35)

(p076,36)(p076,37)(p076,38)(p076,39)(p076,40)(p076,41)(p076,42)(p076,43)

(p076,44)(p076,45)(p076,46)(p076,47)(p076,48)(p076,49)(p076,50)(p076,51)

(p076,52)(p076,53)(p076,54)(p076,55)(p076,56)(p076,57)(p076,58)(p076,59)

(p076,60)(p076,61)(p076,62)(p076,63)(p076,64)(p076,65)(p076,66)(p076,67)

(p076,68)(p076,69)(p076,70)(p076,71)(p076,72)(p076,73)(p076,74)(p076,75)

(p076,76)(p076,77)(p076,78)(p076,79)(p076,80)(p076,81)(p076,82)(p076,83)

(p076,84)(p076,85)(p076,86)(p076,87)(p076,88)(p076,89)(p076,90)(p076,91)

(p076,92)(p076,93)(p076,94)(p076,95)"/>

  <data key="wall"           value="28800"/>

  <data key="wallsoft"       value="28800"/>

  <data key="classprio"      value="50"/>

  <data key="groupprio"      value="50"/>

  <data key="status"         value="RUNNING"/>

  <data key="totalcores"     value="64"/>

  <data key="totaltasks"     value="64"/>

</info>

In file nodes_LML.xml, the nodes are known by their long name, like this

(taking p076, as it is mentioned in the previous XML stanza).

<info oid="nd000076" type="short">

  <data key="ncores"         value="64"/>

  <data key="availmem"       value="38295 mb"/>

  <data key="physmem"        value="124160 mb"/>

  <data key="state"          value="Busy"/>

  <data key="id"             value="p076.dkrz.de"/>

</info>

Could it be that this is a p076 vs. p076.dkrz.de issue ?

At any rate, I wrapped the whole directory

/pf/k/k205001/.eclipsesettings/tmp_blizzard2_29360720 into a tar.bz2
file and

placed it on juqueen.fz-juelich.de:/homec/ibm/pospiech

pospiech@juqueen2:~ $ pwd

/homec/ibm/pospiech

pospiech@juqueen2:~ $ ls -l tmp_blizzard2_29360720.tar.bz2

-rw-r--r-- 1 pospiech apache 124368 Nov 10 13:55

tmp_blizzard2_29360720.tar.bz2

Can you please have a look ? Thanks !

--

Mit freundlichen Grüßen / Kind regards

Dr. Christoph Pospiech

High Performance & Parallel Computing

Phone: +49-351 86269826

Mobile: +49-171-765 5871

E-Mail: christoph.pospiech@xxxxxxxxxx

-------------------------------------------------------------------------------------------------------------------------------------------

IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter

Geschäftsführung: Martina Koederitz (Vorsitzende), Reinhard Reschke,
Dieter Scholz, Gregor Pillen, Joachim Heel, Christian Noll

Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht
Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940




------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Kennen Sie schon unsere app? http://www.fz-juelich.de/app


Back to the top