Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-user] System Monitoring ** System Type == IBM LoadLeveler

 

Once again, with correct subject line.

 

On Wednesday, November 07, 2012 12:00:09

Carsten Karbach <c.karbach@xxxxxxxxxxxxx> wrote:

> Dear Christoph,

>

> this sounds like the server part of PTP's monitoring system is unable to

> map the running jobs to the compute nodes. By clicking on a job all

> nodes are grayed out, which do not belong to this job. This might look

> like all nodes are highlighted, since there are no compute nodes mapped

> to any job.

>

> To get more information about reasons for this behavior you can try the

> following:

>

> 1. On the remote machine, go to the ".eclipsesettings" directory,

> located in your home directory

> 2. Create a file called ".LML_da_options" containing a single line

> "keeptmp=1" (no quotes).

> 3. Restart the monitor.

> 4. You should now find a directory called "tmp_<hostname>_<pid>" in the

> ".eclipsesettings" directory. It should contain an error log file, plus

> a bunch of other files. Check these files to see if you can see the

> cause of the error.

> 5. Remember to remove the ".LML_da_options" file once you have finished.

>

> Best regards,

>

> Carsten

 

Dear Carsten,

it looks as if your guess is correct. LML_da.errlog shows many error messages 

like this.

insert_job_into_nodedisplay: Error: could not map node p076-c32

insert_job_into_nodedisplay: Error: could not map node p118-c00

In file jobs_LML.xml, I can find jobs in status running with a list of nodes 

specified, for example the following XML stanza.

<info oid="j000076" type="short">

 <data key="queue"          value="cluster"/>

 <data key="dispatchdate"   value="Sat Nov 10 05:29:37 CET 2012"/>

 <data key="favored"        value="No"/>

 <data key="name"           value="o3so2000"/>

 <data key="step"           value="pio01.dkrz.de.2809055.0"/>

 <data key="group"          value="mh0469"/>

 <data key="owner"          value="m214074"/>

 <data key="queuedate"      value="Sat Nov 10 05:29:12 CET 2012"/>

 <data key="restart"        value="yes"/>

 <data key="state"          value="Running"/> 

 <data key="nodelist"       value="(p076,32)(p076,33)(p076,34)(p076,35)

(p076,36)(p076,37)(p076,38)(p076,39)(p076,40)(p076,41)(p076,42)(p076,43)

(p076,44)(p076,45)(p076,46)(p076,47)(p076,48)(p076,49)(p076,50)(p076,51)

(p076,52)(p076,53)(p076,54)(p076,55)(p076,56)(p076,57)(p076,58)(p076,59)

(p076,60)(p076,61)(p076,62)(p076,63)(p076,64)(p076,65)(p076,66)(p076,67)

(p076,68)(p076,69)(p076,70)(p076,71)(p076,72)(p076,73)(p076,74)(p076,75)

(p076,76)(p076,77)(p076,78)(p076,79)(p076,80)(p076,81)(p076,82)(p076,83)

(p076,84)(p076,85)(p076,86)(p076,87)(p076,88)(p076,89)(p076,90)(p076,91)

(p076,92)(p076,93)(p076,94)(p076,95)"/>

 <data key="wall"           value="28800"/>

 <data key="wallsoft"       value="28800"/>

 <data key="classprio"      value="50"/>

 <data key="groupprio"      value="50"/>

 <data key="status"         value="RUNNING"/>

 <data key="totalcores"     value="64"/>

 <data key="totaltasks"     value="64"/>

</info>

In file nodes_LML.xml, the nodes are known by their long name, like this 

(taking p076, as it is mentioned in the previous XML stanza).

<info oid="nd000076" type="short">

 <data key="ncores"         value="64"/>

 <data key="availmem"       value="38295 mb"/>

 <data key="physmem"        value="124160 mb"/>

 <data key="state"          value="Busy"/>

 <data key="id"             value="p076.dkrz.de"/>

</info>

Could it be that this is a p076 vs. p076.dkrz.de issue ?

At any rate, I wrapped the whole directory 

/pf/k/k205001/.eclipsesettings/tmp_blizzard2_29360720 into a tar.bz2 file and 

placed it on juqueen.fz-juelich.de:/homec/ibm/pospiech

pospiech@juqueen2:~ $ pwd

/homec/ibm/pospiech

pospiech@juqueen2:~ $ ls -l tmp_blizzard2_29360720.tar.bz2 

-rw-r--r-- 1 pospiech apache 124368 Nov 10 13:55 

tmp_blizzard2_29360720.tar.bz2

Can you please have a look ? Thanks !

--

 

Mit freundlichen Grüßen / Kind regards

 

Dr. Christoph Pospiech

High Performance & Parallel Computing

Phone: +49-351 86269826

Mobile: +49-171-765 5871

E-Mail: christoph.pospiech@xxxxxxxxxx

-------------------------------------------------------------------------------------------------------------------------------------------

IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter

Geschäftsführung: Martina Koederitz (Vorsitzende), Reinhard Reschke, Dieter Scholz, Gregor Pillen, Joachim Heel, Christian Noll

Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940

 


Back to the top