Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-user] ptp-user Digest, Vol 72, Issue 5

On Wednesday, November 07, 2012 12:00:09

Carsten Karbach <c.karbach@xxxxxxxxxxxxx> wrote:

> Dear Christoph,

>

> this sounds like the server part of PTP's monitoring system is unable to

> map the running jobs to the compute nodes. By clicking on a job all

> nodes are grayed out, which do not belong to this job. This might look

> like all nodes are highlighted, since there are no compute nodes mapped

> to any job.

>

> To get more information about reasons for this behavior you can try the

> following:

>

> 1. On the remote machine, go to the ".eclipsesettings" directory,

> located in your home directory

> 2. Create a file called ".LML_da_options" containing a single line

> "keeptmp=1" (no quotes).

> 3. Restart the monitor.

> 4. You should now find a directory called "tmp_<hostname>_<pid>" in the

> ".eclipsesettings" directory. It should contain an error log file, plus

> a bunch of other files. Check these files to see if you can see the

> cause of the error.

> 5. Remember to remove the ".LML_da_options" file once you have finished.

>

> Best regards,

>

> Carsten

 

Dear Carsten,

 

it looks as if your guess is correct. LML_da.errlog shows many error messages like this.

insert_job_into_nodedisplay: Error: could not map node p076-c32

insert_job_into_nodedisplay: Error: could not map node p118-c00

 

In file jobs_LML.xml, I can find jobs in status running with a list of nodes specified, for example the following XML stanza.

<info oid="j000076" type="short">

<data key="queue" value="cluster"/>

<data key="dispatchdate" value="Sat Nov 10 05:29:37 CET 2012"/>

<data key="favored" value="No"/>

<data key="name" value="o3so2000"/>

<data key="step" value="pio01.dkrz.de.2809055.0"/>

<data key="group" value="mh0469"/>

<data key="owner" value="m214074"/>

<data key="queuedate" value="Sat Nov 10 05:29:12 CET 2012"/>

<data key="restart" value="yes"/>

<data key="state" value="Running"/>

<data key="nodelist" value="(p076,32)(p076,33)(p076,34)(p076,35)(p076,36)(p076,37)(p076,38)(p076,39)(p076,40)(p076,41)(p076,42)(p076,43)(p076,44)(p076,45)(p076,46)(p076,47)(p076,48)(p076,49)(p076,50)(p076,51)(p076,52)(p076,53)(p076,54)(p076,55)(p076,56)(p076,57)(p076,58)(p076,59)(p076,60)(p076,61)(p076,62)(p076,63)(p076,64)(p076,65)(p076,66)(p076,67)(p076,68)(p076,69)(p076,70)(p076,71)(p076,72)(p076,73)(p076,74)(p076,75)(p076,76)(p076,77)(p076,78)(p076,79)(p076,80)(p076,81)(p076,82)(p076,83)(p076,84)(p076,85)(p076,86)(p076,87)(p076,88)(p076,89)(p076,90)(p076,91)(p076,92)(p076,93)(p076,94)(p076,95)"/>

<data key="wall" value="28800"/>

<data key="wallsoft" value="28800"/>

<data key="classprio" value="50"/>

<data key="groupprio" value="50"/>

<data key="status" value="RUNNING"/>

<data key="totalcores" value="64"/>

<data key="totaltasks" value="64"/>

</info>

 

In file nodes_LML.xml, the nodes are known by their long name, like this (taking p076, as it is mentioned in the previous XML stanza).

<info oid="nd000076" type="short">

<data key="ncores" value="64"/>

<data key="availmem" value="38295 mb"/>

<data key="physmem" value="124160 mb"/>

<data key="state" value="Busy"/>

<data key="id" value="p076.dkrz.de"/>

</info>

 

Could it be that this is a p076 vs. p076.dkrz.de issue ?

 

At any rate, I wrapped the whole directory /pf/k/k205001/.eclipsesettings/tmp_blizzard2_29360720 into a tar.bz2 file and placed it on juqueen.fz-juelich.de:/homec/ibm/pospiech

 

pospiech@juqueen2:~ $ pwd

/homec/ibm/pospiech

pospiech@juqueen2:~ $ ls -l tmp_blizzard2_29360720.tar.bz2

-rw-r--r-- 1 pospiech apache 124368 Nov 10 13:55 tmp_blizzard2_29360720.tar.bz2

 

Can you please have a look ? Thanks !

--

 

Mit freundlichen Grüßen / Kind regards

 

Dr. Christoph Pospiech

High Performance & Parallel Computing

Phone: +49-351 86269826

Mobile: +49-171-765 5871

E-Mail: christoph.pospiech@xxxxxxxxxx

-------------------------------------------------------------------------------------------------------------------------------------------

IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter

Geschäftsführung: Martina Koederitz (Vorsitzende), Reinhard Reschke, Dieter Scholz, Gregor Pillen, Joachim Heel, Christian Noll

Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940

 


Back to the top