Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-user] Eclipse PTP on K - monitoring / Layout

Dear Peter,

find my answers below:

On 01/27/15 10:30, Peter Bryzgalov wrote:
Dear Carsten,
Thank you for detailed explanation.
But I still have questions I cannot find answers to in documentation.

1. Custom layout for FX10 system.

The problem I have, is that node names don’t map well to hierarchy units.
We have 2 level hierarchy: "Tofu units” and “boards”.

Nodes are grouped like the following:

Is it possible to create schemehint to group nodes like this?
The problem is that to tell which tofu unit a node belongs to we need to
mask two last characters of the node name. But then we have no
characters left for the boards-level mask.

Is there some kind of pre-mapping node names handling? Is it possible to
change node names before they are mapped inside schemehint ?

Unfortunately, no, there is no premapping available, which you could use at this point. You would have to adjust the LML_DA scripts, which gather node and job information. There, you would have to map the real hardware names to the logical names.

As the described naming scheme does not map to the hierarchy (Tofu-boards-cores), you can only use a two-level hierarchy of nodes and cores, if you want to stick with the real hardware names. If you map the hardware names to logical names in LML_DA, you will loose the reference to the real hardware names. So far, we do not show any additional information on the nodes except for their names. Thus, when you change node names to logical names, you can no longer see the hardware names in the monitoring perspective.


2. In documentation it says, that "In the latest Parallel Tools
Platform (PTP) build it is also allowed to embed a customized LML layout
into the monitor-data element of the target system configuration”.
Does it mean, that if we include layout into TSC like this, we shouldn't
need layout on the server (supercomputer)? If we layouts defined both in
TSC and on server, which one takes precedence?

Embedding layout into TSC doesn’t work for me.

I have:
Eclipse for Parallel Application Developers
Version: Luna Service Release 1 (4.4.1)
Build id: 20140925-1800

I guess, you do already know about how to add a customized TSC as decribed at http://wiki.eclipse.org/PTP/FAQ#Q:_How_do_I_customize_an_existing_Target_System_Configuration.3F Yes, if you include the layout into the TSC, there is no additional layout needed on the remote system. Can you give me more information on how the embedding of the layout fails? For a starting point, have a look at http://git.eclipse.org/c/ptp/org.eclipse.ptp.git/tree/rms/org.eclipse.ptp.rm.jaxb.contrib/data/de.fz-juelich.juqueen.ll_bg.xml, which contains a layout definition.

In general, the layout in the TSC precedes the layout file placed as default layout on the remote system. However, I do not know how you provide your layout file to LML_DA on the remote system.



3. How can I include jobs into tl_WAIT table? They all (including queued
ones) appear in tl_Run table.

Therefore, use the pattern element on the key="status" within your tablelayout, e.g. for the running jobs table use:

<column cid="8" pos="7" width="0.3" active="true" key="status">
 <pattern>
  <select rel="=" value="RUNNING" />
 </pattern>
</column>

For the waiting jobs table use:
<column cid="8" pos="7" width="0.3" active="true" key="status">
 <pattern>
  <select rel="!=" value="RUNNING" />
 </pattern>
</column>

This should filter the jobs according to there status attribute. If that does not work, check the status attributes of the job data produced by LML_DA.

Best regards,

Carsten



Kind regards,
Peter


On 23 Jan, 2015, at 19:43, Carsten Karbach <c.karbach@xxxxxxxxxxxxx
<mailto:c.karbach@xxxxxxxxxxxxx>> wrote:

Dear Peter,

a pure layout file does not need to include the nodedisplay element,
but only the nodedisplaylayout element. In the generated LML file,
which is filled with data, nodedisplay/scheme and
nodedisplaylayout/schemehint will be identical. You should use the
nodedisplaylayout element to define the layout of your nodedisplay.

The same accounts for table and tablelayout. In your layout file you
should only use the tablelayout element. Unfortunately, the column
names cannot be changed. The values of the key attribute need to match
with the job data generated by LML_da. E.g. for

<info oid="j000116" type="short">
<data key="name"           value="my job"/>
<data key="owner"          value="karbach"/>
<data key="totalcores"     value="8"/>
<data key="group"          value="unknown"/>
...
</info>

possible values for the tablelayout key attributes are name, owner,
totalcores and so forth.

The tablelayout only allows to change the order and widths of columns.

Best regards,

Carsten

On 01/22/15 07:06, Peter Bryzgalov wrote:
Hi,

I have some questions about layout file. Here I attach my layout source
file screenshot with questions.

Kinds regards,
Peter


On 20 Jan, 2015, at 21:21, Carsten Karbach <c.karbach@xxxxxxxxxxxxx
<mailto:c.karbach@xxxxxxxxxxxxx>
<mailto:c.karbach@xxxxxxxxxxxxx>> wrote:

Dear Peter,

the scripts running on your remote system should generate LML files
for jobs and nodes. You can look into these files following the
instructions given at
https://wiki.eclipse.org/PTP/System_Monitoring_FAQ#Q:_How_do_I_debug_the_server_part_of_PTP.27s_system_monitoring_capability.3F

Jobs are mapped to the nodes through the following procedure:
A running job should have a nodelist or vnodelist attribute. Here is
an example for a job on a cluster:
<info oid="j000116" type="short">
<data key="name"           value="my job"/>
<data key="owner"          value="karbach"/>
<data key="totalcores"     value="8"/>
<data key="nodelist"
value="(jj13c41,7)(jj13c41,6)(jj13c41,5)(jj13c41,4)(jj13c41,3)(jj13c41,2)(jj13c41,1)(jj13c41,0)"/>
<data key="group"          value="unknown"/>
<data key="state"          value="Running"/>
<data key="ppn"            value="8"/>
<data key="queuedate"      value="Sun Jun  2 16:03:21 2013"/>
<data key="queue"          value="jsc"/>
<data key="spec"           value="1:ppn=8"/>
<data key="dispatchdate"   value="Tue Jun  4 05:47:43 2013"/>
<data key="status"         value="RUNNING"/>
<data key="step"           value="2696094.jj28b01"/>
<data key="totaltasks"     value="8"/>
</info>

The nodelist attribute contains one entry for each used core. In the
above example the job uses the cores 0 to 7 on the node jj13c41. Thus,
the nodelist attribute has the form (<nodename>,core-id)*
The vnodelist is a shorter form, which only lists how many cores are
used by each node. For this example the equivalent vnodelist attribute
would be "(jj13c41,8)".

When you concatinate the mask attributes in your nodedisplay layout
(in your example "0xFF0100%01d%01x"), you have to be able to generate
all node names by using this format string in a "printf" command.
Unfortunately, the hexadecimal format is not supported. To use
hexadecimal numbers, replace mask="%01x" with
map="0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f".

These node names need to be identical to the node names listed in the
nodes LML file. E.g. for the cluster above the nodes LML file contains
the following information for node jj13c41:

<object id="nd002177" name="jj13c41" type="node"/>
...
<info oid="nd002177" type="short">
<data key="availmem"       value="23498124kb"/>
<data key="id"             value="jj13c41"/>
<data key="ncores"         value="16"/>
<data key="physmem"        value="24732484kb"/>
<data key="state"          value="Idle"/>
</info>

The mask/map attributes in your nodedisplay layout are used to
generate regular expressions. LML_da can use them to extract the IDs
on each level from the actual node name in order to map a node to your
layout.

In addition to the replacement of the hexadecimal mask with the map
attribute, you also need to add a core-level element, defining how
many cores are configured in each node. LML_da currently always
requires the lowest level to look like the following:

<el3 tagname="core" min="0" max="15" mask="-c%02d"></el3>

Note that the mask attribute cannot be changed here. What you should
adjust are the min/max attributes depending on how many cores are
available in each node.

I hope that helps.

Best regards,

Carsten

On 01/20/15 10:46, Peter Bryzgalov wrote:
Hi,

In order to add monitoring feature to Eclipse PTP on “K” computer I am
working on a customised LML-layout. For tests I use a smaller
sibling of
“K” named “FX10”.

I customised the perl scripts and layout description file
(samples/layout_default_PJM.xml), and now in the monitoring perspective
I can see a table with the jobs running on the computer. Nodes layout
that I created with nodedisplaylayout tag is displayed correctly.

What I can’t figure out is how I can map jobs to the nodes layout.
There
is information on the nodes used by every job in the jobs table,
but I’m
not sure if its format is correct.

Here is what I see:
Job 75612 is not displayed on the nodes layout.


In temporary files directory in datastep_LML2LML.xml file I have:

<nodedisplay id="nd_1" title="system: fx02p08">
<scheme>
   <el1 tagname="chassis" min="0" max="6" mask="0xFF0100%01d">
       <el2 tagname="node" min="0" max="16" mask="%01x">
       </el2>
   </el1>
</scheme>
<data>
   <el1 oid="empty" min="0" max="6">
   </el1>
</data>
</nodedisplay>


<data> is empty and so no job is displayed in nodes layout.


By the way, nodes are numerated with hexadecimal numbers. Is it OK to
use mask “%01x” ?

Kind regards,
Peter





On 19 Jan, 2015, at 18:13, Carsten Karbach <c.karbach@xxxxxxxxxxxxx
<mailto:c.karbach@xxxxxxxxxxxxx>> wrote:

Dear Peter,

thanks for the hint on the broken link. It was moved to
http://llview.fz-juelich.de/LML/OnlineDocumentation/lmldoc.html. I
have
updated the documentation page with the new link. On the LML
documentation page you can find a section about LML layout files,
which
is located at
http://llview.fz-juelich.de/LML/OnlineDocumentation/layouts.html.

There are also some layout examples integrated into PTP. You can find
them at
http://git.eclipse.org/c/ptp/org.eclipse.ptp.git/tree/rms/org.eclipse.ptp.rm.jaxb.contrib/data.
E.g. take a look at the configuration files
de.fz-juelich.judge.torque.batch.xml,
de.fz-juelich.juqueen.ll_bg.xml or
de.fz-juelich.juropa.torque.batch.xml. They all contain <monitor-data>
elements, which represent the layout definitions for each site.

See also
https://wiki.eclipse.org/images/7/7e/Carsten-Karbach-31july2013-Slides.pdf
for an introduction to creating your own monitoring layouts.

Regarding your first question: A summary on the monitoring
architecture
is given in the presentation here:
https://wiki.eclipse.org/images/d/d0/PTPUserDev2012_Monitoring_Karbach_Frings.pdf.
When you need to develop your own batch system adapter, you basically
have to write scripts for gathering three types of data: jobs,
nodes and
global system information. You have to write one script for each of
these types, where each script generates an LML file. All subsequent
steps of LML_da should be handled automatically. Find examples for
these
scripts for all supported batch systems at
http://git.eclipse.org/c/ptp/org.eclipse.ptp.git/tree/rms/org.eclipse.ptp.rm.lml.da/rms.

Best regards,

Carsten

On 01/19/15 09:05, Peter Bryzgalov wrote:
Hi,

I work on adopting Eclipse PTP to “K” computer and its PJM batch
system
made by Fujitsu. I have a basic job running and profiling with TAU
features working. Now I'm working on monitoring.
I have very basic understanding of workflow and LML. Where can I find
specifications and examples of what output files should look like on
every step of the workflow? I also need instructions on creating a
layout definition file.

There is a link on wiki.eclipse.org/PTP/designs/scalability
<http://wiki.eclipse.org/PTP/designs/scalability> to LML
specification.
Unfortunately http://llview.zam.kfa-juelich.de site is not working.

Kind regards,
Peter Bryzgalov

RIKEN AICS
HPC Usability Research Team
Research Scientist


_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
To change your delivery options, retrieve your password, or
unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user




------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
To change your delivery options, retrieve your password, or
unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user



_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
To change your delivery options, retrieve your password, or
unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user


_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
To change your delivery options, retrieve your password, or
unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user



_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx <mailto:ptp-user@xxxxxxxxxxx>
To change your delivery options, retrieve your password, or
unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user


_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx <mailto:ptp-user@xxxxxxxxxxx>
To change your delivery options, retrieve your password, or
unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user



_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user




Back to the top