Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-user] Eclipse PTP on K - monitoring

Dear Carsten,
Thank you for detailed explanation!
Can't wait to test it tomorrow!

Kind regards,
Peter


> On 20Jan, 2015, at 21:21, Carsten Karbach <c.karbach@xxxxxxxxxxxxx> wrote:
> 
> Dear Peter,
> 
> the scripts running on your remote system should generate LML files for jobs and nodes. You can look into these files following the instructions given at https://wiki.eclipse.org/PTP/System_Monitoring_FAQ#Q:_How_do_I_debug_the_server_part_of_PTP.27s_system_monitoring_capability.3F
> 
> Jobs are mapped to the nodes through the following procedure:
> A running job should have a nodelist or vnodelist attribute. Here is an example for a job on a cluster:
> <info oid="j000116" type="short">
> <data key="name"           value="my job"/>
> <data key="owner"          value="karbach"/>
> <data key="totalcores"     value="8"/>
> <data key="nodelist" value="(jj13c41,7)(jj13c41,6)(jj13c41,5)(jj13c41,4)(jj13c41,3)(jj13c41,2)(jj13c41,1)(jj13c41,0)"/>
> <data key="group"          value="unknown"/>
> <data key="state"          value="Running"/>
> <data key="ppn"            value="8"/>
> <data key="queuedate"      value="Sun Jun  2 16:03:21 2013"/>
> <data key="queue"          value="jsc"/>
> <data key="spec"           value="1:ppn=8"/>
> <data key="dispatchdate"   value="Tue Jun  4 05:47:43 2013"/>
> <data key="status"         value="RUNNING"/>
> <data key="step"           value="2696094.jj28b01"/>
> <data key="totaltasks"     value="8"/>
> </info>
> 
> The nodelist attribute contains one entry for each used core. In the above example the job uses the cores 0 to 7 on the node jj13c41. Thus, the nodelist attribute has the form (<nodename>,core-id)*
> The vnodelist is a shorter form, which only lists how many cores are used by each node. For this example the equivalent vnodelist attribute would be "(jj13c41,8)".
> 
> When you concatinate the mask attributes in your nodedisplay layout (in your example "0xFF0100%01d%01x"), you have to be able to generate all node names by using this format string in a "printf" command. Unfortunately, the hexadecimal format is not supported. To use hexadecimal numbers, replace mask="%01x" with map="0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f".
> 
> These node names need to be identical to the node names listed in the nodes LML file. E.g. for the cluster above the nodes LML file contains the following information for node jj13c41:
> 
> <object id="nd002177" name="jj13c41" type="node"/>
> ...
> <info oid="nd002177" type="short">
> <data key="availmem"       value="23498124kb"/>
> <data key="id"             value="jj13c41"/>
> <data key="ncores"         value="16"/>
> <data key="physmem"        value="24732484kb"/>
> <data key="state"          value="Idle"/>
> </info>
> 
> The mask/map attributes in your nodedisplay layout are used to generate regular expressions. LML_da can use them to extract the IDs on each level from the actual node name in order to map a node to your layout.
> 
> In addition to the replacement of the hexadecimal mask with the map attribute, you also need to add a core-level element, defining how many cores are configured in each node. LML_da currently always requires the lowest level to look like the following:
> 
> <el3 tagname="core" min="0" max="15" mask="-c%02d"></el3>
> 
> Note that the mask attribute cannot be changed here. What you should adjust are the min/max attributes depending on how many cores are available in each node.
> 
> I hope that helps.
> 
> Best regards,
> 
> Carsten
> 
> On 01/20/15 10:46, Peter Bryzgalov wrote:
>> Hi,
>> 
>> In order to add monitoring feature to Eclipse PTP on “K” computer I am
>> working on a customised LML-layout. For tests I use a smaller sibling of
>> “K” named “FX10”.
>> 
>> I customised the perl scripts and layout description file
>> (samples/layout_default_PJM.xml), and now in the monitoring perspective
>> I can see a table with the jobs running on the computer. Nodes layout
>> that I created with nodedisplaylayout tag is displayed correctly.
>> 
>> What I can’t figure out is how I can map jobs to the nodes layout. There
>> is information on the nodes used by every job in the jobs table, but I’m
>> not sure if its format is correct.
>> 
>> Here is what I see:
>> Job 75612 is not displayed on the nodes layout.
>> 
>> 
>> In temporary files directory in datastep_LML2LML.xml file I have:
>> 
>> <nodedisplay id="nd_1" title="system: fx02p08">
>> <scheme>
>>     <el1 tagname="chassis" min="0" max="6" mask="0xFF0100%01d">
>>         <el2 tagname="node" min="0" max="16" mask="%01x">
>>         </el2>
>>     </el1>
>> </scheme>
>> <data>
>>     <el1 oid="empty" min="0" max="6">
>>     </el1>
>> </data>
>> </nodedisplay>
>> 
>> 
>> <data> is empty and so no job is displayed in nodes layout.
>> 
>> 
>> By the way, nodes are numerated with hexadecimal numbers. Is it OK to
>> use mask “%01x” ?
>> 
>> Kind regards,
>> Peter
>> 
>> 
>> 
>> 
>> 
>>> On 19 Jan, 2015, at 18:13, Carsten Karbach <c.karbach@xxxxxxxxxxxxx
>>> <mailto:c.karbach@xxxxxxxxxxxxx>> wrote:
>>> 
>>> Dear Peter,
>>> 
>>> thanks for the hint on the broken link. It was moved to
>>> http://llview.fz-juelich.de/LML/OnlineDocumentation/lmldoc.html. I have
>>> updated the documentation page with the new link. On the LML
>>> documentation page you can find a section about LML layout files, which
>>> is located at
>>> http://llview.fz-juelich.de/LML/OnlineDocumentation/layouts.html.
>>> 
>>> There are also some layout examples integrated into PTP. You can find
>>> them at
>>> http://git.eclipse.org/c/ptp/org.eclipse.ptp.git/tree/rms/org.eclipse.ptp.rm.jaxb.contrib/data.
>>> E.g. take a look at the configuration files
>>> de.fz-juelich.judge.torque.batch.xml, de.fz-juelich.juqueen.ll_bg.xml or
>>> de.fz-juelich.juropa.torque.batch.xml. They all contain <monitor-data>
>>> elements, which represent the layout definitions for each site.
>>> 
>>> See also
>>> https://wiki.eclipse.org/images/7/7e/Carsten-Karbach-31july2013-Slides.pdf
>>> for an introduction to creating your own monitoring layouts.
>>> 
>>> Regarding your first question: A summary on the monitoring architecture
>>> is given in the presentation here:
>>> https://wiki.eclipse.org/images/d/d0/PTPUserDev2012_Monitoring_Karbach_Frings.pdf.
>>> When you need to develop your own batch system adapter, you basically
>>> have to write scripts for gathering three types of data: jobs, nodes and
>>> global system information. You have to write one script for each of
>>> these types, where each script generates an LML file. All subsequent
>>> steps of LML_da should be handled automatically. Find examples for these
>>> scripts for all supported batch systems at
>>> http://git.eclipse.org/c/ptp/org.eclipse.ptp.git/tree/rms/org.eclipse.ptp.rm.lml.da/rms.
>>> 
>>> Best regards,
>>> 
>>> Carsten
>>> 
>>> On 01/19/15 09:05, Peter Bryzgalov wrote:
>>>> Hi,
>>>> 
>>>> I work on adopting Eclipse PTP to “K” computer and its PJM batch system
>>>> made by Fujitsu. I have a basic job running and profiling with TAU
>>>> features working. Now I'm working on monitoring.
>>>> I have very basic understanding of workflow and LML. Where can I find
>>>> specifications and examples of what output files should look like on
>>>> every step of the workflow? I also need instructions on creating a
>>>> layout definition file.
>>>> 
>>>> There is a link on wiki.eclipse.org/PTP/designs/scalability
>>>> <http://wiki.eclipse.org/PTP/designs/scalability> to LML specification.
>>>> Unfortunately http://llview.zam.kfa-juelich.de site is not working.
>>>> 
>>>> Kind regards,
>>>> Peter Bryzgalov
>>>> 
>>>> RIKEN AICS
>>>> HPC Usability Research Team
>>>> Research Scientist
>>>> 
>>>> 
>>>> _______________________________________________
>>>> ptp-user mailing list
>>>> ptp-user@xxxxxxxxxxx
>>>> To change your delivery options, retrieve your password, or
>>>> unsubscribe from this list, visit
>>>> https://dev.eclipse.org/mailman/listinfo/ptp-user
>>>> 
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------------------------
>>> ------------------------------------------------------------------------------------------------
>>> Forschungszentrum Juelich GmbH
>>> 52425 Juelich
>>> Sitz der Gesellschaft: Juelich
>>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>>> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
>>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>>> Prof. Dr. Sebastian M. Schmidt
>>> ------------------------------------------------------------------------------------------------
>>> ------------------------------------------------------------------------------------------------
>>> 
>>> _______________________________________________
>>> ptp-user mailing list
>>> ptp-user@xxxxxxxxxxx
>>> To change your delivery options, retrieve your password, or
>>> unsubscribe from this list, visit
>>> https://dev.eclipse.org/mailman/listinfo/ptp-user
>> 
>> 
>> 
>> _______________________________________________
>> ptp-user mailing list
>> ptp-user@xxxxxxxxxxx
>> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
>> https://dev.eclipse.org/mailman/listinfo/ptp-user
>> 
> 
> _______________________________________________
> ptp-user mailing list
> ptp-user@xxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> https://dev.eclipse.org/mailman/listinfo/ptp-user



Back to the top