[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Newsgroup Home]
[news.eclipse.technology.cosmos] COSMOS for high-performance computing

I lead a project at Los Alamos National Laboratory charged with revamping (replacing) our current monitoring infrastructure for HPC systems. Our environment is several 1000-10000 node Linux clusters, and our definition of monitoring is real-time alerting, system event investigation, and regular
reporting of system interrupts in some detail.


Our requirements documentation identifies several concepts in common with the COSMOS project--the importance of a system model, for instance. However, we're having trouble pulling out the details from current documentation, and the June, 2008, general release date is problematic for
us.


We're currently talking with GroundWork and Zenoss (only one of whom seems to be involved with COSMOS) about our extension of one of their infrastructures to meet our needs. Is COSMOS release 0.4 something we should consider as a basis for a project that needs to provide software used in a production HPC environment, or should we just not spend the time?

Regardless of the answer to that question, what is the proper mechanism for understanding the core COSMOS principles (other than what we glean from the eclipse site)? For instance, is an HPC environment an eventual potential target, or is the focus on networks and application servers? There seem to be some biases (each piece of data is atomically relevant, with little room for higher-level correlations, for instance) in all the products/infrastructures we've surveyed, and I personally would like to understand whether the biases are real or if we just misunderstand some underlying concepts.

Thanks,
Rand