I do have some questions about this.
1) If I correctly understood, we need the orte_output and
orte_show_help in order to be able to make a difference between the
application stdout/stderr and the MPI library ones ? Who is applying
the filter ? The local daemon or the HNP ? How do we make sure that
the remote outputs are not interlaced ?
2) Who is really generating the error message ? In your item #2 I
wonder how do you make the difference between what need to be printed
once (such as the PML initialization error) and what is supposed to be
printed multiple times (such as BTL TCP connection failure) ? If the
HPN is managing these error messages, this will force us to always
install all error files, otherwise this approach cannot work on an
heterogeneous environment (such as the local installation doesn't have
infiniband support but the remote one include it).
3) What is the OMPI layer supposed to use ? opal_output ?
orte_output ? or maybe ompi_output ?
On May 9, 2008, at 5:52 PM, Jeff Squyres wrote:
> Per the teleconf this week, Ralph and I worked up two new features
> that we're nearly ready to put back in the trunk:
> 1. IBM+LANL needed a way to XML-ize all output that comes out of OMPI
> so that 3rd party tools can parse and use it intelligently (e.g., the
> PTP debugger can now distinguish between OMPI error messages and
> stderr from the MPI app).
> 2. In order to do #1, we created separate logical channels (vs, just
> throwing everything in stderr and letting IOF relay it back to the
> HNP) for the following:
> - stdout/stderr from the MPI app
> - opal_show_help() messages (***)
> - opal_output*() messages (***)
> As a side effect, we now filter show_help() messages and only print
> them *once* at the HNP (this has been a very long-standing goal of
> mine). So if your MPI app barfs, you will no longer see the same
> show_help() error message N times -- you'll see it only once, possibly
> accompanied with a "...and we got the same error message from N other
> processes" notice.
> (***) To make both #1 and #2 work, we had to raise the abstraction
> level. That is, there had to be job-level intelligence about the
> different kinds of output. So we have created orte_output() (and
> friends) and orte_show_help(). The OPAL variants still exist, but
> they *SHOULD NOT BE USED* by the MPI layer. Specifically, the OPAL
> variants are for what OPAL does best: single process stuff. The ORTE
> variants provide the job-level intelligence, such as duplicate
> show_help filtering, relaying to the HNP in a different channel than
> IOF, etc.
> So when this stuff hits the trunk, you'll see a ton of s/opal_output/
> orte_output/g and /opal_show_help/orte_show_help/g changes throughout
> the code base. Do not be alarmed. :-)
> Jeff Squyres
> Cisco Systems
> devel mailing list
- application/pkcs7-signature attachment: smime.p7s