Something did occur to me that *might* help with the problem of detecting
when the seed is running. There is an option to orted "-- report-uri pipe"
that will cause the orted to write it's uri to the specified pipe. This
comes after the orted has completed orte_init, and so it *should* be ready
at that time for you to connect to it.
So you might try using that option when you kickoff the seed, and then
reading from the pipe until you get the uri back. Or you can just wait to
see when the pipe closes since the orted closes the pipe immediately after
writing to it.
There is still some stuff that the orted does before it accepts commands
send directly to it etc., but that shouldn't impact your ability to connect.
Let me know how that goes. If we need to do so, we can shift the timing of
that report-uri output so it comes a little later in the orted's setup.
On 2/3/07 6:51 AM, "Ralph Castain" <rhc_at_[hidden]> wrote:
> On 2/2/07 8:44 AM, "Greg Watson" <gwatson_at_[hidden]> wrote:
>> We're launching a seed daemon so that we can get registry persistence
>> across multiple job launches. However, there is a race condition
>> between launching the daemon and the first call to orte_init() that
>> can result in a bus error. We set the OMPI_MCA_universe and
>> OMPI_MCA_orte_univ_exist environment variables prior to calling
>> orte_init() so that orte knows how to connect to the daemon, but if
>> the daemon hasn't started this causes a bus error in
>> orte_rds_base_close(). Stack trace below.
>> Exception: EXC_BAD_ACCESS (0x0001)
>> Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x0000001c
>> Thread 0 Crashed:
>> 0 libopen-rte.0.dylib 0x000c6d59 orte_rds_base_close + 66
>> 1 libopen-rte.0.dylib 0x000a3ba7 orte_system_finalize + 121
>> 2 libopen-rte.0.dylib 0x000d41f9
>> orte_sds_base_basic_contact_universe + 648
>> 3 libopen-rte.0.dylib 0x000a06ce orte_init_stage1 + 898
>> 4 libopen-rte.0.dylib 0x000a3c0b orte_system_init + 25
>> 5 libopen-rte.0.dylib 0x000a0190 orte_init + 81
> Hmmm...can you tell me which version you are working with? Obviously, that
> shouldn't happen. My best initial guess is that rds is being opened, but
> hasn't selected components yet when we try to contact the universe. When
> that fails and we call finalize, rds tries to "close" a component list that
> is NULL. I can look into that.
>> A related question, is there any way to check for the daemon other
>> than calling orte_init()? At the moment we just sleep for a few
>> seconds after launching the daemon, but this is obviously not a very
>> satisfactory solution. I can't see any places where this is done in
>> the source.
> There is a "setup_hnp" function that is supposed to do what you describe,
> but I cannot swear that it works right now - I doubt it has been tested in
> some time. Getting that to work properly is on my "to-do" list for the next
> go-around. Meantime, I don't have any immediate solutions other than
>> devel mailing list
> devel mailing list