Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-02-03 08:51:16


On 2/2/07 8:44 AM, "Greg Watson" <gwatson_at_[hidden]> wrote:

> We're launching a seed daemon so that we can get registry persistence
> across multiple job launches. However, there is a race condition
> between launching the daemon and the first call to orte_init() that
> can result in a bus error. We set the OMPI_MCA_universe and
> OMPI_MCA_orte_univ_exist environment variables prior to calling
> orte_init() so that orte knows how to connect to the daemon, but if
> the daemon hasn't started this causes a bus error in
> orte_rds_base_close(). Stack trace below.
>
> Exception: EXC_BAD_ACCESS (0x0001)
> Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x0000001c
>
> Thread 0 Crashed:
> 0 libopen-rte.0.dylib 0x000c6d59 orte_rds_base_close + 66
> 1 libopen-rte.0.dylib 0x000a3ba7 orte_system_finalize + 121
> 2 libopen-rte.0.dylib 0x000d41f9
> orte_sds_base_basic_contact_universe + 648
> 3 libopen-rte.0.dylib 0x000a06ce orte_init_stage1 + 898
> 4 libopen-rte.0.dylib 0x000a3c0b orte_system_init + 25
> 5 libopen-rte.0.dylib 0x000a0190 orte_init + 81
>

Hmmm...can you tell me which version you are working with? Obviously, that
shouldn't happen. My best initial guess is that rds is being opened, but
hasn't selected components yet when we try to contact the universe. When
that fails and we call finalize, rds tries to "close" a component list that
is NULL. I can look into that.

> A related question, is there any way to check for the daemon other
> than calling orte_init()? At the moment we just sleep for a few
> seconds after launching the daemon, but this is obviously not a very
> satisfactory solution. I can't see any places where this is done in
> the source.
>

There is a "setup_hnp" function that is supposed to do what you describe,
but I cannot swear that it works right now - I doubt it has been tested in
some time. Getting that to work properly is on my "to-do" list for the next
go-around. Meantime, I don't have any immediate solutions other than
"sleep".

> Thanks,
>
> Greg
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel