On Feb 3, 2007, at 6:51 AM, Ralph Castain wrote:
> On 2/2/07 8:44 AM, "Greg Watson" <gwatson_at_[hidden]> wrote:
>> We're launching a seed daemon so that we can get registry persistence
>> across multiple job launches. However, there is a race condition
>> between launching the daemon and the first call to orte_init() that
>> can result in a bus error. We set the OMPI_MCA_universe and
>> OMPI_MCA_orte_univ_exist environment variables prior to calling
>> orte_init() so that orte knows how to connect to the daemon, but if
>> the daemon hasn't started this causes a bus error in
>> orte_rds_base_close(). Stack trace below.
>> Exception: EXC_BAD_ACCESS (0x0001)
>> Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x0000001c
>> Thread 0 Crashed:
>> 0 libopen-rte.0.dylib 0x000c6d59 orte_rds_base_close + 66
>> 1 libopen-rte.0.dylib 0x000a3ba7 orte_system_finalize + 121
>> 2 libopen-rte.0.dylib 0x000d41f9
>> orte_sds_base_basic_contact_universe + 648
>> 3 libopen-rte.0.dylib 0x000a06ce orte_init_stage1 + 898
>> 4 libopen-rte.0.dylib 0x000a3c0b orte_system_init + 25
>> 5 libopen-rte.0.dylib 0x000a0190 orte_init + 81
> Hmmm...can you tell me which version you are working with?
> Obviously, that
> shouldn't happen. My best initial guess is that rds is being
> opened, but
> hasn't selected components yet when we try to contact the universe.
> that fails and we call finalize, rds tries to "close" a component
> list that
> is NULL. I can look into that.