I don't know if Fedora RPMs include -g in their builds, or if Fedora includes a debuginfo RPM that you could install such that you can attach a debugger and be able to dig into OMPI's internals yourself.
If that doesn't work, you might need to build from source yourself, link against the external hwloc (you said you could replicate the error this way), and compile with -g (e.g., "./configure CFLAGS=-g LDFLAGS=-g ..."). This would allow you to gdb attach and see what's going on.
Alternatively, you could add some opal_output(0, "printf like args here"); statements in the orte_util_nidmap_init() function to see where it's failing (look in orte/util/nidmap.c).
On Jul 23, 2013, at 9:36 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> I see - I didn't look at the redhat bug list. Sadly, I have no idea how to debug it. The Fedora package is built optimized, so no OMPI debugging output is available and a debugger won't tell us a lot.
> Best guess is that there is something in the build that doesn't match the user's system. The nidmap_init routine unpacks a buffer that contains a bunch of process mapping info that mpirun packed into it - don't usually see an error in there.
> On Jul 23, 2013, at 5:57 AM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>> On Jul 23, 2013, at 8:54 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>> Yes, it's curious that they can't reproduce your issue,
>>> Guess I missed this - where does it say that they can't reproduce the issue?? I'm suspicious because build-from-source produced a working result.
>> Orion mentioned it in https://bugzilla.redhat.com/show_bug.cgi?id=986409.
>> Jeff Squyres
>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>> users mailing list
> users mailing list
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/