This question was buried in an earlier question, and I got no replies,
so I'll try reposting it with a more enticing subject.
I have a multithreaded openmpi code where each task has N+1 threads,
the N threads send nonblocking messages that are received by the 1
thread on the other tasks. When I run this code with 2 tasks, 5+1
threads on a single node with 12 cores, after about a million messages
has been exchanged, the tasks segfault after printing the following
read an unknown type of header
The debugger spits me out on line 674 of btl_sm_component.c, in the
default of a switch on fragment type. There's a comment there that
* This code path should presumably never be called.
* It's unclear if it should exist or, if so, how it should be written.
* If we want to return it to the sending process,
* we have to figure out who the sender is.
* It seems we need to subtract the mask bits.
* Then, hopefully this is an sm header that has an smp_rank field.
* Presumably that means the received header was relative.
* Or, maybe this code should just be removed.
It seems like whoever wrote that code didn't know quite what was going
on, and I guess the assumption was wrong because dereferencing that
result seems to be what's causing the segfault. Does anyone here know
what could cause this error? If I run the code with the tcp btl
instead of sm, it runs fine, albeit with a bit lower performance.
This is with OpenMPI 1.5.3 using MPI_THREAD_MULTIPLE on a Dell
PowerEdge C6100 running linux kernel 2.6.18-194.32.1.el5, using Intel
12.3.174. I've attached the ompi_info output.