>come from the BTL headers where the fields do not have the same
>alignment inside. The original question was asked by Nysal Jan on an
>email with the subject "SEGV in EM64T <--> PPC64 communication" on
>Oct. 11 2006. Unfortunately, we still have the same problem.
I'm forwarding that email. Further investigation showed that the same issue exists with a few other ob1 headers as well. A 64-bit build doesn't have this problem. I'm not sure if this might be the same issue that you are facing. You could test if the attached patch works for you (Although this is not the right solution). Maybe using -malign-double for the build might also work, but I haven't tried that out.
******************************************************************
Hi Jeff,
I'm using the r12014M revision of the trunk.
I'm getting a SEGV (backtrace included) when running the osu b/w benchmark on a heterogeneous set of 2 nodes (a EM64T & PPC64).
A 32 bit build, compiled with gcc, was used. The problem was tracked down to a difference in the size of the mca_btl_tcp_hdr_t structure on these two architectures.
struct mca_btl_tcp_hdr_t {
mca_btl_base_header_t base; /* a uint8_t */
uint8_t type;
uint16_t count;
uint64_t size;
};
This structure has a size of 12 bytes on EM64T(no padding here) & 16 bytes on PPC64(some padding is added before 'size').
http://docs.sun.com/app/docs/doc/816-5138/6mba6ua5t?a=view mentions that 'long long' has a 4 byte alignment on i386, which might explain why the structure is only 12 bytes on EM64T.
The failure happens in mca_btl_tcp_endpoint_recv_handler() when trying to invoke reg->cbfunc() and reg->cbfunc is NULL.
Assuming the receiver side is EM64T:
frag->iov[0].iov_len = sizeof(frag->hdr) (so assigned 12 bytes on EM64T)
thus the readv() in mca_btl_tcp_frag_recv() reads 12 bytes into the first vector instead of 16 and from there on everything goes wrong.
******************************************************************