Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Nysal Jan (jnysal_at_[hidden])
Date: 2006-11-04 03:37:58


>come from the BTL headers where the fields do not have the same
>alignment inside. The original question was asked by Nysal Jan on an
>email with the subject "SEGV in EM64T <--> PPC64 communication" on
>Oct. 11 2006. Unfortunately, we still have the same problem.

I'm forwarding that email. Further investigation showed that the same issue
exists with a few other ob1 headers as well. A 64-bit build doesn't have
this problem. I'm not sure if this might be the same issue that you are
facing. You could test if the attached patch works for you (Although this is
not the right solution). Maybe using -malign-double for the build might also
work, but I haven't tried that out.

******************************************************************
Hi Jeff,
I'm using the r12014M revision of the trunk.
I'm getting a SEGV (backtrace included) when running the osu b/w benchmark
on a heterogeneous set of 2 nodes (a EM64T & PPC64).
A 32 bit build, compiled with gcc, was used. The problem was tracked down to
a difference in the size of the mca_btl_tcp_hdr_t structure on these two
architectures.

struct mca_btl_tcp_hdr_t {
    mca_btl_base_header_t base; /* a uint8_t */
    uint8_t type;
    uint16_t count;
    uint64_t size;
};

This structure has a size of 12 bytes on EM64T(no padding here) & 16 bytes
on PPC64(some padding is added before 'size').
http://docs.sun.com/app/docs/doc/816-5138/6mba6ua5t?a=view mentions that
'long long' has a 4 byte alignment on i386, which might explain why the
structure is only 12 bytes on EM64T.

The failure happens in mca_btl_tcp_endpoint_recv_handler() when trying to
invoke reg->cbfunc() and reg->cbfunc is NULL.
Assuming the receiver side is EM64T:
 frag->iov[0].iov_len = sizeof(frag->hdr) (so assigned 12 bytes on EM64T)
 thus the readv() in mca_btl_tcp_frag_recv() reads 12 bytes into the first
vector instead of 16 and from there on everything goes wrong.
******************************************************************