Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] dropping a pls module into an Open MPI build
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-01-23 22:19:55


I'm sorry, but now I am totally confused. Are you saying that you are having
problems with the default rsh component in the distributed 1.2.3 code?? Or
are you having a problem with your customized version? What compiler are you
using? If it's your customized version, did you make sure to change the
names of the data structures and modules as I pointed out?

We regularly work on Macs, both PPC and Intel based (I develop and test on
both every day), and I have -never- seen this problem in our code base.
Hence my confusion.

Thanks
Ralph

On 1/23/08 8:08 PM, "Dean Dauger, Ph. D." <d_at_[hidden]> wrote:

> Hi All,
>
> I think I have a possible explanation for this problem. Previously
> orterun was jumping to 0x00000000:
>
>> [Rotarran-X-5:04475] Failing at address: 0x0
>> [ 1] [0xbffff828, 0x00000000] (-P-)
>
> On a hunch I tried changing the number of bool's in the
> orte_pls_rsh_component_t data structure of pls_rsh.h. Another bus
> error occurred with orterun jumping to 0x80000000 instead. So I went
> further and changed the layout of the orte_pls_rsh_component_t struct
> from something like this:
>
> bool reap;
> bool assume_same_shell;
> bool force_rsh;
> char** agent_argv;
> int agent_argc;
> char* agent_path;
>
> to this:
>
> char** agent_argv;
> char* agent_path;
> int agent_argc;
> int unusedInt;
> bool reap;
> bool assume_same_shell;
> bool force_rsh;
> bool unusedB;
>
> recompiled, dropped the new .la and .so pieces in, and then it all
> worked.
>
> My hunch is that I'm having a data alignment problem. Perhaps the
> pointer reference to _launch of the pls module is stored after the
> orte_pls_rsh_component_t struct, but then alignment that given build
> assumes is different from that of my newly compiled pls module.
> Apple usually compiles with every type on its "natural" alignment in
> memory (PowerPC always liked it that way and the habit has stuck) and
> looking at 3 bools followed by a char** tells me there could be padding.
>
> The problem, rather than whether or not to have padding, is what do
> we agree on. I don't know who put what memory align compiler flag in
> what makefile or ./configure line, but if I rearrange the struct into
> the latter example above then I have no ambiguity, so orterun() calls
> _launch just fine in the rsh module and my own.
>
> Thanks for your help,
> Dean
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel