Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] dropping a pls module into an Open MPI build
From: Dean Dauger, Ph. D. (d_at_[hidden])
Date: 2008-01-24 12:54:14


> I'm sorry, but now I am totally confused. Are you saying that you
> are having
> problems with the default rsh component in the distributed 1.2.3
> code??

Yes ...

> Or are you having a problem with your customized version?

and yes. Each exhibited the same problem, a bus error.

> What compiler are you using? If it's your customized version, did
> you make sure to change the
> names of the data structures and modules as I pointed out?

gcc 4.0.1, the default of Leopard. Yes, in the customized version, I
did change the names of the data structures, subroutines, support
file names, and where it says "rsh" just like you said.

> We regularly work on Macs, both PPC and Intel based (I develop and
> test on
> both every day), and I have -never- seen this problem in our code
> base.
> Hence my confusion.

I'm sorry to confuse. I'm starting with the shipping Mac OS X 10.5.1
"Leopard", which contains its own build of Open MPI (v1.2.3 according
to "orterun -version"). So I assumed that the v1.2.3 branch from
svn.open-mpi.org was the same code Apple used to build the Open MPI
that ships in Leopard.

My motivation was to build a new pls module based on pls_rsh module's
source code, substituting the rsh with my own name like you said, but
I encountered a bus error. So to be sure I didn't screw up somewhere
in my custom module I rebuilt the unmodified pls_rsh module and
discovered the same problem.

Then, after downloading the Open MPI from opensource.apple.com
(suspecting it was different), I tried recompiling the pls_rsh module
from that source code, dropped in just the resulting mca_pls_rsh.la
and mca_pls_rsh.so into the existing /usr/lib/openmpi of Leopard,
overwriting Leopard's versions, and the bus error happened the same
as before.

That's where I was with my first post to this list.

My last post regards the discovery that rearranging the elements of
orte_pls_rsh_component_t, without changing anything else about the
pls_rsh code, affects the bus error outcome. Then I padded out
orte_pls_rsh_component_t and my "orte_pls_dean_component_t" by hand
so that it would be "data alignment agnostic", if you will.
Consequently the bus error no longer occurs and both pls modules now
run as they should.

My hypothesis: Apple's procedure to build Open MPI into Leopard had a
side effect requiring shared object code structures to follow a data
alignment different than if I simply recompile Open MPI straight from
its source.

I'm not saying anyone is to blame, but I'm recognizing that those
builds have different timelines. I predict that if I overwrite all
of Leopard's Open MPI object code, then it would all run too.

For my needs, I have a sufficient workaround: realign my data
structures to be "agnostic". I'm sharing this little discovery just
in case it might help somebody else out there; for all I know it
could happen on non-Macs too.

Thanks,
   Dean