> I'm sorry, but now I am totally confused. Are you saying that you
> are having
> problems with the default rsh component in the distributed 1.2.3
> code??
Yes ...
> Or are you having a problem with your customized version?
and yes. Each exhibited the same problem, a bus error.
> What compiler are you using? If it's your customized version, did
> you make sure to change the
> names of the data structures and modules as I pointed out?
gcc 4.0.1, the default of Leopard. Yes, in the customized version, I
did change the names of the data structures, subroutines, support
file names, and where it says "rsh" just like you said.
> We regularly work on Macs, both PPC and Intel based (I develop and
> test on
> both every day), and I have -never- seen this problem in our code
> base.
> Hence my confusion.
I'm sorry to confuse. I'm starting with the shipping Mac OS X 10.5.1
"Leopard", which contains its own build of Open MPI (v1.2.3 according
to "orterun -version"). So I assumed that the v1.2.3 branch from
svn.open-mpi.org was the same code Apple used to build the Open MPI
that ships in Leopard.
My motivation was to build a new pls module based on pls_rsh module's
source code, substituting the rsh with my own name like you said, but
I encountered a bus error. So to be sure I didn't screw up somewhere
in my custom module I rebuilt the unmodified pls_rsh module and
discovered the same problem.
Then, after downloading the Open MPI from opensource.apple.com
(suspecting it was different), I tried recompiling the pls_rsh module
from that source code, dropped in just the resulting mca_pls_rsh.la
and mca_pls_rsh.so into the existing /usr/lib/openmpi of Leopard,
overwriting Leopard's versions, and the bus error happened the same
as before.
That's where I was with my first post to this list.
My last post regards the discovery that rearranging the elements of
orte_pls_rsh_component_t, without changing anything else about the
pls_rsh code, affects the bus error outcome. Then I padded out
orte_pls_rsh_component_t and my "orte_pls_dean_component_t" by hand
so that it would be "data alignment agnostic", if you will.
Consequently the bus error no longer occurs and both pls modules now
run as they should.
My hypothesis: Apple's procedure to build Open MPI into Leopard had a
side effect requiring shared object code structures to follow a data
alignment different than if I simply recompile Open MPI straight from
its source.
I'm not saying anyone is to blame, but I'm recognizing that those
builds have different timelines. I predict that if I overwrite all
of Leopard's Open MPI object code, then it would all run too.
For my needs, I have a sufficient workaround: realign my data
structures to be "agnostic". I'm sharing this little discovery just
in case it might help somebody else out there; for all I know it
could happen on non-Macs too.
Thanks,
Dean
|