To recap: the problem was that if orted was launched from Eclipse (on
OS X) then subsequent attempts to run a program (using mpirun or
whatever) returned immediately. If orted was launched from anywhere
else (java, command line, etc.) it worked fine.
Turning on daemon logging showed that the reason that the program was
aborting immediately was that the execv() of the ssh command to the
remote machine was exiting with errno=14 (EFAULT). Clearly there was
some environment difference, and after much checking it became
apparent that the difference was that the Eclipse-launched orted did
not have $(OMPI_INSTALL) in it's path. The orte_pls_rsh_launch()
function checks if you're launching onto the local or a remote
machine. For local machines (as it was in this case), it calls
opal_path_findv() to find the local path of orted. Unfortunately
because $(OMPI_INSTALL) is not included in the local path, this fails
by returning NULL. The NULL is then passed to the first argument of
execv() which returns EFAULT.
The problem is easily reproducible by taking $(OMPI_INSTALL) out of
your path, running $(OMPI_INSTALL)/orted, then trying to run
something with mpirun.
Why did it work from the command line? On OS X, the shell gets the
PATH set in ~/.bash_profile, etc., (which in this case contained
OMPI_INSTALL) but applications launched from window system get their
path from the loginwindow app, which looks in ~/.MacOSX/
environment.plist for environment variables (which didn't contain
OMPI_INSTALL). I suspect, but haven't tried, launching Eclipse from
the command line would have worked.
I'm not sure why the logic is there to look up the path again for
local launches, since it should be the same as the path in the
component. It should certainly check for a NULL return though.