Adding MAC parameter '
plm_rsh_no_tree_spawn' solves the problem.
If I understand correctly, the first layer of daemons are three nodes, and
when there are more than three nodes the second layer of daemons are spawn.
So my problem is happened when MPI processes are launched by the second
layer of daemons, is that correct? I think that is very likely, the second
layer of daemons may be missing some environmental settings.
I would be really helpful if I can solve the problem though, is there any
documents I can find on the way the daemons work? Do you have any
suggestions on the way I can debug the issue?
On Sat, Apr 12, 2014 at 9:00 AM, <users-request_at_[hidden]> wrote:
> The problem is with the tree-spawn nature of the rsh/ssh launcher. For
> scalability, mpirun only launches a first "layer" of daemons. Each of those
> daemons then launches another layer in a tree-like fanout. The default
> pattern is such that you first notice it when you have four nodes in your
> You have two choices:
> * you can just add the MCA param
> plm_rsh_no_tree_spawn=1 to your environment/cmd line
> * you can resolve the tree spawn issue so that a daemon on one of your
> nodes is capable of ssh-ing a daemon on another node
> Either way will work.
> On Apr 11, 2014, at 11:17 AM, Allan Wu <allwu_at_[hidden]> wrote:
> > Hello everyone,
> > I am running a simple helloworld program on several nodes using OpenMPI
> 1.8. Running commands on single node or small number of nodes are
> successful, but when I tried to run the same binary on four different
> nodes, problems occurred.
> > I am using 'mpirun' command line like the following:
> > # mpirun --prefix /mnt/embedded_root/openmpi -np 4 --map-by node
> -hostfile hostfile ./helloworld
> > And my hostfile looks something like these:
> > 10.0.0.16
> > 10.0.0.17
> > 10.0.0.18
> > 10.0.0.19
> > When executing this command, it will result in an error message "sh:
> syntax error: unexpected word", and the program will deadlock. When I added
> "--debug-devel" the output is in the attachment "err_msg_0.txt". In the
> log, "fpga0" is the hostname of "10.0.0.16" and "fpga1" is for "10.0.0.17"
> and so on.
> > However, the weird part is that after I remove one line in the hostfile,
> the problem goes away. It does not matter which host I remove, as long as
> there is less than four hosts, the program can execute without any problem.
> > I also tried using hostname in the hostfile, as:
> > fpga0
> > fpga1
> > fpga2
> > fpga3
> > And the same problem occurs, and the error message becomes "Host key
> verification failed.". I have setup public/private key pairs on all nodes,
> and each node can ssh to any node without problems. I also attached the
> message of --debug-devel as "err_msg_1.txt".
> > I'm running MPI programs on embedded ARM processors. I have previously
> posted questions on cross-compilation on the develop mailing list, which
> contains the setup I used. If you need the information please refer to
> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php, and the
> output of 'ompi-info --all' is also attached with this email.
> > Please let me know if I need to provide more information. Thanks in
> > Regards,
> > --
> > Di Wu (Allan)
> > PhD student, VAST Laboratory,
> > Department of Computer Science, UC Los Angeles
> > Email: allwu_at_[hidden]
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users