Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun problem when running on more than three hosts with OpenMPI 1.8
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-04-11 14:27:16


The problem is with the tree-spawn nature of the rsh/ssh launcher. For scalability, mpirun only launches a first "layer" of daemons. Each of those daemons then launches another layer in a tree-like fanout. The default pattern is such that you first notice it when you have four nodes in your allocation.

You have two choices:

* you can just add the MCA param plm_rsh_no_tree_spawn=1 to your environment/cmd line

* you can resolve the tree spawn issue so that a daemon on one of your nodes is capable of ssh-ing a daemon on another node

Either way will work.
Ralph

On Apr 11, 2014, at 11:17 AM, Allan Wu <allwu_at_[hidden]> wrote:

> Hello everyone,
>
> I am running a simple helloworld program on several nodes using OpenMPI 1.8. Running commands on single node or small number of nodes are successful, but when I tried to run the same binary on four different nodes, problems occurred.
>
> I am using 'mpirun' command line like the following:
> # mpirun --prefix /mnt/embedded_root/openmpi -np 4 --map-by node -hostfile hostfile ./helloworld
> And my hostfile looks something like these:
> 10.0.0.16
> 10.0.0.17
> 10.0.0.18
> 10.0.0.19
>
> When executing this command, it will result in an error message "sh: syntax error: unexpected word", and the program will deadlock. When I added "--debug-devel" the output is in the attachment "err_msg_0.txt". In the log, "fpga0" is the hostname of "10.0.0.16" and "fpga1" is for "10.0.0.17" and so on.
>
> However, the weird part is that after I remove one line in the hostfile, the problem goes away. It does not matter which host I remove, as long as there is less than four hosts, the program can execute without any problem.
>
> I also tried using hostname in the hostfile, as:
> fpga0
> fpga1
> fpga2
> fpga3
> And the same problem occurs, and the error message becomes "Host key verification failed.". I have setup public/private key pairs on all nodes, and each node can ssh to any node without problems. I also attached the message of --debug-devel as "err_msg_1.txt".
>
> I'm running MPI programs on embedded ARM processors. I have previously posted questions on cross-compilation on the develop mailing list, which contains the setup I used. If you need the information please refer to http://www.open-mpi.org/community/lists/devel/2014/04/14440.php, and the output of 'ompi-info --all' is also attached with this email.
>
> Please let me know if I need to provide more information. Thanks in advance!
>
> Regards,
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory,
> Department of Computer Science, UC Los Angeles
> Email: allwu_at_[hidden]
> <err_msg_0.txt><err_msg_1.txt><log.tar.gz>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users