On Apr 13, 2014, at 11:42 AM, Allan Wu <allwu_at_[hidden]> wrote:
> Thanks, Ralph!
> Adding MAC parameter 'plm_rsh_no_tree_spawn' solves the problem.
> If I understand correctly, the first layer of daemons are three nodes, and when there are more than three nodes the second layer of daemons are spawn. So my problem is happened when MPI processes are launched by the second layer of daemons, is that correct?
Yes, that is correct
> I think that is very likely, the second layer of daemons may be missing some environmental settings.
> I would be really helpful if I can solve the problem though, is there any documents I can find on the way the daemons work? Do you have any suggestions on the way I can debug the issue?
Easiest way to debug the issue is to add "-mca plm_base_verbose 5 --debug-daemons" to your command line. This will show the commands being used in the launch, and allow ssh errors to reach the screen.
> On Sat, Apr 12, 2014 at 9:00 AM, <users-request_at_[hidden]> wrote:
> The problem is with the tree-spawn nature of the rsh/ssh launcher. For scalability, mpirun only launches a first "layer" of daemons. Each of those daemons then launches another layer in a tree-like fanout. The default pattern is such that you first notice it when you have four nodes in your allocation.
> You have two choices:
> * you can just add the MCA param plm_rsh_no_tree_spawn=1 to your environment/cmd line
> * you can resolve the tree spawn issue so that a daemon on one of your nodes is capable of ssh-ing a daemon on another node
> Either way will work.
> On Apr 11, 2014, at 11:17 AM, Allan Wu <allwu_at_[hidden]> wrote:
> > Hello everyone,
> > I am running a simple helloworld program on several nodes using OpenMPI 1.8. Running commands on single node or small number of nodes are successful, but when I tried to run the same binary on four different nodes, problems occurred.
> > I am using 'mpirun' command line like the following:
> > # mpirun --prefix /mnt/embedded_root/openmpi -np 4 --map-by node -hostfile hostfile ./helloworld
> > And my hostfile looks something like these:
> > 10.0.0.16
> > 10.0.0.17
> > 10.0.0.18
> > 10.0.0.19
> > When executing this command, it will result in an error message "sh: syntax error: unexpected word", and the program will deadlock. When I added "--debug-devel" the output is in the attachment "err_msg_0.txt". In the log, "fpga0" is the hostname of "10.0.0.16" and "fpga1" is for "10.0.0.17" and so on.
> > However, the weird part is that after I remove one line in the hostfile, the problem goes away. It does not matter which host I remove, as long as there is less than four hosts, the program can execute without any problem.
> > I also tried using hostname in the hostfile, as:
> > fpga0
> > fpga1
> > fpga2
> > fpga3
> > And the same problem occurs, and the error message becomes "Host key verification failed.". I have setup public/private key pairs on all nodes, and each node can ssh to any node without problems. I also attached the message of --debug-devel as "err_msg_1.txt".
> > I'm running MPI programs on embedded ARM processors. I have previously posted questions on cross-compilation on the develop mailing list, which contains the setup I used. If you need the information please refer to http://www.open-mpi.org/community/lists/devel/2014/04/14440.php, and the output of 'ompi-info --all' is also attached with this email.
> > Please let me know if I need to provide more information. Thanks in advance!
> > Regards,
> > --
> > Di Wu (Allan)
> > PhD student, VAST Laboratory,
> > Department of Computer Science, UC Los Angeles
> > Email: allwu_at_[hidden]
> > <err_msg_0.txt><err_msg_1.txt><log.tar.gz>_______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> users mailing list