IIRC, 1.6.5 defaults to *not* using the tree spawn. We changed it in 1.7 series because the launch performance is so much better.


On Nov 11, 2013, at 8:22 AM, Christoffer Hamberg <christoffer.hamberg@gmail.com> wrote:

I re-configured the ssh keys now and for some reason it seems to work. But what baffles me is that the same ssh configuration worked for the other installation (1.6.5) but not for this one.

Thanks for the help!


2013/11/11 Reuti <reuti@staff.uni-marburg.de>
Am 11.11.2013 um 10:04 schrieb Christoffer Hamberg:

> (Correction; I mixed up the output of the two first examples in my first mail, so it fails on the first one)
>
> ubuntu@node0:~$ mpirun --leave-session-attached -mca plm_base_verbose 5 -np 4 -host node0,node1,node2,node3 hostname
> [node0:01486] mca:base:select:(  plm) Querying component [slurm]
> [node0:01486] mca:base:select:(  plm) Skipping component [slurm]. Query failed to return a module
> [node0:01486] mca:base:select:(  plm) Querying component [rsh]
> [node0:01486] mca:base:select:(  plm) Query of component [rsh] set priority to 10
> [node0:01486] mca:base:select:(  plm) Selected component [rsh]
> [node2:26962] mca:base:select:(  plm) Querying component [rsh]
> [node2:26962] mca:base:select:(  plm) Query of component [rsh] set priority to 10
> [node2:26962] mca:base:select:(  plm) Selected component [rsh]
> [node1:11477] mca:base:select:(  plm) Querying component [rsh]
> [node1:11477] mca:base:select:(  plm) Query of component [rsh] set priority to 10
> [node1:11477] mca:base:select:(  plm) Selected component [rsh]
> Host key verification failed.
>
>
> ubuntu@node0:~$ mpirun -mca plm_rsh_no_tree_spawn 1 -np 4 -host node0,node1,node2,node3 hostname
> node0
> node1
> node2
> node3
>
> So it definetely looks like a problem with the tree spawn. Any clue how I could proceed?

The passphraseless ssh is also possible between the nodes? Using hostbased authentication it's also possible to enable it for all users without the necessity to prepare the ssh keys.

-- Reuti


> /Christoffer
>
>
> 2013/11/11 Ralph Castain <rhc@open-mpi.org>
> Add --enable-debug to your configure and run it with the following additional options
>
> --leave-session-attached -mca plm_base_verbose 5
>
> Let's see where it fails during the launch phase. Offhand, the only thing that message means to me is that the ssh keys are botched on at least one node. Keep in mind that we use a tree-based launch, and so when you have more than two nodes, one or more of the intermediate nodes are executing an ssh.
>
> One way to see if that's the problem is to launch without the tree spawn: add
>
> -mca plm_rsh_no_tree_spawn 1
>
> to your cmd line and see if it works.
>
>
>
> On Nov 10, 2013, at 9:24 AM, Christoffer Hamberg <christoffer.hamberg@gmail.com> wrote:
>
>> Hi,
>>
>> I'm having some strange problems running Open MPI(1.9a1r29559) with Java bindings on a Calxeda highbank ARM Server running Ubuntu 12.10 (GNU/Linux 3.5.0-43-highbank armv7l).
>>
>> The problem arises when I try to run a job on more than 3 nodes (I have a total of 8).
>> Note: It's the same error for any of the node[0-7].
>>
>> ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2 hostname
>> Host key verification failed.
>>
>> ubuntu@node0:~$ mpirun -np 4 -host node0,node1,node2,node3 hostname
>> node0
>> node0
>> node1
>> node2
>>
>> and not running the job on the current node also gives Host key verification failed for only 3 nodes.
>>
>> ubuntu@node0:~$ mpirun -np 4 -host node1,node3,node5 hostname
>> Host key verification failed.
>>
>> But not on 2 nodes:
>> ubuntu@node0:~$ mpirun -np 4 -host node1,node3 hostname
>> node1
>> node1
>> node3
>> node3
>>
>> I've configured it with the following:
>> ./configure --prefix=/opt/openmpi-1.9-java --without-openib --enable-static --with-threads=posix --enable-mpi-thread-multiple --enable-mpi-java --with-jdk-bindir=/usr/lib/jvm/java-7-openjdk-armhf/bin --with-jdk-headers=/usr/lib/jvm/java-7-openjdk-armhf/include
>>
>> I have Open MPI 1.6.5 (without Java-binding) installed and it runs without any problems on all nodes, so there should be no problem with SSH that the error points to.
>>
>> Any ideas?
>>
>> Regards,
>> Christoffer
>> _______________________________________________
>> users mailing list
>> users@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users