Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem running on multiple nodes with Java bindings
From: Christoffer Hamberg (christoffer.hamberg_at_[hidden])
Date: 2013-11-11 11:39:11


That explains, thank you for the quick answer.

2013/11/11 Ralph Castain <rhc_at_[hidden]>

> IIRC, 1.6.5 defaults to *not* using the tree spawn. We changed it in 1.7
> series because the launch performance is so much better.
>
>
> On Nov 11, 2013, at 8:22 AM, Christoffer Hamberg <
> christoffer.hamberg_at_[hidden]> wrote:
>
> I re-configured the ssh keys now and for some reason it seems to work. But
> what baffles me is that the same ssh configuration worked for the other
> installation (1.6.5) but not for this one.
>
> Thanks for the help!
>
>
> 2013/11/11 Reuti <reuti_at_[hidden]>
>
>> Am 11.11.2013 um 10:04 schrieb Christoffer Hamberg:
>>
>> > (Correction; I mixed up the output of the two first examples in my
>> first mail, so it fails on the first one)
>> >
>> > ubuntu_at_node0:~$ mpirun --leave-session-attached -mca plm_base_verbose
>> 5 -np 4 -host node0,node1,node2,node3 hostname
>> > [node0:01486] mca:base:select:( plm) Querying component [slurm]
>> > [node0:01486] mca:base:select:( plm) Skipping component [slurm]. Query
>> failed to return a module
>> > [node0:01486] mca:base:select:( plm) Querying component [rsh]
>> > [node0:01486] mca:base:select:( plm) Query of component [rsh] set
>> priority to 10
>> > [node0:01486] mca:base:select:( plm) Selected component [rsh]
>> > [node2:26962] mca:base:select:( plm) Querying component [rsh]
>> > [node2:26962] mca:base:select:( plm) Query of component [rsh] set
>> priority to 10
>> > [node2:26962] mca:base:select:( plm) Selected component [rsh]
>> > [node1:11477] mca:base:select:( plm) Querying component [rsh]
>> > [node1:11477] mca:base:select:( plm) Query of component [rsh] set
>> priority to 10
>> > [node1:11477] mca:base:select:( plm) Selected component [rsh]
>> > Host key verification failed.
>> >
>> >
>> > ubuntu_at_node0:~$ mpirun -mca plm_rsh_no_tree_spawn 1 -np 4 -host
>> node0,node1,node2,node3 hostname
>> > node0
>> > node1
>> > node2
>> > node3
>> >
>> > So it definetely looks like a problem with the tree spawn. Any clue how
>> I could proceed?
>>
>> The passphraseless ssh is also possible between the nodes? Using
>> hostbased authentication it's also possible to enable it for all users
>> without the necessity to prepare the ssh keys.
>>
>> -- Reuti
>>
>>
>> > /Christoffer
>> >
>> >
>> > 2013/11/11 Ralph Castain <rhc_at_[hidden]>
>> > Add --enable-debug to your configure and run it with the following
>> additional options
>> >
>> > --leave-session-attached -mca plm_base_verbose 5
>> >
>> > Let's see where it fails during the launch phase. Offhand, the only
>> thing that message means to me is that the ssh keys are botched on at least
>> one node. Keep in mind that we use a tree-based launch, and so when you
>> have more than two nodes, one or more of the intermediate nodes are
>> executing an ssh.
>> >
>> > One way to see if that's the problem is to launch without the tree
>> spawn: add
>> >
>> > -mca plm_rsh_no_tree_spawn 1
>> >
>> > to your cmd line and see if it works.
>> >
>> >
>> >
>> > On Nov 10, 2013, at 9:24 AM, Christoffer Hamberg <
>> christoffer.hamberg_at_[hidden]> wrote:
>> >
>> >> Hi,
>> >>
>> >> I'm having some strange problems running Open MPI(1.9a1r29559) with
>> Java bindings on a Calxeda highbank ARM Server running Ubuntu 12.10
>> (GNU/Linux 3.5.0-43-highbank armv7l).
>> >>
>> >> The problem arises when I try to run a job on more than 3 nodes (I
>> have a total of 8).
>> >> Note: It's the same error for any of the node[0-7].
>> >>
>> >> ubuntu_at_node0:~$ mpirun -np 4 -host node0,node1,node2 hostname
>> >> Host key verification failed.
>> >>
>> >> ubuntu_at_node0:~$ mpirun -np 4 -host node0,node1,node2,node3 hostname
>> >> node0
>> >> node0
>> >> node1
>> >> node2
>> >>
>> >> and not running the job on the current node also gives Host key
>> verification failed for only 3 nodes.
>> >>
>> >> ubuntu_at_node0:~$ mpirun -np 4 -host node1,node3,node5 hostname
>> >> Host key verification failed.
>> >>
>> >> But not on 2 nodes:
>> >> ubuntu_at_node0:~$ mpirun -np 4 -host node1,node3 hostname
>> >> node1
>> >> node1
>> >> node3
>> >> node3
>> >>
>> >> I've configured it with the following:
>> >> ./configure --prefix=/opt/openmpi-1.9-java --without-openib
>> --enable-static --with-threads=posix --enable-mpi-thread-multiple
>> --enable-mpi-java --with-jdk-bindir=/usr/lib/jvm/java-7-openjdk-armhf/bin
>> --with-jdk-headers=/usr/lib/jvm/java-7-openjdk-armhf/include
>> >>
>> >> I have Open MPI 1.6.5 (without Java-binding) installed and it runs
>> without any problems on all nodes, so there should be no problem with SSH
>> that the error points to.
>> >>
>> >> Any ideas?
>> >>
>> >> Regards,
>> >> Christoffer
>> >> _______________________________________________
>> >> users mailing list
>> >> users_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>