Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem running on multiple nodes with Java bindings
From: Christoffer Hamberg (christoffer.hamberg_at_[hidden])
Date: 2013-11-11 04:04:20


(Correction; I mixed up the output of the two first examples in my first
mail, so it fails on the first one)

ubuntu_at_node0:~$ mpirun --leave-session-attached -mca plm_base_verbose 5 -np
4 -host node0,node1,node2,node3 hostname
[node0:01486] mca:base:select:( plm) Querying component [slurm]
[node0:01486] mca:base:select:( plm) Skipping component [slurm]. Query
failed to return a module
[node0:01486] mca:base:select:( plm) Querying component [rsh]
[node0:01486] mca:base:select:( plm) Query of component [rsh] set priority
to 10
[node0:01486] mca:base:select:( plm) Selected component [rsh]
[node2:26962] mca:base:select:( plm) Querying component [rsh]
[node2:26962] mca:base:select:( plm) Query of component [rsh] set priority
to 10
[node2:26962] mca:base:select:( plm) Selected component [rsh]
[node1:11477] mca:base:select:( plm) Querying component [rsh]
[node1:11477] mca:base:select:( plm) Query of component [rsh] set priority
to 10
[node1:11477] mca:base:select:( plm) Selected component [rsh]
Host key verification failed.

ubuntu_at_node0:~$ mpirun -mca plm_rsh_no_tree_spawn 1 -np 4 -host
node0,node1,node2,node3 hostname
node0
node1
node2
node3

So it definetely looks like a problem with the tree spawn. Any clue how I
could proceed?

/Christoffer

2013/11/11 Ralph Castain <rhc_at_[hidden]>

> Add --enable-debug to your configure and run it with the following
> additional options
>
> --leave-session-attached -mca plm_base_verbose 5
>
> Let's see where it fails during the launch phase. Offhand, the only thing
> that message means to me is that the ssh keys are botched on at least one
> node. Keep in mind that we use a tree-based launch, and so when you have
> more than two nodes, one or more of the intermediate nodes are executing an
> ssh.
>
> One way to see if that's the problem is to launch without the tree spawn:
> add
>
> -mca plm_rsh_no_tree_spawn 1
>
> to your cmd line and see if it works.
>
>
>
> On Nov 10, 2013, at 9:24 AM, Christoffer Hamberg <
> christoffer.hamberg_at_[hidden]> wrote:
>
> Hi,
>
> I'm having some strange problems running Open MPI(1.9a1r29559) with Java
> bindings on a Calxeda highbank ARM Server running Ubuntu 12.10 (GNU/Linux
> 3.5.0-43-highbank armv7l).
>
> The problem arises when I try to run a job on more than 3 nodes (I have a
> total of 8).
> Note: It's the same error for any of the node[0-7].
>
> ubuntu_at_node0:~$ mpirun -np 4 -host node0,node1,node2 hostname
> Host key verification failed.
>
> ubuntu_at_node0:~$ mpirun -np 4 -host node0,node1,node2,node3 hostname
> node0
> node0
> node1
> node2
>
> and not running the job on the current node also gives Host key
> verification failed for only 3 nodes.
>
> ubuntu_at_node0:~$ mpirun -np 4 -host node1,node3,node5 hostname
> Host key verification failed.
>
> But not on 2 nodes:
> ubuntu_at_node0:~$ mpirun -np 4 -host node1,node3 hostname
> node1
> node1
> node3
> node3
>
> I've configured it with the following:
> ./configure --prefix=/opt/openmpi-1.9-java --without-openib
> --enable-static --with-threads=posix --enable-mpi-thread-multiple
> --enable-mpi-java --with-jdk-bindir=/usr/lib/jvm/java-7-openjdk-armhf/bin
> --with-jdk-headers=/usr/lib/jvm/java-7-openjdk-armhf/include
>
> I have Open MPI 1.6.5 (without Java-binding) installed and it runs without
> any problems on all nodes, so there should be no problem with SSH that the
> error points to.
>
> Any ideas?
>
> Regards,
> Christoffer
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>