Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error using hostfile
From: jody (jody.xha_at_[hidden])
Date: 2011-07-09 07:04:01


Hi
If your LD_LIBRARY_PATH is not set for a non-interactive startup,
then successful runs on the remote machines may not be sufficient evidence.

Check this FAQ
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

To see if your variables are set correctly for
non-interactive sessions on your nodes,
you can execute
  mpirun --hostfile hostfile -np 4 printenv
and scan the output for PATH and LD_LIBRARY_PATH.

Hope this helps
  Jody

On Sat, Jul 9, 2011 at 12:25 AM, Mohan, Ashwin <ashmohan_at_[hidden]> wrote:
> Thanks Ralph.
>
>
>
> I have emailed the network admin on the firewall issue.
>
>
>
> About the PATH and LIBRARY PATH issue, is it sufficient evidence that the
> path are set alright if I am able to compile and run successfully on
> individual nodes mentioned in the machine file.
>
>
>
> Thanks,
> Ashwin.
>
>
>
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Ralph Castain
> Sent: Friday, July 08, 2011 1:58 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Error using hostfile
>
>
>
> Is there a firewall in the way? The error indicates that daemons were
> launched on the remote machines, but failed to communicate back.
>
>
>
> Also, check that your remote PATH and LD_LIBRARY_PATH are being set to the
> right place to pickup this version of OMPI. Lots of systems deploy with
> default versions that may not be compatible, so if you wind up running a
> daemon on the remote node that comes from another installation, things won't
> work.
>
>
>
>
>
> On Jul 8, 2011, at 10:52 AM, Mohan, Ashwin wrote:
>
> Hi,
>
> I am following up on a previous error posted. Based on the previous
> recommendation, I did set up a password less SSH login.
>
>
>
> I created a hostfile comprising of 4 nodes (w/ each node having 4 slots). I
> tried to run my job on 4 slots but get no output. Hence, I end up killing
> the job. I am trying to run a simple MPI program on 4 nodes and trying to
> figure out what could be the issue.  What could I check to ensure that I can
> run jobs on 4 nodes (each node has 4 slots)
>
>
>
> Here is the simple MPI program I am trying to execute on 4 nodes
>
> **************************
>
> if (my_rank != 0)
>
> {
>
>         sprintf(message, "Greetings from the process %d!", my_rank);
>
>         dest = 0;
>
>         MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,
> MPI_COMM_WORLD);
>
> }
>
> else
>
> {
>
> for (source = 1;source < p; source++)
>
> {
>
>         MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD,
> &status);
>
>         printf("%s\n", message);
>
> }
>
>
>
> ****************************
>
> My hostfile looks like this:
>
>
>
> [amohan_at_myocyte48 ~]$ cat hostfile
>
> myocyte46
>
> myocyte47
>
> myocyte48
>
> myocyte49
>
> *******************************
>
>
>
> I use the following run command: : mpirun --hostfile hostfile -np 4 new46
>
> And receive a blank screen. Hence, I have to kill the job.
>
>
>
> OUTPUT ON KILLING JOB:
>
> mpirun: killing job...
>
> --------------------------------------------------------------------------
>
> mpirun noticed that the job aborted, but has no info as to the process
>
> that caused that situation.
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun was unable to cleanly terminate the daemons on the nodes shown
>
> below. Additional manual cleanup may be required - please refer to
>
> the "orte-clean" tool for assistance.
>
> --------------------------------------------------------------------------
>
>         myocyte46 - daemon did not report back when launched
>
>         myocyte47 - daemon did not report back when launched
>
>         myocyte49 - daemon did not report back when launched
>
>
>
> Thanks,
>
> Ashwin.
>
>
>
>
>
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Ralph Castain
> Sent: Wednesday, July 06, 2011 6:46 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Error using hostfile
>
>
>
> Please see http://www.open-mpi.org/faq/?category=rsh#ssh-keys
>
>
>
>
>
> On Jul 6, 2011, at 5:09 PM, Mohan, Ashwin wrote:
>
>
> Hi,
>
>
>
> I use the following command (mpirun --prefix /usr/local/openmpi1.4.3 -np 4
> hello) to successfully execute a simple hello world command on a single
> node.  Each node has 4 slots.  Following the successful execution on one
> node, I wish to employ 4 nodes and for this purpose wrote a hostfile. I
> submitted my job using the following command:
>
>
>
> mpirun --prefix /usr/local/openmpi1.4.3 -np 4 --hostfile hostfile hello
>
>
>
> Copied below is the output. How do I go about fixing this issue.
>
>
>
> **********************************************************************
>
>
>
> amohan_at_myocyte48's password: amohan_at_myocyte47's password:
>
> Permission denied, please try again.
>
> amohan_at_myocyte48's password:
>
> Permission denied, please try again.
>
> amohan_at_myocyte47's password:
>
> Permission denied, please try again.
>
> amohan_at_myocyte47's password:
>
> Permission denied, please try again.
>
> amohan_at_myocyte48's password:
>
>
>
> Permission denied (publickey,gssapi-with-mic,password).
>
> --------------------------------------------------------------------------
>
> A daemon (pid 22085) died unexpectedly with status 255 while attempting
>
> to launch so we are aborting.
>
>
>
> There may be more information reported by the environment (see above).
>
>
>
> This may be because the daemon was unable to find all the needed shared
>
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>
> location of the shared libraries on the remote nodes and this will
>
> automatically be forwarded to the remote nodes.
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun noticed that the job aborted, but has no info as to the process
>
> that caused that situation.
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun was unable to cleanly terminate the daemons on the nodes shown
>
> below. Additional manual cleanup may be required - please refer to
>
> the "orte-clean" tool for assistance.
>
> --------------------------------------------------------------------------
>
>         myocyte47 - daemon did not report back when launched
>
>         myocyte48 - daemon did not report back when launched
>
>
>
> **********************************************************************
>
>
>
> Thanks,
>
> Ashwin.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>