Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] MPI orte_init fails on remote nodes
From: Richard Bardwell (richard_at_[hidden])
Date: 2012-02-14 07:36:23


In trying to debug the MPI_Waitall hang on the remote
node, I created a simple code to test.

If we run the simple code below on 2 nodes on a local
machine, we send the number 1 and receive number 1 back.

If we run the same code on a local node and a remote node,
we send number 1 but get 32767 back. Any ideas ???

#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#define PCPU 8
int rank,nproc;

main(argc, argv)
int argc;
char *argv[];
{
   int i,j,k,i1;

   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Comm_size(MPI_COMM_WORLD, &nproc);

   if (rank==0) {
    i1 = 1;
    mpisend(i1);
   }else{
    k=mpirecv();
    printf("R%d: recvd %d\n",rank,k);
   }
   MPI_Finalize();
}

mpisend(ok)

int ok;

{
   int i,j,k,m;
   int tag=201;
   MPI_Request request[PCPU];
   MPI_Status status[PCPU];

   for (m=1;m<nproc;m++) {
    printf("R%d->%d\n",rank,m);
    MPI_Isend(&ok, 1, MPI_INT, m, tag+m, MPI_COMM_WORLD,&request[m-1]);
   }

}

mpirecv()

{
   int i,j,k,m;
   int hrecv;
   int tag=201;
   MPI_Request request[PCPU];
   MPI_Status status[PCPU];

   MPI_Irecv(&hrecv, 1, MPI_INT, 0, tag+rank, MPI_COMM_WORLD, &request[rank-1]);
   MPI_Waitall(1,&request[rank-1],&status[rank-1]);
   return(hrecv);
}

----- Original Message -----
From: "Jeff Squyres" <jsquyres_at_[hidden]>
To: "Open MPI Users" <users_at_[hidden]>
Sent: Tuesday, February 14, 2012 11:13 AM
Subject: Re: [OMPI users] MPI orte_init fails on remote nodes

> Make sure that your LD_LIBRARY_PATH is being set in your shell startup files for *non-interactive logins*.
>
> For example, ensure that LD_LIBRARY_PATH is set properly, even in this case:
>
> -----
> ssh some-other-node env | grep LD_LIBRARY_PATH
> -----
>
> (note that this is different than "ssh some-other-node echo $LD_LIBRARY_PATH", because the "$LD_LIBRARY_PATH" will be evaluated on
> the local node, even before ssh is invoked)
>
> I mention this because some shell startup files distinguish between interactive and non-interactive logins; they sometimes
> terminate early for non-interactive logins. Look for "exit" statements, or conditional blocks that are only invoked during
> interactive logins, for example.
>
>
>
> On Feb 14, 2012, at 5:40 AM, Richard Bardwell wrote:
>
>> Jeff,
>>
>> I wiped out all versions of openmpi on all the nodes including the distro installed version.
>> I reinstalled version 1.4.4 on all nodes.
>> I now get the error that libopen-rte.so.0 cannot be found when running mpiexec across
>> different nodes, even though the LD_LIBRARY_PATH for all nodes points to /usr/local/lib
>> where the file exists. Any ideas ?
>>
>> Many Thanks
>>
>> Richard
>>
>> ----- Original Message ----- From: "Jeff Squyres" <jsquyres_at_[hidden]>
>> To: "Open MPI Users" <users_at_[hidden]>
>> Sent: Monday, February 13, 2012 6:28 PM
>> Subject: Re: [OMPI users] MPI orte_init fails on remote nodes
>>
>>
>>> You might want to fully uninstall the disto-installed version of Open MPI on all the nodes (e.g., Red Hat may have installed a
>>> different version of Open MPI, and that version is being found in your $PATH before your custom-installedversion).
>>>
>>>
>>> On Feb 13, 2012, at 12:12 PM, Richard Bardwell wrote:
>>>
>>>> OK, 1.4.4 is happily installed on both machines. But, I now get a really
>>>> weird error when running on the 2 nodes. I get
>>>> Error: unknown option "--daemonize"
>>>> even though I am just running with -np 2 -hostfile test.hst
>>>>
>>>> The program runs fine on 2 cores if running locally on each node.
>>>>
>>>> Any ideas ??
>>>>
>>>> Thanks
>>>>
>>>> Richard
>>>> ----- Original Message ----- From: "Gustavo Correa" <gus_at_[hidden]>
>>>> To: "Open MPI Users" <users_at_[hidden]>
>>>> Sent: Monday, February 13, 2012 4:22 PM
>>>> Subject: Re: [OMPI users] MPI orte_init fails on remote nodes
>>>>
>>>>
>>>>> On Feb 13, 2012, at 11:02 AM, Richard Bardwell wrote:
>>>>>> Ralph
>>>>>> I had done a make clean in the 1.2.8 directory if that is what you meant ?
>>>>>> Or do I need to do something else ?
>>>>>> I appreciate your help on this by the way ;-)
>>>>> Hi Richard
>>>>> You can install in a different directory, totally separate from 1.2.8.
>>>>> Create a new work directory [which is not the final installation directory, just work, say /tmp/openmpi/1.4.4/work].
>>>>> Launch the OpenMPI 1.4.4 configure script from this new work directory with the --prefix pointing to your desired installation
>>>>> directory [e.g. /home/richard/openmpi/1.4.4/].
>>>>> I am assuming this is NFS mounted on the nodes [if you have a cluster].
>>>>> [Check all options with 'configure --help'.]
>>>>> Then do make, make install.
>>>>> Finally set your PATH and LD_LIBRARY_PATH to point to the new installation directory,
>>>>> to prevent mixing with the old 1.2.8.
>>>>> I have a number of OpenMPI versions here, compiled with various compilers,
>>>>> and they coexist well this way.
>>>>> I hope this helps,
>>>>> Gus Correa
>>>>>> ----- Original Message -----
>>>>>> From: Ralph Castain
>>>>>> To: Open MPI Users
>>>>>> Sent: Monday, February 13, 2012 3:41 PM
>>>>>> Subject: Re: [OMPI users] MPI orte_init fails on remote nodes
>>>>>> You need to clean out the old attempt - that is a stale file
>>>>>> Sent from my iPad
>>>>>> On Feb 13, 2012, at 7:36 AM, "Richard Bardwell" <richard_at_[hidden]> wrote:
>>>>>>> OK, I installed 1.4.4, rebuilt the exec and guess what ...... I now get some weird errors as below:
>>>>>>> mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_ras_dash_host
>>>>>>> along with a few other files
>>>>>>> even though the .so / .la files are all there !
>>>>>>> ----- Original Message -----
>>>>>>> From: Ralph Castain
>>>>>>> To: Open MPI Users
>>>>>>> Sent: Monday, February 13, 2012 2:59 PM
>>>>>>> Subject: Re: [OMPI users] MPI orte_init fails on remote nodes
>>>>>>> Good heavens - where did you find something that old? Can you use a more recent version?
>>>>>>> Sent from my iPad
>>>>>>>
>>>>>>>> Gentlemen
>>>>>>>> I am struggling to get MPI working when the hostfile contains different nodes.
>>>>>>>> I get the error below. Any ideas ?? I can ssh without password between the two
>>>>>>>> nodes. I am running 1.2.8 MPI on both machines.
>>>>>>>> Any help most appreciated !!!!!
>>>>>>>> MPITEST/v8_mpi_test> mpiexec -n 2 --debug-daemons -hostfile test.hst /home/sharc/MPITEST/v8_mpi_test/mpitest
>>>>>>>> Daemon [0,0,1] checking in as pid 10490 on host 192.0.2.67
>>>>>>>> [linux-z0je:08804] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 182
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> It looks like orte_init failed for some reason; your parallel process is
>>>>>>>> likely to abort. There are many reasons that a parallel process can
>>>>>>>> fail during orte_init; some of which are due to configuration or
>>>>>>>> environment problems. This failure appears to be an internal failure;
>>>>>>>> here's some additional information (which may only be relevant to an
>>>>>>>> Open MPI developer):
>>>>>>>> orte_rml_base_select failed
>>>>>>>> --> Returned value -13 instead of ORTE_SUCCESS
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> [linux-z0je:08804] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42
>>>>>>>> [linux-z0je:08804] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52
>>>>>>>> Open RTE was unable to initialize properly. The error occured while
>>>>>>>> attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS.
>>>>>>>> [linux-tmpw:10490] [0,0,1] orted_recv_pls: received message from [0,0,0]
>>>>>>>> [linux-tmpw:10490] [0,0,1] orted_recv_pls: received kill_local_procs
>>>>>>>> [linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275
>>>>>>>> [linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1158
>>>>>>>> [linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
>>>>>>>> [linux-tmpw:10489] ERROR: A daemon on node 192.0.2.68 failed to start as expected.
>>>>>>>> [linux-tmpw:10489] ERROR: There may be more information available from
>>>>>>>> [linux-tmpw:10489] ERROR: the remote shell (see above).
>>>>>>>> [linux-tmpw:10489] ERROR: The daemon exited unexpectedly with status 243.
>>>>>>>> [linux-tmpw:10490] [0,0,1] orted_recv_pls: received message from [0,0,0]
>>>>>>>> [linux-tmpw:10490] [0,0,1] orted_recv_pls: received exit
>>>>>>>> [linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188
>>>>>>>> [linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1190
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpiexec was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>