Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Program hangs when run in the remote host ...
From: souvik bhattacherjee (souvik99_at_[hidden])
Date: 2009-09-19 09:33:40


Hi Gus (and all OpenMPI users),

Thanks for your interest in my problem. However, the points you had raised
earlier in your mails, seems to me that, I had already taken care of them. I
had enlisted them below pointwise. Your comments are rewritten in *RED *and
my replies in *BLACK.*

1) As you have mentioned: "*I would guess you only installed OpenMPI only on
ict1, not on ict2*". However, I had mentioned initially: "*I had installed
openmpi-1.3.3 separately on two of my machines ict1 and ict2*".

2) Next you said: "*I am guessing this, because you used a prefix under
/usr/local*". However, I had installed them under:
*$ mkdir build
$ cd build
$ ../configure --prefix=/usr/local/openmpi-1.3.3/
# make all install*

3) Next as you pointed out: "* ...not a typical name of an NFS mounted
directory. Using an NFS mounted directory is another way to make OpenMPI
visible to all nodes *".
Let me tell you once again, that I am not going for an NFS installation as
the first point in this list makes it clear.

4) In your next mail: " *If you can ssh passwordless from ict1 to ict2 *and*
vice versa *". Again as I had mentioned earlier " *As a prerequisite, I can
ssh between them without a password or passphrase ( I did not supply the
passphrase at all ).* "

5) Further as you said: " *If your /etc/hosts file on *both* machines list
ict1 and ict2
and their IP addresses *". Let me mention here that, these things are
already very well taken care of.

6) Finally as you said: " *In case you have a /home directory on each
machine (i.e. /home is not NFS mounted) if your .bashrc files on *both*
machines set the PATH
and LD_LIBRARY_PATH to point to the OpenMPI directory. *"

Again as I had mentioned previously, *Also .bash_profile and .bashrc had
the following lines written into them:

PATH=$PATH:/usr/local/openmpi-1.3.3/bin/
LD_LIBRARY_PATH=/usr/local/openmpi-1.3.3/lib/*
*
***************************************************************************************************************
*
**
As an additional bit of information, (which might assist you in the
investigation) I had used *Mandriva 2009.1* on all of my systems.

Hope, this will help you. Eagerly awaiting a response.

Thanks,

On 9/18/09, Gus Correa <gus_at_[hidden]> wrote:
>
> Hi Souvik
>
> Also worth checking:
>
> 1) If you can ssh passwordless from ict1 to ict2 *and* vice versa.
> 2) If your /etc/hosts file on *both* machines list ict1 and ict2
> and their IP addresses.
> 3) In case you have a /home directory on each machine (i.e. /home is
> not NFS mounted) if your .bashrc files on *both* machines set the PATH
> and LD_LIBRARY_PATH to point to the OpenMPI directory.
>
> Gus Correa
>
> Gus Correa wrote:
>
>> Hi Souvik
>>
>> I would guess you only installed OpenMPI only on ict1, not on ict2.
>> If that is the case you won't have the required OpenMPI libraries
>> on ict:/usr/local, and the job won't run on ict2.
>>
>> I am guessing this, because you used a prefix under /usr/local,
>> which tends to be a "per machine" directory,
>> not a typical name of an NFS
>> mounted directory.
>> Using an NFS mounted directory is another way to make
>> OpenMPI visible to all nodes.
>> See this FAQ:
>> http://www.open-mpi.org/faq/?category=building#where-to-install
>>
>> I hope this helps,
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>>
>> souvik bhattacherjee wrote:
>>
>>> Dear all,
>>>
>>> Myself quite new to Open MPI. Recently, I had installed openmpi-1.3.3
>>> separately on two of my machines ict1 and ict2. These machines are
>>> dual-socket quad-core (Intel Xeon E5410) i.e. each having 8 processors and
>>> are connected by Gigabit ethernet switch. As a prerequisite, I can ssh
>>> between them without a password or passphrase ( I did not supply the
>>> passphrase at all ). Thereafter,
>>>
>>> $ cd openmpi-1.3.3
>>> $ mkdir build
>>> $ cd build
>>> $ ../configure --prefix=/usr/local/openmpi-1.3.3/
>>>
>>> Then as a root user,
>>>
>>> # make all install
>>>
>>> Also .bash_profile and .bashrc had the following lines written into them:
>>>
>>> PATH=$PATH:/usr/local/openmpi-1.3.3/bin/
>>> LD_LIBRARY_PATH=/usr/local/openmpi-1.3.3/lib/
>>>
>>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>>
>>> $ cd ../examples/
>>> $ make
>>> $ mpirun -np 2 --host ict1 hello_c
>>> hello_c: error while loading shared libraries: libmpi.so.0: cannot open
>>> shared object file: No suchfile or directory
>>> hello_c: error while loading shared libraries: libmpi.so.0: cannot open
>>> shared object file: No suchfile or directory
>>>
>>> $ mpirun --prefix /usr/local/openmpi-1.3.3/ -np 2 --host ict1 hello_c
>>> Hello, world, I am 1 of 2
>>> Hello, world, I am 0 of 2
>>>
>>> But the program hangs when ....
>>>
>>> $ mpirun --prefix /usr/local/openmpi-1.3.3/ -np 2 --host ict1,ict2
>>> hello_c
>>> This statement does not produce any output. Doing top on either machines
>>> does not show any hello_c running. However, when I press Ctrl+C the
>>> following output appears
>>>
>>> ^Cmpirun: killing job...
>>>
>>> --------------------------------------------------------------------------
>>>
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>> below. Additional manual cleanup may be required - please refer to
>>> the "orte-clean" tool for assistance.
>>> --------------------------------------------------------------------------
>>>
>>> ict2 - daemon did not report back when launched
>>>
>>> $
>>>
>>> The same thing repeats itself when hello_c is run from ict2. Since, the
>>> program does not produce any error, it becomes difficult to locate where I
>>> might have gone wrong.
>>>
>>> Did anyone of you encounter this problem or anything similar ? Any help
>>> would be much appreciated.
>>>
>>> Thanks,
>>>
>>> --
>>>
>>> Souvik
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Souvik