Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi hangs when running on more than one node (unless i use --debug-daemons )
From: Advanced Computing Group University of Padova (acg.unipd_at_[hidden])
Date: 2010-12-29 04:10:07


Thank you Ralph,
Your suspects seems to be quite interesting :)
I try to run the same program from node 192.168.1/2.11 using also
192.168.2.12 "tracing" .12 activities.
I attach the two files (_succ: using --debug-daemons , _fail:without
--debug-daemons)
I notice that orted daemon on the second node is called in a different
way.....
Moreover when i launch without --debug-daemons a process called orted......
remain active on the second node after i kill (ctrl+c) the command on the
first node.

Can you continue to help me ?

On Tue, Dec 28, 2010 at 8:51 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> All --debug-daemons really does is keep the ssh session open after
> launching the remote daemon and turn on some output. Otherwise, we close
> that session as most systems only allow a limited number of concurrent ssh
> sessions to be open.
>
> I suspect you have a system setting that kills any running job upon ssh
> close. It would be best if you removed that restriction. If you cannot, then
> you can always run your MPI jobs with --no-daemonize. This will keep the ssh
> session open, but without all the debug output.
>
> That flag is just shorthand for an MCA param, so you can set it in your
> environ or put it in your default MCA param file.
>
>
> On Dec 28, 2010, at 3:31 AM, Advanced Computing Group University of Padova
> wrote:
>
> yes i've tested 'em
> In fact using the --debug-daemons switch everything works fine! (and i see
> that on the nodes a process calles orted... is started whenever i launch a
> test application)
> I believe this is a environment variables problem....
>
> On Mon, Dec 27, 2010 at 10:16 PM, David Zhang <solarbikedz_at_[hidden]>wrote:
>
>> have you tested your ssh key setup, fire wall, and switch settings to
>> ensure all nodes are talking to each other?
>>
>> On Mon, Dec 27, 2010 at 1:07 AM, Advanced Computing Group University of
>> Padova <acg.unipd_at_[hidden]> wrote:
>>
>>> using openmpi 1.4.2
>>>
>>>
>>> On Fri, Dec 24, 2010 at 11:17 AM, Advanced Computing Group University of
>>> Padova <acg.unipd_at_[hidden]> wrote:
>>>
>>>> Hi,
>>>> i am building a small 16 nodes cluster gentoo based.
>>>> I succesfully installed openmpi and i succesfully tried some simple
>>>> small test parallel program on a single host but...
>>>> i can't run parallel program on more than one nodes
>>>>
>>>>
>>>> The nodes are cloned (so they are equals).
>>>> The mpiuser (and their ssh certificates) uses /home/mpiuser that is a
>>>> nfs share.
>>>> I modified .bashrc
>>>>
>>>> -------------------------
>>>> PATH=/usr/bin:$PATH ; export PATH ;
>>>> LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;
>>>>
>>>> # already present below
>>>> if [[ $- != *i* ]] ; then
>>>> # Shell is non-interactive. Be done now!
>>>> return
>>>> fi
>>>> ---------------------
>>>>
>>>> The very very strange behaviour is that using the --debug-daemons let my
>>>> program run succesfully.....
>>>>
>>>> Thank you in advance and sorry for my bad english
>>>>
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>>
>> --
>> David Zhang
>> University of California, San Diego
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>