Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi hangs when running on more than one node (unless i use --debug-daemons )
From: Advanced Computing Group University of Padova (acg.unipd_at_[hidden])
Date: 2010-12-29 04:10:34


On Wed, Dec 29, 2010 at 10:10 AM, Advanced Computing Group University of
Padova <acg.unipd_at_[hidden]> wrote:

> Thank you Ralph,
> Your suspects seems to be quite interesting :)
> I try to run the same program from node 192.168.1/2.11 using also
> 192.168.2.12 "tracing" .12 activities.
> I attach the two files (_succ: using --debug-daemons , _fail:without
> --debug-daemons)
> I notice that orted daemon on the second node is called in a different
> way.....
> Moreover when i launch without --debug-daemons a process called orted......
> remain active on the second node after i kill (ctrl+c) the command on the
> first node.
>
> Can you continue to help me ?
>
>
> On Tue, Dec 28, 2010 at 8:51 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> All --debug-daemons really does is keep the ssh session open after
>> launching the remote daemon and turn on some output. Otherwise, we close
>> that session as most systems only allow a limited number of concurrent ssh
>> sessions to be open.
>>
>> I suspect you have a system setting that kills any running job upon ssh
>> close. It would be best if you removed that restriction. If you cannot, then
>> you can always run your MPI jobs with --no-daemonize. This will keep the ssh
>> session open, but without all the debug output.
>>
>> That flag is just shorthand for an MCA param, so you can set it in your
>> environ or put it in your default MCA param file.
>>
>>
>> On Dec 28, 2010, at 3:31 AM, Advanced Computing Group University of Padova
>> wrote:
>>
>> yes i've tested 'em
>> In fact using the --debug-daemons switch everything works fine! (and i see
>> that on the nodes a process calles orted... is started whenever i launch a
>> test application)
>> I believe this is a environment variables problem....
>>
>> On Mon, Dec 27, 2010 at 10:16 PM, David Zhang <solarbikedz_at_[hidden]>wrote:
>>
>>> have you tested your ssh key setup, fire wall, and switch settings to
>>> ensure all nodes are talking to each other?
>>>
>>> On Mon, Dec 27, 2010 at 1:07 AM, Advanced Computing Group University of
>>> Padova <acg.unipd_at_[hidden]> wrote:
>>>
>>>> using openmpi 1.4.2
>>>>
>>>>
>>>> On Fri, Dec 24, 2010 at 11:17 AM, Advanced Computing Group University of
>>>> Padova <acg.unipd_at_[hidden]> wrote:
>>>>
>>>>> Hi,
>>>>> i am building a small 16 nodes cluster gentoo based.
>>>>> I succesfully installed openmpi and i succesfully tried some simple
>>>>> small test parallel program on a single host but...
>>>>> i can't run parallel program on more than one nodes
>>>>>
>>>>>
>>>>> The nodes are cloned (so they are equals).
>>>>> The mpiuser (and their ssh certificates) uses /home/mpiuser that is a
>>>>> nfs share.
>>>>> I modified .bashrc
>>>>>
>>>>> -------------------------
>>>>> PATH=/usr/bin:$PATH ; export PATH ;
>>>>> LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;
>>>>>
>>>>> # already present below
>>>>> if [[ $- != *i* ]] ; then
>>>>> # Shell is non-interactive. Be done now!
>>>>> return
>>>>> fi
>>>>> ---------------------
>>>>>
>>>>> The very very strange behaviour is that using the --debug-daemons let
>>>>> my program run succesfully.....
>>>>>
>>>>> Thank you in advance and sorry for my bad english
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>>
>>> --
>>> David Zhang
>>> University of California, San Diego
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>