Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi hangs when running on more than one node (unless i use --debug-daemons )
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-12-29 10:23:13


Both look perfectly right to me. The difference is only because your "success" one still has the ssh session active.

It looks to me like something is preventing communication when the ssh session is terminated, but I have no clue why.

Given the small cluster size, I would just add this to your default param file and not worry about it:

orte_leave_session_attached = 1

On Dec 29, 2010, at 2:10 AM, Advanced Computing Group University of Padova wrote:

>
>
> On Wed, Dec 29, 2010 at 10:10 AM, Advanced Computing Group University of Padova <acg.unipd_at_[hidden]> wrote:
> Thank you Ralph,
> Your suspects seems to be quite interesting :)
> I try to run the same program from node 192.168.1/2.11 using also 192.168.2.12 "tracing" .12 activities.
> I attach the two files (_succ: using --debug-daemons , _fail:without --debug-daemons)
> I notice that orted daemon on the second node is called in a different way.....
> Moreover when i launch without --debug-daemons a process called orted...... remain active on the second node after i kill (ctrl+c) the command on the first node.
>
> Can you continue to help me ?
>
>
> On Tue, Dec 28, 2010 at 8:51 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> All --debug-daemons really does is keep the ssh session open after launching the remote daemon and turn on some output. Otherwise, we close that session as most systems only allow a limited number of concurrent ssh sessions to be open.
>
> I suspect you have a system setting that kills any running job upon ssh close. It would be best if you removed that restriction. If you cannot, then you can always run your MPI jobs with --no-daemonize. This will keep the ssh session open, but without all the debug output.
>
> That flag is just shorthand for an MCA param, so you can set it in your environ or put it in your default MCA param file.
>
>
> On Dec 28, 2010, at 3:31 AM, Advanced Computing Group University of Padova wrote:
>
>> yes i've tested 'em
>> In fact using the --debug-daemons switch everything works fine! (and i see that on the nodes a process calles orted... is started whenever i launch a test application)
>> I believe this is a environment variables problem....
>>
>> On Mon, Dec 27, 2010 at 10:16 PM, David Zhang <solarbikedz_at_[hidden]> wrote:
>> have you tested your ssh key setup, fire wall, and switch settings to ensure all nodes are talking to each other?
>>
>> On Mon, Dec 27, 2010 at 1:07 AM, Advanced Computing Group University of Padova <acg.unipd_at_[hidden]> wrote:
>> using openmpi 1.4.2
>>
>>
>> On Fri, Dec 24, 2010 at 11:17 AM, Advanced Computing Group University of Padova <acg.unipd_at_[hidden]> wrote:
>> Hi,
>> i am building a small 16 nodes cluster gentoo based.
>> I succesfully installed openmpi and i succesfully tried some simple small test parallel program on a single host but...
>> i can't run parallel program on more than one nodes
>>
>>
>> The nodes are cloned (so they are equals).
>> The mpiuser (and their ssh certificates) uses /home/mpiuser that is a nfs share.
>> I modified .bashrc
>>
>> -------------------------
>> PATH=/usr/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;
>>
>> # already present below
>> if [[ $- != *i* ]] ; then
>> # Shell is non-interactive. Be done now!
>> return
>> fi
>> ---------------------
>>
>> The very very strange behaviour is that using the --debug-daemons let my program run succesfully.....
>>
>> Thank you in advance and sorry for my bad english
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> David Zhang
>> University of California, San Diego
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> <dump_succ.txt><dump_fail.txt>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users