Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpi problems/many cpus per node
From: Daniel Davidson (danield_at_[hidden])
Date: 2012-12-17 11:28:55


This looks to be having issues as well, and I cannot get any number of
processors to give me a different result with the new version.

[root_at_compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 50 --leave-session-attached -mca
odls_base_verbose 5 hostname
[compute-2-1.local:69417] mca:base:select:( odls) Querying component
[default]
[compute-2-1.local:69417] mca:base:select:( odls) Query of component
[default] set priority to 1
[compute-2-1.local:69417] mca:base:select:( odls) Selected component
[default]
[compute-2-0.local:24486] mca:base:select:( odls) Querying component
[default]
[compute-2-0.local:24486] mca:base:select:( odls) Query of component
[default] set priority to 1
[compute-2-0.local:24486] mca:base:select:( odls) Selected component
[default]
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on
WILDCARD
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on
WILDCARD
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on
WILDCARD
[compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on
WILDCARD
[compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on
WILDCARD

However from the head node:

[root_at_biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun
-host compute-2-0,compute-2-1 -v -np 50 hostname

Displays 25 hostnames from each system.

Thank you again for the help so far,

Dan

On 12/17/2012 08:31 AM, Daniel Davidson wrote:
> I will give this a try, but wouldn't that be an issue as well if the
> process was run on the head node or another node? So long as the mpi
> job is not started on either of these two nodes, it works fine.
>
> Dan
>
> On 12/14/2012 11:46 PM, Ralph Castain wrote:
>> It must be making contact or ORTE wouldn't be attempting to launch
>> your application's procs. Looks more like it never received the
>> launch command. Looking at the code, I suspect you're getting caught
>> in a race condition that causes the message to get "stuck".
>>
>> Just to see if that's the case, you might try running this with the
>> 1.7 release candidate, or even the developer's nightly build. Both
>> use a different timing mechanism intended to resolve such situations.
>>
>>
>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson <danield_at_[hidden]>
>> wrote:
>>
>>> Thank you for the help so far. Here is the information that the
>>> debugging gives me. Looks like the daemon on on the non-local node
>>> never makes contact. If I step NP back two though, it does.
>>>
>>> Dan
>>>
>>> [root_at_compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host
>>> compute-2-0,compute-2-1 -v -np 34 --leave-session-attached -mca
>>> odls_base_verbose 5 hostname
>>> [compute-2-1.local:44855] mca:base:select:( odls) Querying component
>>> [default]
>>> [compute-2-1.local:44855] mca:base:select:( odls) Query of component
>>> [default] set priority to 1
>>> [compute-2-1.local:44855] mca:base:select:( odls) Selected component
>>> [default]
>>> [compute-2-0.local:29282] mca:base:select:( odls) Querying component
>>> [default]
>>> [compute-2-0.local:29282] mca:base:select:( odls) Query of component
>>> [default] set priority to 1
>>> [compute-2-0.local:29282] mca:base:select:( odls) Selected component
>>> [default]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info
>>> updating nidmap
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list
>>> unpacking data to launch job [49524,1]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list
>>> adding new jobdat for job [49524,1]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list
>>> unpacking 1 app_contexts
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],0] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],1] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],1] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local
>>> list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],2] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],3] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],3] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local
>>> list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],4] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],5] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],5] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local
>>> list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],6] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],7] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],7] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local
>>> list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],8] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],9] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],9] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local
>>> list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],10] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],11] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],11] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],12] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],13] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],13] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],14] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],15] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],15] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],16] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],17] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],17] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],18] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],19] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],19] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],20] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],21] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],21] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],22] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],23] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],23] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],24] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],25] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],25] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],26] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],27] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],27] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],28] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],29] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],29] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],30] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],31] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],31] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],32] on daemon 1
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - checking proc [[49524,1],33] on daemon 0
>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>> - found proc [[49524,1],33] for me!
>>> [compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my
>>> local list
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch found 384
>>> processors for 17 children and locally set oversubscribed to false
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],1]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],3]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],5]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],7]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],9]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],11]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],13]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],15]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],17]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],19]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],21]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],23]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],25]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],27]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],29]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],31]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child
>>> [[49524,1],33]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job
>>> [49524,1] launch status
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch flagging launch
>>> report to myself
>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch setting waitpids
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44857 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44858 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44859 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44860 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44861 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44862 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44863 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44865 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44866 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44867 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44869 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44870 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44871 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44872 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44873 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44874 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
>>> process 44875 terminated
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/33/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],33] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/31/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],31] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/29/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],29] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/27/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],27] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/25/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],25] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/23/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],23] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/21/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],21] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/19/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],19] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/17/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],17] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/15/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],15] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/13/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],13] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/11/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],11] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/9/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],9] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/7/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],7] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/5/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],5] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/3/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],3] terminated normally
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
>>> abort file
>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/1/abort
>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
>>> process [[49524,1],1] terminated normally
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],25]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],15]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],11]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],13]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],19]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],9]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],17]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],31]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],7]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],21]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],5]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],33]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],23]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],3]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],29]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],27]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
>>> child [[49524,1],1]
>>> [compute-2-1.local:44855] [[49524,0],0] odls:proc_complete reporting
>>> all procs in [49524,1] terminated
>>> ^Cmpirun: killing job...
>>>
>>> Killed by signal 2.
>>> [compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc working
>>> on WILDCARD
>>>
>>>
>>> On 12/14/2012 04:11 PM, Ralph Castain wrote:
>>>> Sorry - I forgot that you built from a tarball, and so debug isn't
>>>> enabled by default. You need to configure --enable-debug.
>>>>
>>>> On Dec 14, 2012, at 1:52 PM, Daniel Davidson <danield_at_[hidden]>
>>>> wrote:
>>>>
>>>>> Oddly enough, adding this debugging info, lowered the number of
>>>>> processes that can be used down to 42 from 46. When I run the
>>>>> MPI, it fails giving only the information that follows:
>>>>>
>>>>> [root_at_compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host
>>>>> compute-2-0,compute-2-1 -v -np 44 --leave-session-attached -mca
>>>>> odls_base_verbose 5 hostname
>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Querying
>>>>> component [default]
>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Query of
>>>>> component [default] set priority to 1
>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Selected
>>>>> component [default]
>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Querying
>>>>> component [default]
>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Query of
>>>>> component [default] set priority to 1
>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Selected
>>>>> component [default]
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>> compute-2-1.local
>>>>>
>>>>>
>>>>> On 12/14/2012 03:18 PM, Ralph Castain wrote:
>>>>>> It wouldn't be ssh - in both cases, only one ssh is being done to
>>>>>> each node (to start the local daemon). The only difference is the
>>>>>> number of fork/exec's being done on each node, and the number of
>>>>>> file descriptors being opened to support those fork/exec's.
>>>>>>
>>>>>> It certainly looks like your limits are high enough. When you say
>>>>>> it "fails", what do you mean - what error does it report? Try
>>>>>> adding:
>>>>>>
>>>>>> --leave-session-attached -mca odls_base_verbose 5
>>>>>>
>>>>>> to your cmd line - this will report all the local proc launch
>>>>>> debug and hopefully show you a more detailed error report.
>>>>>>
>>>>>>
>>>>>> On Dec 14, 2012, at 12:29 PM, Daniel Davidson
>>>>>> <danield_at_[hidden]> wrote:
>>>>>>
>>>>>>> I have had to cobble together two machines in our rocks cluster
>>>>>>> without using the standard installation, they have efi only bios
>>>>>>> on them and rocks doesnt like that, so it is the only workaround.
>>>>>>>
>>>>>>> Everything works great now, except for one thing. MPI jobs
>>>>>>> (openmpi or mpich) fail when started from one of these nodes
>>>>>>> (via qsub or by logging in and running the command) if 24 or
>>>>>>> more processors are needed on another system. However if the
>>>>>>> originator of the MPI job is the headnode or any of the
>>>>>>> preexisting compute nodes, it works fine. Right now I am
>>>>>>> guessing ssh client or ulimit problems, but I cannot find any
>>>>>>> difference. Any help would be greatly appreciated.
>>>>>>>
>>>>>>> compute-2-1 and compute-2-0 are the new nodes
>>>>>>>
>>>>>>> Examples:
>>>>>>>
>>>>>>> This works, prints 23 hostnames from each machine:
>>>>>>> [root_at_compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
>>>>>>> compute-2-0,compute-2-1 -np 46 hostname
>>>>>>>
>>>>>>> This does not work, prints 24 hostnames for compute-2-1
>>>>>>> [root_at_compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
>>>>>>> compute-2-0,compute-2-1 -np 48 hostname
>>>>>>>
>>>>>>> These both work, print 64 hostnames from each node
>>>>>>> [root_at_biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
>>>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>>>> [root_at_compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
>>>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>
>>>>>>> [root_at_compute-2-1 ~]# ulimit -a
>>>>>>> core file size (blocks, -c) 0
>>>>>>> data seg size (kbytes, -d) unlimited
>>>>>>> scheduling priority (-e) 0
>>>>>>> file size (blocks, -f) unlimited
>>>>>>> pending signals (-i) 16410016
>>>>>>> max locked memory (kbytes, -l) unlimited
>>>>>>> max memory size (kbytes, -m) unlimited
>>>>>>> open files (-n) 4096
>>>>>>> pipe size (512 bytes, -p) 8
>>>>>>> POSIX message queues (bytes, -q) 819200
>>>>>>> real-time priority (-r) 0
>>>>>>> stack size (kbytes, -s) unlimited
>>>>>>> cpu time (seconds, -t) unlimited
>>>>>>> max user processes (-u) 1024
>>>>>>> virtual memory (kbytes, -v) unlimited
>>>>>>> file locks (-x) unlimited
>>>>>>>
>>>>>>> [root_at_compute-2-1 ~]# more /etc/ssh/ssh_config
>>>>>>> Host *
>>>>>>> CheckHostIP no
>>>>>>> ForwardX11 yes
>>>>>>> ForwardAgent yes
>>>>>>> StrictHostKeyChecking no
>>>>>>> UsePrivilegedPort no
>>>>>>> Protocol 2,1
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>