Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpi problems/many cpus per node
From: Daniel Davidson (danield_at_[hidden])
Date: 2012-12-19 17:01:19


I figured this out.

ssh was working, but scp was not due to an mtu mismatch between the
systems. Adding MTU=1500 to my
/etc/sysconfig/network-scripts/ifcfg-eth2 fixed the problem.

Dan

On 12/17/2012 04:12 PM, Daniel Davidson wrote:
> Yes, it does.
>
> Dan
>
> [root_at_compute-2-1 ~]# ssh compute-2-0
> Warning: untrusted X11 forwarding setup failed: xauth key data not
> generated
> Warning: No xauth data; using fake authentication data for X11
> forwarding.
> Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local
> [root_at_compute-2-0 ~]# ssh compute-2-1
> Warning: untrusted X11 forwarding setup failed: xauth key data not
> generated
> Warning: No xauth data; using fake authentication data for X11
> forwarding.
> Last login: Mon Dec 17 16:12:32 2012 from biocluster.local
> [root_at_compute-2-1 ~]#
>
>
>
> On 12/17/2012 03:39 PM, Doug Reeder wrote:
>> Daniel,
>>
>> Does passwordless ssh work. You need to make sure that it is.
>>
>> Doug
>> On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote:
>>
>>> I would also add that scp seems to be creating the file in the /tmp
>>> directory of compute-2-0, and that /var/log secure is showing ssh
>>> connections being accepted. Is there anything in ssh that can limit
>>> connections that I need to look out for? My guess is that it is
>>> part of the client prefs and not the server prefs since I can
>>> initiate the mpi command from another machine and it works fine,
>>> even when it uses compute-2-0 and 1.
>>>
>>> Dan
>>>
>>>
>>> [root_at_compute-2-1 /]# date
>>> Mon Dec 17 15:11:50 CST 2012
>>> [root_at_compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host
>>> compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca
>>> odls_base_verbose 5 -mca plm_base_verbose 5 hostname
>>> [compute-2-1.local:70237] mca:base:select:( plm) Querying component
>>> [rsh]
>>> [compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on
>>> agent ssh : rsh path NULL
>>>
>>> [root_at_compute-2-0 tmp]# ls -ltr
>>> total 24
>>> -rw-------. 1 root root 0 Nov 28 08:42 yum.log
>>> -rw-------. 1 root root 5962 Nov 29 10:50
>>> yum_save_tx-2012-11-29-10-50SRba9s.yumtx
>>> drwx------. 3 danield danield 4096 Dec 12 14:56
>>> openmpi-sessions-danield_at_compute-2-0_0
>>> drwx------. 3 root root 4096 Dec 13 15:38
>>> openmpi-sessions-root_at_compute-2-0_0
>>> drwx------ 18 danield danield 4096 Dec 14 09:48
>>> openmpi-sessions-danield_at_compute-2-0.local_0
>>> drwx------ 44 root root 4096 Dec 17 15:14
>>> openmpi-sessions-root_at_compute-2-0.local_0
>>>
>>> [root_at_compute-2-0 tmp]# tail -10 /var/log/secure
>>> Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root
>>> from 10.1.255.226 port 49483 ssh2
>>> Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session):
>>> session opened for user root by (uid=0)
>>> Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from
>>> 10.1.255.226: 11: disconnected by user
>>> Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session):
>>> session closed for user root
>>> Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root
>>> from 10.1.255.226 port 49484 ssh2
>>> Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session):
>>> session opened for user root by (uid=0)
>>> Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from
>>> 10.1.255.226: 11: disconnected by user
>>> Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session):
>>> session closed for user root
>>> Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root
>>> from 10.1.255.226 port 49485 ssh2
>>> Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session):
>>> session opened for user root by (uid=0)
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 12/17/2012 11:16 AM, Daniel Davidson wrote:
>>>> A very long time (15 mintues or so) I finally received the
>>>> following in addition to what I just sent earlier:
>>>>
>>>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc
>>>> working on WILDCARD
>>>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc
>>>> working on WILDCARD
>>>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc
>>>> working on WILDCARD
>>>> [compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending
>>>> orted_exit commands
>>>> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc
>>>> working on WILDCARD
>>>> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc
>>>> working on WILDCARD
>>>>
>>>> Firewalls are down:
>>>>
>>>> [root_at_compute-2-1 /]# iptables -L
>>>> Chain INPUT (policy ACCEPT)
>>>> target prot opt source destination
>>>>
>>>> Chain FORWARD (policy ACCEPT)
>>>> target prot opt source destination
>>>>
>>>> Chain OUTPUT (policy ACCEPT)
>>>> target prot opt source destination
>>>> [root_at_compute-2-0 ~]# iptables -L
>>>> Chain INPUT (policy ACCEPT)
>>>> target prot opt source destination
>>>>
>>>> Chain FORWARD (policy ACCEPT)
>>>> target prot opt source destination
>>>>
>>>> Chain OUTPUT (policy ACCEPT)
>>>> target prot opt source destination
>>>>
>>>> On 12/17/2012 11:09 AM, Ralph Castain wrote:
>>>>> Hmmm...and that is ALL the output? If so, then it never succeeded
>>>>> in sending a message back, which leads one to suspect some kind of
>>>>> firewall in the way.
>>>>>
>>>>> Looking at the ssh line, we are going to attempt to send a message
>>>>> from tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that
>>>>> going to work? Anything preventing it?
>>>>>
>>>>>
>>>>> On Dec 17, 2012, at 8:56 AM, Daniel Davidson
>>>>> <danield_at_[hidden]> wrote:
>>>>>
>>>>>> These nodes have not been locked down yet so that jobs cannot be
>>>>>> launched from the backend, at least on purpose anyway. The added
>>>>>> logging returns the information below:
>>>>>>
>>>>>> [root_at_compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host
>>>>>> compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca
>>>>>> odls_base_verbose 5 -mca plm_base_verbose 5 hostname
>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Querying
>>>>>> component [rsh]
>>>>>> [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on
>>>>>> agent ssh : rsh path NULL
>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Query of
>>>>>> component [rsh] set priority to 10
>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Querying
>>>>>> component [slurm]
>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Skipping
>>>>>> component [slurm]. Query failed to return a module
>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Querying
>>>>>> component [tm]
>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Skipping
>>>>>> component [tm]. Query failed to return a module
>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Selected
>>>>>> component [rsh]
>>>>>> [compute-2-1.local:69655] plm:base:set_hnp_name: initial bias
>>>>>> 69655 nodename hash 3634869988
>>>>>> [compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent
>>>>>> ssh : rsh path NULL
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm
>>>>>> [compute-2-1.local:69655] mca:base:select:( odls) Querying
>>>>>> component [default]
>>>>>> [compute-2-1.local:69655] mca:base:select:( odls) Query of
>>>>>> component [default] set priority to 1
>>>>>> [compute-2-1.local:69655] mca:base:select:( odls) Selected
>>>>>> component [default]
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
>>>>>> creating map
>>>>>> [compute-2-1.local:69655] [[32341,0],0] setup:vm: working
>>>>>> unmanaged allocation
>>>>>> [compute-2-1.local:69655] [[32341,0],0] using dash_host
>>>>>> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0
>>>>>> [compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list
>>>>>> [compute-2-1.local:69655] [[32341,0],0] checking node
>>>>>> compute-2-1.local
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new
>>>>>> daemon [[32341,0],1]
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
>>>>>> assigning new daemon [[32341,0],1] to node compute-2-0
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0
>>>>>> (bash)
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same
>>>>>> remote shell as local shell
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0
>>>>>> (bash)
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template
>>>>>> argv:
>>>>>> /usr/bin/ssh <template>
>>>>>> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ;
>>>>>> LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ;
>>>>>> export LD_LIBRARY_PATH ;
>>>>>> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ;
>>>>>> export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted
>>>>>> -mca ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid
>>>>>> <template> -mca orte_ess_num_procs 2 -mca orte_hnp_uri
>>>>>> "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314"
>>>>>> -mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca
>>>>>> odls_base_verbose 5 -mca plm_base_verbose 5 -mca plm rsh -mca
>>>>>> orte_leave_session_attached 1
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0
>>>>>> not a child of mine
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node
>>>>>> compute-2-0 to launch list
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating
>>>>>> launch event
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch
>>>>>> of daemon [[32341,0],1]
>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing:
>>>>>> (//usr/bin/ssh) [/usr/bin/ssh compute-2-0
>>>>>> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ;
>>>>>> LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ;
>>>>>> export LD_LIBRARY_PATH ;
>>>>>> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ;
>>>>>> export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted
>>>>>> -mca ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid 1
>>>>>> -mca orte_ess_num_procs 2 -mca orte_hnp_uri
>>>>>> "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314"
>>>>>> -mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca
>>>>>> odls_base_verbose 5 -mca plm_base_verbose 5 -mca plm rsh -mca
>>>>>> orte_leave_session_attached 1]
>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data
>>>>>> not generated
>>>>>> Warning: No xauth data; using fake authentication data for X11
>>>>>> forwarding.
>>>>>> [compute-2-0.local:24659] mca:base:select:( plm) Querying
>>>>>> component [rsh]
>>>>>> [compute-2-0.local:24659] [[32341,0],1] plm:rsh_lookup on agent
>>>>>> ssh : rsh path NULL
>>>>>> [compute-2-0.local:24659] mca:base:select:( plm) Query of
>>>>>> component [rsh] set priority to 10
>>>>>> [compute-2-0.local:24659] mca:base:select:( plm) Selected
>>>>>> component [rsh]
>>>>>> [compute-2-0.local:24659] mca:base:select:( odls) Querying
>>>>>> component [default]
>>>>>> [compute-2-0.local:24659] mca:base:select:( odls) Query of
>>>>>> component [default] set priority to 1
>>>>>> [compute-2-0.local:24659] mca:base:select:( odls) Selected
>>>>>> component [default]
>>>>>> [compute-2-0.local:24659] [[32341,0],1] plm:rsh_setup on agent
>>>>>> ssh : rsh path NULL
>>>>>> [compute-2-0.local:24659] [[32341,0],1] plm:base:receive start comm
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 12/17/2012 10:37 AM, Ralph Castain wrote:
>>>>>>> ?? That was all the output? If so, then something is indeed
>>>>>>> quite wrong as it didn't even attempt to launch the job.
>>>>>>>
>>>>>>> Try adding -mca plm_base_verbose 5 to the cmd line.
>>>>>>>
>>>>>>> I was assuming you were using ssh as the launcher, but I wonder
>>>>>>> if you are in some managed environment? If so, then it could be
>>>>>>> that launch from a backend node isn't allowed (e.g., on
>>>>>>> gridengine).
>>>>>>>
>>>>>>> On Dec 17, 2012, at 8:28 AM, Daniel Davidson
>>>>>>> <danield_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> This looks to be having issues as well, and I cannot get any
>>>>>>>> number of processors to give me a different result with the new
>>>>>>>> version.
>>>>>>>>
>>>>>>>> [root_at_compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun
>>>>>>>> -host compute-2-0,compute-2-1 -v -np 50
>>>>>>>> --leave-session-attached -mca odls_base_verbose 5 hostname
>>>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Querying
>>>>>>>> component [default]
>>>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Query of
>>>>>>>> component [default] set priority to 1
>>>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Selected
>>>>>>>> component [default]
>>>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Querying
>>>>>>>> component [default]
>>>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Query of
>>>>>>>> component [default] set priority to 1
>>>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Selected
>>>>>>>> component [default]
>>>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc
>>>>>>>> working on WILDCARD
>>>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc
>>>>>>>> working on WILDCARD
>>>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc
>>>>>>>> working on WILDCARD
>>>>>>>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc
>>>>>>>> working on WILDCARD
>>>>>>>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc
>>>>>>>> working on WILDCARD
>>>>>>>>
>>>>>>>> However from the head node:
>>>>>>>>
>>>>>>>> [root_at_biocluster openmpi-1.7rc5]#
>>>>>>>> /home/apps/openmpi-1.7rc5/bin/mpirun -host
>>>>>>>> compute-2-0,compute-2-1 -v -np 50 hostname
>>>>>>>>
>>>>>>>> Displays 25 hostnames from each system.
>>>>>>>>
>>>>>>>> Thank you again for the help so far,
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 12/17/2012 08:31 AM, Daniel Davidson wrote:
>>>>>>>>> I will give this a try, but wouldn't that be an issue as well
>>>>>>>>> if the process was run on the head node or another node? So
>>>>>>>>> long as the mpi job is not started on either of these two
>>>>>>>>> nodes, it works fine.
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>> On 12/14/2012 11:46 PM, Ralph Castain wrote:
>>>>>>>>>> It must be making contact or ORTE wouldn't be attempting to
>>>>>>>>>> launch your application's procs. Looks more like it never
>>>>>>>>>> received the launch command. Looking at the code, I suspect
>>>>>>>>>> you're getting caught in a race condition that causes the
>>>>>>>>>> message to get "stuck".
>>>>>>>>>>
>>>>>>>>>> Just to see if that's the case, you might try running this
>>>>>>>>>> with the 1.7 release candidate, or even the developer's
>>>>>>>>>> nightly build. Both use a different timing mechanism intended
>>>>>>>>>> to resolve such situations.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson
>>>>>>>>>> <danield_at_[hidden]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you for the help so far. Here is the information that
>>>>>>>>>>> the debugging gives me. Looks like the daemon on on the
>>>>>>>>>>> non-local node never makes contact. If I step NP back two
>>>>>>>>>>> though, it does.
>>>>>>>>>>>
>>>>>>>>>>> Dan
>>>>>>>>>>>
>>>>>>>>>>> [root_at_compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun
>>>>>>>>>>> -host compute-2-0,compute-2-1 -v -np 34
>>>>>>>>>>> --leave-session-attached -mca odls_base_verbose 5 hostname
>>>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Querying
>>>>>>>>>>> component [default]
>>>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Query of
>>>>>>>>>>> component [default] set priority to 1
>>>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Selected
>>>>>>>>>>> component [default]
>>>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Querying
>>>>>>>>>>> component [default]
>>>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Query of
>>>>>>>>>>> component [default] set priority to 1
>>>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Selected
>>>>>>>>>>> component [default]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:update:daemon:info updating nidmap
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:construct_child_list unpacking data to launch job
>>>>>>>>>>> [49524,1]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:construct_child_list adding new jobdat for job [49524,1]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:construct_child_list unpacking 1 app_contexts
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],0] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],1] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],1] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],2] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],3] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],3] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],4] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],5] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],5] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],6] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],7] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],7] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],8] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],9] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],9] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],10] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],11] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],11] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],12] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],13] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],13] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],14] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],15] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],15] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],15] (15) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],16] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],17] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],17] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],17] (17) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],18] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],19] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],19] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],19] (19) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],20] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],21] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],21] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],21] (21) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],22] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],23] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],23] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],23] (23) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],24] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],25] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],25] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],25] (25) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],26] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],27] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],27] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],27] (27) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],28] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],29] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],29] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],29] (29) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],30] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],31] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],31] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],31] (31) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],32] on daemon 1
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - checking proc [[49524,1],33] on daemon 0
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing
>>>>>>>>>>> child list - found proc [[49524,1],33] for me!
>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],33] (33) to
>>>>>>>>>>> my local list
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch found
>>>>>>>>>>> 384 processors for 17 children and locally set
>>>>>>>>>>> oversubscribed to false
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],1]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],3]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],5]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],7]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],9]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],11]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],13]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],15]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],17]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],19]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],21]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],23]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],25]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],27]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],29]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],31]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working
>>>>>>>>>>> child [[49524,1],33]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch
>>>>>>>>>>> reporting job [49524,1] launch status
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch flagging
>>>>>>>>>>> launch report to myself
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch setting
>>>>>>>>>>> waitpids
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44857 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44858 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44859 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44860 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44861 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44862 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44863 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44865 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44866 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44867 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44869 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44870 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44871 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44872 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44873 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44874 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
>>>>>>>>>>> child process 44875 terminated
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/33/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],33] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/31/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],31] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/29/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],29] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/27/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],27] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/25/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],25] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/23/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],23] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/21/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],21] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/19/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],19] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/17/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],17] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/15/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],15] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/13/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],13] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/11/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],11] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/9/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],9] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/7/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],7] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/5/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],5] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/3/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],3] terminated normally
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> checking abort file
>>>>>>>>>>> /tmp/openmpi-sessions-root_at_compute-2-1.local_0/3245604865/1/abort
>>>>>>>>>>>
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
>>>>>>>>>>> child process [[49524,1],1] terminated normally
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],25]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],15]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],11]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],13]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],19]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],9]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],17]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],31]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],7]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],21]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],5]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],33]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],23]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],3]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],29]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],27]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0]
>>>>>>>>>>> odls:notify_iof_complete for child [[49524,1],1]
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:proc_complete
>>>>>>>>>>> reporting all procs in [49524,1] terminated
>>>>>>>>>>> ^Cmpirun: killing job...
>>>>>>>>>>>
>>>>>>>>>>> Killed by signal 2.
>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc
>>>>>>>>>>> working on WILDCARD
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 12/14/2012 04:11 PM, Ralph Castain wrote:
>>>>>>>>>>>> Sorry - I forgot that you built from a tarball, and so
>>>>>>>>>>>> debug isn't enabled by default. You need to configure
>>>>>>>>>>>> --enable-debug.
>>>>>>>>>>>>
>>>>>>>>>>>> On Dec 14, 2012, at 1:52 PM, Daniel Davidson
>>>>>>>>>>>> <danield_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Oddly enough, adding this debugging info, lowered the
>>>>>>>>>>>>> number of processes that can be used down to 42 from 46.
>>>>>>>>>>>>> When I run the MPI, it fails giving only the information
>>>>>>>>>>>>> that follows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [root_at_compute-2-1 ssh]#
>>>>>>>>>>>>> /home/apps/openmpi-1.6.3/bin/mpirun -host
>>>>>>>>>>>>> compute-2-0,compute-2-1 -v -np 44
>>>>>>>>>>>>> --leave-session-attached -mca odls_base_verbose 5 hostname
>>>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Querying
>>>>>>>>>>>>> component [default]
>>>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Query of
>>>>>>>>>>>>> component [default] set priority to 1
>>>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Selected
>>>>>>>>>>>>> component [default]
>>>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Querying
>>>>>>>>>>>>> component [default]
>>>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Query of
>>>>>>>>>>>>> component [default] set priority to 1
>>>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Selected
>>>>>>>>>>>>> component [default]
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 12/14/2012 03:18 PM, Ralph Castain wrote:
>>>>>>>>>>>>>> It wouldn't be ssh - in both cases, only one ssh is being
>>>>>>>>>>>>>> done to each node (to start the local daemon). The only
>>>>>>>>>>>>>> difference is the number of fork/exec's being done on
>>>>>>>>>>>>>> each node, and the number of file descriptors being
>>>>>>>>>>>>>> opened to support those fork/exec's.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It certainly looks like your limits are high enough. When
>>>>>>>>>>>>>> you say it "fails", what do you mean - what error does it
>>>>>>>>>>>>>> report? Try adding:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --leave-session-attached -mca odls_base_verbose 5
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> to your cmd line - this will report all the local proc
>>>>>>>>>>>>>> launch debug and hopefully show you a more detailed error
>>>>>>>>>>>>>> report.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Dec 14, 2012, at 12:29 PM, Daniel Davidson
>>>>>>>>>>>>>> <danield_at_[hidden]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have had to cobble together two machines in our rocks
>>>>>>>>>>>>>>> cluster without using the standard installation, they
>>>>>>>>>>>>>>> have efi only bios on them and rocks doesnt like that,
>>>>>>>>>>>>>>> so it is the only workaround.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Everything works great now, except for one thing. MPI
>>>>>>>>>>>>>>> jobs (openmpi or mpich) fail when started from one of
>>>>>>>>>>>>>>> these nodes (via qsub or by logging in and running the
>>>>>>>>>>>>>>> command) if 24 or more processors are needed on another
>>>>>>>>>>>>>>> system. However if the originator of the MPI job is the
>>>>>>>>>>>>>>> headnode or any of the preexisting compute nodes, it
>>>>>>>>>>>>>>> works fine. Right now I am guessing ssh client or
>>>>>>>>>>>>>>> ulimit problems, but I cannot find any difference. Any
>>>>>>>>>>>>>>> help would be greatly appreciated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> compute-2-1 and compute-2-0 are the new nodes
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Examples:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This works, prints 23 hostnames from each machine:
>>>>>>>>>>>>>>> [root_at_compute-2-1 ~]#
>>>>>>>>>>>>>>> /home/apps/openmpi-1.6.3/bin/mpirun -host
>>>>>>>>>>>>>>> compute-2-0,compute-2-1 -np 46 hostname
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This does not work, prints 24 hostnames for compute-2-1
>>>>>>>>>>>>>>> [root_at_compute-2-1 ~]#
>>>>>>>>>>>>>>> /home/apps/openmpi-1.6.3/bin/mpirun -host
>>>>>>>>>>>>>>> compute-2-0,compute-2-1 -np 48 hostname
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These both work, print 64 hostnames from each node
>>>>>>>>>>>>>>> [root_at_biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun
>>>>>>>>>>>>>>> -host compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>>>>>>>>> [root_at_compute-0-2 ~]#
>>>>>>>>>>>>>>> /home/apps/openmpi-1.6.3/bin/mpirun -host
>>>>>>>>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [root_at_compute-2-1 ~]# ulimit -a
>>>>>>>>>>>>>>> core file size (blocks, -c) 0
>>>>>>>>>>>>>>> data seg size (kbytes, -d) unlimited
>>>>>>>>>>>>>>> scheduling priority (-e) 0
>>>>>>>>>>>>>>> file size (blocks, -f) unlimited
>>>>>>>>>>>>>>> pending signals (-i) 16410016
>>>>>>>>>>>>>>> max locked memory (kbytes, -l) unlimited
>>>>>>>>>>>>>>> max memory size (kbytes, -m) unlimited
>>>>>>>>>>>>>>> open files (-n) 4096
>>>>>>>>>>>>>>> pipe size (512 bytes, -p) 8
>>>>>>>>>>>>>>> POSIX message queues (bytes, -q) 819200
>>>>>>>>>>>>>>> real-time priority (-r) 0
>>>>>>>>>>>>>>> stack size (kbytes, -s) unlimited
>>>>>>>>>>>>>>> cpu time (seconds, -t) unlimited
>>>>>>>>>>>>>>> max user processes (-u) 1024
>>>>>>>>>>>>>>> virtual memory (kbytes, -v) unlimited
>>>>>>>>>>>>>>> file locks (-x) unlimited
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [root_at_compute-2-1 ~]# more /etc/ssh/ssh_config
>>>>>>>>>>>>>>> Host *
>>>>>>>>>>>>>>> CheckHostIP no
>>>>>>>>>>>>>>> ForwardX11 yes
>>>>>>>>>>>>>>> ForwardAgent yes
>>>>>>>>>>>>>>> StrictHostKeyChecking no
>>>>>>>>>>>>>>> UsePrivilegedPort no
>>>>>>>>>>>>>>> Protocol 2,1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>