Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Mostyn Lewis (Mostyn.Lewis_at_[hidden])
Date: 2009-03-10 12:50:53


Latest status - 1.4a1r20757 (yesterday);
the job now starts with a little output but quickly runs into trouble with
a lot of
'oob-tcp: Communication retries exceeded. Can not communicate with peer '
errors?

e.g.
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded. Can not communicate with peer
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded. Can not communicate with peer
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded. Can not communicate with peer
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded. Can not communicate with peer

The nodes are O.K. ...

Any ideas folks?

DM

On Sat, 28 Feb 2009, Ralph Castain wrote:

> I think I have this figured out - will fix on Monday. I'm not sure why Jeff's
> conditions are all required, especially the second one. However, the
> fundamental problem is that we pull information from two sources regarding
> the number of procs in the job when unpacking a buffer, and the two sources
> appear to be out-of-sync with each other in certain scenarios.
>
> The details are beyond the user list. I'll respond here again once I get it
> fixed.
>
> Ralph
>
> On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
>
>> Unfortunately, I think I have reproduced the problem as well -- with SVN
>> trunk HEAD (r20655):
>>
>> [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
>> [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack failed
>> in file base/odls_base_default_fns.c at line 566
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>>
>> Notice that I'm not trying to run an MPI app -- it's just "uptime".
>>
>> The following things seem to be necessary to make this error occur for me:
>>
>> 1. --bynode
>> 2. set some mca parameter (any mca parameter)
>> 3. -np value less than the size of my slurm allocation
>>
>> If I remove any of those, it seems to run file
>>
>>
>> On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
>>
>>> With further investigation, I have reproduced this problem. I think I was
>>> originally testing against a version that was not recent enough. I do not
>>> see it with r20594 which is from February 19. So, something must have
>>> happened over the last 8 days. I will try and narrow down the issue.
>>>
>>> Rolf
>>>
>>> On 02/27/09 09:34, Rolf Vandevaart wrote:
>>>> I just tried trunk-1.4a1r20458 and I did not see this error, although my
>>>> configuration was rather different. I ran across 100 2-CPU sparc nodes,
>>>> np=256, connected with TCP.
>>>> Hopefully George's comment helps out with this issue.
>>>> One other thought to see whether SGE has anything to do with this is
>>>> create a hostfile and run it outside of SGE.
>>>> Rolf
>>>> On 02/26/09 22:10, Ralph Castain wrote:
>>>>> FWIW: I tested the trunk tonight using both SLURM and rsh launchers, and
>>>>> everything checks out fine. However, this is running under SGE and thus
>>>>> using qrsh, so it is possible the SGE support is having a problem.
>>>>>
>>>>> Perhaps one of the Sun OMPI developers can help here?
>>>>>
>>>>> Ralph
>>>>>
>>>>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>>>>>
>>>>>> It looks like the system doesn't know what nodes the procs are to be
>>>>>> placed upon. Can you run this with --display-devel-map? That will tell
>>>>>> us where the system thinks it is placing things.
>>>>>>
>>>>>> Thanks
>>>>>> Ralph
>>>>>>
>>>>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>>>>>
>>>>>>> Maybe it's my pine mailer.
>>>>>>>
>>>>>>> This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
>>>>>>> shangai nodes running a standard benchmark called stmv.
>>>>>>>
>>>>>>> The basic error message, which occurs 31 times is like:
>>>>>>>
>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>>>> ../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595
>>>>>>>
>>>>>>> The mpirun command has long paths in it, sorry. It's invoking a
>>>>>>> special binding
>>>>>>> script which in turn lauches the NAMD run. This works on an older SVN
>>>>>>> at
>>>>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this 256
>>>>>>> proc run where
>>>>>>> the older SVN hangs indefinitely polling some completion (sm or
>>>>>>> openib). So, I was trying
>>>>>>> later SVNs with this 256 proc run, hoping the error would go away.
>>>>>>>
>>>>>>> Here's some of the invocation again. Hope you can read it:
>>>>>>>
>>>>>>> EAGER_SIZE=32767
>>>>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>>>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>>>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>>>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>>>>>
>>>>>>> and, unexpanded
>>>>>>>
>>>>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit
>>>>>>> -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit
>>>>>>> -machinefile $HOSTS $MPI_BINDER $NAMD2 stmv.namd
>>>>>>>
>>>>>>> and, expanded
>>>>>>>
>>>>>>> mpirun --prefix
>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron
>>>>>>> -np 256 --mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma
>>>>>>> -x OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/newhosts
>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2
>>>>>>> stmv.namd
>>>>>>>
>>>>>>> This is all via Sun Grid Engine.
>>>>>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>>>>>
>>>>>>> DM
>>>>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>>>>>
>>>>>>>> I'm sorry, but I can't make any sense of this message. Could you
>>>>>>>> provide a
>>>>>>>> little explanation of what you are doing, what the system looks like,
>>>>>>>> what is
>>>>>>>> supposed to happen, etc? I can barely parse your cmd line...
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>>>>>
>>>>>>>>> Today's and yesterdays.
>>>>>>>>>
>>>>>>>>> 1.4a1r20643_svn
>>>>>>>>>
>>>>>>>>> + mpirun --prefix
>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6
>>>>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>>>>>> 269.1.all.q/newhosts
>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>>>>>> /ctmp8/mostyn/IM
>>>>>>>>> SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li
>>>>>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
>>>>>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>>>>>> base_default_fns.c at line 595
>>>>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>
>>>>>>>>> Made with INTEL compilers 10.1.015.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Mostyn
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>>
>>> =========================
>>> rolf.vandevaart_at_[hidden]
>>> 781-442-3043
>>> =========================
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users