Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Rolf Vandevaart (Rolf.Vandevaart_at_[hidden])
Date: 2009-02-27 09:34:47


I just tried trunk-1.4a1r20458 and I did not see this error, although my
configuration was rather different. I ran across 100 2-CPU sparc nodes,
np=256, connected with TCP.

Hopefully George's comment helps out with this issue.

One other thought to see whether SGE has anything to do with this is
create a hostfile and run it outside of SGE.

Rolf

On 02/26/09 22:10, Ralph Castain wrote:
> FWIW: I tested the trunk tonight using both SLURM and rsh launchers, and
> everything checks out fine. However, this is running under SGE and thus
> using qrsh, so it is possible the SGE support is having a problem.
>
> Perhaps one of the Sun OMPI developers can help here?
>
> Ralph
>
> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>
>> It looks like the system doesn't know what nodes the procs are to be
>> placed upon. Can you run this with --display-devel-map? That will tell
>> us where the system thinks it is placing things.
>>
>> Thanks
>> Ralph
>>
>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>
>>> Maybe it's my pine mailer.
>>>
>>> This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
>>> shangai nodes running a standard benchmark called stmv.
>>>
>>> The basic error message, which occurs 31 times is like:
>>>
>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>> ../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595
>>>
>>> The mpirun command has long paths in it, sorry. It's invoking a
>>> special binding
>>> script which in turn lauches the NAMD run. This works on an older SVN at
>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this
>>> 256 proc run where
>>> the older SVN hangs indefinitely polling some completion (sm or
>>> openib). So, I was trying
>>> later SVNs with this 256 proc run, hoping the error would go away.
>>>
>>> Here's some of the invocation again. Hope you can read it:
>>>
>>> EAGER_SIZE=32767
>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>
>>> and, unexpanded
>>>
>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit
>>> -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit
>>> -machinefile $HOSTS $MPI_BINDER $NAMD2 stmv.namd
>>>
>>> and, expanded
>>>
>>> mpirun --prefix
>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron
>>> -np 256 --mca btl sm,openib,self -x
>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit
>>> -x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit
>>> -machinefile /tmp/48292.1.all.q/newhosts
>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2
>>> stmv.namd
>>>
>>> This is all via Sun Grid Engine.
>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>
>>> DM
>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>
>>>> I'm sorry, but I can't make any sense of this message. Could you
>>>> provide a
>>>> little explanation of what you are doing, what the system looks
>>>> like, what is
>>>> supposed to happen, etc? I can barely parse your cmd line...
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>
>>>>> Today's and yesterdays.
>>>>>
>>>>> 1.4a1r20643_svn
>>>>>
>>>>> + mpirun --prefix
>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6
>>>>>
>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>> 269.1.all.q/newhosts
>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>> /ctmp8/mostyn/IM
>>>>> SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li
>>>>>
>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>> ../../../.././orte/mca/odls/base/odls
>>>>> _base_default_fns.c at line 595
>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>> base_default_fns.c at line 595
>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
>>>>> ../../../.././orte/mca/odls/base/odls
>>>>> _base_default_fns.c at line 595
>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
>>>>> ../../../.././orte/mca/odls/base/odls
>>>>> _base_default_fns.c at line 595
>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
>>>>> ../../../.././orte/mca/odls/base/odls
>>>>> _base_default_fns.c at line 595
>>>>>
>>>>> Made with INTEL compilers 10.1.015.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mostyn
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
=========================
rolf.vandevaart_at_[hidden]
781-442-3043
=========================