Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Rolf Vandevaart (Rolf.Vandevaart_at_[hidden])
Date: 2009-02-27 17:05:38


With further investigation, I have reproduced this problem. I think I
was originally testing against a version that was not recent enough. I
do not see it with r20594 which is from February 19. So, something must
have happened over the last 8 days. I will try and narrow down the issue.

Rolf

On 02/27/09 09:34, Rolf Vandevaart wrote:
>
> I just tried trunk-1.4a1r20458 and I did not see this error, although my
> configuration was rather different. I ran across 100 2-CPU sparc nodes,
> np=256, connected with TCP.
>
> Hopefully George's comment helps out with this issue.
>
> One other thought to see whether SGE has anything to do with this is
> create a hostfile and run it outside of SGE.
>
> Rolf
>
> On 02/26/09 22:10, Ralph Castain wrote:
>> FWIW: I tested the trunk tonight using both SLURM and rsh launchers,
>> and everything checks out fine. However, this is running under SGE and
>> thus using qrsh, so it is possible the SGE support is having a problem.
>>
>> Perhaps one of the Sun OMPI developers can help here?
>>
>> Ralph
>>
>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>>
>>> It looks like the system doesn't know what nodes the procs are to be
>>> placed upon. Can you run this with --display-devel-map? That will
>>> tell us where the system thinks it is placing things.
>>>
>>> Thanks
>>> Ralph
>>>
>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>>
>>>> Maybe it's my pine mailer.
>>>>
>>>> This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
>>>> shangai nodes running a standard benchmark called stmv.
>>>>
>>>> The basic error message, which occurs 31 times is like:
>>>>
>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>> ../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595
>>>>
>>>> The mpirun command has long paths in it, sorry. It's invoking a
>>>> special binding
>>>> script which in turn lauches the NAMD run. This works on an older
>>>> SVN at
>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this
>>>> 256 proc run where
>>>> the older SVN hangs indefinitely polling some completion (sm or
>>>> openib). So, I was trying
>>>> later SVNs with this 256 proc run, hoping the error would go away.
>>>>
>>>> Here's some of the invocation again. Hope you can read it:
>>>>
>>>> EAGER_SIZE=32767
>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>>
>>>> and, unexpanded
>>>>
>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>> OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER $NAMD2
>>>> stmv.namd
>>>>
>>>> and, expanded
>>>>
>>>> mpirun --prefix
>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron
>>>> -np 256 --mca btl sm,openib,self -x
>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/newhosts
>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2
>>>> stmv.namd
>>>>
>>>> This is all via Sun Grid Engine.
>>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>>
>>>> DM
>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>>
>>>>> I'm sorry, but I can't make any sense of this message. Could you
>>>>> provide a
>>>>> little explanation of what you are doing, what the system looks
>>>>> like, what is
>>>>> supposed to happen, etc? I can barely parse your cmd line...
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>>
>>>>>> Today's and yesterdays.
>>>>>>
>>>>>> 1.4a1r20643_svn
>>>>>>
>>>>>> + mpirun --prefix
>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6
>>>>>>
>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>>> 269.1.all.q/newhosts
>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>>> /ctmp8/mostyn/IM
>>>>>> SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li
>>>>>>
>>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>> _base_default_fns.c at line 595
>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
>>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>>> base_default_fns.c at line 595
>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>> _base_default_fns.c at line 595
>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>> _base_default_fns.c at line 595
>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>> _base_default_fns.c at line 595
>>>>>>
>>>>>> Made with INTEL compilers 10.1.015.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Mostyn
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
=========================
rolf.vandevaart_at_[hidden]
781-442-3043
=========================