Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-02-26 21:21:41


It looks like the system doesn't know what nodes the procs are to be
placed upon. Can you run this with --display-devel-map? That will tell
us where the system thinks it is placing things.

Thanks
Ralph

On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:

> Maybe it's my pine mailer.
>
> This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
> shangai nodes running a standard benchmark called stmv.
>
> The basic error message, which occurs 31 times is like:
>
> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
> file ../../../.././orte/mca/odls/base/odls_base_default_fns.c at
> line 595
>
> The mpirun command has long paths in it, sorry. It's invoking a
> special binding
> script which in turn lauches the NAMD run. This works on an older
> SVN at
> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this
> 256 proc run where
> the older SVN hangs indefinitely polling some completion (sm or
> openib). So, I was trying
> later SVNs with this 256 proc run, hoping the error would go away.
>
> Here's some of the invocation again. Hope you can read it:
>
> EAGER_SIZE=32767
> export OMPI_MCA_btl_openib_use_eager_rdma=0
> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>
> and, unexpanded
>
> mpirun --prefix $PREFIX -np %PE% $MCA -x
> OMPI_MCA_btl_openib_use_eager_rdma -x
> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
> OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER $NAMD2
> stmv.namd
>
> and, expanded
>
> mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/
> intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256 --mca
> btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x
> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/
> newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
> mpi_binder.MRL /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/
> Linux-amd64-MPI/namd2 stmv.namd
>
> This is all via Sun Grid Engine.
> The OS as indicated above is SuSE SLES 10 SP2.
>
> DM
> On Thu, 26 Feb 2009, Ralph Castain wrote:
>
>> I'm sorry, but I can't make any sense of this message. Could you
>> provide a
>> little explanation of what you are doing, what the system looks
>> like, what is
>> supposed to happen, etc? I can barely parse your cmd line...
>>
>> Thanks
>> Ralph
>>
>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>
>>> Today's and yesterdays.
>>>
>>> 1.4a1r20643_svn
>>>
>>> + mpirun --prefix
>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/
>>> suse_sles_10/x86_6
>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>> 269.1.all.q/newhosts
>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>> /ctmp8/mostyn/IM
>>> SC/bench_intel_openmpi_I_shang2/
>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li
>>> nux-amd64-MPI/namd2 stmv.namd
>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>> ../../../.././orte/mca/odls/base/odls
>>> _base_default_fns.c at line 595
>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
>>> ../../../.././orte/mca/odls/base/odls_
>>> base_default_fns.c at line 595
>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
>>> ../../../.././orte/mca/odls/base/odls
>>> _base_default_fns.c at line 595
>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
>>> ../../../.././orte/mca/odls/base/odls
>>> _base_default_fns.c at line 595
>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
>>> ../../../.././orte/mca/odls/base/odls
>>> _base_default_fns.c at line 595
>>>
>>> Made with INTEL compilers 10.1.015.
>>>
>>>
>>> Regards,
>>> Mostyn
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users