Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-02-26 22:10:14


FWIW: I tested the trunk tonight using both SLURM and rsh launchers,
and everything checks out fine. However, this is running under SGE and
thus using qrsh, so it is possible the SGE support is having a problem.

Perhaps one of the Sun OMPI developers can help here?

Ralph

On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:

> It looks like the system doesn't know what nodes the procs are to be
> placed upon. Can you run this with --display-devel-map? That will
> tell us where the system thinks it is placing things.
>
> Thanks
> Ralph
>
> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>
>> Maybe it's my pine mailer.
>>
>> This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
>> shangai nodes running a standard benchmark called stmv.
>>
>> The basic error message, which occurs 31 times is like:
>>
>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
>> file ../../../.././orte/mca/odls/base/odls_base_default_fns.c at
>> line 595
>>
>> The mpirun command has long paths in it, sorry. It's invoking a
>> special binding
>> script which in turn lauches the NAMD run. This works on an older
>> SVN at
>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this
>> 256 proc run where
>> the older SVN hangs indefinitely polling some completion (sm or
>> openib). So, I was trying
>> later SVNs with this 256 proc run, hoping the error would go away.
>>
>> Here's some of the invocation again. Hope you can read it:
>>
>> EAGER_SIZE=32767
>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>
>> and, unexpanded
>>
>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>> OMPI_MCA_btl_openib_use_eager_rdma -x
>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>> OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER $NAMD2
>> stmv.namd
>>
>> and, expanded
>>
>> mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/
>> intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256 --mca
>> btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x
>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/
>> newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>> mpi_binder.MRL /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/
>> Linux-amd64-MPI/namd2 stmv.namd
>>
>> This is all via Sun Grid Engine.
>> The OS as indicated above is SuSE SLES 10 SP2.
>>
>> DM
>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>
>>> I'm sorry, but I can't make any sense of this message. Could you
>>> provide a
>>> little explanation of what you are doing, what the system looks
>>> like, what is
>>> supposed to happen, etc? I can barely parse your cmd line...
>>>
>>> Thanks
>>> Ralph
>>>
>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>
>>>> Today's and yesterdays.
>>>>
>>>> 1.4a1r20643_svn
>>>>
>>>> + mpirun --prefix
>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/
>>>> suse_sles_10/x86_6
>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>> 269.1.all.q/newhosts
>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>> /ctmp8/mostyn/IM
>>>> SC/bench_intel_openmpi_I_shang2/
>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/
>>>> Li
>>>> nux-amd64-MPI/namd2 stmv.namd
>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>> ../../../.././orte/mca/odls/base/odls
>>>> _base_default_fns.c at line 595
>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
>>>> ../../../.././orte/mca/odls/base/odls_
>>>> base_default_fns.c at line 595
>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
>>>> ../../../.././orte/mca/odls/base/odls
>>>> _base_default_fns.c at line 595
>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
>>>> ../../../.././orte/mca/odls/base/odls
>>>> _base_default_fns.c at line 595
>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
>>>> ../../../.././orte/mca/odls/base/odls
>>>> _base_default_fns.c at line 595
>>>>
>>>> Made with INTEL compilers 10.1.015.
>>>>
>>>>
>>>> Regards,
>>>> Mostyn
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>