Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-02-26 22:59:37


Last time I got such an error was when the shared libraries on my head
node didn't match the one loaded by the compute nodes. It was a simple
LD_LIBRARY_PATH mistake from my part. And it was the last time I
didn't build my tree with --enable-mpirun-prefix-by-default.

   george.

On Feb 26, 2009, at 22:10 , Ralph Castain wrote:

> FWIW: I tested the trunk tonight using both SLURM and rsh launchers,
> and everything checks out fine. However, this is running under SGE
> and thus using qrsh, so it is possible the SGE support is having a
> problem.
>
> Perhaps one of the Sun OMPI developers can help here?
>
> Ralph
>
> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>
>> It looks like the system doesn't know what nodes the procs are to
>> be placed upon. Can you run this with --display-devel-map? That
>> will tell us where the system thinks it is placing things.
>>
>> Thanks
>> Ralph
>>
>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>
>>> Maybe it's my pine mailer.
>>>
>>> This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
>>> shangai nodes running a standard benchmark called stmv.
>>>
>>> The basic error message, which occurs 31 times is like:
>>>
>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
>>> file ../../../.././orte/mca/odls/base/odls_base_default_fns.c at
>>> line 595
>>>
>>> The mpirun command has long paths in it, sorry. It's invoking a
>>> special binding
>>> script which in turn lauches the NAMD run. This works on an older
>>> SVN at
>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this
>>> 256 proc run where
>>> the older SVN hangs indefinitely polling some completion (sm or
>>> openib). So, I was trying
>>> later SVNs with this 256 proc run, hoping the error would go away.
>>>
>>> Here's some of the invocation again. Hope you can read it:
>>>
>>> EAGER_SIZE=32767
>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>
>>> and, unexpanded
>>>
>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -
>>> x OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER
>>> $NAMD2 stmv.namd
>>>
>>> and, expanded
>>>
>>> mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/
>>> intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256 --mca
>>> btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x
>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -
>>> x OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/
>>> newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>>> mpi_binder.MRL /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/
>>> Linux-amd64-MPI/namd2 stmv.namd
>>>
>>> This is all via Sun Grid Engine.
>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>
>>> DM
>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>
>>>> I'm sorry, but I can't make any sense of this message. Could you
>>>> provide a
>>>> little explanation of what you are doing, what the system looks
>>>> like, what is
>>>> supposed to happen, etc? I can barely parse your cmd line...
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>
>>>>> Today's and yesterdays.
>>>>>
>>>>> 1.4a1r20643_svn
>>>>>
>>>>> + mpirun --prefix
>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/
>>>>> suse_sles_10/x86_6
>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>> 269.1.all.q/newhosts
>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>> /ctmp8/mostyn/IM
>>>>> SC/bench_intel_openmpi_I_shang2/
>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
>>>>> NAMD_2.6_Source/Li
>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>> ../../../.././orte/mca/odls/base/odls
>>>>> _base_default_fns.c at line 595
>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>> base_default_fns.c at line 595
>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
>>>>> ../../../.././orte/mca/odls/base/odls
>>>>> _base_default_fns.c at line 595
>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
>>>>> ../../../.././orte/mca/odls/base/odls
>>>>> _base_default_fns.c at line 595
>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
>>>>> ../../../.././orte/mca/odls/base/odls
>>>>> _base_default_fns.c at line 595
>>>>>
>>>>> Made with INTEL compilers 10.1.015.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mostyn
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users