Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Latest SVN failures
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-02-27 18:14:44


Unfortunately, I think I have reproduced the problem as well -- with
SVN trunk HEAD (r20655):

[15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
[svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack
failed in file base/odls_base_default_fns.c at line 566
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

Notice that I'm not trying to run an MPI app -- it's just "uptime".

The following things seem to be necessary to make this error occur for
me:

1. --bynode
2. set some mca parameter (any mca parameter)
3. -np value less than the size of my slurm allocation

If I remove any of those, it seems to run file

On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:

> With further investigation, I have reproduced this problem. I think
> I was originally testing against a version that was not recent
> enough. I do not see it with r20594 which is from February 19. So,
> something must have happened over the last 8 days. I will try and
> narrow down the issue.
>
> Rolf
>
> On 02/27/09 09:34, Rolf Vandevaart wrote:
>> I just tried trunk-1.4a1r20458 and I did not see this error,
>> although my configuration was rather different. I ran across 100 2-
>> CPU sparc nodes, np=256, connected with TCP.
>> Hopefully George's comment helps out with this issue.
>> One other thought to see whether SGE has anything to do with this
>> is create a hostfile and run it outside of SGE.
>> Rolf
>> On 02/26/09 22:10, Ralph Castain wrote:
>>> FWIW: I tested the trunk tonight using both SLURM and rsh
>>> launchers, and everything checks out fine. However, this is
>>> running under SGE and thus using qrsh, so it is possible the SGE
>>> support is having a problem.
>>>
>>> Perhaps one of the Sun OMPI developers can help here?
>>>
>>> Ralph
>>>
>>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>>>
>>>> It looks like the system doesn't know what nodes the procs are to
>>>> be placed upon. Can you run this with --display-devel-map? That
>>>> will tell us where the system thinks it is placing things.
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>>>
>>>>> Maybe it's my pine mailer.
>>>>>
>>>>> This is a NAMD run on 256 procs across 32 dual-socket quad-core
>>>>> AMD
>>>>> shangai nodes running a standard benchmark called stmv.
>>>>>
>>>>> The basic error message, which occurs 31 times is like:
>>>>>
>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
>>>>> file ../../../.././orte/mca/odls/base/odls_base_default_fns.c at
>>>>> line 595
>>>>>
>>>>> The mpirun command has long paths in it, sorry. It's invoking a
>>>>> special binding
>>>>> script which in turn lauches the NAMD run. This works on an
>>>>> older SVN at
>>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for
>>>>> this 256 proc run where
>>>>> the older SVN hangs indefinitely polling some completion (sm or
>>>>> openib). So, I was trying
>>>>> later SVNs with this 256 proc run, hoping the error would go away.
>>>>>
>>>>> Here's some of the invocation again. Hope you can read it:
>>>>>
>>>>> EAGER_SIZE=32767
>>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>>>
>>>>> and, unexpanded
>>>>>
>>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit
>>>>> -x OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER
>>>>> $NAMD2 stmv.namd
>>>>>
>>>>> and, expanded
>>>>>
>>>>> mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/
>>>>> intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256 --
>>>>> mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>> OMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit
>>>>> -x OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/
>>>>> newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>>>>> mpi_binder.MRL /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
>>>>> NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd
>>>>>
>>>>> This is all via Sun Grid Engine.
>>>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>>>
>>>>> DM
>>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>>>
>>>>>> I'm sorry, but I can't make any sense of this message. Could
>>>>>> you provide a
>>>>>> little explanation of what you are doing, what the system looks
>>>>>> like, what is
>>>>>> supposed to happen, etc? I can barely parse your cmd line...
>>>>>>
>>>>>> Thanks
>>>>>> Ralph
>>>>>>
>>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>>>
>>>>>>> Today's and yesterdays.
>>>>>>>
>>>>>>> 1.4a1r20643_svn
>>>>>>>
>>>>>>> + mpirun --prefix
>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/
>>>>>>> openib/suse_sles_10/x86_6
>>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>>>> 269.1.all.q/newhosts
>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
>>>>>>> /ctmp8/mostyn/IM
>>>>>>> SC/bench_intel_openmpi_I_shang2/
>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
>>>>>>> NAMD_2.6_Source/Li
>>>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>> _base_default_fns.c at line 595
>>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
>>>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>>>> base_default_fns.c at line 595
>>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>> _base_default_fns.c at line 595
>>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>> _base_default_fns.c at line 595
>>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>> _base_default_fns.c at line 595
>>>>>>>
>>>>>>> Made with INTEL compilers 10.1.015.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mostyn
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
>
> =========================
> rolf.vandevaart_at_[hidden]
> 781-442-3043
> =========================
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems