Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)
From: Gus Correa (gus_at_[hidden])
Date: 2011-05-20 14:30:13


Hi Salvatore

Just in case ...
You say you have problems when you use "--mca btl openib,self".
Is this a typo in your email?
I guess this will disable the shared memory btl intra-node,
whereas your other choice "--mca btl_tcp_if_include ib0" will not.
Could this be the problem?

Here we use "--mca btl openib,self,sm",
to enable the shared memory btl intra-node as well,
and it works just fine on programs that do use collective calls.

My two cents,
Gus Correa

Salvatore Podda wrote:
> We are still struggling we these problems. Actually the new version of
> intel compilers does
> not seem to be the real issue. We clash against the same errors using
> also the `gcc' compilers.
> We succeed in building an openmi-1.2.8 (with different compiler
> flavours) rpm from the installation
> of the cluster section where all seems to work well. We are now doing a
> severe IMB benchmark campaign.
>
> However, yes this happen only whe we use the --mca btl openib,self, on
> the contrary if we use
> --mca btl_tcp_if_include ib0 all works well.
> Yes we can try the flag you suggest. I can check on the FAQ and on the
> opem-mpi.org documentation,
> but can you be so kindly to explain the meaning of this flag?
>
> Thanks
>
> Salvatore Podda
>
> On 20/mag/11, at 03:37, Jeff Squyres wrote:
>
>> Sorry for the late reply.
>>
>> Other users have seen something similar but we have never been able to
>> reproduce it. Is this only when using IB? If you use "mpirun --mca
>> btl_openib_cpc_if_include rdmacm", does the problem go away?
>>
>>
>> On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:
>>
>>> I've seen the same thing when I build openmpi 1.4.3 with Intel 12,
>>> but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1
>>> then the collectives hangs go away. I don't know what, if anything,
>>> the higher optimization buys you when compiling openmpi, so I'm not
>>> sure if that's an acceptable workaround or not.
>>>
>>> My system is similar to yours - Intel X5570 with QDR Mellanox IB
>>> running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm
>>> using IMB 3.2.2 with a single iteration of Barrier to reproduce the
>>> hang, and it happens 100% of the time for me when I invoke it like this:
>>>
>>> # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier
>>>
>>> The hang happens on the first Barrier (64 ranks) and each of the
>>> participating ranks have this backtrace:
>>>
>>> __poll (...)
>>> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
>>> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
>>> opal_progress () from [instdir]/lib/libopen-pal.so.0
>>> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_barrier_intra_recursivedoubling () from
>>> [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_barrier_intra_dec_fixed () from
>>> [instdir]/lib/libmpi.so.0
>>> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
>>> IMB_barrier ()
>>> IMB_init_buffers_iter ()
>>> main ()
>>>
>>> The one non-participating rank has this backtrace:
>>>
>>> __poll (...)
>>> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
>>> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
>>> opal_progress () from [instdir]/lib/libopen-pal.so.0
>>> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_barrier_intra_dec_fixed () from
>>> [instdir]/lib/libmpi.so.0
>>> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
>>> main ()
>>>
>>> If I use more nodes I can get it to hang with 1ppn, so that seems to
>>> rule out the sm btl (or interactions with it) as a culprit at least.
>>>
>>> I can't reproduce this with openmpi 1.5.3, interestingly.
>>>
>>> -Marcus
>>>
>>>
>>> On 05/10/2011 03:37 AM, Salvatore Podda wrote:
>>>> Dear all,
>>>>
>>>> we succeed in building several version of openmpi from 1.2.8 to 1.4.3
>>>> with Intel composer XE 2011 (aka 12.0).
>>>> However we found a threshold in the number of cores (depending from the
>>>> application: IMB, xhpl or user applications
>>>> and form the number of required cores) above which the application
>>>> hangs
>>>> (sort of deadlocks).
>>>> The building of openmpi with 'gcc' and 'pgi' does not show the same
>>>> limits.
>>>> There are any known incompatibilities of openmpi with this version of
>>>> intel compiilers?
>>>>
>>>> The characteristics of our computational infrastructure are:
>>>>
>>>> Intel processors E7330, E5345, E5530 e E5620
>>>>
>>>> CentOS 5.3, CentOS 5.5.
>>>>
>>>> Intel composer XE 2011
>>>> gcc 4.1.2
>>>> pgi 10.2-1
>>>>
>>>> Regards
>>>>
>>>> Salvatore Podda
>>>>
>>>> ENEA UTICT-HPC
>>>> Department for Computer Science Development and ICT
>>>> Facilities Laboratory for Science and High Performace Computing
>>>> C.R. Frascati
>>>> Via E. Fermi, 45
>>>> PoBox 65
>>>> 00044 Frascati (Rome)
>>>> Italy
>>>>
>>>> Tel: +39 06 9400 5342
>>>> Fax: +39 06 9400 5551
>>>> Fax: +39 06 9400 5735
>>>> E-mail: salvatore.podda_at_[hidden]
>>>> Home Page: www.cresco.enea.it
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users