Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)
From: Gus Correa (gus_at_[hidden])
Date: 2011-05-20 14:30:13


Hi Salvatore

Just in case ...
You say you have problems when you use "--mca btl openib,self".
Is this a typo in your email?
I guess this will disable the shared memory btl intra-node,
whereas your other choice "--mca btl_tcp_if_include ib0" will not.
Could this be the problem?

Here we use "--mca btl openib,self,sm",
to enable the shared memory btl intra-node as well,
and it works just fine on programs that do use collective calls.

My two cents,
Gus Correa

Salvatore Podda wrote:
> We are still struggling we these problems. Actually the new version of
> intel compilers does
> not seem to be the real issue. We clash against the same errors using
> also the `gcc' compilers.
> We succeed in building an openmi-1.2.8 (with different compiler
> flavours) rpm from the installation
> of the cluster section where all seems to work well. We are now doing a
> severe IMB benchmark campaign.
>
> However, yes this happen only whe we use the --mca btl openib,self, on
> the contrary if we use
> --mca btl_tcp_if_include ib0 all works well.
> Yes we can try the flag you suggest. I can check on the FAQ and on the
> opem-mpi.org documentation,
> but can you be so kindly to explain the meaning of this flag?
>
> Thanks
>
> Salvatore Podda
>
> On 20/mag/11, at 03:37, Jeff Squyres wrote:
>
>> Sorry for the late reply.
>>
>> Other users have seen something similar but we have never been able to
>> reproduce it. Is this only when using IB? If you use "mpirun --mca
>> btl_openib_cpc_if_include rdmacm", does the problem go away?
>>
>>
>> On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:
>>
>>> I've seen the same thing when I build openmpi 1.4.3 with Intel 12,
>>> but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1
>>> then the collectives hangs go away. I don't know what, if anything,
>>> the higher optimization buys you when compiling openmpi, so I'm not
>>> sure if that's an acceptable workaround or not.
>>>
>>> My system is similar to yours - Intel X5570 with QDR Mellanox IB
>>> running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm
>>> using IMB 3.2.2 with a single iteration of Barrier to reproduce the
>>> hang, and it happens 100% of the time for me when I invoke it like this:
>>>
>>> # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier
>>>
>>> The hang happens on the first Barrier (64 ranks) and each of the
>>> participating ranks have this backtrace:
>>>
>>> __poll (...)
>>> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
>>> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
>>> opal_progress () from [instdir]/lib/libopen-pal.so.0
>>> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_barrier_intra_recursivedoubling () from
>>> [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_barrier_intra_dec_fixed () from
>>> [instdir]/lib/libmpi.so.0
>>> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
>>> IMB_barrier ()
>>> IMB_init_buffers_iter ()
>>> main ()
>>>
>>> The one non-participating rank has this backtrace:
>>>
>>> __poll (...)
>>> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
>>> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
>>> opal_progress () from [instdir]/lib/libopen-pal.so.0
>>> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
>>> ompi_coll_tuned_barrier_intra_dec_fixed () from
>>> [instdir]/lib/libmpi.so.0
>>> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
>>> main ()
>>>
>>> If I use more nodes I can get it to hang with 1ppn, so that seems to
>>> rule out the sm btl (or interactions with it) as a culprit at least.
>>>
>>> I can't reproduce this with openmpi 1.5.3, interestingly.
>>>
>>> -Marcus
>>>
>>>
>>> On 05/10/2011 03:37 AM, Salvatore Podda wrote:
>>>> Dear all,
>>>>
>>>> we succeed in building several version of openmpi from 1.2.8 to 1.4.3
>>>> with Intel composer XE 2011 (aka 12.0).
>>>> However we found a threshold in the number of cores (depending from the
>>>> application: IMB, xhpl or user applications
>>>> and form the number of required cores) above which the application
>>>> hangs
>>>> (sort of deadlocks).
>>>> The building of openmpi with 'gcc' and 'pgi' does not show the same
>>>> limits.
>>>> There are any known incompatibilities of openmpi with this version of
>>>> intel compiilers?
>>>>
>>>> The characteristics of our computational infrastructure are:
>>>>
>>>> Intel processors E7330, E5345, E5530 e E5620
>>>>
>>>> CentOS 5.3, CentOS 5.5.
>>>>
>>>> Intel composer XE 2011
>>>> gcc 4.1.2
>>>> pgi 10.2-1
>>>>
>>>> Regards
>>>>
>>>> Salvatore Podda
>>>>
>>>> ENEA UTICT-HPC
>>>> Department for Computer Science Development and ICT
>>>> Facilities Laboratory for Science and High Performace Computing
>>>> C.R. Frascati
>>>> Via E. Fermi, 45
>>>> PoBox 65
>>>> 00044 Frascati (Rome)
>>>> Italy
>>>>
>>>> Tel: +39 06 9400 5342
>>>> Fax: +39 06 9400 5551
>>>> Fax: +39 06 9400 5735
>>>> E-mail: salvatore.podda_at_[hidden]
>>>> Home Page: www.cresco.enea.it
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users