Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)
From: Salvatore Podda (salvatore.podda_at_[hidden])
Date: 2011-05-24 08:34:52


OK! I catch the meaning of the "--mca btl_openib_cpc_include rdmacm"
parameter.
Howerver, as I just said, we are doing, in the meanwhile, several IMB
tests on openmpi
1.2.8 and on this (our) version either the RDMA CM support is not
implemented or has
not been included in the compilation phase

Salvatore Podda

On 20/mag/11, at 03:37, Jeff Squyres wrote:

> Sorry for the late reply.
>
> Other users have seen something similar but we have never been able
> to reproduce it. Is this only when using IB? If you use "mpirun --
> mca btl_openib_cpc_if_include rdmacm", does the problem go away?
>
>
> On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:
>
>> I've seen the same thing when I build openmpi 1.4.3 with Intel 12,
>> but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1
>> then the collectives hangs go away. I don't know what, if anything,
>> the higher optimization buys you when compiling openmpi, so I'm not
>> sure if that's an acceptable workaround or not.
>>
>> My system is similar to yours - Intel X5570 with QDR Mellanox IB
>> running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm
>> using IMB 3.2.2 with a single iteration of Barrier to reproduce the
>> hang, and it happens 100% of the time for me when I invoke it like
>> this:
>>
>> # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier
>>
>> The hang happens on the first Barrier (64 ranks) and each of the
>> participating ranks have this backtrace:
>>
>> __poll (...)
>> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
>> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
>> opal_progress () from [instdir]/lib/libopen-pal.so.0
>> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
>> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
>> ompi_coll_tuned_barrier_intra_recursivedoubling () from [instdir]/
>> lib/libmpi.so.0
>> ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/
>> libmpi.so.0
>> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
>> IMB_barrier ()
>> IMB_init_buffers_iter ()
>> main ()
>>
>> The one non-participating rank has this backtrace:
>>
>> __poll (...)
>> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
>> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
>> opal_progress () from [instdir]/lib/libopen-pal.so.0
>> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
>> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
>> ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
>> ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/
>> libmpi.so.0
>> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
>> main ()
>>
>> If I use more nodes I can get it to hang with 1ppn, so that seems
>> to rule out the sm btl (or interactions with it) as a culprit at
>> least.
>>
>> I can't reproduce this with openmpi 1.5.3, interestingly.
>>
>> -Marcus
>>
>>
>> On 05/10/2011 03:37 AM, Salvatore Podda wrote:
>>> Dear all,
>>>
>>> we succeed in building several version of openmpi from 1.2.8 to
>>> 1.4.3
>>> with Intel composer XE 2011 (aka 12.0).
>>> However we found a threshold in the number of cores (depending
>>> from the
>>> application: IMB, xhpl or user applications
>>> and form the number of required cores) above which the application
>>> hangs
>>> (sort of deadlocks).
>>> The building of openmpi with 'gcc' and 'pgi' does not show the
>>> same limits.
>>> There are any known incompatibilities of openmpi with this version
>>> of
>>> intel compiilers?
>>>
>>> The characteristics of our computational infrastructure are:
>>>
>>> Intel processors E7330, E5345, E5530 e E5620
>>>
>>> CentOS 5.3, CentOS 5.5.
>>>
>>> Intel composer XE 2011
>>> gcc 4.1.2
>>> pgi 10.2-1
>>>
>>> Regards
>>>
>>> Salvatore Podda
>>>
>>> ENEA UTICT-HPC
>>> Department for Computer Science Development and ICT
>>> Facilities Laboratory for Science and High Performace Computing
>>> C.R. Frascati
>>> Via E. Fermi, 45
>>> PoBox 65
>>> 00044 Frascati (Rome)
>>> Italy
>>>
>>> Tel: +39 06 9400 5342
>>> Fax: +39 06 9400 5551
>>> Fax: +39 06 9400 5735
>>> E-mail: salvatore.podda_at_[hidden]
>>> Home Page: www.cresco.enea.it
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>

==================================================
Investi nel futuro. Investi nelle nostre ricerche.
Destina il 5 x 1000 all'ENEA
Cerchiamo:
- nuove fonti e nuovi modi per produrre energia pulita e sicura.
- modi migliori per utilizzare e risparmiare energia.
- metodologie e tecnologie per innovare e rendere piu' competitivo il sistema produttivo nazionale.
- metodologie e tecnologie per la salvaguardia e il recupero dell'ambiente e per la tutela della nostra salute e del patrimonio artistico del Paese.
Il nostro codice fiscale e': 01320740580