Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)
From: Salvatore Podda (salvatore.podda_at_[hidden])
Date: 2011-05-20 09:29:11


We are still struggling we these problems. Actually the new version of
intel compilers does
not seem to be the real issue. We clash against the same errors using
also the `gcc' compilers.
We succeed in building an openmi-1.2.8 (with different compiler
flavours) rpm from the installation
of the cluster section where all seems to work well. We are now doing
a severe IMB benchmark campaign.

However, yes this happen only whe we use the --mca btl openib,self, on
the contrary if we use
--mca btl_tcp_if_include ib0 all works well.
Yes we can try the flag you suggest. I can check on the FAQ and on the
opem-mpi.org documentation,
  but can you be so kindly to explain the meaning of this flag?

Thanks

Salvatore Podda

On 20/mag/11, at 03:37, Jeff Squyres wrote:

> Sorry for the late reply.
>
> Other users have seen something similar but we have never been able
> to reproduce it. Is this only when using IB? If you use "mpirun --
> mca btl_openib_cpc_if_include rdmacm", does the problem go away?
>
>
> On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:
>
>> I've seen the same thing when I build openmpi 1.4.3 with Intel 12,
>> but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1
>> then the collectives hangs go away. I don't know what, if anything,
>> the higher optimization buys you when compiling openmpi, so I'm not
>> sure if that's an acceptable workaround or not.
>>
>> My system is similar to yours - Intel X5570 with QDR Mellanox IB
>> running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm
>> using IMB 3.2.2 with a single iteration of Barrier to reproduce the
>> hang, and it happens 100% of the time for me when I invoke it like
>> this:
>>
>> # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier
>>
>> The hang happens on the first Barrier (64 ranks) and each of the
>> participating ranks have this backtrace:
>>
>> __poll (...)
>> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
>> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
>> opal_progress () from [instdir]/lib/libopen-pal.so.0
>> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
>> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
>> ompi_coll_tuned_barrier_intra_recursivedoubling () from [instdir]/
>> lib/libmpi.so.0
>> ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/
>> libmpi.so.0
>> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
>> IMB_barrier ()
>> IMB_init_buffers_iter ()
>> main ()
>>
>> The one non-participating rank has this backtrace:
>>
>> __poll (...)
>> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
>> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
>> opal_progress () from [instdir]/lib/libopen-pal.so.0
>> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
>> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
>> ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
>> ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/
>> libmpi.so.0
>> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
>> main ()
>>
>> If I use more nodes I can get it to hang with 1ppn, so that seems
>> to rule out the sm btl (or interactions with it) as a culprit at
>> least.
>>
>> I can't reproduce this with openmpi 1.5.3, interestingly.
>>
>> -Marcus
>>
>>
>> On 05/10/2011 03:37 AM, Salvatore Podda wrote:
>>> Dear all,
>>>
>>> we succeed in building several version of openmpi from 1.2.8 to
>>> 1.4.3
>>> with Intel composer XE 2011 (aka 12.0).
>>> However we found a threshold in the number of cores (depending
>>> from the
>>> application: IMB, xhpl or user applications
>>> and form the number of required cores) above which the application
>>> hangs
>>> (sort of deadlocks).
>>> The building of openmpi with 'gcc' and 'pgi' does not show the
>>> same limits.
>>> There are any known incompatibilities of openmpi with this version
>>> of
>>> intel compiilers?
>>>
>>> The characteristics of our computational infrastructure are:
>>>
>>> Intel processors E7330, E5345, E5530 e E5620
>>>
>>> CentOS 5.3, CentOS 5.5.
>>>
>>> Intel composer XE 2011
>>> gcc 4.1.2
>>> pgi 10.2-1
>>>
>>> Regards
>>>
>>> Salvatore Podda
>>>
>>> ENEA UTICT-HPC
>>> Department for Computer Science Development and ICT
>>> Facilities Laboratory for Science and High Performace Computing
>>> C.R. Frascati
>>> Via E. Fermi, 45
>>> PoBox 65
>>> 00044 Frascati (Rome)
>>> Italy
>>>
>>> Tel: +39 06 9400 5342
>>> Fax: +39 06 9400 5551
>>> Fax: +39 06 9400 5735
>>> E-mail: salvatore.podda_at_[hidden]
>>> Home Page: www.cresco.enea.it
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>