Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)
From: Salvatore Podda (salvatore.podda_at_[hidden])
Date: 2011-05-24 07:29:57


Apoligize, I forgot to edit the subject line.
I send again with the sensible subject.

Salvatore

Begin forwarded message:

> From: Salvatore Podda <salvatore.podda_at_[hidden]>
> Date: 24 maggio 2011 12:46:17 GMT+02:00
> To: gus_at_[hidden]
> Cc: users open-mpi <users_at_[hidden]>
> Subject: Re: users Digest, Vol 1911, Issue 3
>
> Sorry for the late reply, but, as I just say, we are attempting
> to recover the full operation of part of our cluster
>
> Yes, it was a typo, I use to add the "sm" flag to the "--mca btl"
> option. However I think this is not mandatory, as I suppose
> openmpi use the the so-called "Law of Least Astonishment"
> also in this case and adopts "sm" for the intra-node communication
> or, if you prefer, avoiding to add the sm string does not mean "not
> use
> shared memory".
> Indeed if I remove or add this string nothing change, or if
> I run an mpi job on a single multicore node without this
> flag all works well.
>
> Thanks
>
> Salvatore
>
>
>
> On 20/mag/11, at 20:53, users-request_at_[hidden] wrote:
>
>> Message: 1
>> Date: Fri, 20 May 2011 14:30:13 -0400
>> From: Gus Correa <gus_at_[hidden]>
>> Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer
>> XE 2011 (aka 12.0)
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <4DD6B335.2090403_at_[hidden]>
>> Content-Type: text/plain; charset=us-ascii; format=flowed
>>
>> Hi Salvatore
>>
>> Just in case ...
>> You say you have problems when you use "--mca btl openib,self".
>> Is this a typo in your email?
>> I guess this will disable the shared memory btl intra-node,
>> whereas your other choice "--mca btl_tcp_if_include ib0" will not.
>> Could this be the problem?
>>
>> Here we use "--mca btl openib,self,sm",
>> to enable the shared memory btl intra-node as well,
>> and it works just fine on programs that do use collective calls.
>>
>> My two cents,
>> Gus Correa
>>
>> Salvatore Podda wrote:
>>> We are still struggling we these problems. Actually the new
>>> version of
>>> intel compilers does
>>> not seem to be the real issue. We clash against the same errors
>>> using
>>> also the `gcc' compilers.
>>> We succeed in building an openmi-1.2.8 (with different compiler
>>> flavours) rpm from the installation
>>> of the cluster section where all seems to work well. We are now
>>> doing a
>>> severe IMB benchmark campaign.
>>>
>>> However, yes this happen only whe we use the --mca btl
>>> openib,self, on
>>> the contrary if we use
>>> --mca btl_tcp_if_include ib0 all works well.
>>> Yes we can try the flag you suggest. I can check on the FAQ and on
>>> the
>>> opem-mpi.org documentation,
>>> but can you be so kindly to explain the meaning of this flag?
>>>
>>> Thanks
>>>
>>> Salvatore Podda
>>>
>>> On 20/mag/11, at 03:37, Jeff Squyres wrote:
>>>
>>>> Sorry for the late reply.
>>>>
>>>> Other users have seen something similar but we have never been
>>>> able to
>>>> reproduce it. Is this only when using IB? If you use "mpirun --
>>>> mca
>>>> btl_openib_cpc_if_include rdmacm", does the problem go away?
>>>>
>>>>
>>>> On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:
>>>>
>>>>> I've seen the same thing when I build openmpi 1.4.3 with Intel 12,
>>>>> but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -
>>>>> O1
>>>>> then the collectives hangs go away. I don't know what, if
>>>>> anything,
>>>>> the higher optimization buys you when compiling openmpi, so I'm
>>>>> not
>>>>> sure if that's an acceptable workaround or not.
>>>>>
>>>>> My system is similar to yours - Intel X5570 with QDR Mellanox IB
>>>>> running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm
>>>>> using IMB 3.2.2 with a single iteration of Barrier to reproduce
>>>>> the
>>>>> hang, and it happens 100% of the time for me when I invoke it
>>>>> like this:
>>>>>
>>>>> # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier
>>>>>
>>>>> The hang happens on the first Barrier (64 ranks) and each of the
>>>>> participating ranks have this backtrace:
>>>>>
>>>>> __poll (...)
>>>>> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
>>>>> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
>>>>> opal_progress () from [instdir]/lib/libopen-pal.so.0
>>>>> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
>>>>> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
>>>>> ompi_coll_tuned_barrier_intra_recursivedoubling () from
>>>>> [instdir]/lib/libmpi.so.0
>>>>> ompi_coll_tuned_barrier_intra_dec_fixed () from
>>>>> [instdir]/lib/libmpi.so.0
>>>>> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
>>>>> IMB_barrier ()
>>>>> IMB_init_buffers_iter ()
>>>>> main ()
>>>>>
>>>>> The one non-participating rank has this backtrace:
>>>>>
>>>>> __poll (...)
>>>>> poll_dispatch () from [instdir]/lib/libopen-pal.so.0
>>>>> opal_event_loop () from [instdir]/lib/libopen-pal.so.0
>>>>> opal_progress () from [instdir]/lib/libopen-pal.so.0
>>>>> ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
>>>>> ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
>>>>> ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/
>>>>> libmpi.so.0
>>>>> ompi_coll_tuned_barrier_intra_dec_fixed () from
>>>>> [instdir]/lib/libmpi.so.0
>>>>> PMPI_Barrier () from [instdir]/lib/libmpi.so.0
>>>>> main ()
>>>>>
>>>>> If I use more nodes I can get it to hang with 1ppn, so that
>>>>> seems to
>>>>> rule out the sm btl (or interactions with it) as a culprit at
>>>>> least.
>>>>>
>>>>> I can't reproduce this with openmpi 1.5.3, interestingly.
>>>>>
>>>>> -Marcus
>>>>>
>>>>>
>>>>> On 05/10/2011 03:37 AM, Salvatore Podda wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> we succeed in building several version of openmpi from 1.2.8 to
>>>>>> 1.4.3
>>>>>> with Intel composer XE 2011 (aka 12.0).
>>>>>> However we found a threshold in the number of cores (depending
>>>>>> from the
>>>>>> application: IMB, xhpl or user applications
>>>>>> and form the number of required cores) above which the
>>>>>> application
>>>>>> hangs
>>>>>> (sort of deadlocks).
>>>>>> The building of openmpi with 'gcc' and 'pgi' does not show the
>>>>>> same
>>>>>> limits.
>>>>>> There are any known incompatibilities of openmpi with this
>>>>>> version of
>>>>>> intel compiilers?
>>>>>>
>>>>>> The characteristics of our computational infrastructure are:
>>>>>>
>>>>>> Intel processors E7330, E5345, E5530 e E5620
>>>>>>
>>>>>> CentOS 5.3, CentOS 5.5.
>>>>>>
>>>>>> Intel composer XE 2011
>>>>>> gcc 4.1.2
>>>>>> pgi 10.2-1
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Salvatore Podda
>>>>>>
>>>>>> ENEA UTICT-HPC
>>>>>> Department for Computer Science Development and ICT
>>>>>> Facilities Laboratory for Science and High Performace Computing
>>>>>> C.R. Frascati
>>>>>> Via E. Fermi, 45
>>>>>> PoBox 65
>>>>>> 00044 Frascati (Rome)
>>>>>> Italy
>>>>>>
>>>>>> Tel: +39 06 9400 5342
>>>>>> Fax: +39 06 9400 5551
>>>>>> Fax: +39 06 9400 5735
>>>>>> E-mail: salvatore.podda_at_[hidden]
>>>>>> Home Page: www.cresco.enea.it
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>