Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Mixed Mellanox and Qlogic problems
From: David Warren (warren_at_[hidden])
Date: 2011-07-13 19:46:35


I finally got access to the systems again (the original ones are part of
our real time system). I thought I would try one other test I had set up
first. I went to OFED 1.6 and it started running with no errors. It
must have been an OFED bug. Now I just have the speed problem. Anyone
have a way to make the mixture of mlx4 and qlogic work together without
slowing down?

On 07/07/11 17:19, Jeff Squyres wrote:
> Huh; wonky.
>
> Can you set the MCA parameter "mpi_abort_delay" to -1 and run your job again? This will prevent all the processes from dying when MPI_ABORT is invoked. Then attach a debugger to one of the still-live processes after the error message is printed. Can you send the stack trace? It would be interesting to know what is going on here -- I can't think of a reason that would happen offhand.
>
>
> On Jun 30, 2011, at 5:03 PM, David Warren wrote:
>
>
>> I have a cluster with mostly Mellanox ConnectX hardware and a few with Qlogic QLE7340's. After looking through the web, FAQs etc. I built openmpi-1.5.3 with psm and openib. If I run within the same hardware it is fast and works fine. If I run between without specifying an MTL (e.g. mpirun -np 24 -machinefile dwhosts --byslot --bind-to-core --mca btl ^tcp ...) it dies with
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>>
>>> *** This is disallowed by the MPI standard.
>>> *** Your MPI job will now abort.
>>> [n16:9438] Abort before MPI_INIT completed successfully; not able to
>>>
>> guarantee that all other processes were killed!
>>
>>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>>> *** This is disallowed by the MPI standard.
>>> *** Your MPI job will now abort.
>>>
>> ...
>> I can make it run but giving a bad mtl e.g. -mca mtl psm,none. All the processes run after complaining that mtl none does not exist. However, they run just as slow (about 10% slower than either set alone)
>>
>> Pertinent info:
>> On the Qlogic Nodes:
>> OFED: QLogic-OFED.SLES11-x86_64.1.5.3.0.22
>> On the Mellanox Nodes:
>> OFED-1.5.2.1-20101105-0600
>>
>> All:
>> debian lenny kernel 2.6.32.41
>> OpenSM
>> limit | grep memorylocked gives unlimited on all nodes.
>>
>> Configure line:
>> ./configure --with-libnuma --with-openib --prefix=/usr/local/openmpi-1.5.3 --with-psm=/usr --enable-btl-openib-failover --enable-openib-connectx-xrc --enable-openib-rdmacm
>>
>> I thought that with 1.5.3 I am supposed to be able to do this. Am I just wrong? Does anyone see what I am doing wrong?
>>
>> Thanks
>> <mellanox_devinfo.gz><mellanox_ifconfig.gz><ompi_info_output.gz><qlogic_devinfo.gz><qlogic_ifconfig.gz><warren.vcf>_______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>