Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault in MPI_Finalize with IB hardware and memory manager.
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-06-18 11:38:55


Sorry for the confusion; I was asking George which one wins. I'm not active in the MX portion of the OMPI code base, so I don't know which one is better / should be used.

On Jun 18, 2010, at 8:19 AM, guillaume ranquet wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello,
>
> sorry for the very long delay, I didn't understood you waited an answer
> from my side on this. (the debate seemed to be between maintainers)
> do not hesitate to bug me if I'm not answering after some days.
>
> to answer shortly:
> - -yes I've tested the patch submited on this thread by Scott and it
> solved my issues.
> - -no, I havent tested the patch submited by George, I can have a quick
> try if needed.
>
> as of "which one wins", I'm quite sure you have more clues than me on
> the subjet :)
>
>
> On 06/07/2010 09:49 PM, Jeff Squyres wrote:
>> George --
>>
>> Scott's patch was different than the one you applied. Apparently, his fixes this user's problem (I don't know if Guillaume tested yours).
>>
>> Which one wins?
>>
>>
>>
>> On Jun 3, 2010, at 9:49 AM, Scott Atchley wrote:
>>
>>> On Jun 3, 2010, at 8:54 AM, guillaume ranquet wrote:
>>>
>>>> granquet_at_bordeplage-15 ~ $ mpirun --mca btl mx,openib,sm,self --mca pml
>>>> ^cm --mca mpi_leave_pinned 0 ~/bwlat/mpi_helloworld
>>>> [bordeplage-15.bordeaux.grid5000.fr:02707] Error in mx_init (error No MX
>>>> device entry in /dev.)
>>>> Hello world from process 0 of 1
>>>>
>>>> it works :)
>>>
>>> Jeff, you may want to change this message to opal_output_verbose(). It is in $OMPI/ompi/mca/common/common_mx.c.
>>>
>>>>> Ok. I think that OMPI is trying to open the MX MTL first. It fails at
>>>>> mx_init() (the first error message) but it had already created some
>>>>> mpool resources. It then tries to open the MX BTL and it skips the MX
>>>>> initialization and returns SUCCESS. The MX BTL then tries to call
>>>>> mx_get_info() which fails and prints the second message.
>>>>>
>>>>> Try the attached patch. It tries to clean up if mx_init() fails and
>>>>> does not return SUCCESS on subsequent attempts to initialize MX.
>>>>>
>>>>> Scott
>>>>
>>>> I tried your patch and it seems to correct the issue:
>>>>
>>>> configured with: --prefix=$HOME/openmpi-1.4.2-nomx-bin/
>>>> - --with-openib=/usr --with-mx=/usr
>>>>
>>>> $ ~/openmpi-1.4.2-nomx-bin/bin/mpirun ~/bwlat/mpi_helloworld
>>>> [bordeplage-15.bordeaux.grid5000.fr:22406] Error in mx_init (error No MX
>>>> device entry in /dev.)
>>>> Hello world from process 0 of 1
>>>
>>> Excellent.
>>>
>>>> don't hesitate if you need further testing :)
>>>
>>> Thanks for all your assistance!
>>>
>>>> do you plan on applying this patch on next release? (1.4.3?)
>>>
>>> Jeff, I leave this up to you and George.
>>>
>>> Scott
>>>
>>
>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.15 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQEcBAEBAgAGBQJMG46CAAoJEEzIl7PMEAli+2MH/19oFkY+JM1l/1hfRIKVrSl4
> +tzpWuPdrRFBODqKrZz6TTvZTBqCHar0M6FLPVTr3wvTRVMgEbdlBwr6u7GUBdVP
> 3XJw25jFUKkaAOM8PbDI7V3FMZ6oyF7Xxefo2EBCRvp9lVeop6Y0c01fXz9LS6F+
> SYn8mi5bmn58GKd8xKLvK2zgGDwdw5CRQRdWGPOfHVo4hcosvv0d55RhpDs1/U1C
> YRabXwCM0ZU251bYLwhZCjVPZZMfrQBy8oEc1DBiHOXPnc1c25GBwMxL5WPRkR+b
> xXHM2PECDACLZYKAtb/CZh94DXWxTbsMKxM9N37zf48avgKyqQYJdkwrUSlDsxc=
> =zGo1
> -----END PGP SIGNATURE-----
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/