Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault in MPI_Finalize with IB hardware and memory manager.
From: guillaume ranquet (guillaume.ranquet_at_[hidden])
Date: 2010-06-03 08:54:25


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/02/2010 07:51 PM, Jeff Squyres wrote:
>>From your prior mails:
>
> - there's no segv when ptmalloc is disabled at run-time via the env var
> - there's no segv when MX is completed disabled (both BTL and MTL)
>
> What happens if you run with only MX? I *assume* that works with no segv...?

this has been tried I think?
on a node with only MX, and no IB hardware, everything runs fine.

> It might be interesting to see what happens if you run with:
>
> mpirun --mca btl mx,openib,sm,self --mca pml ^cm --mca mpi_leave_pinned 0 ...yourapp...
>
> This should run with both verbs and MX, and the memory manager is in place at run-time, but it isn't being used to track memory. That's slightly different than having the memory manager in place at run-time *and* using it to track memory.
>

granquet_at_bordeplage-15 ~ $ mpirun --mca btl mx,openib,sm,self --mca pml
^cm --mca mpi_leave_pinned 0 ~/bwlat/mpi_helloworld
[bordeplage-15.bordeaux.grid5000.fr:02707] Error in mx_init (error No MX
device entry in /dev.)
Hello world from process 0 of 1

it works :)

>> the goal is to run the same version everywhere on every nodes (for the
>> sake of simplicity).
>> the current plans were targeting 1.4.1.
>> I don't think our users would mind upgrading to 1.4.2.
>
> FWIW, it *is* the same version on all nodes -- you're just running with different MCA parameter values. Also FWIW, the sysadmin can hide these MCA params in a system-level file so that users don't have to deal with them, if that works for you. See:
>
> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>

thank you for the pointer, setting the mca in openmpi-mca-params.conf
would do the trick.

On 06/02/2010 08:12 PM, Scott Atchley wrote:
> On Jun 2, 2010, at 1:31 PM, guillaume ranquet wrote:
>
>> granquet_at_bordeplage-9 ~/openmpi-1.4.2 $ ~/openmpi-1.4.2-bin/bin/mpirun
>> - --mca btl openib,sm,self --mca pml ^cm ~/bwlat/mpi_helloworld
>> Hello world from process 0 of 1
>> granquet_at_bordeplage-9 ~/openmpi-1.4.2 $
>>
>> I can tell it works :)
>
> Ok. I think that OMPI is trying to open the MX MTL first. It fails at
mx_init() (the first error message) but it had already created some
mpool resources. It then tries to open the MX BTL and it skips the MX
initialization and returns SUCCESS. The MX BTL then tries to call
mx_get_info() which fails and prints the second message.
>
> Try the attached patch. It tries to clean up if mx_init() fails and
does not return SUCCESS on subsequent attempts to initialize MX.
>
> Scott
>

I tried your patch and it seems to correct the issue:

configured with: --prefix=$HOME/openmpi-1.4.2-nomx-bin/
- --with-openib=/usr --with-mx=/usr

$ ~/openmpi-1.4.2-nomx-bin/bin/mpirun ~/bwlat/mpi_helloworld
[bordeplage-15.bordeaux.grid5000.fr:22406] Error in mx_init (error No MX
device entry in /dev.)
Hello world from process 0 of 1

don't hesitate if you need further testing :)

do you plan on applying this patch on next release? (1.4.3?)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJMB6YAAAoJEEzIl7PMEAliiB4H/RrZwjALxXGAQ9H6EqPuPBJy
z5VWInUbT4kCCgsQPpd2G8oJjnskM+HTgyvwHIdjyaVtGft6aZexM+Vqf1CxGnLB
TXBopYSQbHf7S20KcENMRT+7Miel+bZ1lvm0vBasdw3FBnOK2Io9uaAYx702u61P
5DUztK/ujFgzwW9AyxuF2AZOsgLQhevo6hz0JrtgPGNVruAU+AT1HFLZAB+wiK7n
xejREXuULASJsqDoRu9JxCFqAJJpOXzmGCgjePUDX/lQQxfeS+o2L7NoJ82G6CCF
0SN9uoKhD0TV6MfL6fvzvzqhLz0JPlY6FqPAeWxSJGmHfj97pIFaqSYgq8a7J+I=
=3pXJ
-----END PGP SIGNATURE-----