Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault in MPI_Finalize with IB hardware and memory manager.
From: guillaume ranquet (guillaume.ranquet_at_[hidden])
Date: 2010-06-02 08:42:50


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

yes, I have multiple clusters, some with infiniband, some with mx, some
nodes with both Myrinet et Infiniband hardware and others with ethernet
only.

I reproduced it on a vanilla 1.4.1 and 1.4.2 with and without the
- --with-mx switch.

this is the output I get on a node with ethernet and infiniband hardware.
note the Error regarding mx.

$ ~/openmpi-1.4.2-bin/bin/mpirun ~/bwlat/mpi_helloworld
[bordeplage-9.bordeaux.grid5000.fr:32365] Error in mx_init (error No MX
device entry in /dev.)
[bordeplage-9.bordeaux.grid5000.fr:32365] mca_btl_mx_component_init:
mx_get_info(MX_NIC_COUNT) failed with status 4(MX not initialized.)
Hello world from process 0 of 1
[bordeplage-9:32365] *** Process received signal ***
[bordeplage-9:32365] Signal: Segmentation fault (11)
[bordeplage-9:32365] Signal code: Address not mapped (1)
[bordeplage-9:32365] Failing at address: 0x7f53bb7bb360
- --------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 32365 on node
bordeplage-9.bordeaux.grid5000.fr exited on signal 11 (Segmentation fault).
- --------------------------------------------------------------------------

I recompiled a 1.4.2 --with-openib --without-mx and the problem is gone
(no segfault, no error message).
seems you aimed at the right spot.

now the problem is that I need support for both.
I could compile two versions of openmpi and deploy appropriate versions
on each cluster with support either for mx, either for openib... but
it's quite painful and well, how should I manage nodes with both?

for now I'll be sticking to a version of openmpi compiled with both
hardware support and --without-memory-manager.
unless the list has a better idea?

thanks for the input, much appreciated.
if you need further infos, I can recompile everything with -g and fire a
gdb and locate the segfault more precisely.

On 06/01/2010 03:34 PM, Jeff Squyres wrote:
> Are you running on nodes with both MX and OpenFabrics?
>
> I don't know if this is a well-tested scenario -- there may be some strange interactions in the registered memory management between MX and OpenFabrics verbs.
>
> FWIW, you should be able to disable Open MPI's memory management at run time in the 1.4 series by setting the environment variable OMPI_MCA_memory_ptmalloc2_disable to 1 (for good measure, ensure that it's set on all nodes where you are running Open MPI).
>
>
>
> On May 31, 2010, at 11:02 AM, guillaume ranquet wrote:
>
> we use a slightly modified openmpi-1.4.1
>
> the patch is here:
> <diff>
> --- ompi/mca/btl/tcp/btl_tcp_proc.c.orig 2010-03-23
> 14:01:28.000000000 +0100
> +++ ompi/mca/btl/tcp/btl_tcp_proc.c 2010-03-23 14:01:50.000000000 +0100
> @@ -496,7 +496,7 @@
> local_interfaces[i]->ipv4_netmask)) {
> weights[i][j] = CQ_PRIVATE_SAME_NETWORK;
> } else {
> - weights[i][j] = CQ_PRIVATE_DIFFERENT_NETWORK;
> + weights[i][j] = CQ_NO_CONNECTION;
> }
> best_addr[i][j] =
> peer_interfaces[j]->ipv4_endpoint_addr;
> }
> </diff>
>
> I actually just discovered the existence of this patch,
> I'm planning to run tests with a vanilla 1.4.1 and if possible a 1.4.2 ASAP.
>
>
> On 05/31/2010 04:18 PM, Ralph Castain wrote:
>>>> What OMPI version are you using?
>>>>
>>>> On May 31, 2010, at 5:37 AM, guillaume ranquet wrote:
>>>>
>>>> Hi,
>>>> I'm new to the list and quite new to the world of MPI.
>>>>
>>>> a bit of background:
>>>> I'm a sysadmin and have to provide a working environment (debian base)
>>>> for researchers to work with MPI : I'm _NOT_ an open-mpi user - I know
>>>> C, but that's all.
>>>>
>>>> I compile openmpi with the following selectors: --prefix=/usr
>>>> --with-openib=/usr --with-mx=/usr
>>>> (yes, everything goes in /usr)
>>>>
>>>> when running an mpi application (any application) on a machine equipped
>>>> with infiniband hardware, I get a segmentation fault during the
>>>> MPI_Finalise()
>>>> the code just runs fine on machines that have no Infiniband devices.
>>>>
>>>> <code>
>>>> #include <stdio.h>
>>>> #include <mpi.h>
>>>>
>>>>
>>>> int main (int argc,char *argv[])
>>>> {
>>>> int i=0,rank, size;
>>>>
>>>> MPI_Init (&argc, &argv); /* starts MPI */
>>>> MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
>>>> MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of
>>>> processes */
>>>> while (i == 0)
>>>> sleep(5);
>>>> printf( "Hello world from process %d of %d\n", rank, size );
>>>> MPI_Finalize();
>>>> return 0;
>>>> }
>>>> </code>
>>>>
>>>> my gdb-fu is quite rusty, but I get the vague idea it happens somewhere
>>>> in the MPI_Finalize(); (I can probably dig a bit there to find exactly
>>>> where, if it's relevant)
>>>>
>>>> I'm running it with:
>>>> $ mpirun --mca orte_base_help_aggregate 0 --mca plm_rsh_agent oarsh
>>>> -machinefile nodefile ./mpi_helloworld
>>>>
>>>>
>>>> after various tests I've been suggested to try recompiling openmpi with
>>>> the --without-memory-manager selector.
>>>> it actually solves the issue and everything runs fine.
>>>>
>>>> from what I understand (correct me if I'm wrong) the "memory manager" is
>>>> used with Infiniband RDMA to have a somewhat persistant memory region
>>>> available on the device instead of destroying/recreating it everytime.
>>>> and thus, it's only a "performance tunning" issue, that disables the
>>>> openmpi "leave_pinned" option?
>>>>
>>>> the various questions I have:
>>>> is this bug/behaviour known?
>>>> if so, is there a better workaround?
>>>> as I'm not an openmpi user, I don't really know if it's considered
>>>> acceptable to have this option disabled?
>>>> does the list want more details on this bug?
>>>>
>>>>
>>>> thanks,
>>>> Guillaume Ranquet.
>>>> Grid5000 support-staff.
>>>>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>>
>>
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJMBlHKAAoJEEzIl7PMEAlipHQIAJT4+oTQbGM8TijO9yWEqOCv
XTUQtYDz6wB/9FViEPncynRgNh8Sbxr2/fPSHkfaLAmVMGaoMpvS2rW6hx2XwXM7
tAWFHtfBxhjjGDG1blSxEyhn0fQMy7ZgPEZ66QTNUslFtZ3cbPcY+hBMwXNfalES
3JCuE1n7G54NF/jl/4sO4d0voFUxIK3Jyt63hisY5b3n4WCf77/yGVjCA24xG2pY
/GqJ3ZkaPNu59zkKUZG8RTGmjQfA+hbhh6NSEvSgvPvUIrOcDYFR/BkVAKSf7nGc
fc0jzzwiSFcodux+5UGZ5I8M27FmHKFxK3LvR1/KRXRC42/PdCBWQSnBjVxluFs=
=/w/Y
-----END PGP SIGNATURE-----