On Jun 2, 2010, at 8:42 AM, guillaume ranquet wrote:
> yes, I have multiple clusters, some with infiniband, some with mx, some
> nodes with both Myrinet et Infiniband hardware and others with ethernet
> I reproduced it on a vanilla 1.4.1 and 1.4.2 with and without the
> - --with-mx switch.
Note that per http://www.open-mpi.org/faq/?category=building#default-build, even if you don't specify --with-mx, if Open MPI's configure is able to find the MX libs/headers, we'll still build support for it.
> this is the output I get on a node with ethernet and infiniband hardware.
> note the Error regarding mx.
> $ ~/openmpi-1.4.2-bin/bin/mpirun ~/bwlat/mpi_helloworld
> [bordeplage-9.bordeaux.grid5000.fr:32365] Error in mx_init (error No MX
> device entry in /dev.)
> [bordeplage-9.bordeaux.grid5000.fr:32365] mca_btl_mx_component_init:
> mx_get_info(MX_NIC_COUNT) failed with status 4(MX not initialized.)
I'm guessing the MX BTL is designed to be noisy when it fails, on the assumption that if MX is down, you probably want to know it.
George/Myricom -- can you confirm?
> Hello world from process 0 of 1
> [bordeplage-9:32365] *** Process received signal ***
> [bordeplage-9:32365] Signal: Segmentation fault (11)
> [bordeplage-9:32365] Signal code: Address not mapped (1)
> [bordeplage-9:32365] Failing at address: 0x7f53bb7bb360
What happens if you run:
~/openmpi-1.4.2-bin/bin/mpirun --mca btl openib,sm,self ~/bwlat/mpi_helloworld
(i.e., MX support is still compiled in, but remove MX from the run-time)
> I recompiled a 1.4.2 --with-openib --without-mx and the problem is gone
> (no segfault, no error message).
> seems you aimed at the right spot.
> now the problem is that I need support for both.
> I could compile two versions of openmpi and deploy appropriate versions
> on each cluster with support either for mx, either for openib... but
> it's quite painful and well, how should I manage nodes with both?
> for now I'll be sticking to a version of openmpi compiled with both
> hardware support and --without-memory-manager.
> unless the list has a better idea?
I'm still guessing that there's some weird interaction between the memory management of those two plugins (MX and verbs). I don't know of anyone else who has this kind of configuration where it could be tested / debugged. :-(
Per the above suggestion, let's see what happens if you run without MX and/or without openib via mpirun command line option. If that fixes the problem, that would mean you only have to change command line params when you run -- not have 2 OMPI installs. Additionally, you might be able to leave both plugins enabled but setenv the OMPI_MCA_memory_ptmalloc2_disable environment variable to 1; this will disable the OMPI memory management stuff. Note that this is not a normal MCA parameter -- you cannot set it on the command line or in a file; it *must* be set as an environment variable (for boring, technical reasons -- I can explain if you care).
For corporate legal information go to: