Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault in MPI_Finalize with IB hardware and memory manager.
From: Scott Atchley (atchley_at_[hidden])
Date: 2010-06-02 10:24:21

On Jun 2, 2010, at 9:54 AM, Jeff Squyres wrote:

>> this is the output I get on a node with ethernet and infiniband hardware.
>> note the Error regarding mx.
>> $ ~/openmpi-1.4.2-bin/bin/mpirun ~/bwlat/mpi_helloworld
>> [] Error in mx_init (error No MX
>> device entry in /dev.)

This is ompi_common_mx_initialize(). It fails since there is no MX and prints the above with:

opal_output(0, "Error in mx_init (error %s)\n", mx_strerror(mx_return));

>> [] mca_btl_mx_component_init:
>> mx_get_info(MX_NIC_COUNT) failed with status 4(MX not initialized.)
> I'm guessing the MX BTL is designed to be noisy when it fails, on the assumption that if MX is down, you probably want to know it.
> George/Myricom -- can you confirm?

This is odd. The ompi_common_mx_initialize() above does not return OPAL_SUCCESS to mca_btl_mx_component_init(). It should return NULL and never call mx_get_info(). This too uses a opal_output(0, ...).

I will let George comment on the verbosity.

It looks like ompi_common_mx_initialize() is doing things that affect memory before calling mx_init() such as setting ompi_mpi_leave_pinned to 1 and setting mpool_resources.regcache_clean = mx__regcache_clean.

There is a chicken-and-egg scenario. The BTL needs to set an registration cache environment variable before calling mx_init(), but the altering of mpool resources should probably wait until after the fact in case MX is not available.

Does the same error happen if he tries on a MX host that does not have IB?