On Jun 2, 2010, at 9:54 AM, Jeff Squyres wrote:
>> this is the output I get on a node with ethernet and infiniband hardware.
>> note the Error regarding mx.
>> $ ~/openmpi-1.4.2-bin/bin/mpirun ~/bwlat/mpi_helloworld
>> [bordeplage-9.bordeaux.grid5000.fr:32365] Error in mx_init (error No MX
>> device entry in /dev.)
This is ompi_common_mx_initialize(). It fails since there is no MX and prints the above with:
opal_output(0, "Error in mx_init (error %s)\n", mx_strerror(mx_return));
>> [bordeplage-9.bordeaux.grid5000.fr:32365] mca_btl_mx_component_init:
>> mx_get_info(MX_NIC_COUNT) failed with status 4(MX not initialized.)
> I'm guessing the MX BTL is designed to be noisy when it fails, on the assumption that if MX is down, you probably want to know it.
> George/Myricom -- can you confirm?
This is odd. The ompi_common_mx_initialize() above does not return OPAL_SUCCESS to mca_btl_mx_component_init(). It should return NULL and never call mx_get_info(). This too uses a opal_output(0, ...).
I will let George comment on the verbosity.
It looks like ompi_common_mx_initialize() is doing things that affect memory before calling mx_init() such as setting ompi_mpi_leave_pinned to 1 and setting mpool_resources.regcache_clean = mx__regcache_clean.
There is a chicken-and-egg scenario. The BTL needs to set an registration cache environment variable before calling mx_init(), but the altering of mpool resources should probably wait until after the fact in case MX is not available.
Does the same error happen if he tries on a MX host that does not have IB?