Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Reese Faucette (reese_at_[hidden])
Date: 2007-01-02 15:52:17


Ompi failing on mx onlyHi, Gary-
This looks like a config problem, and not a code problem yet. Could you send the output of mx_info from node-1 and from node-2? Also, forgive me counter-asking a possibly dumb OMPI question, but is "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if not specifying a value defaults to this behavior, but have to ask).

Also, have you tried MX MTL as opposed to BTL? --mca pml cm --mca mtl mx,self (it looks like you did)

"[node-2:10464] mx_connect fail for node-2:0 with key aaaaffff " makes it look like your fabric may not be fully mapped or that you may have a down link.

thanks,
-reese
Myricom, Inc.

  I was initially using 1.1.2 and moved to 1.2b2 because of a hang on MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My compute nodes are 2 dual core xeons on myrinet with mx. The problem is trying to get ompi running on mx only. My machine file is as follows .

  node-1 slots=4 max-slots=4
  node-2 slots=4 max-slots=4
  node-3 slots=4 max-slots=4

  'mpirun' with the minimum number of processes in order to get the error ...
          mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi

  I don't believe there'a anything wrong w/ the hardware as I can ping on mx between this failed node and the master fine. So I tried a different set of 3 nodes and I got the same error, it always fails on the 2nd node of any group of nodes I choose.