Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Trouble with PSM "Could not detect network connectivity"
From: Blosch, Edwin L (edwin.l.blosch_at_[hidden])
Date: 2012-11-02 11:49:18


I am getting a problem where something called "PSM" is failing to start and that in turn is preventing my job from running. Command and output are below. I would like to understand what's going on. Apparently this version of OpenMPI decided to build itself with support for PSM, but if it's not available, why fail if other transports are available? Also, in my command I think I've told OpenMPI not to use anything but self and sm, so why would it try to use PSM?

Thanks in advance for any help...

user_at_machinename:~> /usr/mpi/intel/openmpi-1.4.3/bin/ompi_info -all | grep psm
                 MCA mtl: psm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA mtl: parameter "mtl_psm_connect_timeout" (current value: "180", data source: default value)
                 MCA mtl: parameter "mtl_psm_debug" (current value: "1", data source: default value)
                 MCA mtl: parameter "mtl_psm_ib_unit" (current value: "-1", data source: default value)
                 MCA mtl: parameter "mtl_psm_ib_port" (current value: "0", data source: default value)
                 MCA mtl: parameter "mtl_psm_ib_service_level" (current value: "0", data source: default value)
                 MCA mtl: parameter "mtl_psm_ib_pkey" (current value: "32767", data source: default value)
                 MCA mtl: parameter "mtl_psm_priority" (current value: "0", data source: default value)

Here is my command:

/usr/mpi/intel/openmpi-1.4.3/bin/mpirun -n 1 --mca btl_base_verbose 30 --mca btl self,sm /release/cfd/simgrid/P_OPT.LINUX64

and here is the output:

[machinename:01124] mca: base: components_open: Looking for btl components
[machinename:01124] mca: base: components_open: opening btl components
[machinename:01124] mca: base: components_open: found loaded component self
[machinename:01124] mca: base: components_open: component self has no register function
[machinename:01124] mca: base: components_open: component self open function successful
[machinename:01124] mca: base: components_open: found loaded component sm
[machinename:01124] mca: base: components_open: component sm has no register function
[machinename:01124] mca: base: components_open: component sm open function successful
machinename.1124ipath_userinit: assign_context command failed: Network is down
machinename.1124can't open /dev/ipath, network down (err=26)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: Could not detect network connectivity
--------------------------------------------------------------------------
[machinename:01124] mca: base: close: component self closed
[machinename:01124] mca: base: close: unloading component self
[machinename:01124] mca: base: close: component sm closed
[machinename:01124] mca: base: close: unloading component sm
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.