Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Pointers for understanding failure messages on NetBSD
From: Kevin.Buckley_at_[hidden]
Date: 2009-11-29 18:15:35


Hi there,

I recently compiled OpenMPI 1.3.3 for a NetBSD platform
as part of an attempt to get some MPI-based codes running
on the SGE cycle stealing grid we have in the School here.

I should point out that this has not been done within the
pkgsrc build system as yet but that I found I was able to
get a working environment by starting out with:

./configure --prefix=/vol/grid/pkg/openmpi-1.3.3 \
  --with-sge --disable-dlopen --enable-contrib-no-build=vt

OK, following a recent rebuild of the underlying NetBSD OS
on the machines which participate in our grid, I am now seeing
the following error message when trying to run a simple mpirun
on a single box:

$ mpirun -n 4 hello_f77
[somebox.ecs.vuw.ac.nz:04414] opal_ifinit: ioctl(SIOCGIFFLAGS) failed with
errno=6
 Hello, world, I am 0 of 4
 Hello, world, I am 1 of 4
 Hello, world, I am 2 of 4
 Hello, world, I am 3 of 4

Whilst this runs, I was not seeing the error before the OS rebuild.

When running on a "server" machine within the grid, a machine I am told
should not be any different to the workstation I was using above in
respect of user environment, I get a different error and find that the
job does not run at all.

This case seems to producean error message that is oft reported within
the OpenMPI community:

$ mpirun -n 4 hello_f77
[somebox2.ecs.vuw.ac.nz:25244] [[51186,0],0] ORTE_ERROR_LOG: Error in file
ess_hnp_module.c at line 150
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
...

  orte_rml_base_select failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[somebox2.ecs.vuw.ac.nz:25244] [[51186,0],0] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 132
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
...

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[somebox2.ecs.vuw.ac.nz:25244] [[51186,0],0] ORTE_ERROR_LOG: Error in file
orterun.c at line 473

Anyone like to suggest what I might do to better understand and so
possibly correct these issues?

Kevin

-- 
Kevin M. Buckley                                  Room:  CO327
School of Engineering and                         Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand