Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] New odls component fails
From: Alex Margolin (alex.margolin_at_[hidden])
Date: 2012-03-17 13:51:42


Hi,

I want to launch Open-MPI processes using another process: instead of
using "hello" x 4 I want to run "mosrun -w hello" x 4 when I start it
with "mpirun -n 4 hello". I've cloned the "default" component in
orte/mca/odls (from trunk) - see patch attached.

I'm getting an error which is related to mosrun, but I want to configure
OpenMPI to avoid it. I'm running on my laptop ("singularity"), which is
the only node.
I suspect my error (full output at the bottom) is caused by the
following lines, indicating system-calls invoked which are not supported
by mosrun:

MOSRUN: system-call 'sched_getaffinity' not supported under MOSIX
MOSRUN: Shared memory (MAP_SHARED) not supported under MOSIX

As the lines state, both the sched_getaffinity() syscall (and the likes)
and mmap with MAP_SHARED are not supported. I've tried to find all the
relevant instances in the Open-MPI code and disable them, but to no avail:

alex_at_singularity:~/huji/openmpi-trunk$ find . -name .ompi_ignore
./opal/mca/shmem/mmap/.ompi_ignore
./opal/mca/shmem/posix/.ompi_ignore
./opal/mca/hwloc/hwloc132/.ompi_ignore
./opal/mca/timer/altix/.ompi_ignore
./opal/mca/memory/linux/.ompi_ignore
./orte/mca/plm/xgrid/.ompi_ignore
./orte/mca/plm/submit/.ompi_ignore
./orte/mca/sensor/heartbeat/.ompi_ignore
./ompi/mca/fs/lustre/.ompi_ignore
./ompi/mca/rcache/rb/.ompi_ignore
./ompi/mca/coll/sm/.ompi_ignore
./ompi/mca/coll/demo/.ompi_ignore
./ompi/mca/pml/example/.ompi_ignore
./ompi/mca/op/x86/.ompi_ignore
./ompi/mca/op/example/.ompi_ignore
./ompi/mca/btl/sm/.ompi_ignore
./ompi/mca/btl/template/.ompi_ignore
./ompi/mca/mpool/sm/.ompi_ignore
./ompi/mca/common/sm/.ompi_ignore
./ompi/mca/vprotocol/example/.ompi_ignore
alex_at_singularity:~/huji/openmpi-trunk$ cat command
./autogen.sh ; ./configure CFLAGS=-m64 CXXFLAGS=-m64
--prefix=/home/alex/huji/ompit --disable-hwloc --disable-mmap-shmem
--disable-posix-shmem --disable-sysv-shmem
--enable-mca-no-build=maffinity,paffinity ; make ; make install
alex_at_singularity:~/huji/openmpi-trunk$

Can anyone help me determine where is the code calling these system
calls and disable it? Or maybe it is another, unrelated problem?
The attached module is part of a system I'm building (along with the BTL
module I've mentioned in the past - still working on it...) in hope of
contributing to the Open-MPI community upon completion.

Thanks a lot,
Alex

P.S. Here is the full output of the error:

alex_at_singularity:~/huji/benchmarks/simple$ ~/huji/ompit/bin/mpirun -mca
orte_debug 100 -n 1 hello
[singularity:15041] mca: base: component_find: unable to open
/home/alex/huji/ompit/lib/openmpi/mca_paffinity_hwloc:
/home/alex/huji/ompit/lib/openmpi/mca_paffinity_hwloc.so: undefined
symbol: opal_hwloc_topology (ignored)
[singularity:15041] mca: base: component_find: unable to open
/home/alex/huji/ompit/lib/openmpi/mca_rmaps_rank_file:
/home/alex/huji/ompit/lib/openmpi/mca_rmaps_rank_file.so: undefined
symbol: opal_hwloc_binding_policy (ignored)
[singularity:15041] procdir:
/tmp/openmpi-sessions-alex_at_singularity_0/35712/0/0
[singularity:15041] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/35712/0
[singularity:15041] top: openmpi-sessions-alex_at_singularity_0
[singularity:15041] tmp: /tmp
[singularity:15041] mpirun: reset PATH:
/home/alex/huji/ompit/bin:/usr/lib/lightdm/lightdm:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/lib/jvm/default-java/bin::/usr/local/apache-maven-3.0.3/bin
[singularity:15041] mpirun: reset LD_LIBRARY_PATH: /home/alex/huji/ompit/lib
[singularity:15041] [[35712,0],0] hostfile: checking hostfile
/home/alex/huji/ompit/etc/openmpi-default-hostfile for nodes
[singularity:15041] [[35712,0],0] hostfile: filtering nodes through
hostfile /home/alex/huji/ompit/etc/openmpi-default-hostfile
[singularity:15041] defining message event: grpcomm_bad_module.c 165
[singularity:15041] progressed_wait: base/plm_base_launch_support.c 297
[singularity:15041] [[35712,0],0] orte:daemon:cmd:processor called by
[[35712,0],0] for tag 1
[singularity:15041] [[35712,0],0] orte:daemon:send_relay
[singularity:15041] [[35712,0],0] orte:daemon:send_relay - recipient
list is empty!
[singularity:15041] [[35712,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
   MPIR_being_debugged = 0
   MPIR_debug_state = 1
   MPIR_partial_attach_ok = 1
   MPIR_i_am_starter = 0
   MPIR_forward_output = 0
   MPIR_proctable_size = 1
   MPIR_proctable:
     (i, host, exe, pid) = (0, singularity,
/home/alex/huji/benchmarks/simple/hello, 15042)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL

MOSRUN: system-call 'sched_getaffinity' not supported under MOSIX
MOSRUN: Shared memory (MAP_SHARED) not supported under MOSIX

[singularity:15042] procdir:
/tmp/openmpi-sessions-alex_at_singularity_0/35712/1/0
[singularity:15042] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/35712/1
[singularity:15042] top: openmpi-sessions-alex_at_singularity_0
[singularity:15042] tmp: /tmp
[singularity:15041] [[35712,0],0] orted_recv_cmd: received message from
[[35712,1],0]
[singularity:15041] defining message event: orted/orted_comm.c 172
[singularity:15041] [[35712,0],0] orted_recv_cmd: reissued recv
[singularity:15041] [[35712,0],0] orte:daemon:cmd:processor called by
[[35712,1],0] for tag 1
[singularity:15041] [[35712,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
[singularity:15041] [[35712,0],0] orte:daemon:cmd:processor: processing
commands completed
[singularity:15042] OPAL dss:unpack: got type 33 when expecting type 12
[singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in
file ../../../orte/util/nidmap.c at line 429
[singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in
file ../../../../../orte/mca/ess/base/ess_base_nidmap.c at line 62
[singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in
file ../../../../../../orte/mca/ess/env/ess_env_module.c at line 173
[singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in
file ../../../orte/runtime/orte_init.c at line 132
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

   ompi_mpi_init: orte_init failed
   --> Returned "Pack data mismatch" (-22) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[singularity:15042] Abort before MPI_INIT completed successfully; not
able to guarantee that all other processes were killed!
[singularity:15041] defining message event: iof_hnp_read.c 293
[singularity:15041] [[35712,0],0] orte:daemon:cmd:processor called by
[[35712,0],0] for tag 1
[singularity:15041] [[35712,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_IOF_COMPLETE
[singularity:15041] [[35712,0],0] orte:daemon:cmd:processor: processing
commands completed
[singularity:15041] defining message event: base/odls_base_default_fns.c
2532
[singularity:15041] [[35712,0],0] orte:daemon:cmd:processor called by
[[35712,0],0] for tag 1
[singularity:15041] [[35712,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_WAITPID_FIRED
[singularity:15041] sess_dir_finalize: proc session dir not empty - leaving
[singularity:15041] [[35712,0],0]:errmgr_default_hnp.c(948) updating
exit status to 1
-------------------------------------------------------
While the primary job terminated normally, 1 process returned
a non-zero exit code.. Further examination may be required.
-------------------------------------------------------
[singularity:15041] sess_dir_finalize: job session dir not empty - leaving
[singularity:15041] [[35712,0],0] Releasing job data for [35712,0]
[singularity:15041] [[35712,0],0] Releasing job data for [35712,1]
[singularity:15041] sess_dir_finalize: proc session dir not empty - leaving
orterun: exiting with status 1
alex_at_singularity:~/huji/benchmarks/simple$