Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] New odls component fails
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-03-17 14:16:26


I don't think you need to .ompi_ignore all those components. First, you need to use the --without-hwloc option (you misspelled it below as --disable-hwloc).

Assuming you removed the relevant code from your clone of the default odls module, I suspect the calls are being made in ompi/runtime/ompi_mpi_init.c. If the process detects it isn't bound, it looks to see if it should bind itself. I thought that code was also turned "off" if we configured without-hwloc, so you might have to check it.

Shared memory is a separate issue. If you want/need to avoid it, then run with -mca btl ^sm and this will turn off all shared memory calls.

On Mar 17, 2012, at 11:51 AM, Alex Margolin wrote:

> Hi,
>
> I want to launch Open-MPI processes using another process: instead of using "hello" x 4 I want to run "mosrun -w hello" x 4 when I start it with "mpirun -n 4 hello". I've cloned the "default" component in orte/mca/odls (from trunk) - see patch attached.
>
> I'm getting an error which is related to mosrun, but I want to configure OpenMPI to avoid it. I'm running on my laptop ("singularity"), which is the only node.
> I suspect my error (full output at the bottom) is caused by the following lines, indicating system-calls invoked which are not supported by mosrun:
>
> MOSRUN: system-call 'sched_getaffinity' not supported under MOSIX
> MOSRUN: Shared memory (MAP_SHARED) not supported under MOSIX
>
> As the lines state, both the sched_getaffinity() syscall (and the likes) and mmap with MAP_SHARED are not supported. I've tried to find all the relevant instances in the Open-MPI code and disable them, but to no avail:
>
> alex_at_singularity:~/huji/openmpi-trunk$ find . -name .ompi_ignore
> ./opal/mca/shmem/mmap/.ompi_ignore
> ./opal/mca/shmem/posix/.ompi_ignore
> ./opal/mca/hwloc/hwloc132/.ompi_ignore
> ./opal/mca/timer/altix/.ompi_ignore
> ./opal/mca/memory/linux/.ompi_ignore
> ./orte/mca/plm/xgrid/.ompi_ignore
> ./orte/mca/plm/submit/.ompi_ignore
> ./orte/mca/sensor/heartbeat/.ompi_ignore
> ./ompi/mca/fs/lustre/.ompi_ignore
> ./ompi/mca/rcache/rb/.ompi_ignore
> ./ompi/mca/coll/sm/.ompi_ignore
> ./ompi/mca/coll/demo/.ompi_ignore
> ./ompi/mca/pml/example/.ompi_ignore
> ./ompi/mca/op/x86/.ompi_ignore
> ./ompi/mca/op/example/.ompi_ignore
> ./ompi/mca/btl/sm/.ompi_ignore
> ./ompi/mca/btl/template/.ompi_ignore
> ./ompi/mca/mpool/sm/.ompi_ignore
> ./ompi/mca/common/sm/.ompi_ignore
> ./ompi/mca/vprotocol/example/.ompi_ignore
> alex_at_singularity:~/huji/openmpi-trunk$ cat command
> ./autogen.sh ; ./configure CFLAGS=-m64 CXXFLAGS=-m64 --prefix=/home/alex/huji/ompit --disable-hwloc --disable-mmap-shmem --disable-posix-shmem --disable-sysv-shmem --enable-mca-no-build=maffinity,paffinity ; make ; make install
> alex_at_singularity:~/huji/openmpi-trunk$
>
> Can anyone help me determine where is the code calling these system calls and disable it? Or maybe it is another, unrelated problem?
> The attached module is part of a system I'm building (along with the BTL module I've mentioned in the past - still working on it...) in hope of contributing to the Open-MPI community upon completion.
>
> Thanks a lot,
> Alex
>
> P.S. Here is the full output of the error:
>
> alex_at_singularity:~/huji/benchmarks/simple$ ~/huji/ompit/bin/mpirun -mca orte_debug 100 -n 1 hello
> [singularity:15041] mca: base: component_find: unable to open /home/alex/huji/ompit/lib/openmpi/mca_paffinity_hwloc: /home/alex/huji/ompit/lib/openmpi/mca_paffinity_hwloc.so: undefined symbol: opal_hwloc_topology (ignored)
> [singularity:15041] mca: base: component_find: unable to open /home/alex/huji/ompit/lib/openmpi/mca_rmaps_rank_file: /home/alex/huji/ompit/lib/openmpi/mca_rmaps_rank_file.so: undefined symbol: opal_hwloc_binding_policy (ignored)
> [singularity:15041] procdir: /tmp/openmpi-sessions-alex_at_singularity_0/35712/0/0
> [singularity:15041] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/35712/0
> [singularity:15041] top: openmpi-sessions-alex_at_singularity_0
> [singularity:15041] tmp: /tmp
> [singularity:15041] mpirun: reset PATH: /home/alex/huji/ompit/bin:/usr/lib/lightdm/lightdm:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/lib/jvm/default-java/bin::/usr/local/apache-maven-3.0.3/bin
> [singularity:15041] mpirun: reset LD_LIBRARY_PATH: /home/alex/huji/ompit/lib
> [singularity:15041] [[35712,0],0] hostfile: checking hostfile /home/alex/huji/ompit/etc/openmpi-default-hostfile for nodes
> [singularity:15041] [[35712,0],0] hostfile: filtering nodes through hostfile /home/alex/huji/ompit/etc/openmpi-default-hostfile
> [singularity:15041] defining message event: grpcomm_bad_module.c 165
> [singularity:15041] progressed_wait: base/plm_base_launch_support.c 297
> [singularity:15041] [[35712,0],0] orte:daemon:cmd:processor called by [[35712,0],0] for tag 1
> [singularity:15041] [[35712,0],0] orte:daemon:send_relay
> [singularity:15041] [[35712,0],0] orte:daemon:send_relay - recipient list is empty!
> [singularity:15041] [[35712,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
> MPIR_being_debugged = 0
> MPIR_debug_state = 1
> MPIR_partial_attach_ok = 1
> MPIR_i_am_starter = 0
> MPIR_forward_output = 0
> MPIR_proctable_size = 1
> MPIR_proctable:
> (i, host, exe, pid) = (0, singularity, /home/alex/huji/benchmarks/simple/hello, 15042)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
>
> MOSRUN: system-call 'sched_getaffinity' not supported under MOSIX
> MOSRUN: Shared memory (MAP_SHARED) not supported under MOSIX
>
> [singularity:15042] procdir: /tmp/openmpi-sessions-alex_at_singularity_0/35712/1/0
> [singularity:15042] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/35712/1
> [singularity:15042] top: openmpi-sessions-alex_at_singularity_0
> [singularity:15042] tmp: /tmp
> [singularity:15041] [[35712,0],0] orted_recv_cmd: received message from [[35712,1],0]
> [singularity:15041] defining message event: orted/orted_comm.c 172
> [singularity:15041] [[35712,0],0] orted_recv_cmd: reissued recv
> [singularity:15041] [[35712,0],0] orte:daemon:cmd:processor called by [[35712,1],0] for tag 1
> [singularity:15041] [[35712,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:15041] [[35712,0],0] orte:daemon:cmd:processor: processing commands completed
> [singularity:15042] OPAL dss:unpack: got type 33 when expecting type 12
> [singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/util/nidmap.c at line 429
> [singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../../../orte/mca/ess/base/ess_base_nidmap.c at line 62
> [singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../../../../orte/mca/ess/env/ess_env_module.c at line 173
> [singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/runtime/orte_init.c at line 132
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> ompi_mpi_init: orte_init failed
> --> Returned "Pack data mismatch" (-22) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** The MPI_Init() function was called before MPI_INIT was invoked.
> *** This is disallowed by the MPI standard.
> *** Your MPI job will now abort.
> [singularity:15042] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
> [singularity:15041] defining message event: iof_hnp_read.c 293
> [singularity:15041] [[35712,0],0] orte:daemon:cmd:processor called by [[35712,0],0] for tag 1
> [singularity:15041] [[35712,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_IOF_COMPLETE
> [singularity:15041] [[35712,0],0] orte:daemon:cmd:processor: processing commands completed
> [singularity:15041] defining message event: base/odls_base_default_fns.c 2532
> [singularity:15041] [[35712,0],0] orte:daemon:cmd:processor called by [[35712,0],0] for tag 1
> [singularity:15041] [[35712,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_WAITPID_FIRED
> [singularity:15041] sess_dir_finalize: proc session dir not empty - leaving
> [singularity:15041] [[35712,0],0]:errmgr_default_hnp.c(948) updating exit status to 1
> -------------------------------------------------------
> While the primary job terminated normally, 1 process returned
> a non-zero exit code.. Further examination may be required.
> -------------------------------------------------------
> [singularity:15041] sess_dir_finalize: job session dir not empty - leaving
> [singularity:15041] [[35712,0],0] Releasing job data for [35712,0]
> [singularity:15041] [[35712,0],0] Releasing job data for [35712,1]
> [singularity:15041] sess_dir_finalize: proc session dir not empty - leaving
> orterun: exiting with status 1
> alex_at_singularity:~/huji/benchmarks/simple$
> <odls_mosix.diff>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel