Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] New odls component fails
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-03-18 14:53:56

On Mar 17, 2012, at 4:18 PM, Alex Margolin wrote:

> On 03/17/2012 08:16 PM, Ralph Castain wrote:
>> I don't think you need to .ompi_ignore all those components. First, you need to use the --without-hwloc option (you misspelled it below as --disable-hwloc).
> I missed it, thank you.
>> Assuming you removed the relevant code from your clone of the default odls module, I suspect the calls are being made in ompi/runtime/ompi_mpi_init.c. If the process detects it isn't bound, it looks to see if it should bind itself. I thought that code was also turned "off" if we configured without-hwloc, so you might have to check it.
> I didn't remove any code from the default module. Should I have? (All I added was inserting "mosrun -w" before the app name in the argv)

No, using --without-hwloc will turn off all the memory and cpu binding calls.

> Could you please explain what do you mean by "bound" and how can I bind processes?

Binding means to tell the OS to restrict execution of this process to the specified cpus. You can also ask that it restrict all malloc'd memory to a region local to those cpus - this is where you get some of your prior error messages.

> Also, I'm now getting a similar error, but a quick check shows ess_base_nidmap.c doesn't exist in the trunk:
> ...
> [singularity:01899] OPAL dss:unpack: got type 22 when expecting type 16
> [singularity:01899] [[46635,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../../../orte/mca/ess/base/ess_base_nidmap.c at line 57
> [singularity:01899] [[46635,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../../../../orte/mca/ess/env/ess_env_module.c at line 173
> [singularity:01899] [[46635,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/runtime/orte_init.c at line 132

This is typically caused by stale libraries in your install area. Did you rm -rf your prior installation before rebuilding? Did you recompile your application after your built?

These files no longer exist in the trunk, as you noted - so if something is looking for it, that means you either didn't clean out the old installation or you forgot to recompile the application after rebuilding OMPI.

> --------------------------------------------------------------------------
> ...
>> Shared memory is a separate issue. If you want/need to avoid it, then run with -mca btl ^sm and this will turn off all shared memory calls.
> After my last post I tried to rebuild and then even the simplest app wouldn't start. Turns out I disabled all the shmem (mmap, posix, sysv) and orte wouldn't start without any (so I had to turn it back on). Could you tell me if there is a way to run the application without making any mmap() calls with MAP_SHARED? Currently, mosrun is run with -w asking it to fail (return -1) on any such system-call.

ORTE doesn't use shared memory, but I suspect that the opal shmem framework may object to not finding any usable component. We shouldn't error out for that reason, but the problem is present in the code. Edit the file opal/mca/shmem/base/shmem_base_select.c and change line 174 to return OPAL_SUCCESS. You may encounter other problems down the line as the system may not react well to not having anything there, but give it a try.

Worst case, you may have to add a "null" component to the opal/mca/shmem framework that does nothing, just so the framework has a defined module instead of a bunch of NULL function pointers.

> Thanks for your help,
> Alex
>> On Mar 17, 2012, at 11:51 AM, Alex Margolin wrote:
>>> [singularity:15041] [[35712,0],0] orted_recv_cmd: received message from [[35712,1],0]
>>> [singularity:15041] defining message event: orted/orted_comm.c 172
>>> [singularity:15041] [[35712,0],0] orted_recv_cmd: reissued recv
>>> [singularity:15041] [[35712,0],0] orte:daemon:cmd:processor called by [[35712,1],0] for tag 1
>>> [singularity:15041] [[35712,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
>>> [singularity:15041] [[35712,0],0] orte:daemon:cmd:processor: processing commands completed
>>> [singularity:15042] OPAL dss:unpack: got type 33 when expecting type 12
>>> [singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/util/nidmap.c at line 429
>>> [singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../../../orte/mca/ess/base/ess_base_nidmap.c at line 62
>>> [singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../../../../orte/mca/ess/env/ess_env_module.c at line 173
>>> [singularity:15042] [[35712,1],0] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/runtime/orte_init.c at line 132
>>> --------------------------------------------------------------------------
>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>> likely to abort. There are many reasons that a parallel process can
>>> fail during MPI_INIT; some of which are due to configuration or environment
>>> problems. This failure appears to be an internal failure; here's some
>>> additional information (which may only be relevant to an Open MPI
>>> developer):
>>> ompi_mpi_init: orte_init failed
>>> --> Returned "Pack data mismatch" (-22) instead of "Success" (0)
>>> --------------------------------------------------------------------------
>>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>>> *** This is disallowed by the MPI standard.
>>> *** Your MPI job will now abort.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]