Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Spawn error: Data unpack would read past end of buffer" (-26) instead of "Success"
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-09-06 14:11:27


Hmmm...well, nothing definitive there, I'm afraid.

All I can suggest is to remove/reduce the threading. Like I said, we aren't terribly thread safe at this time. I suspect you're stepping into one of those non-safe areas here.

Hopefully will do better in later releases.

On Sep 6, 2011, at 1:20 PM, Simone Pellegrini wrote:

> On 09/06/2011 04:58 PM, Ralph Castain wrote:
>> On Sep 6, 2011, at 12:49 PM, Simone Pellegrini wrote:
>>
>>> On 09/06/2011 02:57 PM, Ralph Castain wrote:
>>>> Hi Simone
>>>>
>>>> Just to clarify: is your application threaded? Could you please send the OMPI configure cmd you used?
>>> yes, it is threaded. There are basically 3 threads, 1 for the outgoing messages (MPI_send), 1 for incoming messages (MPI_Iprobe / MPI_Recv) and one spawning.
>>>
>>> I am not sure what you mean with OMPI configure cmd I used... I simply do mpirun --np 1 ./executable
>> How was OMPI configured when it was installed? If you didn't install it, then provide the output of ompi_info - it will tell us.
> [@arch-moto tasksys]$ ompi_info
> Package: Open MPI nobody_at_alderaan Distribution
> Open MPI: 1.5.3
> Open MPI SVN revision: r24532
> Open MPI release date: Mar 16, 2011
> Open RTE: 1.5.3
> Open RTE SVN revision: r24532
> Open RTE release date: Mar 16, 2011
> OPAL: 1.5.3
> OPAL SVN revision: r24532
> OPAL release date: Mar 16, 2011
> Ident string: 1.5.3
> Prefix: /usr
> Configured architecture: x86_64-unknown-linux-gnu
> Configure host: alderaan
> Configured by: nobody
> Configured on: Thu Jul 7 13:21:35 UTC 2011
> Configure host: alderaan
> Built by: nobody
> Built on: Thu Jul 7 13:27:08 UTC 2011
> Built host: alderaan
> C bindings: yes
> C++ bindings: yes
> Fortran77 bindings: yes (all)
> Fortran90 bindings: yes
> Fortran90 bindings size: small
> C compiler: gcc
> C compiler absolute: /usr/bin/gcc
> C compiler family name: GNU
> C compiler version: 4.6.1
> C++ compiler: g++
> C++ compiler absolute: /usr/bin/g++
> Fortran77 compiler: gfortran
> Fortran77 compiler abs: /usr/bin/gfortran
> Fortran90 compiler: /usr/bin/gfortran
> Fortran90 compiler abs:
> C profiling: yes
> C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
> C++ exceptions: no
> Thread support: posix (mpi: yes, progress: no)
> Sparse Groups: no
> Internal debug support: yes
> MPI interface warnings: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: yes
> Heterogeneous support: no
> mpirun default --prefix: no
> MPI I/O support: yes
> MPI_WTIME support: gettimeofday
> Symbol vis. support: yes
> MPI extensions: affinity example
> FT Checkpoint support: no (checkpoint thread: no)
> MPI_MAX_PROCESSOR_NAME: 256
> MPI_MAX_ERROR_STRING: 256
> MPI_MAX_OBJECT_NAME: 64
> MPI_MAX_INFO_KEY: 36
> MPI_MAX_INFO_VAL: 256
> MPI_MAX_PORT_NAME: 1024
> MPI_MAX_DATAREP_STRING: 128
> MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.5.3)
> MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.5.3)
> MCA memory: linux (MCA v2.0, API v2.0, Component v1.5.3)
> MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.3)
> MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.5.3)
> MCA carto: file (MCA v2.0, API v2.0, Component v1.5.3)
> MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.3)
> MCA timer: linux (MCA v2.0, API v2.0, Component v1.5.3)
> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.3)
> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.3)
> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.3)
> MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.3)
> MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.3)
> MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.3)
> MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.3)
> MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.3)
> MCA coll: inter (MCA v2.0, API v2.0, Component v1.5.3)
> MCA coll: self (MCA v2.0, API v2.0, Component v1.5.3)
> MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.3)
> MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.3)
> MCA coll: tuned (MCA v2.0, API v2.0, Component v1.5.3)
> MCA io: romio (MCA v2.0, API v2.0, Component v1.5.3)
> MCA mpool: fake (MCA v2.0, API v2.0, Component v1.5.3)
> MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.3)
> MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.3)
> MCA pml: bfo (MCA v2.0, API v2.0, Component v1.5.3)
> MCA pml: csum (MCA v2.0, API v2.0, Component v1.5.3)
> MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.3)
> MCA pml: v (MCA v2.0, API v2.0, Component v1.5.3)
> MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.3)
> MCA rcache: vma (MCA v2.0, API v2.0, Component v1.5.3)
> MCA btl: self (MCA v2.0, API v2.0, Component v1.5.3)
> MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.3)
> MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.3)
> MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.3)
> MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.5.3)
> MCA osc: rdma (MCA v2.0, API v2.0, Component v1.5.3)
> MCA iof: hnp (MCA v2.0, API v2.0, Component v1.5.3)
> MCA iof: orted (MCA v2.0, API v2.0, Component v1.5.3)
> MCA iof: tool (MCA v2.0, API v2.0, Component v1.5.3)
> MCA oob: tcp (MCA v2.0, API v2.0, Component v1.5.3)
> MCA odls: default (MCA v2.0, API v2.0, Component v1.5.3)
> MCA ras: cm (MCA v2.0, API v2.0, Component v1.5.3)
> MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.5.3)
> MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.5.3)
> MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.5.3)
> MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.5.3)
> MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.5.3)
> MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.5.3)
> MCA rml: oob (MCA v2.0, API v2.0, Component v1.5.3)
> MCA routed: binomial (MCA v2.0, API v2.0, Component v1.5.3)
> MCA routed: cm (MCA v2.0, API v2.0, Component v1.5.3)
> MCA routed: direct (MCA v2.0, API v2.0, Component v1.5.3)
> MCA routed: linear (MCA v2.0, API v2.0, Component v1.5.3)
> MCA routed: radix (MCA v2.0, API v2.0, Component v1.5.3)
> MCA routed: slave (MCA v2.0, API v2.0, Component v1.5.3)
> MCA plm: rsh (MCA v2.0, API v2.0, Component v1.5.3)
> MCA plm: rshd (MCA v2.0, API v2.0, Component v1.5.3)
> MCA filem: rsh (MCA v2.0, API v2.0, Component v1.5.3)
> MCA errmgr: default (MCA v2.0, API v2.0, Component v1.5.3)
> MCA ess: env (MCA v2.0, API v2.0, Component v1.5.3)
> MCA ess: hnp (MCA v2.0, API v2.0, Component v1.5.3)
> MCA ess: singleton (MCA v2.0, API v2.0, Component v1.5.3)
> MCA ess: slave (MCA v2.0, API v2.0, Component v1.5.3)
> MCA ess: tool (MCA v2.0, API v2.0, Component v1.5.3)
> MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.5.3)
> MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.5.3)
> MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.5.3)
> MCA notifier: command (MCA v2.0, API v1.0, Component v1.5.3)
> MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.5.3)
>
>
>>
>>>> Adding the debug flags just changes the race condition. Interestingly, those values only impact the behavior of mpirun, so it looks like the race condition is occurring there.
>>> The problem is that the error is totally nondeterministic. Sometimes happens, others not but the error message gives me no clue where the error is coming from. Is is a problem of my code or internal MPI?
>> Can't tell, but it is likely an impact of threading. Race conditions within threaded environments are common, and OMPI isn't particularly thread safe, especially when it comes to comm_spawn.
>>
>>> cheers, Simone
>>>>
>>>> On Sep 6, 2011, at 3:01 AM, Simone Pellegrini wrote:
>>>>
>>>>> Dear all,
>>>>> I am developing an MPI application which uses heavily MPI_Spawn. Usually everything works fine for the first hundred spawn but after a while the application exist with a curious message:
>>>>>
>>>>> [arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
>>>>> [arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
>>>>> --------------------------------------------------------------------------
>>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>>> likely to abort. There are many reasons that a parallel process can
>>>>> fail during MPI_INIT; some of which are due to configuration or environment
>>>>> problems. This failure appears to be an internal failure; here's some
>>>>> additional information (which may only be relevant to an Open MPI
>>>>> developer):
>>>>>
>>>>> ompi_proc_set_arch failed
>>>>> --> Returned "Data unpack would read past end of buffer" (-26) instead of "Success" (0)
>>>>> --------------------------------------------------------------------------
>>>>> *** The MPI_Init_thread() function was called before MPI_INIT was invoked.
>>>>> *** This is disallowed by the MPI standard.
>>>>> *** Your MPI job will now abort.
>>>>> [arch-top:27712] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>>>>> [arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
>>>>> [arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
>>>>> *** The MPI_Init_thread() function was called before MPI_INIT was invoked.
>>>>> *** This is disallowed by the MPI standard.
>>>>> *** Your MPI job will now abort.
>>>>> [arch-top:27714] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>>>>> [arch-top:27226] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
>>>>> [arch-top:27226] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>>>>
>>>>> Also using MPI_init instead of MPI_Init_thread does not help, the same error occurs.
>>>>>
>>>>> Strangely the error does not occur if I run the code enabling debug in (-mca plm_base_verbose 5 -mca rmaps_base_verbose 5).
>>>>>
>>>>> I am using OpenMPI 1.5.3
>>>>>
>>>>> cheers, Simone
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users