Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] process kill signal 59
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-10-30 15:04:41


On Oct 30, 2012, at 11:55 AM, Sandra Guija <sguija_at_[hidden]> wrote:

> I am able to change the memory size parameters, so if I increase memory size (currently 2 gb) or add caches, it could be a solution?

could be

> or is the program that is using too much memory?

Hard to tell. In the case you show, we are aborting because we don't see enough memory to support the shared memory system. You can adjust that size by setting the MCA params for shared memory - see "ompi_info --param btl sm".

On the other hand, your program is clearly huge. 10k x 10k = 100M entries, so you are using close to a Gbyte (assuming doubles) just to store the array in one process.

>
> thanks really for you input, I appreciate it.
>
> Sandra Guija
>
> From: rhc_at_[hidden]
> Date: Tue, 30 Oct 2012 11:50:28 -0700
> To: devel_at_[hidden]
> Subject: Re: [OMPI devel] process kill signal 59
>
> Yeah, you're using too much memory for the shared memory system. Run with -mca btl ^sm on your cmd line - it'll run slower, but you probably don't have a choice.
>
>
> On Oct 30, 2012, at 11:38 AM, Sandra Guija <sguija_at_[hidden]> wrote:
>
> yes I think is related with my program too, when I run 1000x1000 matrix multiplication, the program works.
> when I run the 10,000 matrix only on one machine I got this:
> mca_common_sm_mmap_init: mmap failed with errno=12
> mca_mpool_sm_init: unable to shared memory mapping ( /tmp/openmpi-sessions-mpiu_at_tango_0/default-universe-1529/1/shared_mem_pool .tango)
> mca_common_sm_mmap_init: /tmp/openmpi-sessions-mpiu_at_tango_0/default-universe-1529/1/shared_mem_pool .tango failed with errno=2
> mca_mpool_sm_init: unable to shared memory mapping ( /tmp/openmpi-sessions-mpiu_at_tango_0/default-universe-1529/1/shared_mem_pool .tango)
> PML add procs failed
> -->Returned "0ut of resource" (-2) instead of " Success" (0)
>
> this is the result when I run free -m
> total used free shared buffers cached
> Mem: 2026 54 1972 0 6 25
> -/+ buffer cache: 22 511
> Swap: 511 0 511
>
> Sandra Guija
>
> From: rhc_at_[hidden]
> Date: Tue, 30 Oct 2012 10:33:02 -0700
> To: devel_at_[hidden]
> Subject: Re: [OMPI devel] process kill signal 59
>
> Ummm...not sure what I can say about that with so little info. It looks like your process died for some reason that has nothing to do with us - a bug in your "magic10000" program?
>
>
> On Oct 30, 2012, at 10:24 AM, Sandra Guija <sguija_at_[hidden]> wrote:
>
> Hello,
> I am running a 10,000x10,000 matrix multiplication in 4 processors/1 core and I get the following error:
> mpirun -np 4 --hostfile nodes --bynode magic10000
>
> mpirun noticed that job rank1 with PID 635 on node slave1 exited on signal 509(Real-time signal 25).
> 2 additional process aborted (not shown)
> 1 process killed (possibly by open MPI)
>
> node file contains:
> master
> slave1
> slave2
> slave3
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________ devel mailing list devel_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________ devel mailing list devel_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel