On Oct 30, 2012, at 11:55 AM, Sandra Guija <sguija@hotmail.com> wrote:

I am able to change the memory size parameters, so if I increase memory size (currently 2 gb) or add caches, it could be a solution?

could be

or is the program that is using too much memory?

Hard to tell. In the case you show, we are aborting because we don't see enough memory to support the shared memory system. You can adjust that size by setting the MCA params for shared memory - see "ompi_info --param btl sm".

On the other hand, your program is clearly huge. 10k x 10k = 100M entries, so you are using close to a Gbyte (assuming doubles) just to store the array in one process.



thanks really for you input, I appreciate it.

Sandra Guija


From: rhc@open-mpi.org
Date: Tue, 30 Oct 2012 11:50:28 -0700
To: devel@open-mpi.org
Subject: Re: [OMPI devel] process kill signal 59

Yeah, you're using too much memory for the shared memory system. Run with -mca btl ^sm on your cmd line - it'll run slower, but you probably don't have a choice.


On Oct 30, 2012, at 11:38 AM, Sandra Guija <sguija@hotmail.com> wrote:

yes I think is related with my program too, when I run 1000x1000 matrix multiplication, the program works.
when I run the 10,000 matrix only on one machine  I got this:
mca_common_sm_mmap_init: mmap failed with errno=12
mca_mpool_sm_init: unable to shared memory mapping ( /tmp/openmpi-sessions-mpiu@tango_0/default-universe-1529/1/shared_mem_pool .tango)
mca_common_sm_mmap_init: /tmp/openmpi-sessions-mpiu@tango_0/default-universe-1529/1/shared_mem_pool .tango failed with errno=2
mca_mpool_sm_init: unable to shared memory mapping ( /tmp/openmpi-sessions-mpiu@tango_0/default-universe-1529/1/shared_mem_pool .tango)
PML add procs failed
-->Returned "0ut of resource" (-2) instead of " Success" (0)

this is the result when I run free -m
                  total   used   free   shared  buffers   cached
Mem:          2026    54    1972      0         6           25
-/+ buffer cache:    22      511      
 Swap:         511     0       511

Sandra Guija


From: rhc@open-mpi.org
Date: Tue, 30 Oct 2012 10:33:02 -0700
To: devel@open-mpi.org
Subject: Re: [OMPI devel] process kill signal 59

Ummm...not sure what I can say about that with so little info. It looks like your process died for some reason that has nothing to do with us - a bug in your "magic10000" program?


On Oct 30, 2012, at 10:24 AM, Sandra Guija <sguija@hotmail.com> wrote:

Hello, 
I am running a 10,000x10,000 matrix multiplication in 4 processors/1 core and I get the following error:
mpirun -np 4 --hostfile nodes --bynode magic10000

mpirun noticed that job rank1 with PID 635 on node slave1 exited on signal 509(Real-time signal 25).
2 additional process aborted (not shown)
1 process killed (possibly by open MPI)

node file contains:
master
slave1
slave2
slave3


_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________ devel mailing list devel@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________ devel mailing list devel@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel