Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] shared memory (sm) module not working properly?
From: Nicolas Bock (nicolasbock_at_[hidden])
Date: 2010-01-19 17:41:03


Thanks, that explains it :)

On Tue, Jan 19, 2010 at 15:01, Ralph Castain <rhc_at_[hidden]> wrote:

> Shared memory doesn't extend between comm_spawn'd parent/child processes in
> Open MPI. Perhaps someday it will, but not yet.
>
>
> On Jan 19, 2010, at 1:14 PM, Nicolas Bock wrote:
>
> Hello list,
>
> I think I understand better now what's happening, although I still don't
> know why. I have attached two small C codes that demonstrate the problem.
> The code in main.c uses MPI_Comm_spawn() to start the code in the second
> source, child.c. I can force the issue by running the main.c code with
>
> mpirun -mca btl self,sm -np 1 ./main
>
> and get this error:
>
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
> Process 1 ([[26121,2],0]) is on host: mujo
> Process 2 ([[26121,1],0]) is on host: mujo
> BTLs attempted: self sm
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
>
> Is that because the spawned process is in a different group? They are still
> all running on the same host, so at least in principle they should be able
> to communicate with each other via shared memory.
>
> nick
>
>
>
> On Fri, Jan 15, 2010 at 16:08, Eugene Loh <Eugene.Loh_at_[hidden]> wrote:
>
>> Dunno. Do lower np values succeed? If so, at what value of np does the
>> job no longer start?
>>
>> Perhaps it's having a hard time creating the shared-memory backing file in
>> /tmp. I think this is a 64-Mbyte file. If this is the case, try reducing
>> the size of the shared area per this FAQ item:
>> http://www.open-mpi.org/faq/?category=sm#decrease-sm Most notably,
>> reduce mpool_sm_min_size below 67108864.
>>
>> Also note trac ticket 2043, which describes problems with the sm BTL
>> exposed by GCC 4.4.x compilers. You need to get a sufficiently recent build
>> to solve this. But, those problems don't occur until you start passing
>> messages, and here you're not even starting up.
>>
>>
>> Nicolas Bock wrote:
>>
>> Sorry, I forgot to give more details on what versions I am using:
>>
>> OpenMPI 1.4
>> Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu
>> gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
>>
>> On Fri, Jan 15, 2010 at 15:47, Nicolas Bock <nicolasbock_at_[hidden]>wrote:
>>
>>> Hello list,
>>>
>>> I am running a job on a 4 quadcore AMD Opteron. This machine has 16
>>> cores, which I can verify by looking at /proc/cpuinfo. However, when I run a
>>> job with
>>>
>>> mpirun -np 16 -mca btl self,sm job
>>>
>>> I get this error:
>>>
>>>
>>> --------------------------------------------------------------------------
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications. This means that no Open MPI device has indicated
>>> that it can be used to communicate between these processes. This is
>>> an error; Open MPI requires that all MPI processes be able to reach
>>> each other. This error can sometimes be the result of forgetting to
>>> specify the "self" BTL.
>>>
>>> Process 1 ([[56972,2],0]) is on host: rust
>>> Process 2 ([[56972,1],0]) is on host: rust
>>> BTLs attempted: self sm
>>>
>>> Your MPI job is now going to abort; sorry.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> By adding the tcp btl I can run the job. I don't understand why openmpi
>>> claims that a pair of processes can not reach each other, all processor
>>> cores should have access to all memory after all. Do I need to set some
>>> other btl limit?
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> <main.c><child.c>_______________________________________________
>
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>