Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] v1.3: mca_common_sm_mmap_init error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-02-03 09:31:27


On Feb 2, 2009, at 4:48 PM, Prentice Bisbal wrote:

> No. I was running just a simple "Hello, world" program to test v1.3
> when
> these errors occured. And as soon as I reverted to 1.2.8, the errors
> disappeared.

FWIW, OMPI allocates shared memory based on the number of peers on the
host. This allocation is during MPI_INIT, not during the first
MPI_SEND/MPI_RECV call. So even if you're running "hello world", you
could still be running out of shared memory space.

> Interestingly enough, I just upgraded my cluster to PU_IAS 5.3, and
> now
> I can't reproduce the problem but HPL fails with a segfault, which
> I'll
> report in a separate e-mail to start a new thread for that problem.
>
> --
> Prentice
>
> Jeff Squyres wrote:
>> Could the nodes be running out of shared memory and/or temp
>> filesystem
>> space?
>>
>>
>> On Jan 29, 2009, at 3:05 PM, Rolf vandeVaart wrote:
>>
>>>
>>> I have not seen this before. I assume that for some reason, the
>>> shared memory transport layer cannot create the file it uses for
>>> communicating within a node. Open MPI then selects some other
>>> transport (TCP, openib) to communicate within the node so the
>>> program
>>> runs fine.
>>>
>>> The code has not changed that much from 1.2 to 1.3, but it is a
>>> little
>>> different. Let me see if I can reproduce the problem.
>>>
>>> Rolf
>>>
>>> Mostyn Lewis wrote:
>>>> Sort of ditto but with SVN release at 20123 (and earlier):
>>>>
>>>> e.g.
>>>>
>>>> [r2250_46:30018] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn_at_r2250_46_0/25682/1/
>>>> shared_mem_pool.r2250_46
>>>>
>>>> failed with errno=2
>>>> [r2250_63:05292] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn_at_r2250_63_0/25682/1/
>>>> shared_mem_pool.r2250_63
>>>>
>>>> failed with errno=2
>>>> [r2250_57:17527] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn_at_r2250_57_0/25682/1/
>>>> shared_mem_pool.r2250_57
>>>>
>>>> failed with errno=2
>>>> [r2250_68:13553] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn_at_r2250_68_0/25682/1/
>>>> shared_mem_pool.r2250_68
>>>>
>>>> failed with errno=2
>>>> [r2250_50:06541] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn_at_r2250_50_0/25682/1/
>>>> shared_mem_pool.r2250_50
>>>>
>>>> failed with errno=2
>>>> [r2250_49:29237] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn_at_r2250_49_0/25682/1/
>>>> shared_mem_pool.r2250_49
>>>>
>>>> failed with errno=2
>>>> [r2250_66:19066] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn_at_r2250_66_0/25682/1/
>>>> shared_mem_pool.r2250_66
>>>>
>>>> failed with errno=2
>>>> [r2250_58:24902] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn_at_r2250_58_0/25682/1/
>>>> shared_mem_pool.r2250_58
>>>>
>>>> failed with errno=2
>>>> [r2250_69:27426] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn_at_r2250_69_0/25682/1/
>>>> shared_mem_pool.r2250_69
>>>>
>>>> failed with errno=2
>>>> [r2250_60:30560] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn_at_r2250_60_0/25682/1/
>>>> shared_mem_pool.r2250_60
>>>>
>>>> failed with errno=2
>>>>
>>>> File not found in sm.
>>>>
>>>> 10 of them across 32 nodes (8 cores per node (2 sockets x quad-
>>>> core))
>>>> "Apparently harmless"?
>>>>
>>>> DM
>>>>
>>>> On Tue, 27 Jan 2009, Prentice Bisbal wrote:
>>>>
>>>>> I just installed OpenMPI 1.3 with tight integration for SGE.
>>>>> Version
>>>>> 1.2.8 was working just fine for several months in the same
>>>>> arrangement.
>>>>>
>>>>> Now that I've upgraded to 1.3, I get the following errors in my
>>>>> standard
>>>>> error file:
>>>>>
>>>>> mca_common_sm_mmap_init: open /tmp/968.1.all.q/openmpi-sessions-
>>>>> prent
>>>>> ice_at_node09.aurora_0/21400/1/shared_mem_pool.node09.aurora failed
>>>>> with
>>>>> errno=2
>>>>> [node23.aurora:20601] mca_common_sm_mmap_init: open
>>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice_at_node23.aurora_0/21400/1/shared_mem_pool.node23.aurora failed
>>>>> with
>>>>> errno=2
>>>>> [node46.aurora:12118] mca_common_sm_mmap_init: open
>>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice_at_node46.aurora_0/21400/1/shared_mem_pool.node46.aurora failed
>>>>> with
>>>>> errno=2
>>>>> [node15.aurora:12421] mca_common_sm_mmap_init: open
>>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice_at_node15.aurora_0/21400/1/shared_mem_pool.node15.aurora failed
>>>>> with
>>>>> errno=2
>>>>> [node20.aurora:12534] mca_common_sm_mmap_init: open
>>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice_at_node20.aurora_0/21400/1/shared_mem_pool.node20.aurora failed
>>>>> with
>>>>> errno=2
>>>>> [node16.aurora:12573] mca_common_sm_mmap_init: open
>>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice_at_node16.aurora_0/21400/1/shared_mem_pool.node16.aurora failed
>>>>> with
>>>>> errno=2
>>>>>
>>>>> I've tested 3-4 different times, and the number of hosts that
>>>>> produces
>>>>> this error varies, as well as which hosts produce this error. My
>>>>> program
>>>>> seems to run fun, but it's just a simple "Hello, World!"
>>>>> program. Any
>>>>> ideas? Is this a bug in 1.3?
>>>>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems