Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] v1.3: mca_common_sm_mmap_init error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-02-01 06:43:14


Could the nodes be running out of shared memory and/or temp filesystem
space?

On Jan 29, 2009, at 3:05 PM, Rolf vandeVaart wrote:

>
> I have not seen this before. I assume that for some reason, the
> shared memory transport layer cannot create the file it uses for
> communicating within a node. Open MPI then selects some other
> transport (TCP, openib) to communicate within the node so the
> program runs fine.
>
> The code has not changed that much from 1.2 to 1.3, but it is a
> little different. Let me see if I can reproduce the problem.
>
> Rolf
>
> Mostyn Lewis wrote:
>> Sort of ditto but with SVN release at 20123 (and earlier):
>>
>> e.g.
>>
>> [r2250_46:30018] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>> openmpi-sessions-mostyn_at_r2250_46_0/25682/1/shared_mem_pool.r2250_46
>> failed with errno=2
>> [r2250_63:05292] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>> openmpi-sessions-mostyn_at_r2250_63_0/25682/1/shared_mem_pool.r2250_63
>> failed with errno=2
>> [r2250_57:17527] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>> openmpi-sessions-mostyn_at_r2250_57_0/25682/1/shared_mem_pool.r2250_57
>> failed with errno=2
>> [r2250_68:13553] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>> openmpi-sessions-mostyn_at_r2250_68_0/25682/1/shared_mem_pool.r2250_68
>> failed with errno=2
>> [r2250_50:06541] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>> openmpi-sessions-mostyn_at_r2250_50_0/25682/1/shared_mem_pool.r2250_50
>> failed with errno=2
>> [r2250_49:29237] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>> openmpi-sessions-mostyn_at_r2250_49_0/25682/1/shared_mem_pool.r2250_49
>> failed with errno=2
>> [r2250_66:19066] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>> openmpi-sessions-mostyn_at_r2250_66_0/25682/1/shared_mem_pool.r2250_66
>> failed with errno=2
>> [r2250_58:24902] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>> openmpi-sessions-mostyn_at_r2250_58_0/25682/1/shared_mem_pool.r2250_58
>> failed with errno=2
>> [r2250_69:27426] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>> openmpi-sessions-mostyn_at_r2250_69_0/25682/1/shared_mem_pool.r2250_69
>> failed with errno=2
>> [r2250_60:30560] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>> openmpi-sessions-mostyn_at_r2250_60_0/25682/1/shared_mem_pool.r2250_60
>> failed with errno=2
>>
>> File not found in sm.
>>
>> 10 of them across 32 nodes (8 cores per node (2 sockets x quad-core))
>> "Apparently harmless"?
>>
>> DM
>>
>> On Tue, 27 Jan 2009, Prentice Bisbal wrote:
>>
>>> I just installed OpenMPI 1.3 with tight integration for SGE. Version
>>> 1.2.8 was working just fine for several months in the same
>>> arrangement.
>>>
>>> Now that I've upgraded to 1.3, I get the following errors in my
>>> standard
>>> error file:
>>>
>>> mca_common_sm_mmap_init: open /tmp/968.1.all.q/openmpi-sessions-
>>> prent
>>> ice_at_node09.aurora_0/21400/1/shared_mem_pool.node09.aurora failed
>>> with
>>> errno=2
>>> [node23.aurora:20601] mca_common_sm_mmap_init: open
>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>> ice_at_node23.aurora_0/21400/1/shared_mem_pool.node23.aurora failed
>>> with
>>> errno=2
>>> [node46.aurora:12118] mca_common_sm_mmap_init: open
>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>> ice_at_node46.aurora_0/21400/1/shared_mem_pool.node46.aurora failed
>>> with
>>> errno=2
>>> [node15.aurora:12421] mca_common_sm_mmap_init: open
>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>> ice_at_node15.aurora_0/21400/1/shared_mem_pool.node15.aurora failed
>>> with
>>> errno=2
>>> [node20.aurora:12534] mca_common_sm_mmap_init: open
>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>> ice_at_node20.aurora_0/21400/1/shared_mem_pool.node20.aurora failed
>>> with
>>> errno=2
>>> [node16.aurora:12573] mca_common_sm_mmap_init: open
>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>> ice_at_node16.aurora_0/21400/1/shared_mem_pool.node16.aurora failed
>>> with
>>> errno=2
>>>
>>> I've tested 3-4 different times, and the number of hosts that
>>> produces
>>> this error varies, as well as which hosts produce this error. My
>>> program
>>> seems to run fun, but it's just a simple "Hello, World!" program.
>>> Any
>>> ideas? Is this a bug in 1.3?
>>>
>>>
>>> -- Prentice
>>> --
>>> Prentice Bisbal
>>> Linux Software Support Specialist/System Administrator
>>> School of Natural Sciences
>>> Institute for Advanced Study
>>> Princeton, NJ
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems