Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] v1.3: mca_common_sm_mmap_init error
From: Reuti (reuti_at_[hidden])
Date: 2009-02-02 03:42:29


Am 01.02.2009 um 12:43 schrieb Jeff Squyres:

> Could the nodes be running out of shared memory and/or temp
> filesystem space?

I still have this issue, and it happens only from time to time. But
despite the fact that SGE's qrsh is used automatically, more severe
is the fact, that on the slave nodes the orted daemons will be pushed
into daemonland and no longer be bound to the sge_shepherd:

  3173 1 /usr/sge/bin/lx24-x86/sge_execd
  3431 1 orted --daemonize -mca ess env -mca orte_ess_jobid
81199104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 811
  3432 3431 \_ /home/reuti/mpihello
  3433 3431 \_ /home/reuti/mpihello

-- Reuti

>
> On Jan 29, 2009, at 3:05 PM, Rolf vandeVaart wrote:
>
>>
>> I have not seen this before. I assume that for some reason, the
>> shared memory transport layer cannot create the file it uses for
>> communicating within a node. Open MPI then selects some other
>> transport (TCP, openib) to communicate within the node so the
>> program runs fine.
>>
>> The code has not changed that much from 1.2 to 1.3, but it is a
>> little different. Let me see if I can reproduce the problem.
>>
>> Rolf
>>
>> Mostyn Lewis wrote:
>>> Sort of ditto but with SVN release at 20123 (and earlier):
>>>
>>> e.g.
>>>
>>> [r2250_46:30018] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>>> openmpi-sessions-mostyn_at_r2250_46_0/25682/1/shared_mem_pool.r2250_46
>>> failed with errno=2
>>> [r2250_63:05292] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>>> openmpi-sessions-mostyn_at_r2250_63_0/25682/1/shared_mem_pool.r2250_63
>>> failed with errno=2
>>> [r2250_57:17527] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>>> openmpi-sessions-mostyn_at_r2250_57_0/25682/1/shared_mem_pool.r2250_57
>>> failed with errno=2
>>> [r2250_68:13553] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>>> openmpi-sessions-mostyn_at_r2250_68_0/25682/1/shared_mem_pool.r2250_68
>>> failed with errno=2
>>> [r2250_50:06541] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>>> openmpi-sessions-mostyn_at_r2250_50_0/25682/1/shared_mem_pool.r2250_50
>>> failed with errno=2
>>> [r2250_49:29237] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>>> openmpi-sessions-mostyn_at_r2250_49_0/25682/1/shared_mem_pool.r2250_49
>>> failed with errno=2
>>> [r2250_66:19066] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>>> openmpi-sessions-mostyn_at_r2250_66_0/25682/1/shared_mem_pool.r2250_66
>>> failed with errno=2
>>> [r2250_58:24902] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>>> openmpi-sessions-mostyn_at_r2250_58_0/25682/1/shared_mem_pool.r2250_58
>>> failed with errno=2
>>> [r2250_69:27426] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>>> openmpi-sessions-mostyn_at_r2250_69_0/25682/1/shared_mem_pool.r2250_69
>>> failed with errno=2
>>> [r2250_60:30560] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/
>>> openmpi-sessions-mostyn_at_r2250_60_0/25682/1/shared_mem_pool.r2250_60
>>> failed with errno=2
>>>
>>> File not found in sm.
>>>
>>> 10 of them across 32 nodes (8 cores per node (2 sockets x quad-
>>> core))
>>> "Apparently harmless"?
>>>
>>> DM
>>>
>>> On Tue, 27 Jan 2009, Prentice Bisbal wrote:
>>>
>>>> I just installed OpenMPI 1.3 with tight integration for SGE.
>>>> Version
>>>> 1.2.8 was working just fine for several months in the same
>>>> arrangement.
>>>>
>>>> Now that I've upgraded to 1.3, I get the following errors in my
>>>> standard
>>>> error file:
>>>>
>>>> mca_common_sm_mmap_init: open /tmp/968.1.all.q/openmpi-sessions-
>>>> prent
>>>> ice_at_node09.aurora_0/21400/1/shared_mem_pool.node09.aurora failed
>>>> with
>>>> errno=2
>>>> [node23.aurora:20601] mca_common_sm_mmap_init: open
>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice_at_node23.aurora_0/21400/1/shared_mem_pool.node23.aurora failed
>>>> with
>>>> errno=2
>>>> [node46.aurora:12118] mca_common_sm_mmap_init: open
>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice_at_node46.aurora_0/21400/1/shared_mem_pool.node46.aurora failed
>>>> with
>>>> errno=2
>>>> [node15.aurora:12421] mca_common_sm_mmap_init: open
>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice_at_node15.aurora_0/21400/1/shared_mem_pool.node15.aurora failed
>>>> with
>>>> errno=2
>>>> [node20.aurora:12534] mca_common_sm_mmap_init: open
>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice_at_node20.aurora_0/21400/1/shared_mem_pool.node20.aurora failed
>>>> with
>>>> errno=2
>>>> [node16.aurora:12573] mca_common_sm_mmap_init: open
>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice_at_node16.aurora_0/21400/1/shared_mem_pool.node16.aurora failed
>>>> with
>>>> errno=2
>>>>
>>>> I've tested 3-4 different times, and the number of hosts that
>>>> produces
>>>> this error varies, as well as which hosts produce this error. My
>>>> program
>>>> seems to run fun, but it's just a simple "Hello, World!"
>>>> program. Any
>>>> ideas? Is this a bug in 1.3?
>>>>
>>>>
>>>> -- Prentice
>>>> --
>>>> Prentice Bisbal
>>>> Linux Software Support Specialist/System Administrator
>>>> School of Natural Sciences
>>>> Institute for Advanced Study
>>>> Princeton, NJ
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users