Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Tmpdir work for first process only
From: Clement Kam Man Chu (Clement.Chu_at_[hidden])
Date: 2007-11-15 10:04:57


Jeff Squyres wrote:

Thanks for your reply. I am using pbs job scheduler and I reqested 16
cpus to run 400 processes, but I don't how many processes are allocated
on each cpus. Do you think it is a problem?

Clement
> Are you running all of these processes on the same machine, or
> multiple different machines?
>
> If you're running 400 processes on the same machine, it may well be
> that you are simply running out of memory or other OS resources. In
> particular, I've never seem iof fail that way before (iof is our I/O
> forwarding subsystem).
>
> Looking at the iof code, the error you're seeing occurs when iof is
> trying to create a pipe between our OMPI "helper daemon" and the newly
> spawned user executable and fails. The only reason that I can guess
> for why that would happen is if a max limit of pipes have been created
> on a machine and the OS refuses to create any more...?
>
>
>
> On Nov 14, 2007, at 9:36 PM, Clement Kam Man Chu wrote:
>
>
>> Hi,
>>
>> I have configured out why the tmpdir parameter works for the first
>> process. I got another problem if I tried to run 400 processes (no
>> problem if under 400 processes). I got an error "ORTE_ERROR_LOG: Out
>> of
>> resource in file base/iof_base_setup.c at line 106". I attached the
>> message as below:
>>
>> [ac27:12442] [0,0,0] setting up session dir with
>> [ac27:12442] tmpdir /jobfs/z07/247752.ac-pbs
>> [ac27:12442] universe default-universe-12442
>> [ac27:12442] user kxc565
>> [ac27:12442] host ac27
>> [ac27:12442] jobid 0
>> [ac27:12442] procid 0
>> [ac27:12442] procdir:
>> /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565_at_ac27_0/default-
>> universe-12442/0/0
>> [ac27:12442] jobdir:
>> /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565_at_ac27_0/default-
>> universe-12442/0
>> [ac27:12442] unidir:
>> /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565_at_ac27_0/default-
>> universe-12442
>> [ac27:12442] top: openmpi-sessions-kxc565_at_ac27_0
>> [ac27:12442] tmp: ??
>> [ac27:12442] [0,0,0] contact_file
>> /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565_at_ac27_0/default-
>> universe-12442/universe-setup.txt
>> [ac27:12442] [0,0,0] wrote setup file
>> [ac27:12447] [0,0,1] setting up session dir with
>> [ac27:12447] universe default-universe-12442
>> [ac27:12447] user kxc565
>> [ac27:12447] host ac27
>> [ac27:12447] jobid 0
>> [ac27:12447] procid 1
>> [ac27:12447] procdir:
>> /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565_at_ac27_0/default-
>> universe-12442/0/1
>> [ac27:12447] jobdir:
>> /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565_at_ac27_0/default-
>> universe-12442/0
>> [ac27:12447] unidir:
>> /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565_at_ac27_0/default-
>> universe-12442
>> [ac27:12447] top: openmpi-sessions-kxc565_at_ac27_0
>> [ac27:12447] tmp: /jobfs/z07/247752.ac-pbs
>> [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
>> base/iof_base_setup.c at line 106
>> [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
>> odls_default_module.c at line 663
>> [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
>> odls_default_module.c at line 1191
>> [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c
>> at
>> line 594
>> [ac27:12442] spawn: in job_state_callback(jobid = 1, state = 0x80)
>> mpirun noticed that job rank 0 with PID 0 on node ac27 exited on
>> signal
>> 15 (Terminated).
>> [ac27:12447] sess_dir_finalize: job session dir not empty - leaving
>> [ac27:12447] sess_dir_finalize: proc session dir not empty - leaving
>> [ac27:12442] sess_dir_finalize: proc session dir not empty - leaving
>>
>>
>> Thanks,
>> Clement
>>
>> Clement Kam Man Chu wrote:
>>
>>> Hi,
>>>
>>> I am using openmpi 1.2.3 under ia64 machine. I typed "mpirun -d --
>>> tmpdir
>>> /home/565/kxc565/tmpdir -mca btl sm -np 400 ./testprogram". I found
>>> only
>>> the first process can use my parameter setting to store tmp file, but
>>> the second process used its default setting to store tmp file in /tmp
>>> directory. How can I change all processes stored in a directory I
>>> required? I have attached the message from openmpi for more in
>>> details.
>>> Thanks for any help.
>>>
>>> Cheers,
>>> Clement
>>>
>>>
>>> [ac27:27928] [0,0,0] setting up session dir with
>>> [ac27:27928] tmpdir /home/565/kxc565/tmpdir
>>> [ac27:27928] universe default-universe-27928
>>> [ac27:27928] user kxc565
>>> [ac27:27928] host ac27
>>> [ac27:27928] jobid 0
>>> [ac27:27928] procid 0
>>> [ac27:27928] procdir:
>>> /home/565/kxc565/tmpdir/openmpi-sessions-kxc565_at_ac27_0/default-
>>> universe-27928/0/0
>>> [ac27:27928] jobdir:
>>> /home/565/kxc565/tmpdir/openmpi-sessions-kxc565_at_ac27_0/default-
>>> universe-27928/0
>>> [ac27:27928] unidir:
>>> /home/565/kxc565/tmpdir/openmpi-sessions-kxc565_at_ac27_0/default-
>>> universe-27928
>>> [ac27:27928] top: openmpi-sessions-kxc565_at_ac27_0
>>> [ac27:27928] tmp: ?
>>> [ac27:27928] [0,0,0] contact_file
>>> /home/565/kxc565/tmpdir/openmpi-sessions-kxc565_at_ac27_0/default-
>>> universe-27928/universe-setup.txt
>>> [ac27:27928] [0,0,0] wrote setup file
>>> [ac27:27932] [0,0,1] setting up session dir with
>>> [ac27:27932] universe default-universe-27928
>>> [ac27:27932] user kxc565
>>> [ac27:27932] host ac27
>>> [ac27:27932] jobid 0
>>> [ac27:27932] procid 1
>>> [ac27:27932] procdir:
>>> /tmp/openmpi-sessions-kxc565_at_ac27_0/default-universe-27928/0/1
>>> [ac27:27932] jobdir:
>>> /tmp/openmpi-sessions-kxc565_at_ac27_0/default-universe-27928/0
>>> [ac27:27932] unidir:
>>> /tmp/openmpi-sessions-kxc565_at_ac27_0/default-universe-27928
>>> [ac27:27932] top: openmpi-sessions-kxc565_at_ac27_0
>>> [ac27:27932] tmp: /tmp
>>> [ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
>>> base/iof_base_setup.c at line 106
>>> [ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
>>> odls_default_module.c at line 663
>>> [ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
>>> odls_default_module.c at line 1191
>>> [ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
>>> orted.c at
>>> line 594
>>> [ac27:27928] spawn: in job_state_callback(jobid = 1, state = 0x80)
>>> mpirun noticed that job rank 0 with PID 0 on node ac27 exited on
>>> signal
>>> 15 (Terminated).
>>> [ac27:27932] sess_dir_finalize: job session dir not empty - leaving
>>> [ac27:27932] sess_dir_finalize: proc session dir not empty - leaving
>>> [ac27:27928] sess_dir_finalize: proc session dir not empty - leaving
>>>
>>>
>>>
>> --
>> Clement Kam Man Chu
>> Research Assistant
>> Faculty of Information Technology
>> Monash University, Caulfield Campus
>> Ph: 61 3 9903 2355
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>

-- 
Clement Kam Man Chu
Research Assistant
Faculty of Information Technology
Monash University, Caulfield Campus
Ph: 61 3 9903 2355