I am using this system: http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't know exactly configurations of the file system. Here is the output of "df -h":
Filesystem Size Used Avail Use% Mounted on
/dev/sda6 919G 16G 857G 2% /
tmpfs 32G 0 32G 0% /dev/shm
/dev/sda5 139M 33M 100M 25% /boot
6.5T 678G 5.5T 11% /scratch
6.5T 678G 5.5T 11% /var/spool/mail
1.2P 136T 1.1P 12% /work1
1.2P 793T 368T 69% /work4
1.2P 509T 652T 44% /work3
1.2P 521T 640T 45% /work2
728T 286T 443T 40% /p/cwfs
728T 286T 443T 40% /p/CWFS1
728T 286T 443T 40% /p/CWFS2
728T 286T 443T 40% /p/CWFS3
728T 286T 443T 40% /p/CWFS4
728T 286T 443T 40% /p/CWFS5
728T 286T 443T 40% /p/CWFS6
728T 286T 443T 40% /p/CWFS7
1. My home directory is /home/yanb.
My simulation files are located at /work3/yanb.
The default TMPDIR set by system is just /work3/yanb
2. I did try not to set TMPDIR and let it default, which is just case 1 and case 2.
Case1: #export TMPDIR=/home/yanb/tmp
TCP="--mca btl_tcp_if_include 10.148.0.0/16"
It gives no apparent reason.
Case2: #export TMPDIR=/home/yanb/tmp
#TCP="--mca btl_tcp_if_include 10.148.0.0/16"
It gives warning of shared memory file on network file system.
3. With "export TMPDIR=/tmp", the job gives the same, no apparent reason.
4. FYI, "ls /" gives:
ELT apps cgroup hafs1 hafs12 hafs2 hafs5 hafs8 home lost+found mnt p root selinux tftpboot var work3
admin bin dev hafs10 hafs13 hafs3 hafs6 hafs9 lib media net panfs sbin srv tmp work1 work4
app boot etc hafs11 hafs15 hafs4 hafs7 hafs_x86_64 lib64 misc opt proc scratch sys usr work2 workspace
From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Gus Correa
Sent: Monday, March 03, 2014 17:24
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem
If you are using the university cluster, chances are that /home is not local, but on an NFS share, or perhaps Lustre (which you may have mentioned before, I don't remember).
Maybe "df -h" will show what is local what is not.
It works for NFS, it prefixes file systems with the server name, but I don't know about Lustre.
Did you try just not to set TMPDIR and let it default?
If the default TMPDIR is on Lustre (did you say this?, anyway I don't
remember) you could perhaps try to force it to /tmp:
If the cluster nodes are diskfull /tmp is likely to exist and be local to the cluster nodes.
[But the cluster nodes may be diskless ... :( ]
I hope this helps,
On 03/03/2014 07:10 PM, Beichuan Yan wrote:
> How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local filesystem? I don't know how to tell a directory is local file system or network file system.
> -----Original Message-----
> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Jeff
> Squyres (jsquyres)
> Sent: Monday, March 03, 2014 16:57
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
> How about setting TMPDIR to a local filesystem?
> On Mar 3, 2014, at 3:43 PM, Beichuan Yan<beichuan.yan_at_[hidden]> wrote:
>> I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent reason; 2 job complains shared-memory file on network file system, which can be resolved by " export TMPDIR=/home/yanb/tmp", /home/yanb/tmp is my local directory. The default TMPDIR points to a Lustre directory.
>> There is no any other output. I checked my job with "qstat -n" and found that processes were actually not started on compute nodes even though PBS Pro has "started" my job.
>>> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node runs 16 processes (clearly shared-memory of MPI is used). Four combinations of "TMPDIR" and "TCP" are tested:
>>> case 1:
>>> #export TMPDIR=/home/yanb/tmp
>>> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>>> mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE
>>> ./paraEllip3d input.txt
>>> Start Prologue v2.5 Mon Mar 3 15:47:16 EST 2014 End Prologue v2.5
>>> Mon Mar 3 15:47:16 EST 2014
>>> -bash: line 1: 448597 Terminated /var/spool/PBS/mom_priv/jobs/602244.service12.SC
>>> Start Epilogue v2.5 Mon Mar 3 15:50:51 EST 2014 Statistics
>>> =00:03:24 End Epilogue v2.5 Mon Mar 3 15:50:52 EST 2014
>> It looks like you have two general cases:
>> 1. The job fails for no apparent reason (like above), or 2. The job
>> complains that your TMPDIR is on a shared filesystem
>> I think the real issue, then, is to figure out why your jobs are failing with no output.
>> Is there anything in the stderr output?
>> Jeff Squyres
>> For corporate legal information go to:
>> users mailing list
>> users mailing list
> Jeff Squyres
> For corporate legal information go to:
> users mailing list
> users mailing list
users mailing list