Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] ibm/io/file_status_get_count
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-11-04 08:56:06


On Oct 28, 2011, at 1:59 AM, Eugene Loh wrote:

> In our MTT testing, we see ibm/io/file_status_get_count fail occasionally with:
>
> File locking failed in ADIOI_Set_lock(fd A,cmd F_SETLKW/7,type F_RDLCK/0,whence 0) with return value
> FFFFFFFF and errno 5.
> - If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running
> on all the machines, and mount the directory with the 'noac' option (no attribute caching).
> - If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option.
> ADIOI_Set_lock:: Input/output error
> ADIOI_Set_lock:offset 0, length 1
>
> One of the curious things (to us) about this test is that no one else appears to run it. Looking back through a lot of MTT results, essentially the only results reported are Oracle. Almost no non-Oracle results for this test have been reported in the last few months. Is there something special about this test we should know about?

Not that I'm aware of.

I see why Cisco skipped it -- I didn't have the "io" directory listed in my list of IBM directories to traverse. Doh! That's been fixed.

(Cisco's MTT runs look like they need a bit of TLC -- I'm guessing IB is down on a node or two, resulting in a lot of false failures, but I likely won't have time to look at them until after SC :-( )

> P.S. We're also interested in understanding the error message better. I suppose that's more appropriately taken up with ROMIO folks, which I will do, but if anyone on this list has useful information I'd love to hear it. The error apparently comes when MPI_File_get_size sets a lock. Each process has its own file and the test usually passes, so it's unclear to me what the problem is. Further, the error message discussing NFS and Lustre strikes me as rather speculative. We tend to run these tests repeatedly on the same file systems from the same test nodes. Anyone have any idea how sound the NFSv3/lockd/noac advice is or what the real issue is here?

No. You'll need to ask Rob Latham.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/