Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] ibm/io/file_status_get_count
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2011-11-04 12:05:56


On 11/4/2011 5:56 AM, Jeff Squyres wrote:
> On Oct 28, 2011, at 1:59 AM, Eugene Loh wrote
>> In our MTT testing, we see ibm/io/file_status_get_count fail occasionally with:
>>
>> File locking failed in ADIOI_Set_lock(fd A,cmd F_SETLKW/7,type F_RDLCK/0,whence 0) with return value
>> FFFFFFFF and errno 5.
>> - If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running
>> on all the machines, and mount the directory with the 'noac' option (no attribute caching).
>> - If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option.
>> ADIOI_Set_lock:: Input/output error
>> ADIOI_Set_lock:offset 0, length 1
>>
>> One of the curious things (to us) about this test is that no one else appears to run it. Looking back through a lot of MTT results, essentially the only results reported are Oracle. Almost no non-Oracle results for this test have been reported in the last few months. Is there something special about this test we should know about?
> Not that I'm aware of.
>
> I see why Cisco skipped it -- I didn't have the "io" directory listed in my list of IBM directories to traverse. Doh! That's been fixed.
>
> (Cisco's MTT runs look like they need a bit of TLC -- I'm guessing IB is down on a node or two, resulting in a lot of false failures, but I likely won't have time to look at them until after SC :-( )
Yeah. In our recent experience, everyone's MTT runs seem to need lots
of TLC. Anyhow, thanks for the feedback: it appears there is no
general intentional avoidance of this particular test that we were
simply unaware of.
>> P.S. We're also interested in understanding the error message better. I suppose that's more appropriately taken up with ROMIO folks, which I will do, but if anyone on this list has useful information I'd love to hear it. The error apparently comes when MPI_File_get_size sets a lock. Each process has its own file and the test usually passes, so it's unclear to me what the problem is. Further, the error message discussing NFS and Lustre strikes me as rather speculative. We tend to run these tests repeatedly on the same file systems from the same test nodes. Anyone have any idea how sound the NFSv3/lockd/noac advice is or what the real issue is here?
> No. You'll need to ask Rob Latham.
Thanks. He replied to my inquiry on the MPICH list. Main answer is
that robustness bets are off on NFS and the message might be a little
misleading.