I have verified that disabling UAC does not fix the problem. xhlp.exe starts, threads spin up on both machines, CPU usage is at 80-90% but no progress is ever made.
 
>From this state, Ctrl-break on the head node yields the following output:

[REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to lifeline [[20816,0],0] lost
[REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to lifeline [[20816,0],0] lost
[REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to lifeline [[20816,0],0] lost
[REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to lifeline [[20816,0],0] lost
[REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to lifeline [[20816,0],0] lost
[REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to lifeline [[20816,0],0] lost
 
 
 
> From: users-request@open-mpi.org
> Subject: users Digest, Vol 1911, Issue 1
> To: users@open-mpi.org
> Date: Fri, 20 May 2011 08:14:13 -0400
>
> Send users mailing list submissions to
> users@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-request@open-mpi.org
>
> You can reach the person managing the list at
> users-owner@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
> 1. Re: Error: Entry Point Not Found (Zhangping Wei)
> 2. Re: Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test (George Bosilca)
> 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff Squyres)
> 4. Re: Error: Entry Point Not Found (Jeff Squyres)
> 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011 (aka
> 12.0) (Jeff Squyres)
> 6. Re: Openib with > 32 cores per node (Jeff Squyres)
> 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres)
> 8. Re: Trouble with MPI-IO (Jeff Squyres)
> 9. Re: Trouble with MPI-IO (Tom Rosmond)
> 10. Re: Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test (David B?ttner)
> 11. Re: Trouble with MPI-IO (Jeff Squyres)
> 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff Squyres)
> 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only
> sometimes... (Jeff Squyres)
> 14. Re: Trouble with MPI-IO (Jeff Squyres)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 19 May 2011 09:13:53 -0700 (PDT)
> From: Zhangping Wei <zhangping_wei@yahoo.com>
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: users@open-mpi.org
> Message-ID: <101342.7961.qm@web111818.mail.gq1.yahoo.com>
> Content-Type: text/plain; charset="gb2312"
>
> Dear Paul,
>
> I checked the way 'mpirun -np N <cmd>' you mentioned, but it was the same
> problem.
>
> I guess it may related to the system I used, because I have used it correctly in
> another XP 32 bit system.
>
> I look forward to more advice.Thanks.
>
> Zhangping
>
>
>
>
> ________________________________
> ???????? "users-request@open-mpi.org" <users-request@open-mpi.org>
> ???????? users@open-mpi.org
> ?????????? 2011/5/19 (????) 11:00:02 ????
> ?? ???? users Digest, Vol 1910, Issue 2
>
> Send users mailing list submissions to
> users@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-request@open-mpi.org
>
> You can reach the person managing the list at
> users-owner@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
> 1. Re: Error: Entry Point Not Found (Paul van der Walt)
> 2. Re: Openib with > 32 cores per node (Robert Horton)
> 3. Re: Openib with > 32 cores per node (Samuel K. Gutierrez)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 19 May 2011 16:14:02 +0100
> From: Paul van der Walt <paul@denknerd.nl>
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <BANLkTinjZ0CNtchQJCZYhfGSnR51jPuP7w@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> Hi,
>
> On 19 May 2011 15:54, Zhangping Wei <zhangping_wei@yahoo.com> wrote:
> > 4, I use command window to run it in this way: ?mpirun ?n 4 ?**.exe ?,then I
>
> Probably not the problem, but shouldn't that be 'mpirun -np N <cmd>' ?
>
> Paul
>
> --
> O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 19 May 2011 16:37:56 +0100
> From: Robert Horton <r.horton@qmul.ac.uk>
> Subject: Re: [OMPI users] Openib with > 32 cores per node
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <1305819476.9663.148.camel@moelwyn>
> Content-Type: text/plain; charset="UTF-8"
>
> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> > Hi,
> >
> > Try the following QP parameters that only use shared receive queues.
> >
> > -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> >
>
> Thanks for that. If I run the job over 2 x 48 cores it now works and the
> performance seems reasonable (I need to do some more tuning) but when I
> go up to 4 x 48 cores I'm getting the same problem:
>
> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> error creating qp errno says Cannot allocate memory
> [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>
> Any thoughts?
>
> Thanks,
> Rob
> --
> Robert Horton
> System Administrator (Research Support) - School of Mathematical Sciences
> Queen Mary, University of London
> r.horton@qmul.ac.uk - +44 (0) 20 7882 7345
>
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 19 May 2011 09:59:13 -0600
> From: "Samuel K. Gutierrez" <samuel@lanl.gov>
> Subject: Re: [OMPI users] Openib with > 32 cores per node
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <B3E83138-9AF0-48C0-871C-DBBB2E712E12@lanl.gov>
> Content-Type: text/plain; charset=us-ascii
>
> Hi,
>
> On May 19, 2011, at 9:37 AM, Robert Horton wrote
>
> > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> >> Hi,
> >>
> >> Try the following QP parameters that only use shared receive queues.
> >>
> >> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> >>
> >
> > Thanks for that. If I run the job over 2 x 48 cores it now works and the
> > performance seems reasonable (I need to do some more tuning) but when I
> > go up to 4 x 48 cores I'm getting the same problem:
> >
> >[compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> >] error creating qp errno says Cannot allocate memory
> > [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
> >abort)
> >
> > Any thoughts?
>
> How much memory does each node have? Does this happen at startup?
>
> Try adding:
>
> -mca btl_openib_cpc_include rdmacm
>
> I'm not sure if your version of OFED supports this feature, but maybe using XRC
> may help. I **think** other tweaks are needed to get this going, but I'm not
> familiar with the details.
>
> Hope that helps,
>
> Samuel K. Gutierrez
> Los Alamos National Laboratory
>
>
> >
> > Thanks,
> > Rob
> > --
> > Robert Horton
> > System Administrator (Research Support) - School of Mathematical Sciences
> > Queen Mary, University of London
> > r.horton@qmul.ac.uk - +44 (0) 20 7882 7345
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 1910, Issue 2
> **************************************
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 2
> Date: Thu, 19 May 2011 08:48:03 -0800
> From: George Bosilca <bosilca@eecs.utk.edu>
> Subject: Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <FCAC66F9-FDB5-48BB-A800-263D8A4F9337@eecs.utk.edu>
> Content-Type: text/plain; charset=iso-8859-1
>
> David,
>
> I do not see any mechanism for protecting the accesses to the requests to a single thread? What is the thread model you're using?
>
> >From an implementation perspective, your code is correct only if you initialize the MPI library with MPI_THREAD_MULTIPLE and if the library accepts. Otherwise, there is an assumption that the application is single threaded, or that the MPI behavior is implementation dependent. Please read the MPI standard regarding to MPI_Init_thread for more details.
>
> Regards,
> george.
>
> On May 19, 2011, at 02:34 , David B?ttner wrote:
>
> > Hello,
> >
> > I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread code. I am using MPI_Isend and MPI_Irecv for communication and MPI_Test/MPI_Wait to check if it is done. I do this repeatedly in the outer loop of my code. The MPI_Test is used in the inner loop to check if some function can be called which depends on the received data.
> > The program regularly crashed (only when not using printf...) and after debugging it I figured out the following problem:
> >
> > In MPI_Isend I have an invalid read of memory. I fixed the problem with not re-using a
> >
> > MPI_Request req_s, req_r;
> >
> > but by using
> >
> > MPI_Request* req_s;
> > MPI_Request* req_r
> >
> > and re-allocating them before the MPI_Isend/recv.
> >
> > The documentation says, that in MPI_Wait and MPI_Test (if successful) the request-objects are deallocated and set to MPI_REQUEST_NULL.
> > It also says, that in MPI_Isend and MPI_Irecv, it allocates the Objects and associates it with the request object.
> >
> > As I understand this, this either means I can use a pointer to MPI_Request which I don't have to initialize for this (it doesn't work but crashes), or that I can use a MPI_Request pointer which I have initialized with malloc(sizeof(MPI_REQUEST)) (or passing the address of a MPI_Request req), which is set and unset in the functions. But this version crashes, too.
> > What works is using a pointer, which I allocate before the MPI_Isend/recv and which I free after MPI_Wait in every iteration. In other words: It only uses if I don't reuse any kind of MPI_Request. Only if I recreate one every time.
> >
> > Is this, what is should be like? I believe that a reuse of the memory would be a lot more efficient (less calls to malloc...). Am I missing something here? Or am I doing something wrong?
> >
> >
> > Let me provide some more detailed information about my problem:
> >
> > I am running the program on a 30 node infiniband cluster. Each node has 4 single core Opteron CPUs. I am running 1 MPI Rank per node and 4 threads per rank (-> one thread per core).
> > I am compiling with mpicc of OpenMPI using gcc below.
> > Some pseudo-code of the program can be found at the end of this e-mail.
> >
> > I was able to reproduce the problem using different amount of nodes and even using one node only. The problem does not arise when I put printf-debugging information into the code. This pointed me into the direction that I have some memory problem, where some write accesses some memory it is not supposed to.
> > I ran the tests using valgrind with --leak-check=full and --show-reachable=yes, which pointed me either to MPI_Isend or MPI_Wait depending on whether I had the threads spin in a loop for MPI_Test to return success or used MPI_Wait respectively.
> >
> > I would appreciate your help with this. Am I missing something important here? Is there a way to re-use the request in the different iterations other than I thought it should work?
> > Or is there a way to re-initialize the allocated memory before the MPI_Isend/recv so that I at least don't have to call free and malloc each time?
> >
> > Thank you very much for your help!
> > Kind regards,
> > David B?ttner
> >
> > _____________________
> > Pseudo-Code of program:
> >
> > MPI_Request* req_s;
> > MPI_Request* req_w;
> > OUTER-LOOP
> > if(0 == threadid)
> > {
> > req_s = malloc(sizeof(MPI_Request));
> > req_r = malloc(sizeof(MPI_Request));
> > MPI_Isend(..., req_s)
> > MPI_Irecv(..., req_r)
> > }
> > pthread_barrier
> > INNER-LOOP (while NOT_DONE or RET)
> > if(TRYLOCK && NOT_DONE)
> > {
> > if(MPI_TEST(req_r))
> > {
> > Call_Function_A;
> > NOT_DONE = 0;
> > }
> >
> > }
> > RET = Call_Function_B;
> > }
> > pthread_barrier_wait
> > if(0 == threadid)
> > {
> > MPI_WAIT(req_s)
> > MPI_WAIT(req_r)
> > free(req_s);
> > free(req_r);
> > }
> > _____________
> >
> >
> > --
> > David B?ttner, Informatik, Technische Universit?t M?nchen
> > TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> "To preserve the freedom of the human mind then and freedom of the press, every spirit should be ready to devote itself to martyrdom; for as long as we may think as we will, and speak as we think, the condition of man will proceed in improvement."
> -- Thomas Jefferson, 1799
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 19 May 2011 21:22:48 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] v1.5.3-x64 does not work on Windows 7
> workgroup
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <278274F0-BF00-4498-950F-9779E0083C5A@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> Unfortunately, our Windows guy (Shiqing) is off getting married and will be out for a little while. :-(
>
> All that I can cite is the README.WINDOWS.txt file in the top-level directory. I'm afraid that I don't know much else about Windows. :-(
>
>
> On May 18, 2011, at 8:17 PM, Jason Mackay wrote:
>
> > Hi all,
> >
> > My thanks to all those involved for putting together this Windows binary release of OpenMPI! I am hoping to use it in a small Windows based OpenMPI cluster at home.
> >
> > Unfortunately my experience so far has not exactly been trouble free. It seems that, due to the fact that this release is using WMI, there are a number of settings that must be configured on the machines in order to get this to work. These settings are not documented in the distribution at all. I have been experimenting with it for over a week on and off and as soon as I solve one problem, another one arises.
> >
> > Currently, after much searching, reading, and tinkering with DCOM settings etc..., I can remotely start processes on all my machines using mpirun but those processes cannot access network shares (e.g. for binary distribution) and HPL (which works on any one node) does not seem to work if I run it across multiple nodes, also indicating a network issue (CPU sits at 100% in all processes with no network traffic and never terminates). To eliminate premission issues that may be caused by UAC I tried the same setup on two domain machines using an administrative account to launch and the behavior was the same. I have read that WMI processes cannot access network resources and I am at a loss for a solution to this newest of problems. If anyone knows how to make this work I would appreciate the help. I assume that someone has gotten this working and has the answers.
> >
> > I have searched the mailing list archives and I found other users with similar problems but no clear guidance on the threads. Some threads make references to Microsoft KB articles but do not explicitly tell the user what needs to be done, leaving each new user to rediscover the tricks on their own. One thread made it appear that testing had only been done on Windows XP. Needless to say, security has changed dramatically in Windows since XP!
> >
> > I would like to see OpenMPI for Windows be usable by a newcomer without all of this pain.
> >
> > What would be fantastic would be:
> > 1) a step-by-step procedure for how to get OpenMPI 1.5 working on Windows
> > a) preferably in a bare Windows 7 workgroup environment with nothing else (i.e. no Microsoft Cluster Compute Pack, no domain etc...)
> > 2) inclusion of these steps in the binary distribution
> > 3) bonus points for a script which accomplishes these things automatically
> >
> > If someone can help with (1), I would happily volunteer my time to work on (3).
> >
> > Regards,
> > Jason
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 19 May 2011 21:26:43 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <F830EC35-FC9B-4801-B2A3-50F54D2152A4@cisco.com>
> Content-Type: text/plain; charset=windows-1252
>
> On May 19, 2011, at 10:54 AM, Zhangping Wei wrote:
>
> > 4, I use command window to run it in this way: ?mpirun ?n 4 **.exe ?,then I met the error: ?entry point not found: the procedure entry point inet_pton could not be located in the dynamic link library WS2_32.dll?
>
> Unfortunately our Windows developer/maintainer is out for a little while (he's getting married); he pretty much did the Windows stuff by himself, so none of the rest of us know much about it. :(
>
> inet_pton is a standard function call relating to IP addresses that we use in the internals of OMPI; I'm not sure why it wouldn't be found on Windows XP (Shiqing did cite that the OMPI Windows port should work on Windows XP).
>
> This post seems to imply that inet_ntop is only available on Vista and above:
>
> http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/e40465f2-41b7-4243-ad33-15ae9366f4e6/
>
> So perhaps Shiqing needs to put in some kind of portability workaround for OMPI, and the current binaries won't actually work for XP...?
>
> I can't say that for sure because I really know very little about Windows; we'll unfortunately have to wait until he returns to get a definitive answer. :-(
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Thu, 19 May 2011 21:37:49 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer
> XE 2011 (aka 12.0)
> To: Open MPI Users <users@open-mpi.org>
> Cc: Giovanni Bracco <giovanni.bracco@enea.it>, Agostino Funel
> <agostino.funel@enea.it>, Fiorenzo Ambrosino
> <fiorenzo.ambrosino@enea.it>, Guido Guarnieri
> <guido.guarnieri@enea.it>, Roberto Ciavarella
> <roberto.ciavarella@enea.it>, Salvatore Podda
> <salvatore.podda@enea.it>, Giovanni Ponti <giovanni.ponti@enea.it>
> Message-ID: <45362608-B8B0-4ADE-9959-B35C5690A6F3@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> Sorry for the late reply.
>
> Other users have seen something similar but we have never been able to reproduce it. Is this only when using IB? If you use "mpirun --mca btl_openib_cpc_if_include rdmacm", does the problem go away?
>
>
> On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:
>
> > I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the collectives hangs go away. I don't know what, if anything, the higher optimization buys you when compiling openmpi, so I'm not sure if that's an acceptable workaround or not.
> >
> > My system is similar to yours - Intel X5570 with QDR Mellanox IB running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 with a single iteration of Barrier to reproduce the hang, and it happens 100% of the time for me when I invoke it like this:
> >
> > # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier
> >
> > The hang happens on the first Barrier (64 ranks) and each of the participating ranks have this backtrace:
> >
> > __poll (...)
> > poll_dispatch () from [instdir]/lib/libopen-pal.so.0
> > opal_event_loop () from [instdir]/lib/libopen-pal.so.0
> > opal_progress () from [instdir]/lib/libopen-pal.so.0
> > ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_recursivedoubling () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
> > PMPI_Barrier () from [instdir]/lib/libmpi.so.0
> > IMB_barrier ()
> > IMB_init_buffers_iter ()
> > main ()
> >
> > The one non-participating rank has this backtrace:
> >
> > __poll (...)
> > poll_dispatch () from [instdir]/lib/libopen-pal.so.0
> > opal_event_loop () from [instdir]/lib/libopen-pal.so.0
> > opal_progress () from [instdir]/lib/libopen-pal.so.0
> > ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
> > PMPI_Barrier () from [instdir]/lib/libmpi.so.0
> > main ()
> >
> > If I use more nodes I can get it to hang with 1ppn, so that seems to rule out the sm btl (or interactions with it) as a culprit at least.
> >
> > I can't reproduce this with openmpi 1.5.3, interestingly.
> >
> > -Marcus
> >
> >
> > On 05/10/2011 03:37 AM, Salvatore Podda wrote:
> >> Dear all,
> >>
> >> we succeed in building several version of openmpi from 1.2.8 to 1.4.3
> >> with Intel composer XE 2011 (aka 12.0).
> >> However we found a threshold in the number of cores (depending from the
> >> application: IMB, xhpl or user applications
> >> and form the number of required cores) above which the application hangs
> >> (sort of deadlocks).
> >> The building of openmpi with 'gcc' and 'pgi' does not show the same limits.
> >> There are any known incompatibilities of openmpi with this version of
> >> intel compiilers?
> >>
> >> The characteristics of our computational infrastructure are:
> >>
> >> Intel processors E7330, E5345, E5530 e E5620
> >>
> >> CentOS 5.3, CentOS 5.5.
> >>
> >> Intel composer XE 2011
> >> gcc 4.1.2
> >> pgi 10.2-1
> >>
> >> Regards
> >>
> >> Salvatore Podda
> >>
> >> ENEA UTICT-HPC
> >> Department for Computer Science Development and ICT
> >> Facilities Laboratory for Science and High Performace Computing
> >> C.R. Frascati
> >> Via E. Fermi, 45
> >> PoBox 65
> >> 00044 Frascati (Rome)
> >> Italy
> >>
> >> Tel: +39 06 9400 5342
> >> Fax: +39 06 9400 5551
> >> Fax: +39 06 9400 5735
> >> E-mail: salvatore.podda@enea.it
> >> Home Page: www.cresco.enea.it
> >> _______________________________________________
> >> users mailing list
> >> users@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 6
> Date: Thu, 19 May 2011 22:01:00 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] Openib with > 32 cores per node
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <C18C4827-D305-484A-9DAE-290902D40DB3@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> What Sam is alluding to is that the OpenFabrics driver code in OMPI is sucking up oodles of memory for each IB connection that you're using. The receive_queues param that he sent tells OMPI to use all shared receive queues (instead of defaulting to one per-peer receive queue and the rest shared receive queues -- the per-peer RQ sucks up all the memory when you multiple it by N peers).
>
>
> On May 19, 2011, at 11:59 AM, Samuel K. Gutierrez wrote:
>
> > Hi,
> >
> > On May 19, 2011, at 9:37 AM, Robert Horton wrote
> >
> >> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> >>> Hi,
> >>>
> >>> Try the following QP parameters that only use shared receive queues.
> >>>
> >>> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> >>>
> >>
> >> Thanks for that. If I run the job over 2 x 48 cores it now works and the
> >> performance seems reasonable (I need to do some more tuning) but when I
> >> go up to 4 x 48 cores I'm getting the same problem:
> >>
> >> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] error creating qp errno says Cannot allocate memory
> >> [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> >> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> >> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> >> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> >>
> >> Any thoughts?
> >
> > How much memory does each node have? Does this happen at startup?
> >
> > Try adding:
> >
> > -mca btl_openib_cpc_include rdmacm
> >
> > I'm not sure if your version of OFED supports this feature, but maybe using XRC may help. I **think** other tweaks are needed to get this going, but I'm not familiar with the details.
> >
> > Hope that helps,
> >
> > Samuel K. Gutierrez
> > Los Alamos National Laboratory
> >
> >
> >>
> >> Thanks,
> >> Rob
> >> --
> >> Robert Horton
> >> System Administrator (Research Support) - School of Mathematical Sciences
> >> Queen Mary, University of London
> >> r.horton@qmul.ac.uk - +44 (0) 20 7882 7345
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 7
> Date: Thu, 19 May 2011 22:04:46 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] MPI_COMM_DUP freeze with OpenMPI 1.4.1
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <0DCF20B8-CA5C-4746-8187-A2DFF39B15DD@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> On May 13, 2011, at 8:31 AM, francoise.roch@obs.ujf-grenoble.fr wrote:
>
> > Here is the MUMPS portion of code (in zmumps_part1.F file) where the slaves call MPI_COMM_DUP , id%PAR and MASTER are initialized to 0 before :
> >
> > CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
>
> I re-indented so that I could read it better:
>
> CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
> IF ( id%PAR .eq. 0 ) THEN
> IF ( id%MYID .eq. MASTER ) THEN
> color = MPI_UNDEFINED
> ELSE
> color = 0
> END IF
> CALL MPI_COMM_SPLIT( id%COMM, color, 0,
> & id%COMM_NODES, IERR )
> id%NSLAVES = id%NPROCS - 1
> ELSE
> CALL MPI_COMM_DUP( id%COMM, id%COMM_NODES, IERR )
> id%NSLAVES = id%NPROCS
> END IF
>
> IF (id%PAR .ne. 0 .or. id%MYID .NE. MASTER) THEN
> CALL MPI_COMM_DUP( id%COMM_NODES, id%COMM_LOAD, IERR
> ENDIF
>
> That doesn't look right -- both MPI_COMM_SPLIT and MPI_COMM_DUP are collective, meaning that all processes in the communicator must call them. In the first case, only some processes are calling MPI_COMM_SPLIT. Is there some other logic that forces the rest of the processes to call MPI_COMM_SPLIT, too?
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 8
> Date: Thu, 19 May 2011 22:30:03 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <EEFB638F-72F1-4208-8EA2-4F25F610C47B@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> Props for that testio script. I think you win the award for "most easy to reproduce test case." :-)
>
> I notice that some of the lines went over 72 columns, so I renamed the file x.f90 and changed all the comments from "c" to "!" and joined the two &-split lines. The error about implicit type for lenr went away, but then when I enabled better type checking by using "use mpi" instead of "include 'mpif.h'", I got the following:
>
> x.f90:99.77:
>
> call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> 1
> Error: There is no specific subroutine for the generic 'mpi_type_indexed' at (1)
>
> I looked at our mpi F90 module and see the following:
>
> interface MPI_Type_indexed
> subroutine MPI_Type_indexed(count, array_of_blocklengths, array_of_displacements, oldtype, newtype, ierr)
> integer, intent(in) :: count
> integer, dimension(*), intent(in) :: array_of_blocklengths
> integer, dimension(*), intent(in) :: array_of_displacements
> integer, intent(in) :: oldtype
> integer, intent(out) :: newtype
> integer, intent(out) :: ierr
> end subroutine MPI_Type_indexed
> end interface
>
> I don't quite grok the syntax of the "allocatable" type ijdisp, so that might be the problem here...?
>
> Regardless, I'm not entirely sure if the problem is the >72 character lines, but then when that is gone, I'm not sure how the allocatable stuff fits in... (I'm not enough of a Fortran programmer to know)
>
>
>
>
> On May 10, 2011, at 7:14 PM, Tom Rosmond wrote:
>
> > I would appreciate someone with experience with MPI-IO look at the
> > simple fortran program gzipped and attached to this note. It is
> > imbedded in a script so that all that is necessary to run it is do:
> > 'testio' from the command line. The program generates a small 2-D input
> > array, sets up an MPI-IO environment, and write a 2-D output array
> > twice, with the only difference being the displacement arrays used to
> > construct the indexed datatype. For the first write, simple
> > monotonically increasing displacements are used, for the second the
> > displacements are 'shuffled' in one dimension. They are printed during
> > the run.
> >
> > For the first case the file is written properly, but for the second the
> > program hangs on MPI_FILE_WRITE_AT_ALL and must be aborted manually.
> > Although the program is compiled as an mpi program, I am running on a
> > single processor, which makes the problem more puzzling.
> >
> > The program should be relatively self-explanatory, but if more
> > information is needed, please ask. I am on an 8 core Xeon based Dell
> > workstation running Scientific Linux 5.5, Intel fortran 12.0.3, and
> > OpenMPI 1.5.3. I have also attached output from 'ompi_info'.
> >
> > T. Rosmond
> >
> >
> > <testio.gz><info_ompi.gz>_______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 9
> Date: Thu, 19 May 2011 20:24:25 -0700
> From: Tom Rosmond <rosmond@reachone.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <1305861865.4284.104.camel@cedar.reachone.com>
> Content-Type: text/plain
>
> Thanks for looking at my problem. Sounds like you did reproduce my
> problem. I have added some comments below
>
> On Thu, 2011-05-19 at 22:30 -0400, Jeff Squyres wrote:
> > Props for that testio script. I think you win the award for "most easy to reproduce test case." :-)
> >
> > I notice that some of the lines went over 72 columns, so I renamed the file x.f90 and changed all the comments from "c" to "!" and joined the two &-split lines. The error about implicit type for lenr went away, but then when I enabled better type checking by using "use mpi" instead of "include 'mpif.h'", I got the following:
>
> What fortran compiler did you use?
>
> In the original script my Intel compile used the -132 option,
> allowing up to that many columns per line. I still think in
> F77 fortran much of the time, and use 'c' for comments out
> of habit. The change to '!' doesn't make any difference.
>
>
> > x.f90:99.77:
> >
> > call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> > 1
> > Error: There is no specific subroutine for the generic 'mpi_type_indexed' at (1)
>
> Hmmm, very strange, since I am looking right at the MPI standard
> documents with that routine documented. I too get this compile failure
> when I switch to 'use mpi'. Could that be a problem with the Open MPI
> fortran libraries???
> >
> > I looked at our mpi F90 module and see the following:
> >
> > interface MPI_Type_indexed
> > subroutine MPI_Type_indexed(count, array_of_blocklengths, array_of_displacements, oldtype, newtype, ierr)
> > integer, intent(in) :: count
> > integer, dimension(*), intent(in) :: array_of_blocklengths
> > integer, dimension(*), intent(in) :: array_of_displacements
> > integer, intent(in) :: oldtype
> > integer, intent(out) :: newtype
> > integer, intent(out) :: ierr
> > end subroutine MPI_Type_indexed
> > end interface
> >
> > I don't quite grok the syntax of the "allocatable" type ijdisp, so that might be the problem here...?
>
> Just a standard F90 'allocatable' statement. I've written thousands
> just like it.
> >
> > Regardless, I'm not entirely sure if the problem is the >72 character lines, but then when that is gone, I'm not sure how the allocatable stuff fits in... (I'm not enough of a Fortran programmer to know)
> >
> Anyone else out that who can comment????
>
>
> T. Rosmond
>
>
>
> >
> > On May 10, 2011, at 7:14 PM, Tom Rosmond wrote:
> >
> > > I would appreciate someone with experience with MPI-IO look at the
> > > simple fortran program gzipped and attached to this note. It is
> > > imbedded in a script so that all that is necessary to run it is do:
> > > 'testio' from the command line. The program generates a small 2-D input
> > > array, sets up an MPI-IO environment, and write a 2-D output array
> > > twice, with the only difference being the displacement arrays used to
> > > construct the indexed datatype. For the first write, simple
> > > monotonically increasing displacements are used, for the second the
> > > displacements are 'shuffled' in one dimension. They are printed during
> > > the run.
> > >
> > > For the first case the file is written properly, but for the second the
> > > program hangs on MPI_FILE_WRITE_AT_ALL and must be aborted manually.
> > > Although the program is compiled as an mpi program, I am running on a
> > > single processor, which makes the problem more puzzling.
> > >
> > > The program should be relatively self-explanatory, but if more
> > > information is needed, please ask. I am on an 8 core Xeon based Dell
> > > workstation running Scientific Linux 5.5, Intel fortran 12.0.3, and
> > > OpenMPI 1.5.3. I have also attached output from 'ompi_info'.
> > >
> > > T. Rosmond
> > >
> > >
> > > <testio.gz><info_ompi.gz>_______________________________________________
> > > users mailing list
> > > users@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>
>
>
> ------------------------------
>
> Message: 10
> Date: Fri, 20 May 2011 09:25:14 +0200
> From: David B?ttner <david.buettner@in.tum.de>
> Subject: Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <4DD6175A.1080403@in.tum.de>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hello,
>
> thanks for the quick answer. I am sorry that I forgot to mention this: I
> did compile OpenMPI with MPI_THREAD_MULTIPLE support and test if
> required == provided after the MPI_Thread_init call.
>
> > I do not see any mechanism for protecting the accesses to the requests to a single thread? What is the thread model you're using?
> >
> Again I am sorry that this was not clear: In the pseudo code below I
> wanted to indicate the access-protection I do by thread-id dependent
> calls if(0 == thread-id) and by using the trylock(...) (using
> pthread-mutexes). In the code all accesses concerning one MPI_Request
> (which are pthread-global-pointers in my case) are protected and called
> in sequential order, i.e. MPI_Isend/recv is returns before any thread is
> allowed to call the corresponding MPI_Test and no-one can call MPI_Test
> any more when a thread is allowed to call MPI_Wait.
> I did this in the same manner before with other MPI implementations, but
> also on the same machine with the same (untouched) OpenMPI
> implementation, also using pthreads and MPI in combination, but I used
>
> MPI_Request req;
>
> instead of
>
> MPI_Request* req;
> (and later)
> req = (MPI_Request*)malloc(sizeof(MPI_Request));
>
>
> In my recent (problem) code, I also tried not using pointers, but got
> the same problem. Also, as I described in the first mail, I tried
> everything concerning the memory allocation of the MPI_Request objects.
> I tried not calling malloc. This I guessed wouldn't work, but the
> OpenMPI documentation says this:
>
> " Nonblocking calls allocate a communication request object and
> associate it with the request handle the argument request). "
> [http://www.open-mpi.org/doc/v1.4/man3/MPI_Isend.3.php] and
>
> " [...] if the communication object was created by a nonblocking send or
> receive, then it is deallocated and the request handle is set to
> MPI_REQUEST_NULL."
> [http://www.open-mpi.org/doc/v1.4/man3/MPI_Test.3.php] and (in slightly
> different words) [http://www.open-mpi.org/doc/v1.4/man3/MPI_Wait.3.php]
>
> So I thought that it might do some kind of optimized memory stuff
> internally.
>
> I also tried allocating req (for each used MPI_Request) once before the
> first use and deallocation after the last use (which I thought was the
> way it was supposed to work), but that crashes also.
>
> I tried replacing the pointers through global variables
>
> MPI_Request req;
>
> which didn't do the job...
>
> The only thing that seems to work is what I mentioned below: Allocate
> every time I am going to need it in the MPI_Isend/recv, use it in
> MPI_Test/Wait and after that deallocate it by hand each time.
> I don't think that this is supposed to be like this since I have to do a
> call to malloc and free so often (for multiple MPI_Request objects in
> each iteration) that it will most likely limit performance...
>
> Anyway I still have the same problem and am still unclear on what kind
> of memory allocation I should be doing for the MPI_Requests. Is there
> anything else (besides MPI_THREAD_MULTIPLE support, thread access
> control, sequential order of MPI_Isend/recv, MPI_Test and MPI_Wait for
> one MPI_Request object) I need to take care of? If not, what could I do
> to find the source of my problem?
>
> Thanks again for any kind of help!
>
> Kind regards,
> David
>
>
>
> > > From an implementation perspective, your code is correct only if you initialize the MPI library with MPI_THREAD_MULTIPLE and if the library accepts. Otherwise, there is an assumption that the application is single threaded, or that the MPI behavior is implementation dependent. Please read the MPI standard regarding to MPI_Init_thread for more details.
> >
> > Regards,
> > george.
> >
> > On May 19, 2011, at 02:34 , David B?ttner wrote:
> >
> >> Hello,
> >>
> >> I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread code. I am using MPI_Isend and MPI_Irecv for communication and MPI_Test/MPI_Wait to check if it is done. I do this repeatedly in the outer loop of my code. The MPI_Test is used in the inner loop to check if some function can be called which depends on the received data.
> >> The program regularly crashed (only when not using printf...) and after debugging it I figured out the following problem:
> >>
> >> In MPI_Isend I have an invalid read of memory. I fixed the problem with not re-using a
> >>
> >> MPI_Request req_s, req_r;
> >>
> >> but by using
> >>
> >> MPI_Request* req_s;
> >> MPI_Request* req_r
> >>
> >> and re-allocating them before the MPI_Isend/recv.
> >>
> >> The documentation says, that in MPI_Wait and MPI_Test (if successful) the request-objects are deallocated and set to MPI_REQUEST_NULL.
> >> It also says, that in MPI_Isend and MPI_Irecv, it allocates the Objects and associates it with the request object.
> >>
> >> As I understand this, this either means I can use a pointer to MPI_Request which I don't have to initialize for this (it doesn't work but crashes), or that I can use a MPI_Request pointer which I have initialized with malloc(sizeof(MPI_REQUEST)) (or passing the address of a MPI_Request req), which is set and unset in the functions. But this version crashes, too.
> >> What works is using a pointer, which I allocate before the MPI_Isend/recv and which I free after MPI_Wait in every iteration. In other words: It only uses if I don't reuse any kind of MPI_Request. Only if I recreate one every time.
> >>
> >> Is this, what is should be like? I believe that a reuse of the memory would be a lot more efficient (less calls to malloc...). Am I missing something here? Or am I doing something wrong?
> >>
> >>
> >> Let me provide some more detailed information about my problem:
> >>
> >> I am running the program on a 30 node infiniband cluster. Each node has 4 single core Opteron CPUs. I am running 1 MPI Rank per node and 4 threads per rank (-> one thread per core).
> >> I am compiling with mpicc of OpenMPI using gcc below.
> >> Some pseudo-code of the program can be found at the end of this e-mail.
> >>
> >> I was able to reproduce the problem using different amount of nodes and even using one node only. The problem does not arise when I put printf-debugging information into the code. This pointed me into the direction that I have some memory problem, where some write accesses some memory it is not supposed to.
> >> I ran the tests using valgrind with --leak-check=full and --show-reachable=yes, which pointed me either to MPI_Isend or MPI_Wait depending on whether I had the threads spin in a loop for MPI_Test to return success or used MPI_Wait respectively.
> >>
> >> I would appreciate your help with this. Am I missing something important here? Is there a way to re-use the request in the different iterations other than I thought it should work?
> >> Or is there a way to re-initialize the allocated memory before the MPI_Isend/recv so that I at least don't have to call free and malloc each time?
> >>
> >> Thank you very much for your help!
> >> Kind regards,
> >> David B?ttner
> >>
> >> _____________________
> >> Pseudo-Code of program:
> >>
> >> MPI_Request* req_s;
> >> MPI_Request* req_w;
> >> OUTER-LOOP
> >> if(0 == threadid)
> >> {
> >> req_s = malloc(sizeof(MPI_Request));
> >> req_r = malloc(sizeof(MPI_Request));
> >> MPI_Isend(..., req_s)
> >> MPI_Irecv(..., req_r)
> >> }
> >> pthread_barrier
> >> INNER-LOOP (while NOT_DONE or RET)
> >> if(TRYLOCK&& NOT_DONE)
> >> {
> >> if(MPI_TEST(req_r))
> >> {
> >> Call_Function_A;
> >> NOT_DONE = 0;
> >> }
> >>
> >> }
> >> RET = Call_Function_B;
> >> }
> >> pthread_barrier_wait
> >> if(0 == threadid)
> >> {
> >> MPI_WAIT(req_s)
> >> MPI_WAIT(req_r)
> >> free(req_s);
> >> free(req_r);
> >> }
> >> _____________
> >>
> >>
> >> --
> >> David B?ttner, Informatik, Technische Universit?t M?nchen
> >> TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > "To preserve the freedom of the human mind then and freedom of the press, every spirit should be ready to devote itself to martyrdom; for as long as we may think as we will, and speak as we think, the condition of man will proceed in improvement."
> > -- Thomas Jefferson, 1799
> >
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> David B?ttner, Informatik, Technische Universit?t M?nchen
> TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
>
>
>
> ------------------------------
>
> Message: 11
> Date: Fri, 20 May 2011 06:23:21 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <A5B121E9-E664-49D0-AE54-2CFE527129D2@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> On May 19, 2011, at 11:24 PM, Tom Rosmond wrote:
>
> > What fortran compiler did you use?
>
> gfortran.
>
> > In the original script my Intel compile used the -132 option,
> > allowing up to that many columns per line.
>
> Gotcha.
>
> >> x.f90:99.77:
> >>
> >> call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> >> 1
> >> Error: There is no specific subroutine for the generic 'mpi_type_indexed' at (1)
> >
> > Hmmm, very strange, since I am looking right at the MPI standard
> > documents with that routine documented. I too get this compile failure
> > when I switch to 'use mpi'. Could that be a problem with the Open MPI
> > fortran libraries???
>
> I think that that error is telling us that there's a compile-time mismatch -- that the signature of what you've passed doesn't match the signature of OMPI's MPI_Type_indexed subroutine.
>
> >> I looked at our mpi F90 module and see the following:
> >>
> >> interface MPI_Type_indexed
> >> subroutine MPI_Type_indexed(count, array_of_blocklengths, array_of_displacements, oldtype, newtype, ierr)
> >> integer, intent(in) :: count
> >> integer, dimension(*), intent(in) :: array_of_blocklengths
> >> integer, dimension(*), intent(in) :: array_of_displacements
> >> integer, intent(in) :: oldtype
> >> integer, intent(out) :: newtype
> >> integer, intent(out) :: ierr
> >> end subroutine MPI_Type_indexed
> >> end interface
>
> Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 12
> Date: Fri, 20 May 2011 07:26:19 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] MPI_Alltoallv function crashes when np > 100
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <F9F71854-B9DD-459F-999D-8A8AEF8D6006@cisco.com>
> Content-Type: text/plain; charset=GB2312
>
> I missed this email in my INBOX, sorry.
>
> Can you be more specific about what exact error is occurring? You just say that the application crashes...? Please send all the information listed here:
>
> http://www.open-mpi.org/community/help/
>
>
> On Apr 26, 2011, at 10:51 PM, ?????? wrote:
>
> > It seems that the const variable SOMAXCONN who used by listen() system call causes this problem. Can anybody help me resolve this question?
> >
> > 2011/4/25 ?????? <xjun.meng@gmail.com>
> > Dear all,
> >
> > As I mentioned, when I mpiruned an application with the parameter "np = 150(or bigger)", the application who used the MPI_Alltoallv function would carsh. The problem would recur no matter how many nodes we used.
> >
> > The edition of OpenMPI: 1.4.1 or 1.4.3
> > The OS: linux redhat 2.6.32
> >
> > BTW, my nodes had enough memory to run the application, and the MPI_Alltoall function worked well at my environment.
> > Did anybody meet the same problem? Thanks.
> >
> >
> > Best Regards
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 13
> Date: Fri, 20 May 2011 07:28:28 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] MPI_ERR_TRUNCATE with MPI_Allreduce() error,
> but only sometimes...
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <CAEF632E-757B-49EE-B545-5CCCBC712247@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> Sorry for the super-late reply. :-\
>
> Yes, ERR_TRUNCATE means that the receiver didn't have a large enough buffer.
>
> Have you tried upgrading to a newer version of Open MPI? 1.4.3 is the current stable release (I have a very dim and not guaranteed to be correct recollection that we fixed something in the internals of collectives somewhere with regards to ERR_TRUNCATE...?).
>
>
> On Apr 25, 2011, at 4:44 PM, Wei Hao wrote:
>
> > Hi:
> >
> > I'm running openmpi 1.2.8. I'm working on a project where one part involves communicating an integer, representing the number of data points I'm keeping track of, to all the processors. The line is simple:
> >
> > MPI_Allreduce(&np,&geo_N,1,MPI_INT,MPI_MAX,MPI_COMM_WORLD);
> >
> > where np and geo_N are integers, np is the result of a local calculation, and geo_N has been declared on all the processors. geo_N is nondecreasing. This line works the first time I call it (geo_N goes from 0 to some other integer), but if I call it later in the program, I get the following error:
> >
> >
> > [woodhen-039:26189] *** An error occurred in MPI_Allreduce
> > [woodhen-039:26189] *** on communicator MPI_COMM_WORLD
> > [woodhen-039:26189] *** MPI_ERR_TRUNCATE: message truncated
> > [woodhen-039:26189] *** MPI_ERRORS_ARE_FATAL (goodbye)
> >
> >
> > As I understand it, MPI_ERR_TRUNCATE means that the output buffer is too small, but I'm not sure where I've made a mistake. It's particularly frustrating because it seems to work fine the first time. Does anyone have any thoughts?
> >
> > Thanks
> > Wei
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 14
> Date: Fri, 20 May 2011 08:14:07 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <42DB03B3-9CF4-4ACB-AA20-B857E5F76087@cisco.com>
> Content-Type: text/plain; charset="us-ascii"
>
> On May 20, 2011, at 6:23 AM, Jeff Squyres wrote:
>
> > Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?
>
> Ok, if I convert ijlena and ijdisp to 1D arrays, I don't get the compile error (even though they're allocatable -- so allocate was a red herring, sorry). That's all that "use mpi" is complaining about -- that the function signatures didn't match.
>
> use mpi is your friend -- even if you don't use F90 constructs much. Compile-time checking is Very Good Thing (you were effectively "getting lucky" by passing in the 2D arrays, I think).
>
> Attached is my final version. And with this version, I see the hang when running it with the "T" parameter.
>
> That being said, I'm not an expert on the MPI IO stuff -- your code *looks* right to me, but I could be missing something subtle in the interpretation of MPI_FILE_SET_VIEW. I tried running your code with MPICH 1.3.2p1 and it also hung.
>
> Rob (ROMIO guy) -- can you comment this code? Is it correct?
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: x.f90
> Type: application/octet-stream
> Size: 3820 bytes
> Desc: not available
> URL: <http://www.open-mpi.org/MailArchives/users/attachments/20110520/53a5461b/attachment.obj>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 1911, Issue 1
> **************************************