Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 workgroup
From: Damien (damien_at_[hidden])
Date: 2011-05-20 14:58:21


MPI can get through your firewall, right?

Damien

On 20/05/2011 12:53 PM, Jason Mackay wrote:
> I have verified that disabling UAC does not fix the problem. xhlp.exe
> starts, threads spin up on both machines, CPU usage is at 80-90% but
> no progress is ever made.
>
> >From this state, Ctrl-break on the head node yields the following output:
>
> [REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0]
> mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> [REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0]
> mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> [REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0]
> mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> [REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0]
> mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> [REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0]
> mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> [REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0]
> mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> [REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to
> lifeline [[20816,0],0] lost
> [REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to
> lifeline [[20816,0],0] lost
> [REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to
> lifeline [[20816,0],0] lost
> [REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to
> lifeline [[20816,0],0] lost
> [REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to
> lifeline [[20816,0],0] lost
> [REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to
> lifeline [[20816,0],0] lost
>
>
>
> > From: users-request_at_[hidden]
> > Subject: users Digest, Vol 1911, Issue 1
> > To: users_at_[hidden]
> > Date: Fri, 20 May 2011 08:14:13 -0400
> >
> > Send users mailing list submissions to
> > users_at_[hidden]
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > or, via email, send a message with subject or body 'help' to
> > users-request_at_[hidden]
> >
> > You can reach the person managing the list at
> > users-owner_at_[hidden]
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of users digest..."
> >
> >
> > Today's Topics:
> >
> > 1. Re: Error: Entry Point Not Found (Zhangping Wei)
> > 2. Re: Problem with MPI_Request, MPI_Isend/recv and
> > MPI_Wait/Test (George Bosilca)
> > 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff Squyres)
> > 4. Re: Error: Entry Point Not Found (Jeff Squyres)
> > 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011 (aka
> > 12.0) (Jeff Squyres)
> > 6. Re: Openib with > 32 cores per node (Jeff Squyres)
> > 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres)
> > 8. Re: Trouble with MPI-IO (Jeff Squyres)
> > 9. Re: Trouble with MPI-IO (Tom Rosmond)
> > 10. Re: Problem with MPI_Request, MPI_Isend/recv and
> > MPI_Wait/Test (David B?ttner)
> > 11. Re: Trouble with MPI-IO (Jeff Squyres)
> > 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff Squyres)
> > 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only
> > sometimes... (Jeff Squyres)
> > 14. Re: Trouble with MPI-IO (Jeff Squyres)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Thu, 19 May 2011 09:13:53 -0700 (PDT)
> > From: Zhangping Wei <zhangping_wei_at_[hidden]>
> > Subject: Re: [OMPI users] Error: Entry Point Not Found
> > To: users_at_[hidden]
> > Message-ID: <101342.7961.qm_at_[hidden]>
> > Content-Type: text/plain; charset="gb2312"
> >
> > Dear Paul,
> >
> > I checked the way 'mpirun -np N <cmd>' you mentioned, but it was the
> same
> > problem.
> >
> > I guess it may related to the system I used, because I have used it
> correctly in
> > another XP 32 bit system.
> >
> > I look forward to more advice.Thanks.
> >
> > Zhangping
> >
> >
> >
> >
> > ________________________________
> > ???????? "users-request_at_[hidden]" <users-request_at_[hidden]>
> > ???????? users_at_[hidden]
> > ?????????? 2011/5/19 (????) 11:00:02 ????
> > ?? ???? users Digest, Vol 1910, Issue 2
> >
> > Send users mailing list submissions to
> > users_at_[hidden]
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > or, via email, send a message with subject or body 'help' to
> > users-request_at_[hidden]
> >
> > You can reach the person managing the list at
> > users-owner_at_[hidden]
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of users digest..."
> >
> >
> > Today's Topics:
> >
> > 1. Re: Error: Entry Point Not Found (Paul van der Walt)
> > 2. Re: Openib with > 32 cores per node (Robert Horton)
> > 3. Re: Openib with > 32 cores per node (Samuel K. Gutierrez)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Thu, 19 May 2011 16:14:02 +0100
> > From: Paul van der Walt <paul_at_[hidden]>
> > Subject: Re: [OMPI users] Error: Entry Point Not Found
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <BANLkTinjZ0CNtchQJCZYhfGSnR51jPuP7w_at_[hidden]>
> > Content-Type: text/plain; charset=UTF-8
> >
> > Hi,
> >
> > On 19 May 2011 15:54, Zhangping Wei <zhangping_wei_at_[hidden]> wrote:
> > > 4, I use command window to run it in this way: ?mpirun ?n 4
> ?**.exe ?,then I
> >
> > Probably not the problem, but shouldn't that be 'mpirun -np N <cmd>' ?
> >
> > Paul
> >
> > --
> > O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
> >
> >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Thu, 19 May 2011 16:37:56 +0100
> > From: Robert Horton <r.horton_at_[hidden]>
> > Subject: Re: [OMPI users] Openib with > 32 cores per node
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <1305819476.9663.148.camel_at_moelwyn>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> > > Hi,
> > >
> > > Try the following QP parameters that only use shared receive queues.
> > >
> > > -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> > >
> >
> > Thanks for that. If I run the job over 2 x 48 cores it now works and the
> > performance seems reasonable (I need to do some more tuning) but when I
> > go up to 4 x 48 cores I'm getting the same problem:
> >
> >
> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> > error creating qp errno says Cannot allocate memory
> > [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job
> will now abort)
> >
> > Any thoughts?
> >
> > Thanks,
> > Rob
> > --
> > Robert Horton
> > System Administrator (Research Support) - School of Mathematical
> Sciences
> > Queen Mary, University of London
> > r.horton_at_[hidden] - +44 (0) 20 7882 7345
> >
> >
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Thu, 19 May 2011 09:59:13 -0600
> > From: "Samuel K. Gutierrez" <samuel_at_[hidden]>
> > Subject: Re: [OMPI users] Openib with > 32 cores per node
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <B3E83138-9AF0-48C0-871C-DBBB2E712E12_at_[hidden]>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Hi,
> >
> > On May 19, 2011, at 9:37 AM, Robert Horton wrote
> >
> > > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> > >> Hi,
> > >>
> > >> Try the following QP parameters that only use shared receive queues.
> > >>
> > >> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> > >>
> > >
> > > Thanks for that. If I run the job over 2 x 48 cores it now works
> and the
> > > performance seems reasonable (I need to do some more tuning) but
> when I
> > > go up to 4 x 48 cores I'm getting the same problem:
> > >
> >
> >[compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> > >] error creating qp errno says Cannot allocate memory
> > > [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> > > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> > > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> > > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job
> will now
> > >abort)
> > >
> > > Any thoughts?
> >
> > How much memory does each node have? Does this happen at startup?
> >
> > Try adding:
> >
> > -mca btl_openib_cpc_include rdmacm
> >
> > I'm not sure if your version of OFED supports this feature, but
> maybe using XRC
> > may help. I **think** other tweaks are needed to get this going, but
> I'm not
> > familiar with the details.
> >
> > Hope that helps,
> >
> > Samuel K. Gutierrez
> > Los Alamos National Laboratory
> >
> >
> > >
> > > Thanks,
> > > Rob
> > > --
> > > Robert Horton
> > > System Administrator (Research Support) - School of Mathematical
> Sciences
> > > Queen Mary, University of London
> > > r.horton_at_[hidden] - +44 (0) 20 7882 7345
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> >
> >
> >
> > ------------------------------
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > End of users Digest, Vol 1910, Issue 2
> > **************************************
> > -------------- next part --------------
> > HTML attachment scrubbed and removed
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Thu, 19 May 2011 08:48:03 -0800
> > From: George Bosilca <bosilca_at_[hidden]>
> > Subject: Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and
> > MPI_Wait/Test
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <FCAC66F9-FDB5-48BB-A800-263D8A4F9337_at_[hidden]>
> > Content-Type: text/plain; charset=iso-8859-1
> >
> > David,
> >
> > I do not see any mechanism for protecting the accesses to the
> requests to a single thread? What is the thread model you're using?
> >
> > >From an implementation perspective, your code is correct only if
> you initialize the MPI library with MPI_THREAD_MULTIPLE and if the
> library accepts. Otherwise, there is an assumption that the
> application is single threaded, or that the MPI behavior is
> implementation dependent. Please read the MPI standard regarding to
> MPI_Init_thread for more details.
> >
> > Regards,
> > george.
> >
> > On May 19, 2011, at 02:34 , David B?ttner wrote:
> >
> > > Hello,
> > >
> > > I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread code. I
> am using MPI_Isend and MPI_Irecv for communication and
> MPI_Test/MPI_Wait to check if it is done. I do this repeatedly in the
> outer loop of my code. The MPI_Test is used in the inner loop to check
> if some function can be called which depends on the received data.
> > > The program regularly crashed (only when not using printf...) and
> after debugging it I figured out the following problem:
> > >
> > > In MPI_Isend I have an invalid read of memory. I fixed the problem
> with not re-using a
> > >
> > > MPI_Request req_s, req_r;
> > >
> > > but by using
> > >
> > > MPI_Request* req_s;
> > > MPI_Request* req_r
> > >
> > > and re-allocating them before the MPI_Isend/recv.
> > >
> > > The documentation says, that in MPI_Wait and MPI_Test (if
> successful) the request-objects are deallocated and set to
> MPI_REQUEST_NULL.
> > > It also says, that in MPI_Isend and MPI_Irecv, it allocates the
> Objects and associates it with the request object.
> > >
> > > As I understand this, this either means I can use a pointer to
> MPI_Request which I don't have to initialize for this (it doesn't work
> but crashes), or that I can use a MPI_Request pointer which I have
> initialized with malloc(sizeof(MPI_REQUEST)) (or passing the address
> of a MPI_Request req), which is set and unset in the functions. But
> this version crashes, too.
> > > What works is using a pointer, which I allocate before the
> MPI_Isend/recv and which I free after MPI_Wait in every iteration. In
> other words: It only uses if I don't reuse any kind of MPI_Request.
> Only if I recreate one every time.
> > >
> > > Is this, what is should be like? I believe that a reuse of the
> memory would be a lot more efficient (less calls to malloc...). Am I
> missing something here? Or am I doing something wrong?
> > >
> > >
> > > Let me provide some more detailed information about my problem:
> > >
> > > I am running the program on a 30 node infiniband cluster. Each
> node has 4 single core Opteron CPUs. I am running 1 MPI Rank per node
> and 4 threads per rank (-> one thread per core).
> > > I am compiling with mpicc of OpenMPI using gcc below.
> > > Some pseudo-code of the program can be found at the end of this
> e-mail.
> > >
> > > I was able to reproduce the problem using different amount of
> nodes and even using one node only. The problem does not arise when I
> put printf-debugging information into the code. This pointed me into
> the direction that I have some memory problem, where some write
> accesses some memory it is not supposed to.
> > > I ran the tests using valgrind with --leak-check=full and
> --show-reachable=yes, which pointed me either to MPI_Isend or MPI_Wait
> depending on whether I had the threads spin in a loop for MPI_Test to
> return success or used MPI_Wait respectively.
> > >
> > > I would appreciate your help with this. Am I missing something
> important here? Is there a way to re-use the request in the different
> iterations other than I thought it should work?
> > > Or is there a way to re-initialize the allocated memory before the
> MPI_Isend/recv so that I at least don't have to call free and malloc
> each time?
> > >
> > > Thank you very much for your help!
> > > Kind regards,
> > > David B?ttner
> > >
> > > _____________________
> > > Pseudo-Code of program:
> > >
> > > MPI_Request* req_s;
> > > MPI_Request* req_w;
> > > OUTER-LOOP
> > > if(0 == threadid)
> > > {
> > > req_s = malloc(sizeof(MPI_Request));
> > > req_r = malloc(sizeof(MPI_Request));
> > > MPI_Isend(..., req_s)
> > > MPI_Irecv(..., req_r)
> > > }
> > > pthread_barrier
> > > INNER-LOOP (while NOT_DONE or RET)
> > > if(TRYLOCK && NOT_DONE)
> > > {
> > > if(MPI_TEST(req_r))
> > > {
> > > Call_Function_A;
> > > NOT_DONE = 0;
> > > }
> > >
> > > }
> > > RET = Call_Function_B;
> > > }
> > > pthread_barrier_wait
> > > if(0 == threadid)
> > > {
> > > MPI_WAIT(req_s)
> > > MPI_WAIT(req_r)
> > > free(req_s);
> > > free(req_r);
> > > }
> > > _____________
> > >
> > >
> > > --
> > > David B?ttner, Informatik, Technische Universit?t M?nchen
> > > TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > "To preserve the freedom of the human mind then and freedom of the
> press, every spirit should be ready to devote itself to martyrdom; for
> as long as we may think as we will, and speak as we think, the
> condition of man will proceed in improvement."
> > -- Thomas Jefferson, 1799
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Thu, 19 May 2011 21:22:48 -0400
> > From: Jeff Squyres <jsquyres_at_[hidden]>
> > Subject: Re: [OMPI users] v1.5.3-x64 does not work on Windows 7
> > workgroup
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <278274F0-BF00-4498-950F-9779E0083C5A_at_[hidden]>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Unfortunately, our Windows guy (Shiqing) is off getting married and
> will be out for a little while. :-(
> >
> > All that I can cite is the README.WINDOWS.txt file in the top-level
> directory. I'm afraid that I don't know much else about Windows. :-(
> >
> >
> > On May 18, 2011, at 8:17 PM, Jason Mackay wrote:
> >
> > > Hi all,
> > >
> > > My thanks to all those involved for putting together this Windows
> binary release of OpenMPI! I am hoping to use it in a small Windows
> based OpenMPI cluster at home.
> > >
> > > Unfortunately my experience so far has not exactly been trouble
> free. It seems that, due to the fact that this release is using WMI,
> there are a number of settings that must be configured on the machines
> in order to get this to work. These settings are not documented in the
> distribution at all. I have been experimenting with it for over a week
> on and off and as soon as I solve one problem, another one arises.
> > >
> > > Currently, after much searching, reading, and tinkering with DCOM
> settings etc..., I can remotely start processes on all my machines
> using mpirun but those processes cannot access network shares (e.g.
> for binary distribution) and HPL (which works on any one node) does
> not seem to work if I run it across multiple nodes, also indicating a
> network issue (CPU sits at 100% in all processes with no network
> traffic and never terminates). To eliminate premission issues that may
> be caused by UAC I tried the same setup on two domain machines using
> an administrative account to launch and the behavior was the same. I
> have read that WMI processes cannot access network resources and I am
> at a loss for a solution to this newest of problems. If anyone knows
> how to make this work I would appreciate the help. I assume that
> someone has gotten this working and has the answers.
> > >
> > > I have searched the mailing list archives and I found other users
> with similar problems but no clear guidance on the threads. Some
> threads make references to Microsoft KB articles but do not explicitly
> tell the user what needs to be done, leaving each new user to
> rediscover the tricks on their own. One thread made it appear that
> testing had only been done on Windows XP. Needless to say, security
> has changed dramatically in Windows since XP!
> > >
> > > I would like to see OpenMPI for Windows be usable by a newcomer
> without all of this pain.
> > >
> > > What would be fantastic would be:
> > > 1) a step-by-step procedure for how to get OpenMPI 1.5 working on
> Windows
> > > a) preferably in a bare Windows 7 workgroup environment with
> nothing else (i.e. no Microsoft Cluster Compute Pack, no domain etc...)
> > > 2) inclusion of these steps in the binary distribution
> > > 3) bonus points for a script which accomplishes these things
> automatically
> > >
> > > If someone can help with (1), I would happily volunteer my time to
> work on (3).
> > >
> > > Regards,
> > > Jason
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 4
> > Date: Thu, 19 May 2011 21:26:43 -0400
> > From: Jeff Squyres <jsquyres_at_[hidden]>
> > Subject: Re: [OMPI users] Error: Entry Point Not Found
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <F830EC35-FC9B-4801-B2A3-50F54D2152A4_at_[hidden]>
> > Content-Type: text/plain; charset=windows-1252
> >
> > On May 19, 2011, at 10:54 AM, Zhangping Wei wrote:
> >
> > > 4, I use command window to run it in this way: ?mpirun ?n 4 **.exe
> ?,then I met the error: ?entry point not found: the procedure entry
> point inet_pton could not be located in the dynamic link library
> WS2_32.dll?
> >
> > Unfortunately our Windows developer/maintainer is out for a little
> while (he's getting married); he pretty much did the Windows stuff by
> himself, so none of the rest of us know much about it. :(
> >
> > inet_pton is a standard function call relating to IP addresses that
> we use in the internals of OMPI; I'm not sure why it wouldn't be found
> on Windows XP (Shiqing did cite that the OMPI Windows port should work
> on Windows XP).
> >
> > This post seems to imply that inet_ntop is only available on Vista
> and above:
> >
> >
> http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/e40465f2-41b7-4243-ad33-15ae9366f4e6/
> >
> > So perhaps Shiqing needs to put in some kind of portability
> workaround for OMPI, and the current binaries won't actually work for
> XP...?
> >
> > I can't say that for sure because I really know very little about
> Windows; we'll unfortunately have to wait until he returns to get a
> definitive answer. :-(
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 5
> > Date: Thu, 19 May 2011 21:37:49 -0400
> > From: Jeff Squyres <jsquyres_at_[hidden]>
> > Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer
> > XE 2011 (aka 12.0)
> > To: Open MPI Users <users_at_[hidden]>
> > Cc: Giovanni Bracco <giovanni.bracco_at_[hidden]>, Agostino Funel
> > <agostino.funel_at_[hidden]>, Fiorenzo Ambrosino
> > <fiorenzo.ambrosino_at_[hidden]>, Guido Guarnieri
> > <guido.guarnieri_at_[hidden]>, Roberto Ciavarella
> > <roberto.ciavarella_at_[hidden]>, Salvatore Podda
> > <salvatore.podda_at_[hidden]>, Giovanni Ponti <giovanni.ponti_at_[hidden]>
> > Message-ID: <45362608-B8B0-4ADE-9959-B35C5690A6F3_at_[hidden]>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Sorry for the late reply.
> >
> > Other users have seen something similar but we have never been able
> to reproduce it. Is this only when using IB? If you use "mpirun --mca
> btl_openib_cpc_if_include rdmacm", does the problem go away?
> >
> >
> > On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:
> >
> > > I've seen the same thing when I build openmpi 1.4.3 with Intel 12,
> but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1
> then the collectives hangs go away. I don't know what, if anything,
> the higher optimization buys you when compiling openmpi, so I'm not
> sure if that's an acceptable workaround or not.
> > >
> > > My system is similar to yours - Intel X5570 with QDR Mellanox IB
> running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm
> using IMB 3.2.2 with a single iteration of Barrier to reproduce the
> hang, and it happens 100% of the time for me when I invoke it like this:
> > >
> > > # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier
> > >
> > > The hang happens on the first Barrier (64 ranks) and each of the
> participating ranks have this backtrace:
> > >
> > > __poll (...)
> > > poll_dispatch () from [instdir]/lib/libopen-pal.so.0
> > > opal_event_loop () from [instdir]/lib/libopen-pal.so.0
> > > opal_progress () from [instdir]/lib/libopen-pal.so.0
> > > ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
> > > ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
> > > ompi_coll_tuned_barrier_intra_recursivedoubling () from
> [instdir]/lib/libmpi.so.0
> > > ompi_coll_tuned_barrier_intra_dec_fixed () from
> [instdir]/lib/libmpi.so.0
> > > PMPI_Barrier () from [instdir]/lib/libmpi.so.0
> > > IMB_barrier ()
> > > IMB_init_buffers_iter ()
> > > main ()
> > >
> > > The one non-participating rank has this backtrace:
> > >
> > > __poll (...)
> > > poll_dispatch () from [instdir]/lib/libopen-pal.so.0
> > > opal_event_loop () from [instdir]/lib/libopen-pal.so.0
> > > opal_progress () from [instdir]/lib/libopen-pal.so.0
> > > ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
> > > ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
> > > ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
> > > ompi_coll_tuned_barrier_intra_dec_fixed () from
> [instdir]/lib/libmpi.so.0
> > > PMPI_Barrier () from [instdir]/lib/libmpi.so.0
> > > main ()
> > >
> > > If I use more nodes I can get it to hang with 1ppn, so that seems
> to rule out the sm btl (or interactions with it) as a culprit at least.
> > >
> > > I can't reproduce this with openmpi 1.5.3, interestingly.
> > >
> > > -Marcus
> > >
> > >
> > > On 05/10/2011 03:37 AM, Salvatore Podda wrote:
> > >> Dear all,
> > >>
> > >> we succeed in building several version of openmpi from 1.2.8 to
> 1.4.3
> > >> with Intel composer XE 2011 (aka 12.0).
> > >> However we found a threshold in the number of cores (depending
> from the
> > >> application: IMB, xhpl or user applications
> > >> and form the number of required cores) above which the
> application hangs
> > >> (sort of deadlocks).
> > >> The building of openmpi with 'gcc' and 'pgi' does not show the
> same limits.
> > >> There are any known incompatibilities of openmpi with this
> version of
> > >> intel compiilers?
> > >>
> > >> The characteristics of our computational infrastructure are:
> > >>
> > >> Intel processors E7330, E5345, E5530 e E5620
> > >>
> > >> CentOS 5.3, CentOS 5.5.
> > >>
> > >> Intel composer XE 2011
> > >> gcc 4.1.2
> > >> pgi 10.2-1
> > >>
> > >> Regards
> > >>
> > >> Salvatore Podda
> > >>
> > >> ENEA UTICT-HPC
> > >> Department for Computer Science Development and ICT
> > >> Facilities Laboratory for Science and High Performace Computing
> > >> C.R. Frascati
> > >> Via E. Fermi, 45
> > >> PoBox 65
> > >> 00044 Frascati (Rome)
> > >> Italy
> > >>
> > >> Tel: +39 06 9400 5342
> > >> Fax: +39 06 9400 5551
> > >> Fax: +39 06 9400 5735
> > >> E-mail: salvatore.podda_at_[hidden]
> > >> Home Page: www.cresco.enea.it
> > >> _______________________________________________
> > >> users mailing list
> > >> users_at_[hidden]
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 6
> > Date: Thu, 19 May 2011 22:01:00 -0400
> > From: Jeff Squyres <jsquyres_at_[hidden]>
> > Subject: Re: [OMPI users] Openib with > 32 cores per node
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <C18C4827-D305-484A-9DAE-290902D40DB3_at_[hidden]>
> > Content-Type: text/plain; charset=us-ascii
> >
> > What Sam is alluding to is that the OpenFabrics driver code in OMPI
> is sucking up oodles of memory for each IB connection that you're
> using. The receive_queues param that he sent tells OMPI to use all
> shared receive queues (instead of defaulting to one per-peer receive
> queue and the rest shared receive queues -- the per-peer RQ sucks up
> all the memory when you multiple it by N peers).
> >
> >
> > On May 19, 2011, at 11:59 AM, Samuel K. Gutierrez wrote:
> >
> > > Hi,
> > >
> > > On May 19, 2011, at 9:37 AM, Robert Horton wrote
> > >
> > >> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> > >>> Hi,
> > >>>
> > >>> Try the following QP parameters that only use shared receive queues.
> > >>>
> > >>> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> > >>>
> > >>
> > >> Thanks for that. If I run the job over 2 x 48 cores it now works
> and the
> > >> performance seems reasonable (I need to do some more tuning) but
> when I
> > >> go up to 4 x 48 cores I'm getting the same problem:
> > >>
> > >>
> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> error creating qp errno says Cannot allocate memory
> > >> [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> > >> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> > >> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> > >> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job
> will now abort)
> > >>
> > >> Any thoughts?
> > >
> > > How much memory does each node have? Does this happen at startup?
> > >
> > > Try adding:
> > >
> > > -mca btl_openib_cpc_include rdmacm
> > >
> > > I'm not sure if your version of OFED supports this feature, but
> maybe using XRC may help. I **think** other tweaks are needed to get
> this going, but I'm not familiar with the details.
> > >
> > > Hope that helps,
> > >
> > > Samuel K. Gutierrez
> > > Los Alamos National Laboratory
> > >
> > >
> > >>
> > >> Thanks,
> > >> Rob
> > >> --
> > >> Robert Horton
> > >> System Administrator (Research Support) - School of Mathematical
> Sciences
> > >> Queen Mary, University of London
> > >> r.horton_at_[hidden] - +44 (0) 20 7882 7345
> > >>
> > >> _______________________________________________
> > >> users mailing list
> > >> users_at_[hidden]
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 7
> > Date: Thu, 19 May 2011 22:04:46 -0400
> > From: Jeff Squyres <jsquyres_at_[hidden]>
> > Subject: Re: [OMPI users] MPI_COMM_DUP freeze with OpenMPI 1.4.1
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <0DCF20B8-CA5C-4746-8187-A2DFF39B15DD_at_[hidden]>
> > Content-Type: text/plain; charset=us-ascii
> >
> > On May 13, 2011, at 8:31 AM, francoise.roch_at_[hidden] wrote:
> >
> > > Here is the MUMPS portion of code (in zmumps_part1.F file) where
> the slaves call MPI_COMM_DUP , id%PAR and MASTER are initialized to 0
> before :
> > >
> > > CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
> >
> > I re-indented so that I could read it better:
> >
> > CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
> > IF ( id%PAR .eq. 0 ) THEN
> > IF ( id%MYID .eq. MASTER ) THEN
> > color = MPI_UNDEFINED
> > ELSE
> > color = 0
> > END IF
> > CALL MPI_COMM_SPLIT( id%COMM, color, 0,
> > & id%COMM_NODES, IERR )
> > id%NSLAVES = id%NPROCS - 1
> > ELSE
> > CALL MPI_COMM_DUP( id%COMM, id%COMM_NODES, IERR )
> > id%NSLAVES = id%NPROCS
> > END IF
> >
> > IF (id%PAR .ne. 0 .or. id%MYID .NE. MASTER) THEN
> > CALL MPI_COMM_DUP( id%COMM_NODES, id%COMM_LOAD, IERR
> > ENDIF
> >
> > That doesn't look right -- both MPI_COMM_SPLIT and MPI_COMM_DUP are
> collective, meaning that all processes in the communicator must call
> them. In the first case, only some processes are calling
> MPI_COMM_SPLIT. Is there some other logic that forces the rest of the
> processes to call MPI_COMM_SPLIT, too?
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 8
> > Date: Thu, 19 May 2011 22:30:03 -0400
> > From: Jeff Squyres <jsquyres_at_[hidden]>
> > Subject: Re: [OMPI users] Trouble with MPI-IO
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <EEFB638F-72F1-4208-8EA2-4F25F610C47B_at_[hidden]>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Props for that testio script. I think you win the award for "most
> easy to reproduce test case." :-)
> >
> > I notice that some of the lines went over 72 columns, so I renamed
> the file x.f90 and changed all the comments from "c" to "!" and joined
> the two &-split lines. The error about implicit type for lenr went
> away, but then when I enabled better type checking by using "use mpi"
> instead of "include 'mpif.h'", I got the following:
> >
> > x.f90:99.77:
> >
> > call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> > 1
> > Error: There is no specific subroutine for the generic
> 'mpi_type_indexed' at (1)
> >
> > I looked at our mpi F90 module and see the following:
> >
> > interface MPI_Type_indexed
> > subroutine MPI_Type_indexed(count, array_of_blocklengths,
> array_of_displacements, oldtype, newtype, ierr)
> > integer, intent(in) :: count
> > integer, dimension(*), intent(in) :: array_of_blocklengths
> > integer, dimension(*), intent(in) :: array_of_displacements
> > integer, intent(in) :: oldtype
> > integer, intent(out) :: newtype
> > integer, intent(out) :: ierr
> > end subroutine MPI_Type_indexed
> > end interface
> >
> > I don't quite grok the syntax of the "allocatable" type ijdisp, so
> that might be the problem here...?
> >
> > Regardless, I'm not entirely sure if the problem is the >72
> character lines, but then when that is gone, I'm not sure how the
> allocatable stuff fits in... (I'm not enough of a Fortran programmer
> to know)
> >
> >
> >
> >
> > On May 10, 2011, at 7:14 PM, Tom Rosmond wrote:
> >
> > > I would appreciate someone with experience with MPI-IO look at the
> > > simple fortran program gzipped and attached to this note. It is
> > > imbedded in a script so that all that is necessary to run it is do:
> > > 'testio' from the command line. The program generates a small 2-D
> input
> > > array, sets up an MPI-IO environment, and write a 2-D output array
> > > twice, with the only difference being the displacement arrays used to
> > > construct the indexed datatype. For the first write, simple
> > > monotonically increasing displacements are used, for the second the
> > > displacements are 'shuffled' in one dimension. They are printed during
> > > the run.
> > >
> > > For the first case the file is written properly, but for the
> second the
> > > program hangs on MPI_FILE_WRITE_AT_ALL and must be aborted manually.
> > > Although the program is compiled as an mpi program, I am running on a
> > > single processor, which makes the problem more puzzling.
> > >
> > > The program should be relatively self-explanatory, but if more
> > > information is needed, please ask. I am on an 8 core Xeon based Dell
> > > workstation running Scientific Linux 5.5, Intel fortran 12.0.3, and
> > > OpenMPI 1.5.3. I have also attached output from 'ompi_info'.
> > >
> > > T. Rosmond
> > >
> > >
> > >
> <testio.gz><info_ompi.gz>_______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 9
> > Date: Thu, 19 May 2011 20:24:25 -0700
> > From: Tom Rosmond <rosmond_at_[hidden]>
> > Subject: Re: [OMPI users] Trouble with MPI-IO
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <1305861865.4284.104.camel_at_[hidden]>
> > Content-Type: text/plain
> >
> > Thanks for looking at my problem. Sounds like you did reproduce my
> > problem. I have added some comments below
> >
> > On Thu, 2011-05-19 at 22:30 -0400, Jeff Squyres wrote:
> > > Props for that testio script. I think you win the award for "most
> easy to reproduce test case." :-)
> > >
> > > I notice that some of the lines went over 72 columns, so I renamed
> the file x.f90 and changed all the comments from "c" to "!" and joined
> the two &-split lines. The error about implicit type for lenr went
> away, but then when I enabled better type checking by using "use mpi"
> instead of "include 'mpif.h'", I got the following:
> >
> > What fortran compiler did you use?
> >
> > In the original script my Intel compile used the -132 option,
> > allowing up to that many columns per line. I still think in
> > F77 fortran much of the time, and use 'c' for comments out
> > of habit. The change to '!' doesn't make any difference.
> >
> >
> > > x.f90:99.77:
> > >
> > > call
> mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> > > 1
> > > Error: There is no specific subroutine for the generic
> 'mpi_type_indexed' at (1)
> >
> > Hmmm, very strange, since I am looking right at the MPI standard
> > documents with that routine documented. I too get this compile failure
> > when I switch to 'use mpi'. Could that be a problem with the Open MPI
> > fortran libraries???
> > >
> > > I looked at our mpi F90 module and see the following:
> > >
> > > interface MPI_Type_indexed
> > > subroutine MPI_Type_indexed(count, array_of_blocklengths,
> array_of_displacements, oldtype, newtype, ierr)
> > > integer, intent(in) :: count
> > > integer, dimension(*), intent(in) :: array_of_blocklengths
> > > integer, dimension(*), intent(in) :: array_of_displacements
> > > integer, intent(in) :: oldtype
> > > integer, intent(out) :: newtype
> > > integer, intent(out) :: ierr
> > > end subroutine MPI_Type_indexed
> > > end interface
> > >
> > > I don't quite grok the syntax of the "allocatable" type ijdisp, so
> that might be the problem here...?
> >
> > Just a standard F90 'allocatable' statement. I've written thousands
> > just like it.
> > >
> > > Regardless, I'm not entirely sure if the problem is the >72
> character lines, but then when that is gone, I'm not sure how the
> allocatable stuff fits in... (I'm not enough of a Fortran programmer
> to know)
> > >
> > Anyone else out that who can comment????
> >
> >
> > T. Rosmond
> >
> >
> >
> > >
> > > On May 10, 2011, at 7:14 PM, Tom Rosmond wrote:
> > >
> > > > I would appreciate someone with experience with MPI-IO look at the
> > > > simple fortran program gzipped and attached to this note. It is
> > > > imbedded in a script so that all that is necessary to run it is do:
> > > > 'testio' from the command line. The program generates a small
> 2-D input
> > > > array, sets up an MPI-IO environment, and write a 2-D output array
> > > > twice, with the only difference being the displacement arrays
> used to
> > > > construct the indexed datatype. For the first write, simple
> > > > monotonically increasing displacements are used, for the second the
> > > > displacements are 'shuffled' in one dimension. They are printed
> during
> > > > the run.
> > > >
> > > > For the first case the file is written properly, but for the
> second the
> > > > program hangs on MPI_FILE_WRITE_AT_ALL and must be aborted manually.
> > > > Although the program is compiled as an mpi program, I am running
> on a
> > > > single processor, which makes the problem more puzzling.
> > > >
> > > > The program should be relatively self-explanatory, but if more
> > > > information is needed, please ask. I am on an 8 core Xeon based Dell
> > > > workstation running Scientific Linux 5.5, Intel fortran 12.0.3, and
> > > > OpenMPI 1.5.3. I have also attached output from 'ompi_info'.
> > > >
> > > > T. Rosmond
> > > >
> > > >
> > > >
> <testio.gz><info_ompi.gz>_______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> >
> >
> >
> > ------------------------------
> >
> > Message: 10
> > Date: Fri, 20 May 2011 09:25:14 +0200
> > From: David B?ttner <david.buettner_at_[hidden]>
> > Subject: Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and
> > MPI_Wait/Test
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <4DD6175A.1080403_at_[hidden]>
> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> >
> > Hello,
> >
> > thanks for the quick answer. I am sorry that I forgot to mention
> this: I
> > did compile OpenMPI with MPI_THREAD_MULTIPLE support and test if
> > required == provided after the MPI_Thread_init call.
> >
> > > I do not see any mechanism for protecting the accesses to the
> requests to a single thread? What is the thread model you're using?
> > >
> > Again I am sorry that this was not clear: In the pseudo code below I
> > wanted to indicate the access-protection I do by thread-id dependent
> > calls if(0 == thread-id) and by using the trylock(...) (using
> > pthread-mutexes). In the code all accesses concerning one MPI_Request
> > (which are pthread-global-pointers in my case) are protected and called
> > in sequential order, i.e. MPI_Isend/recv is returns before any
> thread is
> > allowed to call the corresponding MPI_Test and no-one can call MPI_Test
> > any more when a thread is allowed to call MPI_Wait.
> > I did this in the same manner before with other MPI implementations,
> but
> > also on the same machine with the same (untouched) OpenMPI
> > implementation, also using pthreads and MPI in combination, but I used
> >
> > MPI_Request req;
> >
> > instead of
> >
> > MPI_Request* req;
> > (and later)
> > req = (MPI_Request*)malloc(sizeof(MPI_Request));
> >
> >
> > In my recent (problem) code, I also tried not using pointers, but got
> > the same problem. Also, as I described in the first mail, I tried
> > everything concerning the memory allocation of the MPI_Request objects.
> > I tried not calling malloc. This I guessed wouldn't work, but the
> > OpenMPI documentation says this:
> >
> > " Nonblocking calls allocate a communication request object and
> > associate it with the request handle the argument request). "
> > [http://www.open-mpi.org/doc/v1.4/man3/MPI_Isend.3.php] and
> >
> > " [...] if the communication object was created by a nonblocking
> send or
> > receive, then it is deallocated and the request handle is set to
> > MPI_REQUEST_NULL."
> > [http://www.open-mpi.org/doc/v1.4/man3/MPI_Test.3.php] and (in slightly
> > different words) [http://www.open-mpi.org/doc/v1.4/man3/MPI_Wait.3.php]
> >
> > So I thought that it might do some kind of optimized memory stuff
> > internally.
> >
> > I also tried allocating req (for each used MPI_Request) once before the
> > first use and deallocation after the last use (which I thought was the
> > way it was supposed to work), but that crashes also.
> >
> > I tried replacing the pointers through global variables
> >
> > MPI_Request req;
> >
> > which didn't do the job...
> >
> > The only thing that seems to work is what I mentioned below: Allocate
> > every time I am going to need it in the MPI_Isend/recv, use it in
> > MPI_Test/Wait and after that deallocate it by hand each time.
> > I don't think that this is supposed to be like this since I have to
> do a
> > call to malloc and free so often (for multiple MPI_Request objects in
> > each iteration) that it will most likely limit performance...
> >
> > Anyway I still have the same problem and am still unclear on what kind
> > of memory allocation I should be doing for the MPI_Requests. Is there
> > anything else (besides MPI_THREAD_MULTIPLE support, thread access
> > control, sequential order of MPI_Isend/recv, MPI_Test and MPI_Wait for
> > one MPI_Request object) I need to take care of? If not, what could I do
> > to find the source of my problem?
> >
> > Thanks again for any kind of help!
> >
> > Kind regards,
> > David
> >
> >
> >
> > > > From an implementation perspective, your code is correct only if
> you initialize the MPI library with MPI_THREAD_MULTIPLE and if the
> library accepts. Otherwise, there is an assumption that the
> application is single threaded, or that the MPI behavior is
> implementation dependent. Please read the MPI standard regarding to
> MPI_Init_thread for more details.
> > >
> > > Regards,
> > > george.
> > >
> > > On May 19, 2011, at 02:34 , David B?ttner wrote:
> > >
> > >> Hello,
> > >>
> > >> I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread code. I
> am using MPI_Isend and MPI_Irecv for communication and
> MPI_Test/MPI_Wait to check if it is done. I do this repeatedly in the
> outer loop of my code. The MPI_Test is used in the inner loop to check
> if some function can be called which depends on the received data.
> > >> The program regularly crashed (only when not using printf...) and
> after debugging it I figured out the following problem:
> > >>
> > >> In MPI_Isend I have an invalid read of memory. I fixed the
> problem with not re-using a
> > >>
> > >> MPI_Request req_s, req_r;
> > >>
> > >> but by using
> > >>
> > >> MPI_Request* req_s;
> > >> MPI_Request* req_r
> > >>
> > >> and re-allocating them before the MPI_Isend/recv.
> > >>
> > >> The documentation says, that in MPI_Wait and MPI_Test (if
> successful) the request-objects are deallocated and set to
> MPI_REQUEST_NULL.
> > >> It also says, that in MPI_Isend and MPI_Irecv, it allocates the
> Objects and associates it with the request object.
> > >>
> > >> As I understand this, this either means I can use a pointer to
> MPI_Request which I don't have to initialize for this (it doesn't work
> but crashes), or that I can use a MPI_Request pointer which I have
> initialized with malloc(sizeof(MPI_REQUEST)) (or passing the address
> of a MPI_Request req), which is set and unset in the functions. But
> this version crashes, too.
> > >> What works is using a pointer, which I allocate before the
> MPI_Isend/recv and which I free after MPI_Wait in every iteration. In
> other words: It only uses if I don't reuse any kind of MPI_Request.
> Only if I recreate one every time.
> > >>
> > >> Is this, what is should be like? I believe that a reuse of the
> memory would be a lot more efficient (less calls to malloc...). Am I
> missing something here? Or am I doing something wrong?
> > >>
> > >>
> > >> Let me provide some more detailed information about my problem:
> > >>
> > >> I am running the program on a 30 node infiniband cluster. Each
> node has 4 single core Opteron CPUs. I am running 1 MPI Rank per node
> and 4 threads per rank (-> one thread per core).
> > >> I am compiling with mpicc of OpenMPI using gcc below.
> > >> Some pseudo-code of the program can be found at the end of this
> e-mail.
> > >>
> > >> I was able to reproduce the problem using different amount of
> nodes and even using one node only. The problem does not arise when I
> put printf-debugging information into the code. This pointed me into
> the direction that I have some memory problem, where some write
> accesses some memory it is not supposed to.
> > >> I ran the tests using valgrind with --leak-check=full and
> --show-reachable=yes, which pointed me either to MPI_Isend or MPI_Wait
> depending on whether I had the threads spin in a loop for MPI_Test to
> return success or used MPI_Wait respectively.
> > >>
> > >> I would appreciate your help with this. Am I missing something
> important here? Is there a way to re-use the request in the different
> iterations other than I thought it should work?
> > >> Or is there a way to re-initialize the allocated memory before
> the MPI_Isend/recv so that I at least don't have to call free and
> malloc each time?
> > >>
> > >> Thank you very much for your help!
> > >> Kind regards,
> > >> David B?ttner
> > >>
> > >> _____________________
> > >> Pseudo-Code of program:
> > >>
> > >> MPI_Request* req_s;
> > >> MPI_Request* req_w;
> > >> OUTER-LOOP
> > >> if(0 == threadid)
> > >> {
> > >> req_s = malloc(sizeof(MPI_Request));
> > >> req_r = malloc(sizeof(MPI_Request));
> > >> MPI_Isend(..., req_s)
> > >> MPI_Irecv(..., req_r)
> > >> }
> > >> pthread_barrier
> > >> INNER-LOOP (while NOT_DONE or RET)
> > >> if(TRYLOCK&& NOT_DONE)
> > >> {
> > >> if(MPI_TEST(req_r))
> > >> {
> > >> Call_Function_A;
> > >> NOT_DONE = 0;
> > >> }
> > >>
> > >> }
> > >> RET = Call_Function_B;
> > >> }
> > >> pthread_barrier_wait
> > >> if(0 == threadid)
> > >> {
> > >> MPI_WAIT(req_s)
> > >> MPI_WAIT(req_r)
> > >> free(req_s);
> > >> free(req_r);
> > >> }
> > >> _____________
> > >>
> > >>
> > >> --
> > >> David B?ttner, Informatik, Technische Universit?t M?nchen
> > >> TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
> > >>
> > >> _______________________________________________
> > >> users mailing list
> > >> users_at_[hidden]
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > "To preserve the freedom of the human mind then and freedom of the
> press, every spirit should be ready to devote itself to martyrdom; for
> as long as we may think as we will, and speak as we think, the
> condition of man will proceed in improvement."
> > > -- Thomas Jefferson, 1799
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > --
> > David B?ttner, Informatik, Technische Universit?t M?nchen
> > TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
> >
> >
> >
> > ------------------------------
> >
> > Message: 11
> > Date: Fri, 20 May 2011 06:23:21 -0400
> > From: Jeff Squyres <jsquyres_at_[hidden]>
> > Subject: Re: [OMPI users] Trouble with MPI-IO
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <A5B121E9-E664-49D0-AE54-2CFE527129D2_at_[hidden]>
> > Content-Type: text/plain; charset=us-ascii
> >
> > On May 19, 2011, at 11:24 PM, Tom Rosmond wrote:
> >
> > > What fortran compiler did you use?
> >
> > gfortran.
> >
> > > In the original script my Intel compile used the -132 option,
> > > allowing up to that many columns per line.
> >
> > Gotcha.
> >
> > >> x.f90:99.77:
> > >>
> > >> call
> mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> > >> 1
> > >> Error: There is no specific subroutine for the generic
> 'mpi_type_indexed' at (1)
> > >
> > > Hmmm, very strange, since I am looking right at the MPI standard
> > > documents with that routine documented. I too get this compile failure
> > > when I switch to 'use mpi'. Could that be a problem with the Open MPI
> > > fortran libraries???
> >
> > I think that that error is telling us that there's a compile-time
> mismatch -- that the signature of what you've passed doesn't match the
> signature of OMPI's MPI_Type_indexed subroutine.
> >
> > >> I looked at our mpi F90 module and see the following:
> > >>
> > >> interface MPI_Type_indexed
> > >> subroutine MPI_Type_indexed(count, array_of_blocklengths,
> array_of_displacements, oldtype, newtype, ierr)
> > >> integer, intent(in) :: count
> > >> integer, dimension(*), intent(in) :: array_of_blocklengths
> > >> integer, dimension(*), intent(in) :: array_of_displacements
> > >> integer, intent(in) :: oldtype
> > >> integer, intent(out) :: newtype
> > >> integer, intent(out) :: ierr
> > >> end subroutine MPI_Type_indexed
> > >> end interface
> >
> > Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 12
> > Date: Fri, 20 May 2011 07:26:19 -0400
> > From: Jeff Squyres <jsquyres_at_[hidden]>
> > Subject: Re: [OMPI users] MPI_Alltoallv function crashes when np > 100
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <F9F71854-B9DD-459F-999D-8A8AEF8D6006_at_[hidden]>
> > Content-Type: text/plain; charset=GB2312
> >
> > I missed this email in my INBOX, sorry.
> >
> > Can you be more specific about what exact error is occurring? You
> just say that the application crashes...? Please send all the
> information listed here:
> >
> > http://www.open-mpi.org/community/help/
> >
> >
> > On Apr 26, 2011, at 10:51 PM, ?????? wrote:
> >
> > > It seems that the const variable SOMAXCONN who used by listen()
> system call causes this problem. Can anybody help me resolve this
> question?
> > >
> > > 2011/4/25 ?????? <xjun.meng_at_[hidden]>
> > > Dear all,
> > >
> > > As I mentioned, when I mpiruned an application with the parameter
> "np = 150(or bigger)", the application who used the MPI_Alltoallv
> function would carsh. The problem would recur no matter how many nodes
> we used.
> > >
> > > The edition of OpenMPI: 1.4.1 or 1.4.3
> > > The OS: linux redhat 2.6.32
> > >
> > > BTW, my nodes had enough memory to run the application, and the
> MPI_Alltoall function worked well at my environment.
> > > Did anybody meet the same problem? Thanks.
> > >
> > >
> > > Best Regards
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 13
> > Date: Fri, 20 May 2011 07:28:28 -0400
> > From: Jeff Squyres <jsquyres_at_[hidden]>
> > Subject: Re: [OMPI users] MPI_ERR_TRUNCATE with MPI_Allreduce() error,
> > but only sometimes...
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <CAEF632E-757B-49EE-B545-5CCCBC712247_at_[hidden]>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Sorry for the super-late reply. :-\
> >
> > Yes, ERR_TRUNCATE means that the receiver didn't have a large enough
> buffer.
> >
> > Have you tried upgrading to a newer version of Open MPI? 1.4.3 is
> the current stable release (I have a very dim and not guaranteed to be
> correct recollection that we fixed something in the internals of
> collectives somewhere with regards to ERR_TRUNCATE...?).
> >
> >
> > On Apr 25, 2011, at 4:44 PM, Wei Hao wrote:
> >
> > > Hi:
> > >
> > > I'm running openmpi 1.2.8. I'm working on a project where one part
> involves communicating an integer, representing the number of data
> points I'm keeping track of, to all the processors. The line is simple:
> > >
> > > MPI_Allreduce(&np,&geo_N,1,MPI_INT,MPI_MAX,MPI_COMM_WORLD);
> > >
> > > where np and geo_N are integers, np is the result of a local
> calculation, and geo_N has been declared on all the processors. geo_N
> is nondecreasing. This line works the first time I call it (geo_N goes
> from 0 to some other integer), but if I call it later in the program,
> I get the following error:
> > >
> > >
> > > [woodhen-039:26189] *** An error occurred in MPI_Allreduce
> > > [woodhen-039:26189] *** on communicator MPI_COMM_WORLD
> > > [woodhen-039:26189] *** MPI_ERR_TRUNCATE: message truncated
> > > [woodhen-039:26189] *** MPI_ERRORS_ARE_FATAL (goodbye)
> > >
> > >
> > > As I understand it, MPI_ERR_TRUNCATE means that the output buffer
> is too small, but I'm not sure where I've made a mistake. It's
> particularly frustrating because it seems to work fine the first time.
> Does anyone have any thoughts?
> > >
> > > Thanks
> > > Wei
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 14
> > Date: Fri, 20 May 2011 08:14:07 -0400
> > From: Jeff Squyres <jsquyres_at_[hidden]>
> > Subject: Re: [OMPI users] Trouble with MPI-IO
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <42DB03B3-9CF4-4ACB-AA20-B857E5F76087_at_[hidden]>
> > Content-Type: text/plain; charset="us-ascii"
> >
> > On May 20, 2011, at 6:23 AM, Jeff Squyres wrote:
> >
> > > Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?
> >
> > Ok, if I convert ijlena and ijdisp to 1D arrays, I don't get the
> compile error (even though they're allocatable -- so allocate was a
> red herring, sorry). That's all that "use mpi" is complaining about --
> that the function signatures didn't match.
> >
> > use mpi is your friend -- even if you don't use F90 constructs much.
> Compile-time checking is Very Good Thing (you were effectively
> "getting lucky" by passing in the 2D arrays, I think).
> >
> > Attached is my final version. And with this version, I see the hang
> when running it with the "T" parameter.
> >
> > That being said, I'm not an expert on the MPI IO stuff -- your code
> *looks* right to me, but I could be missing something subtle in the
> interpretation of MPI_FILE_SET_VIEW. I tried running your code with
> MPICH 1.3.2p1 and it also hung.
> >
> > Rob (ROMIO guy) -- can you comment this code? Is it correct?
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> > -------------- next part --------------
> > A non-text attachment was scrubbed...
> > Name: x.f90
> > Type: application/octet-stream
> > Size: 3820 bytes
> > Desc: not available
> > URL:
> <http://www.open-mpi.org/MailArchives/users/attachments/20110520/53a5461b/attachment.obj>
> >
> > ------------------------------
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > End of users Digest, Vol 1911, Issue 1
> > **************************************
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users