MPI can get through your firewall, right?
Damien
On 20/05/2011 12:53 PM, Jason Mackay wrote:
I have verified that disabling UAC does not fix the problem.
xhlp.exe starts, threads spin up on both machines, CPU usage is at
80-90% but no progress is ever made.
>From this state, Ctrl-break on the head node yields the
following output:
[REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0]
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0]
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0]
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0]
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0]
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0]
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to
lifeline [[20816,0],0] lost
[REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to
lifeline [[20816,0],0] lost
[REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to
lifeline [[20816,0],0] lost
[REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to
lifeline [[20816,0],0] lost
[REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to
lifeline [[20816,0],0] lost
[REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to
lifeline [[20816,0],0] lost
> From: users-request@open-mpi.org
> Subject: users Digest, Vol 1911, Issue 1
> To: users@open-mpi.org
> Date: Fri, 20 May 2011 08:14:13 -0400
>
> Send users mailing list submissions to
> users@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-request@open-mpi.org
>
> You can reach the person managing the list at
> users-owner@open-mpi.org
>
> When replying, please edit your Subject line so it is more
specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
> 1. Re: Error: Entry Point Not Found (Zhangping Wei)
> 2. Re: Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test (George Bosilca)
> 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff
Squyres)
> 4. Re: Error: Entry Point Not Found (Jeff Squyres)
> 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011
(aka
> 12.0) (Jeff Squyres)
> 6. Re: Openib with > 32 cores per node (Jeff Squyres)
> 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres)
> 8. Re: Trouble with MPI-IO (Jeff Squyres)
> 9. Re: Trouble with MPI-IO (Tom Rosmond)
> 10. Re: Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test (David B?ttner)
> 11. Re: Trouble with MPI-IO (Jeff Squyres)
> 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff
Squyres)
> 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only
> sometimes... (Jeff Squyres)
> 14. Re: Trouble with MPI-IO (Jeff Squyres)
>
>
>
----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 19 May 2011 09:13:53 -0700 (PDT)
> From: Zhangping Wei <zhangping_wei@yahoo.com>
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: users@open-mpi.org
> Message-ID:
<101342.7961.qm@web111818.mail.gq1.yahoo.com>
> Content-Type: text/plain; charset="gb2312"
>
> Dear Paul,
>
> I checked the way 'mpirun -np N <cmd>' you mentioned,
but it was the same
> problem.
>
> I guess it may related to the system I used, because I have
used it correctly in
> another XP 32 bit system.
>
> I look forward to more advice.Thanks.
>
> Zhangping
>
>
>
>
> ________________________________
> ???????? "users-request@open-mpi.org"
<users-request@open-mpi.org>
> ???????? users@open-mpi.org
> ?????????? 2011/5/19 (????) 11:00:02 ????
> ?? ???? users Digest, Vol 1910, Issue 2
>
> Send users mailing list submissions to
> users@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-request@open-mpi.org
>
> You can reach the person managing the list at
> users-owner@open-mpi.org
>
> When replying, please edit your Subject line so it is more
specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
> 1. Re: Error: Entry Point Not Found (Paul van der Walt)
> 2. Re: Openib with > 32 cores per node (Robert Horton)
> 3. Re: Openib with > 32 cores per node (Samuel K.
Gutierrez)
>
>
>
----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 19 May 2011 16:14:02 +0100
> From: Paul van der Walt <paul@denknerd.nl>
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<BANLkTinjZ0CNtchQJCZYhfGSnR51jPuP7w@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> Hi,
>
> On 19 May 2011 15:54, Zhangping Wei
<zhangping_wei@yahoo.com> wrote:
> > 4, I use command window to run it in this way: ?mpirun
?n 4 ?**.exe ?,then I
>
> Probably not the problem, but shouldn't that be 'mpirun -np N
<cmd>' ?
>
> Paul
>
> --
> O< ascii ribbon campaign - stop html mail -
www.asciiribbon.org
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 19 May 2011 16:37:56 +0100
> From: Robert Horton <r.horton@qmul.ac.uk>
> Subject: Re: [OMPI users] Openib with > 32 cores per node
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <1305819476.9663.148.camel@moelwyn>
> Content-Type: text/plain; charset="UTF-8"
>
> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> > Hi,
> >
> > Try the following QP parameters that only use shared
receive queues.
> >
> > -mca btl_openib_receive_queues
S,12288,128,64,32:S,65536,128,64,32
> >
>
> Thanks for that. If I run the job over 2 x 48 cores it now
works and the
> performance seems reasonable (I need to do some more tuning)
but when I
> go up to 4 x 48 cores I'm getting the same problem:
>
>
[compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> error creating qp errno says Cannot allocate memory
> [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not
in list
> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI
job will now abort)
>
> Any thoughts?
>
> Thanks,
> Rob
> --
> Robert Horton
> System Administrator (Research Support) - School of
Mathematical Sciences
> Queen Mary, University of London
> r.horton@qmul.ac.uk - +44 (0) 20 7882 7345
>
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 19 May 2011 09:59:13 -0600
> From: "Samuel K. Gutierrez" <samuel@lanl.gov>
> Subject: Re: [OMPI users] Openib with > 32 cores per node
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<B3E83138-9AF0-48C0-871C-DBBB2E712E12@lanl.gov>
> Content-Type: text/plain; charset=us-ascii
>
> Hi,
>
> On May 19, 2011, at 9:37 AM, Robert Horton wrote
>
> > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez
wrote:
> >> Hi,
> >>
> >> Try the following QP parameters that only use shared
receive queues.
> >>
> >> -mca btl_openib_receive_queues
S,12288,128,64,32:S,65536,128,64,32
> >>
> >
> > Thanks for that. If I run the job over 2 x 48 cores it
now works and the
> > performance seems reasonable (I need to do some more
tuning) but when I
> > go up to 4 x 48 cores I'm getting the same problem:
> >
>
>[compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> >] error creating qp errno says Cannot allocate memory
> > [compute-1-7.local:18106] *** An error occurred in
MPI_Isend
> > [compute-1-7.local:18106] *** on communicator
MPI_COMM_WORLD
> > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error
not in list
> > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your
MPI job will now
> >abort)
> >
> > Any thoughts?
>
> How much memory does each node have? Does this happen at
startup?
>
> Try adding:
>
> -mca btl_openib_cpc_include rdmacm
>
> I'm not sure if your version of OFED supports this feature,
but maybe using XRC
> may help. I **think** other tweaks are needed to get this
going, but I'm not
> familiar with the details.
>
> Hope that helps,
>
> Samuel K. Gutierrez
> Los Alamos National Laboratory
>
>
> >
> > Thanks,
> > Rob
> > --
> > Robert Horton
> > System Administrator (Research Support) - School of
Mathematical Sciences
> > Queen Mary, University of London
> > r.horton@qmul.ac.uk - +44 (0) 20 7882 7345
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 1910, Issue 2
> **************************************
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 2
> Date: Thu, 19 May 2011 08:48:03 -0800
> From: George Bosilca <bosilca@eecs.utk.edu>
> Subject: Re: [OMPI users] Problem with MPI_Request,
MPI_Isend/recv and
> MPI_Wait/Test
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<FCAC66F9-FDB5-48BB-A800-263D8A4F9337@eecs.utk.edu>
> Content-Type: text/plain; charset=iso-8859-1
>
> David,
>
> I do not see any mechanism for protecting the accesses to the
requests to a single thread? What is the thread model you're
using?
>
> >From an implementation perspective, your code is correct
only if you initialize the MPI library with MPI_THREAD_MULTIPLE
and if the library accepts. Otherwise, there is an assumption that
the application is single threaded, or that the MPI behavior is
implementation dependent. Please read the MPI standard regarding
to MPI_Init_thread for more details.
>
> Regards,
> george.
>
> On May 19, 2011, at 02:34 , David B?ttner wrote:
>
> > Hello,
> >
> > I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread
code. I am using MPI_Isend and MPI_Irecv for communication and
MPI_Test/MPI_Wait to check if it is done. I do this repeatedly in
the outer loop of my code. The MPI_Test is used in the inner loop
to check if some function can be called which depends on the
received data.
> > The program regularly crashed (only when not using
printf...) and after debugging it I figured out the following
problem:
> >
> > In MPI_Isend I have an invalid read of memory. I fixed
the problem with not re-using a
> >
> > MPI_Request req_s, req_r;
> >
> > but by using
> >
> > MPI_Request* req_s;
> > MPI_Request* req_r
> >
> > and re-allocating them before the MPI_Isend/recv.
> >
> > The documentation says, that in MPI_Wait and MPI_Test
(if successful) the request-objects are deallocated and set to
MPI_REQUEST_NULL.
> > It also says, that in MPI_Isend and MPI_Irecv, it
allocates the Objects and associates it with the request object.
> >
> > As I understand this, this either means I can use a
pointer to MPI_Request which I don't have to initialize for this
(it doesn't work but crashes), or that I can use a MPI_Request
pointer which I have initialized with malloc(sizeof(MPI_REQUEST))
(or passing the address of a MPI_Request req), which is set and
unset in the functions. But this version crashes, too.
> > What works is using a pointer, which I allocate before
the MPI_Isend/recv and which I free after MPI_Wait in every
iteration. In other words: It only uses if I don't reuse any kind
of MPI_Request. Only if I recreate one every time.
> >
> > Is this, what is should be like? I believe that a reuse
of the memory would be a lot more efficient (less calls to
malloc...). Am I missing something here? Or am I doing something
wrong?
> >
> >
> > Let me provide some more detailed information about my
problem:
> >
> > I am running the program on a 30 node infiniband
cluster. Each node has 4 single core Opteron CPUs. I am running 1
MPI Rank per node and 4 threads per rank (-> one thread per
core).
> > I am compiling with mpicc of OpenMPI using gcc below.
> > Some pseudo-code of the program can be found at the end
of this e-mail.
> >
> > I was able to reproduce the problem using different
amount of nodes and even using one node only. The problem does not
arise when I put printf-debugging information into the code. This
pointed me into the direction that I have some memory problem,
where some write accesses some memory it is not supposed to.
> > I ran the tests using valgrind with --leak-check=full
and --show-reachable=yes, which pointed me either to MPI_Isend or
MPI_Wait depending on whether I had the threads spin in a loop for
MPI_Test to return success or used MPI_Wait respectively.
> >
> > I would appreciate your help with this. Am I missing
something important here? Is there a way to re-use the request in
the different iterations other than I thought it should work?
> > Or is there a way to re-initialize the allocated memory
before the MPI_Isend/recv so that I at least don't have to call
free and malloc each time?
> >
> > Thank you very much for your help!
> > Kind regards,
> > David B?ttner
> >
> > _____________________
> > Pseudo-Code of program:
> >
> > MPI_Request* req_s;
> > MPI_Request* req_w;
> > OUTER-LOOP
> > if(0 == threadid)
> > {
> > req_s = malloc(sizeof(MPI_Request));
> > req_r = malloc(sizeof(MPI_Request));
> > MPI_Isend(..., req_s)
> > MPI_Irecv(..., req_r)
> > }
> > pthread_barrier
> > INNER-LOOP (while NOT_DONE or RET)
> > if(TRYLOCK && NOT_DONE)
> > {
> > if(MPI_TEST(req_r))
> > {
> > Call_Function_A;
> > NOT_DONE = 0;
> > }
> >
> > }
> > RET = Call_Function_B;
> > }
> > pthread_barrier_wait
> > if(0 == threadid)
> > {
> > MPI_WAIT(req_s)
> > MPI_WAIT(req_r)
> > free(req_s);
> > free(req_r);
> > }
> > _____________
> >
> >
> > --
> > David B?ttner, Informatik, Technische Universit?t
M?nchen
> > TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> "To preserve the freedom of the human mind then and freedom
of the press, every spirit should be ready to devote itself to
martyrdom; for as long as we may think as we will, and speak as we
think, the condition of man will proceed in improvement."
> -- Thomas Jefferson, 1799
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 19 May 2011 21:22:48 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] v1.5.3-x64 does not work on Windows
7
> workgroup
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<278274F0-BF00-4498-950F-9779E0083C5A@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> Unfortunately, our Windows guy (Shiqing) is off getting
married and will be out for a little while. :-(
>
> All that I can cite is the README.WINDOWS.txt file in the
top-level directory. I'm afraid that I don't know much else about
Windows. :-(
>
>
> On May 18, 2011, at 8:17 PM, Jason Mackay wrote:
>
> > Hi all,
> >
> > My thanks to all those involved for putting together
this Windows binary release of OpenMPI! I am hoping to use it in a
small Windows based OpenMPI cluster at home.
> >
> > Unfortunately my experience so far has not exactly been
trouble free. It seems that, due to the fact that this release is
using WMI, there are a number of settings that must be configured
on the machines in order to get this to work. These settings are
not documented in the distribution at all. I have been
experimenting with it for over a week on and off and as soon as I
solve one problem, another one arises.
> >
> > Currently, after much searching, reading, and tinkering
with DCOM settings etc..., I can remotely start processes on all
my machines using mpirun but those processes cannot access network
shares (e.g. for binary distribution) and HPL (which works on any
one node) does not seem to work if I run it across multiple nodes,
also indicating a network issue (CPU sits at 100% in all processes
with no network traffic and never terminates). To eliminate
premission issues that may be caused by UAC I tried the same setup
on two domain machines using an administrative account to launch
and the behavior was the same. I have read that WMI processes
cannot access network resources and I am at a loss for a solution
to this newest of problems. If anyone knows how to make this work
I would appreciate the help. I assume that someone has gotten this
working and has the answers.
> >
> > I have searched the mailing list archives and I found
other users with similar problems but no clear guidance on the
threads. Some threads make references to Microsoft KB articles but
do not explicitly tell the user what needs to be done, leaving
each new user to rediscover the tricks on their own. One thread
made it appear that testing had only been done on Windows XP.
Needless to say, security has changed dramatically in Windows
since XP!
> >
> > I would like to see OpenMPI for Windows be usable by a
newcomer without all of this pain.
> >
> > What would be fantastic would be:
> > 1) a step-by-step procedure for how to get OpenMPI 1.5
working on Windows
> > a) preferably in a bare Windows 7 workgroup environment
with nothing else (i.e. no Microsoft Cluster Compute Pack, no
domain etc...)
> > 2) inclusion of these steps in the binary distribution
> > 3) bonus points for a script which accomplishes these
things automatically
> >
> > If someone can help with (1), I would happily volunteer
my time to work on (3).
> >
> > Regards,
> > Jason
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 19 May 2011 21:26:43 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<F830EC35-FC9B-4801-B2A3-50F54D2152A4@cisco.com>
> Content-Type: text/plain; charset=windows-1252
>
> On May 19, 2011, at 10:54 AM, Zhangping Wei wrote:
>
> > 4, I use command window to run it in this way: ?mpirun
?n 4 **.exe ?,then I met the error: ?entry point not found: the
procedure entry point inet_pton could not be located in the
dynamic link library WS2_32.dll?
>
> Unfortunately our Windows developer/maintainer is out for a
little while (he's getting married); he pretty much did the
Windows stuff by himself, so none of the rest of us know much
about it. :(
>
> inet_pton is a standard function call relating to IP
addresses that we use in the internals of OMPI; I'm not sure why
it wouldn't be found on Windows XP (Shiqing did cite that the OMPI
Windows port should work on Windows XP).
>
> This post seems to imply that inet_ntop is only available on
Vista and above:
>
>
http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/e40465f2-41b7-4243-ad33-15ae9366f4e6/
>
> So perhaps Shiqing needs to put in some kind of portability
workaround for OMPI, and the current binaries won't actually work
for XP...?
>
> I can't say that for sure because I really know very little
about Windows; we'll unfortunately have to wait until he returns
to get a definitive answer. :-(
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Thu, 19 May 2011 21:37:49 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel
composer
> XE 2011 (aka 12.0)
> To: Open MPI Users <users@open-mpi.org>
> Cc: Giovanni Bracco <giovanni.bracco@enea.it>, Agostino
Funel
> <agostino.funel@enea.it>, Fiorenzo Ambrosino
> <fiorenzo.ambrosino@enea.it>, Guido Guarnieri
> <guido.guarnieri@enea.it>, Roberto Ciavarella
> <roberto.ciavarella@enea.it>, Salvatore Podda
> <salvatore.podda@enea.it>, Giovanni Ponti
<giovanni.ponti@enea.it>
> Message-ID:
<45362608-B8B0-4ADE-9959-B35C5690A6F3@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> Sorry for the late reply.
>
> Other users have seen something similar but we have never
been able to reproduce it. Is this only when using IB? If you use
"mpirun --mca btl_openib_cpc_if_include rdmacm", does the problem
go away?
>
>
> On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:
>
> > I've seen the same thing when I build openmpi 1.4.3 with
Intel 12, but only when I have -O2 or -O3 in CFLAGS. If I drop it
down to -O1 then the collectives hangs go away. I don't know what,
if anything, the higher optimization buys you when compiling
openmpi, so I'm not sure if that's an acceptable workaround or
not.
> >
> > My system is similar to yours - Intel X5570 with QDR
Mellanox IB running RHEL 5, Slurm, and these openmpi btls:
openib,sm,self. I'm using IMB 3.2.2 with a single iteration of
Barrier to reproduce the hang, and it happens 100% of the time for
me when I invoke it like this:
> >
> > # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1
barrier
> >
> > The hang happens on the first Barrier (64 ranks) and
each of the participating ranks have this backtrace:
> >
> > __poll (...)
> > poll_dispatch () from [instdir]/lib/libopen-pal.so.0
> > opal_event_loop () from [instdir]/lib/libopen-pal.so.0
> > opal_progress () from [instdir]/lib/libopen-pal.so.0
> > ompi_request_default_wait_all () from
[instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_sendrecv_actual () from
[instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_recursivedoubling () from
[instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_dec_fixed () from
[instdir]/lib/libmpi.so.0
> > PMPI_Barrier () from [instdir]/lib/libmpi.so.0
> > IMB_barrier ()
> > IMB_init_buffers_iter ()
> > main ()
> >
> > The one non-participating rank has this backtrace:
> >
> > __poll (...)
> > poll_dispatch () from [instdir]/lib/libopen-pal.so.0
> > opal_event_loop () from [instdir]/lib/libopen-pal.so.0
> > opal_progress () from [instdir]/lib/libopen-pal.so.0
> > ompi_request_default_wait_all () from
[instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_sendrecv_actual () from
[instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_bruck () from
[instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_dec_fixed () from
[instdir]/lib/libmpi.so.0
> > PMPI_Barrier () from [instdir]/lib/libmpi.so.0
> > main ()
> >
> > If I use more nodes I can get it to hang with 1ppn, so
that seems to rule out the sm btl (or interactions with it) as a
culprit at least.
> >
> > I can't reproduce this with openmpi 1.5.3,
interestingly.
> >
> > -Marcus
> >
> >
> > On 05/10/2011 03:37 AM, Salvatore Podda wrote:
> >> Dear all,
> >>
> >> we succeed in building several version of openmpi
from 1.2.8 to 1.4.3
> >> with Intel composer XE 2011 (aka 12.0).
> >> However we found a threshold in the number of cores
(depending from the
> >> application: IMB, xhpl or user applications
> >> and form the number of required cores) above which
the application hangs
> >> (sort of deadlocks).
> >> The building of openmpi with 'gcc' and 'pgi' does
not show the same limits.
> >> There are any known incompatibilities of openmpi
with this version of
> >> intel compiilers?
> >>
> >> The characteristics of our computational
infrastructure are:
> >>
> >> Intel processors E7330, E5345, E5530 e E5620
> >>
> >> CentOS 5.3, CentOS 5.5.
> >>
> >> Intel composer XE 2011
> >> gcc 4.1.2
> >> pgi 10.2-1
> >>
> >> Regards
> >>
> >> Salvatore Podda
> >>
> >> ENEA UTICT-HPC
> >> Department for Computer Science Development and ICT
> >> Facilities Laboratory for Science and High
Performace Computing
> >> C.R. Frascati
> >> Via E. Fermi, 45
> >> PoBox 65
> >> 00044 Frascati (Rome)
> >> Italy
> >>
> >> Tel: +39 06 9400 5342
> >> Fax: +39 06 9400 5551
> >> Fax: +39 06 9400 5735
> >> E-mail: salvatore.podda@enea.it
> >> Home Page: www.cresco.enea.it
> >> _______________________________________________
> >> users mailing list
> >> users@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 6
> Date: Thu, 19 May 2011 22:01:00 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] Openib with > 32 cores per node
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<C18C4827-D305-484A-9DAE-290902D40DB3@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> What Sam is alluding to is that the OpenFabrics driver code
in OMPI is sucking up oodles of memory for each IB connection that
you're using. The receive_queues param that he sent tells OMPI to
use all shared receive queues (instead of defaulting to one
per-peer receive queue and the rest shared receive queues -- the
per-peer RQ sucks up all the memory when you multiple it by N
peers).
>
>
> On May 19, 2011, at 11:59 AM, Samuel K. Gutierrez wrote:
>
> > Hi,
> >
> > On May 19, 2011, at 9:37 AM, Robert Horton wrote
> >
> >> On Thu, 2011-05-19 at 08:27 -0600, Samuel K.
Gutierrez wrote:
> >>> Hi,
> >>>
> >>> Try the following QP parameters that only use
shared receive queues.
> >>>
> >>> -mca btl_openib_receive_queues
S,12288,128,64,32:S,65536,128,64,32
> >>>
> >>
> >> Thanks for that. If I run the job over 2 x 48 cores
it now works and the
> >> performance seems reasonable (I need to do some more
tuning) but when I
> >> go up to 4 x 48 cores I'm getting the same problem:
> >>
> >>
[compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
error creating qp errno says Cannot allocate memory
> >> [compute-1-7.local:18106] *** An error occurred in
MPI_Isend
> >> [compute-1-7.local:18106] *** on communicator
MPI_COMM_WORLD
> >> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known
error not in list
> >> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL
(your MPI job will now abort)
> >>
> >> Any thoughts?
> >
> > How much memory does each node have? Does this happen at
startup?
> >
> > Try adding:
> >
> > -mca btl_openib_cpc_include rdmacm
> >
> > I'm not sure if your version of OFED supports this
feature, but maybe using XRC may help. I **think** other tweaks
are needed to get this going, but I'm not familiar with the
details.
> >
> > Hope that helps,
> >
> > Samuel K. Gutierrez
> > Los Alamos National Laboratory
> >
> >
> >>
> >> Thanks,
> >> Rob
> >> --
> >> Robert Horton
> >> System Administrator (Research Support) - School of
Mathematical Sciences
> >> Queen Mary, University of London
> >> r.horton@qmul.ac.uk - +44 (0) 20 7882 7345
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 7
> Date: Thu, 19 May 2011 22:04:46 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] MPI_COMM_DUP freeze with OpenMPI
1.4.1
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<0DCF20B8-CA5C-4746-8187-A2DFF39B15DD@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> On May 13, 2011, at 8:31 AM,
francoise.roch@obs.ujf-grenoble.fr wrote:
>
> > Here is the MUMPS portion of code (in zmumps_part1.F
file) where the slaves call MPI_COMM_DUP , id%PAR and MASTER are
initialized to 0 before :
> >
> > CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
>
> I re-indented so that I could read it better:
>
> CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
> IF ( id%PAR .eq. 0 ) THEN
> IF ( id%MYID .eq. MASTER ) THEN
> color = MPI_UNDEFINED
> ELSE
> color = 0
> END IF
> CALL MPI_COMM_SPLIT( id%COMM, color, 0,
> & id%COMM_NODES, IERR )
> id%NSLAVES = id%NPROCS - 1
> ELSE
> CALL MPI_COMM_DUP( id%COMM, id%COMM_NODES, IERR )
> id%NSLAVES = id%NPROCS
> END IF
>
> IF (id%PAR .ne. 0 .or. id%MYID .NE. MASTER) THEN
> CALL MPI_COMM_DUP( id%COMM_NODES, id%COMM_LOAD, IERR
> ENDIF
>
> That doesn't look right -- both MPI_COMM_SPLIT and
MPI_COMM_DUP are collective, meaning that all processes in the
communicator must call them. In the first case, only some
processes are calling MPI_COMM_SPLIT. Is there some other logic
that forces the rest of the processes to call MPI_COMM_SPLIT, too?
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 8
> Date: Thu, 19 May 2011 22:30:03 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<EEFB638F-72F1-4208-8EA2-4F25F610C47B@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> Props for that testio script. I think you win the award for
"most easy to reproduce test case." :-)
>
> I notice that some of the lines went over 72 columns, so I
renamed the file x.f90 and changed all the comments from "c" to
"!" and joined the two &-split lines. The error about implicit
type for lenr went away, but then when I enabled better type
checking by using "use mpi" instead of "include 'mpif.h'", I got
the following:
>
> x.f90:99.77:
>
> call
mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> 1
> Error: There is no specific subroutine for the generic
'mpi_type_indexed' at (1)
>
> I looked at our mpi F90 module and see the following:
>
> interface MPI_Type_indexed
> subroutine MPI_Type_indexed(count, array_of_blocklengths,
array_of_displacements, oldtype, newtype, ierr)
> integer, intent(in) :: count
> integer, dimension(*), intent(in) :: array_of_blocklengths
> integer, dimension(*), intent(in) :: array_of_displacements
> integer, intent(in) :: oldtype
> integer, intent(out) :: newtype
> integer, intent(out) :: ierr
> end subroutine MPI_Type_indexed
> end interface
>
> I don't quite grok the syntax of the "allocatable" type
ijdisp, so that might be the problem here...?
>
> Regardless, I'm not entirely sure if the problem is the
>72 character lines, but then when that is gone, I'm not sure
how the allocatable stuff fits in... (I'm not enough of a Fortran
programmer to know)
>
>
>
>
> On May 10, 2011, at 7:14 PM, Tom Rosmond wrote:
>
> > I would appreciate someone with experience with MPI-IO
look at the
> > simple fortran program gzipped and attached to this
note. It is
> > imbedded in a script so that all that is necessary to
run it is do:
> > 'testio' from the command line. The program generates a
small 2-D input
> > array, sets up an MPI-IO environment, and write a 2-D
output array
> > twice, with the only difference being the displacement
arrays used to
> > construct the indexed datatype. For the first write,
simple
> > monotonically increasing displacements are used, for the
second the
> > displacements are 'shuffled' in one dimension. They are
printed during
> > the run.
> >
> > For the first case the file is written properly, but for
the second the
> > program hangs on MPI_FILE_WRITE_AT_ALL and must be
aborted manually.
> > Although the program is compiled as an mpi program, I am
running on a
> > single processor, which makes the problem more puzzling.
> >
> > The program should be relatively self-explanatory, but
if more
> > information is needed, please ask. I am on an 8 core
Xeon based Dell
> > workstation running Scientific Linux 5.5, Intel fortran
12.0.3, and
> > OpenMPI 1.5.3. I have also attached output from
'ompi_info'.
> >
> > T. Rosmond
> >
> >
> >
<testio.gz><info_ompi.gz>_______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 9
> Date: Thu, 19 May 2011 20:24:25 -0700
> From: Tom Rosmond <rosmond@reachone.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<1305861865.4284.104.camel@cedar.reachone.com>
> Content-Type: text/plain
>
> Thanks for looking at my problem. Sounds like you did
reproduce my
> problem. I have added some comments below
>
> On Thu, 2011-05-19 at 22:30 -0400, Jeff Squyres wrote:
> > Props for that testio script. I think you win the award
for "most easy to reproduce test case." :-)
> >
> > I notice that some of the lines went over 72 columns, so
I renamed the file x.f90 and changed all the comments from "c" to
"!" and joined the two &-split lines. The error about implicit
type for lenr went away, but then when I enabled better type
checking by using "use mpi" instead of "include 'mpif.h'", I got
the following:
>
> What fortran compiler did you use?
>
> In the original script my Intel compile used the -132 option,
> allowing up to that many columns per line. I still think in
> F77 fortran much of the time, and use 'c' for comments out
> of habit. The change to '!' doesn't make any difference.
>
>
> > x.f90:99.77:
> >
> > call
mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> > 1
> > Error: There is no specific subroutine for the generic
'mpi_type_indexed' at (1)
>
> Hmmm, very strange, since I am looking right at the MPI
standard
> documents with that routine documented. I too get this
compile failure
> when I switch to 'use mpi'. Could that be a problem with the
Open MPI
> fortran libraries???
> >
> > I looked at our mpi F90 module and see the following:
> >
> > interface MPI_Type_indexed
> > subroutine MPI_Type_indexed(count,
array_of_blocklengths, array_of_displacements, oldtype, newtype,
ierr)
> > integer, intent(in) :: count
> > integer, dimension(*), intent(in) ::
array_of_blocklengths
> > integer, dimension(*), intent(in) ::
array_of_displacements
> > integer, intent(in) :: oldtype
> > integer, intent(out) :: newtype
> > integer, intent(out) :: ierr
> > end subroutine MPI_Type_indexed
> > end interface
> >
> > I don't quite grok the syntax of the "allocatable" type
ijdisp, so that might be the problem here...?
>
> Just a standard F90 'allocatable' statement. I've written
thousands
> just like it.
> >
> > Regardless, I'm not entirely sure if the problem is the
>72 character lines, but then when that is gone, I'm not sure
how the allocatable stuff fits in... (I'm not enough of a Fortran
programmer to know)
> >
> Anyone else out that who can comment????
>
>
> T. Rosmond
>
>
>
> >
> > On May 10, 2011, at 7:14 PM, Tom Rosmond wrote:
> >
> > > I would appreciate someone with experience with
MPI-IO look at the
> > > simple fortran program gzipped and attached to this
note. It is
> > > imbedded in a script so that all that is necessary
to run it is do:
> > > 'testio' from the command line. The program
generates a small 2-D input
> > > array, sets up an MPI-IO environment, and write a
2-D output array
> > > twice, with the only difference being the
displacement arrays used to
> > > construct the indexed datatype. For the first
write, simple
> > > monotonically increasing displacements are used,
for the second the
> > > displacements are 'shuffled' in one dimension. They
are printed during
> > > the run.
> > >
> > > For the first case the file is written properly,
but for the second the
> > > program hangs on MPI_FILE_WRITE_AT_ALL and must be
aborted manually.
> > > Although the program is compiled as an mpi program,
I am running on a
> > > single processor, which makes the problem more
puzzling.
> > >
> > > The program should be relatively self-explanatory,
but if more
> > > information is needed, please ask. I am on an 8
core Xeon based Dell
> > > workstation running Scientific Linux 5.5, Intel
fortran 12.0.3, and
> > > OpenMPI 1.5.3. I have also attached output from
'ompi_info'.
> > >
> > > T. Rosmond
> > >
> > >
> > >
<testio.gz><info_ompi.gz>_______________________________________________
> > > users mailing list
> > > users@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>
>
>
> ------------------------------
>
> Message: 10
> Date: Fri, 20 May 2011 09:25:14 +0200
> From: David B?ttner <david.buettner@in.tum.de>
> Subject: Re: [OMPI users] Problem with MPI_Request,
MPI_Isend/recv and
> MPI_Wait/Test
> To: Open MPI Users <users@open-mpi.org>
> Message-ID: <4DD6175A.1080403@in.tum.de>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hello,
>
> thanks for the quick answer. I am sorry that I forgot to
mention this: I
> did compile OpenMPI with MPI_THREAD_MULTIPLE support and test
if
> required == provided after the MPI_Thread_init call.
>
> > I do not see any mechanism for protecting the accesses
to the requests to a single thread? What is the thread model
you're using?
> >
> Again I am sorry that this was not clear: In the pseudo code
below I
> wanted to indicate the access-protection I do by thread-id
dependent
> calls if(0 == thread-id) and by using the trylock(...) (using
> pthread-mutexes). In the code all accesses concerning one
MPI_Request
> (which are pthread-global-pointers in my case) are protected
and called
> in sequential order, i.e. MPI_Isend/recv is returns before
any thread is
> allowed to call the corresponding MPI_Test and no-one can
call MPI_Test
> any more when a thread is allowed to call MPI_Wait.
> I did this in the same manner before with other MPI
implementations, but
> also on the same machine with the same (untouched) OpenMPI
> implementation, also using pthreads and MPI in combination,
but I used
>
> MPI_Request req;
>
> instead of
>
> MPI_Request* req;
> (and later)
> req = (MPI_Request*)malloc(sizeof(MPI_Request));
>
>
> In my recent (problem) code, I also tried not using pointers,
but got
> the same problem. Also, as I described in the first mail, I
tried
> everything concerning the memory allocation of the
MPI_Request objects.
> I tried not calling malloc. This I guessed wouldn't work, but
the
> OpenMPI documentation says this:
>
> " Nonblocking calls allocate a communication request object
and
> associate it with the request handle the argument request). "
> [http://www.open-mpi.org/doc/v1.4/man3/MPI_Isend.3.php] and
>
> " [...] if the communication object was created by a
nonblocking send or
> receive, then it is deallocated and the request handle is set
to
> MPI_REQUEST_NULL."
> [http://www.open-mpi.org/doc/v1.4/man3/MPI_Test.3.php] and
(in slightly
> different words)
[http://www.open-mpi.org/doc/v1.4/man3/MPI_Wait.3.php]
>
> So I thought that it might do some kind of optimized memory
stuff
> internally.
>
> I also tried allocating req (for each used MPI_Request) once
before the
> first use and deallocation after the last use (which I
thought was the
> way it was supposed to work), but that crashes also.
>
> I tried replacing the pointers through global variables
>
> MPI_Request req;
>
> which didn't do the job...
>
> The only thing that seems to work is what I mentioned below:
Allocate
> every time I am going to need it in the MPI_Isend/recv, use
it in
> MPI_Test/Wait and after that deallocate it by hand each time.
> I don't think that this is supposed to be like this since I
have to do a
> call to malloc and free so often (for multiple MPI_Request
objects in
> each iteration) that it will most likely limit performance...
>
> Anyway I still have the same problem and am still unclear on
what kind
> of memory allocation I should be doing for the MPI_Requests.
Is there
> anything else (besides MPI_THREAD_MULTIPLE support, thread
access
> control, sequential order of MPI_Isend/recv, MPI_Test and
MPI_Wait for
> one MPI_Request object) I need to take care of? If not, what
could I do
> to find the source of my problem?
>
> Thanks again for any kind of help!
>
> Kind regards,
> David
>
>
>
> > > From an implementation perspective, your code is
correct only if you initialize the MPI library with
MPI_THREAD_MULTIPLE and if the library accepts. Otherwise, there
is an assumption that the application is single threaded, or that
the MPI behavior is implementation dependent. Please read the MPI
standard regarding to MPI_Init_thread for more details.
> >
> > Regards,
> > george.
> >
> > On May 19, 2011, at 02:34 , David B?ttner wrote:
> >
> >> Hello,
> >>
> >> I am working on a hybrid MPI (OpenMPI 1.4.3) and
Pthread code. I am using MPI_Isend and MPI_Irecv for communication
and MPI_Test/MPI_Wait to check if it is done. I do this repeatedly
in the outer loop of my code. The MPI_Test is used in the inner
loop to check if some function can be called which depends on the
received data.
> >> The program regularly crashed (only when not using
printf...) and after debugging it I figured out the following
problem:
> >>
> >> In MPI_Isend I have an invalid read of memory. I
fixed the problem with not re-using a
> >>
> >> MPI_Request req_s, req_r;
> >>
> >> but by using
> >>
> >> MPI_Request* req_s;
> >> MPI_Request* req_r
> >>
> >> and re-allocating them before the MPI_Isend/recv.
> >>
> >> The documentation says, that in MPI_Wait and
MPI_Test (if successful) the request-objects are deallocated and
set to MPI_REQUEST_NULL.
> >> It also says, that in MPI_Isend and MPI_Irecv, it
allocates the Objects and associates it with the request object.
> >>
> >> As I understand this, this either means I can use a
pointer to MPI_Request which I don't have to initialize for this
(it doesn't work but crashes), or that I can use a MPI_Request
pointer which I have initialized with malloc(sizeof(MPI_REQUEST))
(or passing the address of a MPI_Request req), which is set and
unset in the functions. But this version crashes, too.
> >> What works is using a pointer, which I allocate
before the MPI_Isend/recv and which I free after MPI_Wait in every
iteration. In other words: It only uses if I don't reuse any kind
of MPI_Request. Only if I recreate one every time.
> >>
> >> Is this, what is should be like? I believe that a
reuse of the memory would be a lot more efficient (less calls to
malloc...). Am I missing something here? Or am I doing something
wrong?
> >>
> >>
> >> Let me provide some more detailed information about
my problem:
> >>
> >> I am running the program on a 30 node infiniband
cluster. Each node has 4 single core Opteron CPUs. I am running 1
MPI Rank per node and 4 threads per rank (-> one thread per
core).
> >> I am compiling with mpicc of OpenMPI using gcc
below.
> >> Some pseudo-code of the program can be found at the
end of this e-mail.
> >>
> >> I was able to reproduce the problem using different
amount of nodes and even using one node only. The problem does not
arise when I put printf-debugging information into the code. This
pointed me into the direction that I have some memory problem,
where some write accesses some memory it is not supposed to.
> >> I ran the tests using valgrind with
--leak-check=full and --show-reachable=yes, which pointed me
either to MPI_Isend or MPI_Wait depending on whether I had the
threads spin in a loop for MPI_Test to return success or used
MPI_Wait respectively.
> >>
> >> I would appreciate your help with this. Am I missing
something important here? Is there a way to re-use the request in
the different iterations other than I thought it should work?
> >> Or is there a way to re-initialize the allocated
memory before the MPI_Isend/recv so that I at least don't have to
call free and malloc each time?
> >>
> >> Thank you very much for your help!
> >> Kind regards,
> >> David B?ttner
> >>
> >> _____________________
> >> Pseudo-Code of program:
> >>
> >> MPI_Request* req_s;
> >> MPI_Request* req_w;
> >> OUTER-LOOP
> >> if(0 == threadid)
> >> {
> >> req_s = malloc(sizeof(MPI_Request));
> >> req_r = malloc(sizeof(MPI_Request));
> >> MPI_Isend(..., req_s)
> >> MPI_Irecv(..., req_r)
> >> }
> >> pthread_barrier
> >> INNER-LOOP (while NOT_DONE or RET)
> >> if(TRYLOCK&& NOT_DONE)
> >> {
> >> if(MPI_TEST(req_r))
> >> {
> >> Call_Function_A;
> >> NOT_DONE = 0;
> >> }
> >>
> >> }
> >> RET = Call_Function_B;
> >> }
> >> pthread_barrier_wait
> >> if(0 == threadid)
> >> {
> >> MPI_WAIT(req_s)
> >> MPI_WAIT(req_r)
> >> free(req_s);
> >> free(req_r);
> >> }
> >> _____________
> >>
> >>
> >> --
> >> David B?ttner, Informatik, Technische Universit?t
M?nchen
> >> TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > "To preserve the freedom of the human mind then and
freedom of the press, every spirit should be ready to devote
itself to martyrdom; for as long as we may think as we will, and
speak as we think, the condition of man will proceed in
improvement."
> > -- Thomas Jefferson, 1799
> >
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> David B?ttner, Informatik, Technische Universit?t M?nchen
> TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
>
>
>
> ------------------------------
>
> Message: 11
> Date: Fri, 20 May 2011 06:23:21 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<A5B121E9-E664-49D0-AE54-2CFE527129D2@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> On May 19, 2011, at 11:24 PM, Tom Rosmond wrote:
>
> > What fortran compiler did you use?
>
> gfortran.
>
> > In the original script my Intel compile used the -132
option,
> > allowing up to that many columns per line.
>
> Gotcha.
>
> >> x.f90:99.77:
> >>
> >> call
mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> >> 1
> >> Error: There is no specific subroutine for the
generic 'mpi_type_indexed' at (1)
> >
> > Hmmm, very strange, since I am looking right at the MPI
standard
> > documents with that routine documented. I too get this
compile failure
> > when I switch to 'use mpi'. Could that be a problem with
the Open MPI
> > fortran libraries???
>
> I think that that error is telling us that there's a
compile-time mismatch -- that the signature of what you've passed
doesn't match the signature of OMPI's MPI_Type_indexed subroutine.
>
> >> I looked at our mpi F90 module and see the
following:
> >>
> >> interface MPI_Type_indexed
> >> subroutine MPI_Type_indexed(count,
array_of_blocklengths, array_of_displacements, oldtype, newtype,
ierr)
> >> integer, intent(in) :: count
> >> integer, dimension(*), intent(in) ::
array_of_blocklengths
> >> integer, dimension(*), intent(in) ::
array_of_displacements
> >> integer, intent(in) :: oldtype
> >> integer, intent(out) :: newtype
> >> integer, intent(out) :: ierr
> >> end subroutine MPI_Type_indexed
> >> end interface
>
> Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 12
> Date: Fri, 20 May 2011 07:26:19 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] MPI_Alltoallv function crashes when
np > 100
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<F9F71854-B9DD-459F-999D-8A8AEF8D6006@cisco.com>
> Content-Type: text/plain; charset=GB2312
>
> I missed this email in my INBOX, sorry.
>
> Can you be more specific about what exact error is occurring?
You just say that the application crashes...? Please send all the
information listed here:
>
> http://www.open-mpi.org/community/help/
>
>
> On Apr 26, 2011, at 10:51 PM, ?????? wrote:
>
> > It seems that the const variable SOMAXCONN who used by
listen() system call causes this problem. Can anybody help me
resolve this question?
> >
> > 2011/4/25 ?????? <xjun.meng@gmail.com>
> > Dear all,
> >
> > As I mentioned, when I mpiruned an application with the
parameter "np = 150(or bigger)", the application who used the
MPI_Alltoallv function would carsh. The problem would recur no
matter how many nodes we used.
> >
> > The edition of OpenMPI: 1.4.1 or 1.4.3
> > The OS: linux redhat 2.6.32
> >
> > BTW, my nodes had enough memory to run the application,
and the MPI_Alltoall function worked well at my environment.
> > Did anybody meet the same problem? Thanks.
> >
> >
> > Best Regards
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 13
> Date: Fri, 20 May 2011 07:28:28 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] MPI_ERR_TRUNCATE with
MPI_Allreduce() error,
> but only sometimes...
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<CAEF632E-757B-49EE-B545-5CCCBC712247@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> Sorry for the super-late reply. :-\
>
> Yes, ERR_TRUNCATE means that the receiver didn't have a large
enough buffer.
>
> Have you tried upgrading to a newer version of Open MPI?
1.4.3 is the current stable release (I have a very dim and not
guaranteed to be correct recollection that we fixed something in
the internals of collectives somewhere with regards to
ERR_TRUNCATE...?).
>
>
> On Apr 25, 2011, at 4:44 PM, Wei Hao wrote:
>
> > Hi:
> >
> > I'm running openmpi 1.2.8. I'm working on a project
where one part involves communicating an integer, representing the
number of data points I'm keeping track of, to all the processors.
The line is simple:
> >
> >
MPI_Allreduce(&np,&geo_N,1,MPI_INT,MPI_MAX,MPI_COMM_WORLD);
> >
> > where np and geo_N are integers, np is the result of a
local calculation, and geo_N has been declared on all the
processors. geo_N is nondecreasing. This line works the first time
I call it (geo_N goes from 0 to some other integer), but if I call
it later in the program, I get the following error:
> >
> >
> > [woodhen-039:26189] *** An error occurred in
MPI_Allreduce
> > [woodhen-039:26189] *** on communicator MPI_COMM_WORLD
> > [woodhen-039:26189] *** MPI_ERR_TRUNCATE: message
truncated
> > [woodhen-039:26189] *** MPI_ERRORS_ARE_FATAL (goodbye)
> >
> >
> > As I understand it, MPI_ERR_TRUNCATE means that the
output buffer is too small, but I'm not sure where I've made a
mistake. It's particularly frustrating because it seems to work
fine the first time. Does anyone have any thoughts?
> >
> > Thanks
> > Wei
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 14
> Date: Fri, 20 May 2011 08:14:07 -0400
> From: Jeff Squyres <jsquyres@cisco.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <users@open-mpi.org>
> Message-ID:
<42DB03B3-9CF4-4ACB-AA20-B857E5F76087@cisco.com>
> Content-Type: text/plain; charset="us-ascii"
>
> On May 20, 2011, at 6:23 AM, Jeff Squyres wrote:
>
> > Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?
>
> Ok, if I convert ijlena and ijdisp to 1D arrays, I don't get
the compile error (even though they're allocatable -- so allocate
was a red herring, sorry). That's all that "use mpi" is
complaining about -- that the function signatures didn't match.
>
> use mpi is your friend -- even if you don't use F90
constructs much. Compile-time checking is Very Good Thing (you
were effectively "getting lucky" by passing in the 2D arrays, I
think).
>
> Attached is my final version. And with this version, I see
the hang when running it with the "T" parameter.
>
> That being said, I'm not an expert on the MPI IO stuff --
your code *looks* right to me, but I could be missing something
subtle in the interpretation of MPI_FILE_SET_VIEW. I tried running
your code with MPICH 1.3.2p1 and it also hung.
>
> Rob (ROMIO guy) -- can you comment this code? Is it correct?
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: x.f90
> Type: application/octet-stream
> Size: 3820 bytes
> Desc: not available
> URL:
<http://www.open-mpi.org/MailArchives/users/attachments/20110520/53a5461b/attachment.obj>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 1911, Issue 1
> **************************************
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users