Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] ompi-ps broken or just changed?
From: Ashley Pittman (ashley_at_[hidden])
Date: 2009-05-19 09:07:33


Ralph,

At least part of them problem is to do with error reporting, orte-ps is
hitting the error case for a stale hnp at around line 258 and is trying
to report the error via orte_show_help() however this function is
calling a rpc into the orted-run which is silently ignoring it for some
reason.

The failure itself seems to come from a timeout in comm.c:1114 where the
client process isn't waiting long enough for the orted-run to reply and
is returning ORTE_ERR_SILENT instead. I can't think what to suggest
here other than increasing the timeout?

Ashley,

On Mon, 2009-05-18 at 17:06 +0100, Ashley Pittman wrote:
> It's certainly helped and now runs for me however if I run mpirun under
> valgrind and then opmi-ps in another window Valgrind reports errors and
> ompi-ps doesn't list the job so there is clearly something still amiss.
> I'm trying to do some more diagnostics now.
>
> ==32362== Syscall param writev(vector[...]) points to uninitialised
> byte(s)
> ==32362== at 0x41BF10C: writev (writev.c:46)
> ==32362== by 0x4EAAD52: mca_oob_tcp_msg_send_handler
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so)
> ==32362== by 0x4EAC505: mca_oob_tcp_peer_send
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so)
> ==32362== by 0x4EAEF89: mca_oob_tcp_send_nb
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so)
> ==32362== by 0x4EA20BE: orte_rml_oob_send
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_rml_oob.so)
> ==32362== by 0x4EA2359: orte_rml_oob_send_buffer
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_rml_oob.so)
> ==32362== by 0x4050738: process_commands
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
> ==32362== by 0x405108C: orte_daemon_cmd_processor
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
> ==32362== by 0x4260B57: opal_event_base_loop
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362== by 0x4260DF6: opal_event_loop
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362== by 0x4260E1D: opal_event_dispatch
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362== by 0x804B15F: orterun (orterun.c:757)
> ==32362== Address 0x448507c is 20 bytes inside a block of size 512
> alloc'd
> ==32362== at 0x402613C: realloc (vg_replace_malloc.c:429)
> ==32362== by 0x42556B7: opal_dss_buffer_extend
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362== by 0x4256C4F: opal_dss_pack_int32
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362== by 0x42565C9: opal_dss_pack_buffer
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362== by 0x403A60D: orte_dt_pack_job
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
> ==32362== by 0x42565C9: opal_dss_pack_buffer
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362== by 0x4256FFB: opal_dss_pack
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362== by 0x40506F7: process_commands
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
> ==32362== by 0x405108C: orte_daemon_cmd_processor
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
> ==32362== by 0x4260B57: opal_event_base_loop
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362== by 0x4260DF6: opal_event_loop
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362== by 0x4260E1D: opal_event_dispatch
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
>
> On Mon, 2009-05-18 at 08:22 -0600, Ralph Castain wrote:
> > Aha! Thanks for spotting the problem - I had to move that var init to
> > cover all cases, but it should be working now with r21249
> >
> >
> >
> > On May 18, 2009, at 8:08 AM, Ashley Pittman wrote:
> >
> > >
> > > Ralph,
> > >
> > > This patch fixed it, num_nodes was being used initialised and hence
> > > the
> > > client was getting a bogus value for the number of nodes.
> > >
> > > Ashley,
> > >
> > > On Mon, 2009-05-18 at 10:09 +0100, Ashley Pittman wrote:
> > >> No joy I'm afraid, now I get errors when I run it. This is a single
> > >> node job run with the command line "mpirun -n 3 ./a.out". I've
> > >> attached
> > >> the strace output and gzipped /tmp files from the machine.
> > >> Valgrind on
> > >> the opmi-ps process doesn't show anything interesting.
> > >>
> > >> [alpha:29942] [[35044,0],0] ORTE_ERROR_LOG: Data unpack would read
> > >> past
> > >> end of buffer in
> > >> file /mnt/home/debian/ashley/code/OpenMPI/ompi-trunk-tes/trunk/orte/
> > >> util/comm/comm.c at line 242
> > >> [alpha:29942] [[35044,0],0] ORTE_ERROR_LOG: Data unpack would read
> > >> past
> > >> end of buffer in
> > >> file /mnt/home/debian/ashley/code/OpenMPI/ompi-trunk-tes/trunk/orte/
> > >> tools/orte-ps/orte-ps.c at line 818
> > >>
> > >> Ashley.
> > >>
> > >> On Sat, 2009-05-16 at 08:15 -0600, Ralph Castain wrote:
> > >>> This is fixed now, Ashley - sorry for the problem.
> > >>>
> > >>>
> > >>> On May 15, 2009, at 4:47 AM, Ashley Pittman wrote:
> > >>>
> > >>>> On Thu, 2009-05-14 at 22:49 -0600, Ralph Castain wrote:
> > >>>>> It is definitely broken at the moment, Ashley. I have it pretty
> > >>>>> well
> > >>>>> fixed, but need/want to cleanup some corner cases that have
> > >>>>> plagued
> > >>>>> us
> > >>>>> for a long time.
> > >>>>>
> > >>>>> Should have it for you sometime Friday.
> > >>>>
> > >>>> Ok, thanks. I might try switching to slurm in the mean-time, I
> > >>>> know
> > >>>> my
> > >>>> code works with that.
> > >>>>
> > >>>> Can you let me know when it's fixed on or off list and I'll do an
> > >>>> update.
> > >>>>
> > >>>> Ashley,
> > >>>>
> > >>>> _______________________________________________
> > >>>> devel mailing list
> > >>>> devel_at_[hidden]
> > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>
> > >>> _______________________________________________
> > >>> devel mailing list
> > >>> devel_at_[hidden]
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >> _______________________________________________
> > >> devel mailing list
> > >> devel_at_[hidden]
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > <ompi-ps.patch>_______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel