Hi Josh.
If i make a checkpoint in another terminal of the mpirun process, during the execution, i get this output:[hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10Antes de MPI_InitAntes de MPI_Init[clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287[clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287Soy el número 1 (100000000)Terminando, una instrucción antes del finalizeSoy el número 0 (100000000)Terminando, una instrucción antes del finalize--------------------------------------------------------------------------Error: The process below has failed. There is no checkpoint available forthis job, so we are terminating the application since automaticrecovery cannot occur.Internal Name: [[41167,1],0]MCW Rank: 0--------------------------------------------------------------------------[clus9:04985] 1 more process has sent help message help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc[clus9:04985] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287[clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287[clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at line 350[clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323[clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26[clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at line 350[clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323[clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26--------------------------------------------------------------------------Notice: The job has been successfully recovered from thelast checkpoint.--------------------------------------------------------------------------Soy el número 1 (100000000)Terminando, una instrucción antes del finalizeSoy el número 0 (100000000)Terminando, una instrucción antes del finalize[clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt / autor_recovering_job[clus9:06105] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages[clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287[clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287[clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at line 350[clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323[clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26[clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at line 350[clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323[clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26[clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt / autor_recovery_completeSoy el número 0 (100000000)Terminando, una instrucción antes del finalizeSoy el número 1 (100000000)Terminando, una instrucción antes del finalize[clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt / autor_recovering_job[clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287[clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287[clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at line 350[clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323[clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at line 350[clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323[clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26[clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
[hmeyer@clus9 ~]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t node3 18082--------------------------------------------------------------------------Error: The Job identified by PID (18082) was not able to migrate processes in thisjob. This could be caused by any of the following:- Invalid node or rank specified- No processes on the indicated node can by migrated- Process migration was not enabled for this job. Make sure to indicatethe proper AMCA file: "-am ft-enable-cr-recovery".--------------------------------------------------------------------------
I asume that the orte_get_job_data_object is the problem, because it is not obtaining the proper value.[hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10Antes de MPI_InitAntes de MPI_Init[clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287[clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287--------------------------------------------------------------------------Warning: Could not find any processes to migrate on the nodes specified.You provided the following:Nodes: node9Procs: (null)----------------------------------------------------------------------------------------------------------------------------------------------------Notice: The processes have been successfully migrated to/from the specifiedmachines.--------------------------------------------------------------------------Soy el número 1 (100000000)Terminando, una instrucción antes del finalizeSoy el número 0 (100000000)Terminando, una instrucción antes del finalize--------------------------------------------------------------------------Error: The process below has failed. There is no checkpoint available forthis job, so we are terminating the application since automaticrecovery cannot occur.Internal Name: [[62740,1],0]MCW Rank: 0--------------------------------------------------------------------------[clus9:18082] 1 more process has sent help message help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc[clus9:18082] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I found a few more bugs after testing the C/R functionality this morning. I just committed some more C/R fixes in r24306 (things are now working correctly on my test cluster).
https://svn.open-mpi.org/trac/ompi/changeset/24306
One thing I just noticed in your original email was that you are specifying the wrong parameter for migration (it is different than the standard C/R functionality for backwards compatibility reasons). You need to use the 'ft-enable-cr-recovery' AMCA parameter:
mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10
If you still get the segmentation fault after upgrading to the current trunk, can you send me a backtrace from the core file? That will help me narrow down on the problem.
Thanks,
Josh
On Jan 26, 2011, at 8:40 AM, Hugo Meyer wrote:
> Josh.
>
> The ompi-checkpoint with his restart now are working great, but the same error persist with ompi-migrate. I've also tried using "-r", but i get the same error.
>
> Best regards.
>
> Hugo Meyer
>
> 2011/1/26 Hugo Meyer <meyer.hugo@gmail.com>
> Thanks Josh.
>
> I've already check te prelink and is set to "no".
>
> I'm going to try with the trunk head, and then i'll let you know how it goes.
>
> Best regards.
>
> Hugo Meyer
>
> 2011/1/25 Joshua Hursey <jjhursey@open-mpi.org>
>
> Can you try with the current trunk head (r24296)?
> I just committed a fix for the C/R functionality in which restarts were getting stuck. This will likely affect the migration functionality, but I have not had an opportunity to test just yet.
>
> Another thing to check is that prelink is turned off on all of your machines.
> https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
>
> Let me know if the problem persists, and I'll dig into a bit more.
>
> Thanks,
> Josh
>
> On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote:
>
> > Hello @ll
> >
> > I've got a problem when i try to use the ompi-migrate command.
> >
> > What i'm doing is execute for example the next application in one node of a cluster (both process wil run on the same node):
> >
> > mpirun -np 2 -am ft-enable-cr ./whoami 10 10
> >
> > Then in the same node i try to migrate the processes to another node:
> >
> > ompi-migrate -x node9 -t node3 14914
> >
> > And then i get this message:
> >
> > [clus9:15620] *** Process received signal ***
> > [clus9:15620] Signal: Segmentation fault (11)
> > [clus9:15620] Signal code: Address not mapped (1)
> > [clus9:15620] Failing at address: (nil)
> > [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b8d40]
> > [clus9:15620] *** End of error message ***
> > Segmentation fault
> >
> > I assume that maybe there is something wrong with the thread level, but i have configured the open-mpi like this:
> >
> > ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/ --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/ --with-blcr-libdir=/soft/blcr-0.8.2/lib/
> >
> > The checkpoint and restart works fine, but when i restore an application that has more than one process, this one is restored and executed until the last line before MPI_FINALIZE(), but the processes never finalize, i assume that they never call the MPI_FINALIZE(), but with one process ompi-checkpoint and ompi-restart work great.
> >
> > Best regards.
> >
> > Hugo Meyer
> > _______________________________________________
> > devel mailing list
> > devel@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
>
> _______________________________________________
> devel mailing list
> devel@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel