Checkpointing with mpiexec
Pete Wyckoff
pw at osc.edu
Thu Jun 19 11:34:12 EDT 2008
artpol84 at gmail.com wrote on Mon, 16 Jun 2008 14:28 +0700:
> I try to use mpiexec with checkpointing program, which considers all sockets
> and descriptors in the program. First problem I faced is that checkpointing
> entire mpiexec have following problem:
> When I restart from checkpointed image restoring program searches temporary
> files created by PBS and fails when did not find them. Is it possible to
> divide mpiexec into 2 parts:
> 1. Gathering information about execution resources from PBS
> 2. Starting the program using predetermined temporary files (not depended on
> query ID and so on).
But mpiexec doesn't keep open any files. It does all its querying
via sockets to the PBS server and to the local PBS mom. So I think
that we've got things more or less as you need them already.
However the bigger problem is how to recreate these connections to
the existing (non-restarted) PBS. Mpiexec would likely need to be
involved in the restart process, as it must find out the new TM task
ids for the restarted tasks, and register obits for them.
Curious how much of this you've thought through. If you have ideas
about what to do in mpiexec, please continue to say so.
-- Pete
More information about the mpiexec
mailing list