cleaning up SHM segments
Ben Webb
ben at salilab.org
Thu Apr 1 15:47:14 EST 2004
On Thu, Apr 01, 2004 at 10:09:42AM +0100, Victoria Pennington wrote:
> This subject has arisen a few times on the list, but I wonder if anyone
> has a conclusive answer to the issue of shared memory segments being
> left on nodes after an MPI job (with shared memory support) has died?
>
> My idea is to use a PBS epilogue script to run cleanipcs (as the user).
> This will of course get rid of ALL the user's SHM segments on each node
> on which the job was run, regardless of which job they were attached to,
> but I'm assuming that only one such job would be running on each node
> at any one time.
You don't say which MPI library you're using, but we've used MPICH in
the past and had this problem. See the end of the first section at
http://bellatrix.pcl.ox.ac.uk/~ben/pbs/
for my solution. Basically, I patched MPICH so that it wrote out a state
file listing all shared memory and semaphores. If the job crashed, then
I ran a short C program on all the nodes to clean them up. This is safe
to run through crontab periodically, as it doesn't affect running jobs.
If you use LAM instead, it already maintains a state file in /tmp which
contains this information, and it's pretty easy to parse.
cleanipcs has the disadvantage that you will break anything else which
uses shared memory (e.g. causing problems if the user is running
multiple jobs, e.g. on an SMP machine, or if you run PBS on workstation
machines).
Ben
--
ben at salilab.org http://salilab.org/~ben/
"I was a modest, good-humoured boy. It is Oxford that has made
me insufferable."
- Max Beerbohm
More information about the mpiexec
mailing list