• Skip to sidebar navigation
  • Skip to content

Bitbucket

  • Projects
  • Repositories
    • View all public repositories
  • Help
    • Online help
    • Learn Git
    • Welcome to Bitbucket
    • Keyboard shortcuts
  • Log In
Elena Pourmal
  1. Elena Pourmal

hdf5_ep

HDF5
hdf5
Public
Actions
  • Clone
  • Download

Learn more about cloning repositories

You have read-only access

Navigation
  • Source
  • Commits
  • Graphs
  • Branches
  • Network
  • Latest Activities

Commits

Albert Cheng
345a9a3c237
Albert Cheng committed 063e4b2e2eb11 Aug 2010
[svn-r19230] Reset alarm_seconds back to 20 minutes.

Description:
honest3 v1.8 failed in parallel test.  It got stuck in the same
testpar/testphdf5 subtest (cbhsssdrpio).  This is an old problem.
Upon closer inspection, the testphdf5, when terminated, had clocked
up 1hr 9min 46 sec wall clock time.  Honest1 system also sent a message
that an mpi process has used up 30+ CPU minutes which exceeded their login
node cpu time limit and they killed the process.  I also did a hand-run
of testphdf5. All subtests before cbhsssdrpio completed in a few minutes.
Therefore, it is safe to say the majority of the 70 minutes of wall clock
time are spent in the sub-test cbhsssdrpio. It also used up lots of CPU
time.  cbhsssdrpio is likely infinite looping.

Since MPI application is prone to infinite looping due to message deadlock,
the testphdf5 has a built-in protection to give each subtest at most 20 minutes
of wall-clock time to run.  When the 20 minutes wall-clock time is exceeded,
the testphdf5 will attempt to terminate itself.  This prevents unnecessary
CPU time consumption in infinite looping.

But that clock limit was changed to 30 and then 60 minutes.  I should have
but failed to, noticed the change mentioned by Quincey.  IMO, 20 wall clock
time is more than sufficient for each subtest of testphdf5 to complete.
If a subtest takes longer than 20 minutes, it is likely infinite looping.
Giving it more time will not help.

If a subtest of testphdf5 takes more than 20 minutes, it should be broken
down to small tests that will finish way under 20 minutes so that it is 
much easier to see progress and identify any deadlock problems.

In view of this, I am changing the testphdf5 time limit back to 20 minutes.
This will at least stop the CPU TIME exceeding limits and annoying the
system administrators.

Maybe there could be a provision, such as environment variable like
$HDF5_ALARM_SECOND to modify the alarm duration on individual execution.
Even so, that should be used temporary to see if an execution just needs
a little more time.

Tested: just eyeballed as the change is trivia.

Changed files

  • Git repository management for enterprise teams powered by Atlassian Bitbucket
  • Atlassian Bitbucket v4.4.1
  • Documentation
  • Contact Support
  • Request a feature
  • About
  • Contact Atlassian
Atlassian