Ticket #10092 (closed: fixed)

Opened 6 years ago

Last modified 5 years ago

Provide method to allow custom interfaces to execute a set of algorithms sequentially parallel with the UI

Reported by: Dan Nixon Owned by: Dan Nixon
Priority: major Milestone: Release 3.3
Component: Framework Keywords:
Cc: samuel.jackson@… Blocked By:
Blocking: #7860, #10057, #10065, #10083, #10215 Tester: Martyn Gigg

Description

Better support for running a set of algorithms which must be run sequentially (one depends on the data of another) separate from the UI thread in a custom interface.

Change History

comment:1 Changed 6 years ago by Dan Nixon

  • Status changed from new to assigned

comment:2 Changed 6 years ago by Dan Nixon

  • Summary changed from Provide method to allow custom interfaces to execute a set of algorithms sequntially parallel with the UI to Provide method to allow custom interfaces to execute a set of algorithms sequentially parallel with the UI

comment:3 Changed 6 years ago by Dan Nixon

  • Status changed from assigned to inprogress

Work in progress commit, started to add framework for seq. algos

Refs #10092

Changeset: 4fd4b2561fe6fbbfa4120d30a08500fdde009f43

comment:4 Changed 6 years ago by Dan Nixon

Added batch algorithm runner

Essentially a deque feeding an AlgorithmRunner

Refs #10092

Changeset: 63923f7d48afb875ad6d3f6bbddeaca467cfc015

comment:5 Changed 6 years ago by Dan Nixon

  • Blocking 10065 added

comment:6 Changed 6 years ago by Dan Nixon

Refactoring, removed unit test

Refs #10092

Changeset: 5d2bfc5732fa71dbd4a79bd2b575ea0acb3d8615

comment:7 Changed 6 years ago by Dan Nixon

Fix build errors and actually use the value of the save checkbox

Refs #10092

Changeset: d0e40835d8e7cee826d19d77b068459244b3990a

comment:8 Changed 6 years ago by Dan Nixon

To enable testing of this I added a separate SaveNexus algorithm to Indirect Moments.

To test:

  • Ensure logging level is set to at least Information
  • Attempt to run Indirect Moments with the Save option enabled
  • Note that the log indicates that the SofqwMoments and SaveNexus algorithms were triggered by BatchAlgorithmRunner
  • Ensure that Moments and SaveNexus were completed correctly

comment:9 Changed 6 years ago by Dan Nixon

  • Status changed from inprogress to verify
  • Cc samuel.jackson@… added
  • Resolution set to fixed

comment:10 Changed 6 years ago by Dan Nixon

  • Status changed from verify to reopened
  • Resolution fixed deleted

comment:11 Changed 6 years ago by Dan Nixon

  • Status changed from reopened to inprogress

Replaced old implementation with one which should be unit testable

Refs #10092

Changeset: 6692668f64947554b5d2863d6906b605a6671042

comment:12 Changed 6 years ago by Dan Nixon

Added unit test, fixed workspace property issue

Refs #10092

Changeset: 08cab69df0e42a09efbfdb99a7086b5e3cdf4f80

comment:13 Changed 6 years ago by Dan Nixon

Fixed build error and static analysis warnings

Refs #10092

Changeset: ac908977ecef3f44f5c62c8eb57dee5fc315c543

comment:14 Changed 6 years ago by Dan Nixon

Fix second compile error

Refs #10092

Changeset: 7617e9979fe1e8086ba2bbb70258fa932bbec7c0

comment:15 Changed 6 years ago by Dan Nixon

  • Status changed from inprogress to verify
  • Resolution set to fixed

Addition to testing, there is a new unit test BatchAlgorithmRunnerTest, which should be passing on all platforms now.

comment:16 Changed 6 years ago by Martyn Gigg

  • Status changed from verify to reopened
  • Resolution fixed deleted

The new test, BatchAlgorithmRunnerTest, is good to have but it seems to be hanging on Mac and Windows 8, but unfortunately not all of the time.

http://builds.mantidproject.org/job/develop_incremental/label=osx-10.8-build/1411/console

The log seems to suggest that it had run the first algorithm but then was stuck for some reason. One thing I would suggest is not having such generic workspace names, e.g. ws1, in the tests. As the names go in to a singleton object that exists for each process then it maybe the case that some lock on a workspace is being hit somehow. I would suggest prefixing the names with the test name.

Lets then leave it on develop for a day to see if it hangs again.

comment:17 Changed 6 years ago by Dan Nixon

  • Status changed from reopened to inprogress

Renamed test workspaces

Refs #10092

Changeset: b7d3c0704cbe4b3fcb6cee28e2ef55cbafaa28fb

comment:18 Changed 6 years ago by Dan Nixon

Included a catch all in algorithm runner

Should ensure that unit tests cannot hang if starting an algorithm throws a non standard exception

Refs #10092

Changeset: 4be794b85296bc35fc584a1f1094985b710e810e

comment:19 Changed 6 years ago by Dan Nixon

  • Status changed from inprogress to verify
  • Resolution set to fixed

comment:20 Changed 6 years ago by Dan Nixon

I'm assuming this issue was caused by code that waits for the algorithms to finish entering an infinite loop as en exception thrown when starting the next algorithm in the handler for the Poco notification, this should no longer be an issue with the added additional catch statement.

comment:21 Changed 6 years ago by Martyn Gigg

  • Status changed from verify to reopened
  • Resolution fixed deleted

I've just seen Windows 8 and Mac hang again on the BatchAlgorithmRunnerTest, unfortunately.

It attached to the process with gdb and all it said was that it was stuck constantly checking isExecuting() in the while loop within in the test. I tried again manually on the machine but the test was successful so it's something that is only happening when the system is under additional load I think.

Is there a particular reason why the algorithms themselves are executed asynchronously rather than having the whole BatchAlgorithmRunner object in a separate thread and have that execute each algorithm serially? That would greatly simplify the algorithm scehduling.

comment:22 Changed 6 years ago by Martyn Gigg

The BatchAlgorithmRunnerTest has hung again on Win8 & OS X. Can we temporarily disable the unit test (just comment the body out) while it is looked at?

comment:23 Changed 6 years ago by Dan Nixon

  • Status changed from reopened to inprogress

Work in progress - run all algorithms on an ActiveMethod

Refs #10092

Changeset: da71d31e0014d8e2ae2c4698ea5593457d931ded

comment:24 Changed 6 years ago by Dan Nixon

Temp. removal of unit test

Refs #10092

Changeset: 8e69f5e075f130b83d50f504c083d48eac55cfb6

comment:25 Changed 6 years ago by Dan Nixon

Fix failing develop build

Refs #10092

Changeset: 9d7c5c508cecd97981fc9627e890dac02c54ac08

comment:26 Changed 6 years ago by Dan Nixon

New batch algorithm runner implmentation

Added basic unit tests, removed changes made to indirect moments

Refs #10092

Changeset: f634034d3016a62e2b52e09d3692c7e5697aab36

comment:27 Changed 6 years ago by Dan Nixon

Tidy up, commenting and refactoring

Refs #10092

Changeset: 8eafb5b1866dace264d3174a1912e62035321516

comment:28 Changed 6 years ago by Dan Nixon

Temp. removal of failing unit tests

Refs #10092

Changeset: dc3044fb9d4babdab10cb17907e351d6fd149615

comment:29 Changed 6 years ago by Dan Nixon

  • Blocking 10215 added

comment:30 Changed 6 years ago by Dan Nixon

This is a stack trace from Ubuntu 12.04:

(gdb) r BatchAlgorithmRunnerTest
Starting program: /home/dan/mantid-build/bin/MantidQtAPITest BatchAlgorithmRunnerTest
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Running 3 tests
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff4426eaf in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007ffff4426eaf in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000000000054f157 in Poco::MutexImpl::lockImpl (this=0x7ffff7de453b) at /usr/include/Poco/Mutex_POSIX.h:81
#2  0x000000000054f2fc in Poco::Mutex::lock (this=0x7ffff7de453b) at /usr/include/Poco/Mutex.h:174
#3  0x0000000000553098 in Poco::ScopedLock<Poco::Mutex>::ScopedLock (this=0x7fffffffd770, mutex=...) at /usr/include/Poco/ScopedLock.h:59
#4  0x00007ffff53f6eca in Poco::NotificationCenter::addObserver (this=0x7ffff7de4523, observer=...) at src/NotificationCenter.cpp:59
#5  0x00007ffff7b47757 in MantidQt::API::BatchAlgorithmRunner::executeBatch (this=0x7fffffffd800)
    at /home/dan/mantid/Code/Mantid/MantidQt/API/src/BatchAlgorithmRunner.cpp:61
#6  0x00000000005504c7 in BatchAlgorithmRunnerTest::test_basicBatch (this=0xa75000) at /home/dan/mantid/Code/Mantid/MantidQt/API/test/BatchAlgorithmRunnerTest.h:70
#7  0x0000000000550d43 in TestDescription_suite_BatchAlgorithmRunnerTest_test_basicBatch::runTest (this=0x82ec40)
    at /home/dan/mantid-build/MantidQt/API/test/BatchAlgorithmRunnerTest.cpp:26
#8  0x00000000005565bd in CxxTest::RealTestDescription::run (this=0x82ec40) at /home/dan/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/RealDescriptions.cpp:96
#9  0x0000000000559b13 in CxxTest::TestRunner::runTest (this=0x7fffffffdbff, td=...) at /home/dan/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/TestRunner.h:94
#10 0x0000000000559a05 in CxxTest::TestRunner::runSuite (this=0x7fffffffdbff, sd=...) at /home/dan/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/TestRunner.h:79
#11 0x00000000005598a0 in CxxTest::TestRunner::runWorld (this=0x7fffffffdbff) at /home/dan/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/TestRunner.h:63
#12 0x0000000000559663 in CxxTest::TestRunner::runAllTests (listener=...) at /home/dan/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/TestRunner.h:24
#13 0x000000000055f6a2 in CxxTest::XUnitPrinter::run (this=0x7fffffffdcc0) at /home/dan/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/XUnitPrinter.h:30
#14 0x0000000000560572 in CxxTest::Main<CxxTest::XUnitPrinter> (tmp=..., argc=2, argv=0x7fffffffe268)
    at /home/dan/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/TestMain.h:106
#15 0x00000000005555f8 in main (argc=2, argv=0x7fffffffe268) at /home/dan/mantid-build/MantidQt/API/test/MantidQtAPITest_runner.cpp:27

comment:31 Changed 6 years ago by Dan Nixon

And RHEL6:

(gdb) r BatchAlgorithmRunnerTest
Starting program: /home/ibn32760/mantid-build/bin/MantidQtAPITest BatchAlgorithmRunnerTest
[Thread debugging using libthread_db enabled]
Running 3 testswarning: "/usr/lib/debug/opt/rh/mantidlibs/root/usr/lib64/libQtWebKit.so.4.9.3.debug": separate debug info file has no debug info
terminate called after throwing an instance of 'Poco::SystemException'
  what():  System exception

Program received signal SIGABRT, Aborted.
0x0000003528632925 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install gstreamer-0.10.29-1.el6.x86_64 gstreamer-plugins-base-0.10.29-2.el6.x86_64 libxml2-2.7.6-14.el6_5.2.x86_64 nss-pam-ldapd-0.7.5-18.2.el6_4.x86_64 qscintilla-2.4.6-2.el6.x86_64 sqlite-3.6.20-1.el6.x86_64
(gdb) bt
#0  0x0000003528632925 in raise () from /lib64/libc.so.6
#1  0x0000003528634105 in abort () from /lib64/libc.so.6
#2  0x0000003532ebea5d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x0000003532ebcbe6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x0000003532ebcc13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x0000003532ebcd0e in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x00007ffff0edd9bf in Poco::NotificationCenter::addObserver(Poco::AbstractObserver const&) () from /usr/lib64/libPocoFoundationd.so.11
#7  0x00007ffff7d6348b in MantidQt::API::BatchAlgorithmRunner::executeBatch (this=0x7fffffffd7f0)
    at /home/ibn32760/mantid/Code/Mantid/MantidQt/API/src/BatchAlgorithmRunner.cpp:61
#8  0x000000000055071f in BatchAlgorithmRunnerTest::test_basicBatch (this=0x1298000)
    at /home/ibn32760/mantid/Code/Mantid/MantidQt/API/test/BatchAlgorithmRunnerTest.h:70
#9  0x000000000055103d in TestDescription_suite_BatchAlgorithmRunnerTest_test_basicBatch::runTest (this=0x830840)
    at /home/ibn32760/mantid-build/MantidQt/API/test/BatchAlgorithmRunnerTest.cpp:26
#10 0x00000000005568cb in CxxTest::RealTestDescription::run (this=0x830840)
    at /home/ibn32760/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/RealDescriptions.cpp:96
#11 0x0000000000559f5f in CxxTest::TestRunner::runTest (this=0x7fffffffdc0f, td=...)
    at /home/ibn32760/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/TestRunner.h:94
#12 0x0000000000559e45 in CxxTest::TestRunner::runSuite (this=0x7fffffffdc0f, sd=...)
    at /home/ibn32760/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/TestRunner.h:79
#13 0x0000000000559cfd in CxxTest::TestRunner::runWorld (this=0x7fffffffdc0f)
    at /home/ibn32760/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/TestRunner.h:63
#14 0x0000000000559a67 in CxxTest::TestRunner::runAllTests (listener=...)
    at /home/ibn32760/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/TestRunner.h:24
#15 0x000000000055ff66 in CxxTest::XUnitPrinter::run (this=0x7fffffffdcd0)
    at /home/ibn32760/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/XUnitPrinter.h:30
#16 0x0000000000560e0d in CxxTest::Main<CxxTest::XUnitPrinter> (tmp=..., argc=2, argv=0x7fffffffe268)
    at /home/ibn32760/mantid/Code/Mantid/TestingTools/cxxtest/cxxtest/TestMain.h:106
#17 0x0000000000555979 in main (argc=2, argv=0x7fffffffe268) at /home/ibn32760/mantid-build/MantidQt/API/test/MantidQtAPITest_runner.cpp:27

comment:32 Changed 6 years ago by Dan Nixon

Fix for RHEL6 Poco exception

Refs #10092

Changeset: fcc66858206652207b8712c95fe7674b71bfe4ae

comment:33 Changed 6 years ago by Dan Nixon

Added additional unit tests

Refs #10092

Changeset: 5ca90c761fdafae3151f382afc15faa6690b9117

comment:34 Changed 6 years ago by Dan Nixon

  • Status changed from inprogress to verify
  • Resolution set to fixed

comment:35 Changed 6 years ago by Dan Nixon

New to test (replaces comment 8 and 15):

  • Ensure unit test (BatchAlgorithmRunnerTest) is passing reliably
  • Force an error on the Indirect Data Reduction > Moments interface (set EMin and EMax to the same value), this is a crude test to ensure the Qt signals are emitted property, you should get warnings in the Results Log from the runner and an alert dialog box

It may be worth someone testing this on Ubuntu 12.04 (since there is no longer a Jenkins job) as it was previously failing on this platform (in a similar way as OSX and RHEL6 so I assume it should be fixed tested and works as intended)

Last edited 6 years ago by Dan Nixon (previous) (diff)

comment:36 Changed 6 years ago by Dan Nixon

  • Status changed from verify to reopened
  • Resolution fixed deleted

comment:37 Changed 6 years ago by Dan Nixon

  • Blocking 7860 added

(In #7860) Batch runner ticket replaces the standard algorithm runner, need Qt signal form either in the UI code.

comment:38 Changed 6 years ago by Dan Nixon

  • Status changed from reopened to inprogress

More unit tests, fixed Poco notification issue

Remove observer when algorithm fails, added unit tests for queue length.

Refs #10092

Changeset: 765982d8b85c156871aaef70827a5d4ea218c478

comment:39 Changed 6 years ago by Dan Nixon

Merge branch 'feature/10092_sequential_async_algorithm_runner' into develop

Conflicts:

Code/Mantid/MantidQt/CustomInterfaces/src/IndirectDataReductionTab.cpp

Refs #10092

Changeset: 21135669ebf4d8fe503dffa453c59ac7e858d27d

comment:40 Changed 6 years ago by Dan Nixon

Force Git to see new file change

Refs #10092

Changeset: 2f09314cd7eebdcafa640112fce9330c0abdd220

comment:41 Changed 6 years ago by Dan Nixon

  • Status changed from inprogress to verify
  • Resolution set to fixed

comment:42 Changed 6 years ago by Martyn Gigg

  • Status changed from verify to verifying
  • Tester set to Martyn Gigg

comment:43 Changed 6 years ago by Martyn Gigg

I haven't seen the build servers hang any more while executing the unit test so it would appear that the threading issue has been resolved.

comment:44 Changed 6 years ago by Martyn Gigg

  • Status changed from verifying to closed

Merge remote-tracking branch 'origin/feature/10092_sequential_async_algorithm_runner'

Conflicts:

Code/Mantid/Framework/PythonInterface/plugins/algorithms/WorkflowAlgorithms/SofQWMoments.py

Full changeset: bed7ab69fb911493abc55c5863cfe8f7987101d9

comment:45 Changed 5 years ago by Stuart Campbell

This ticket has been transferred to github issue 10934

Note: See TracTickets for help on using tickets.