Ticket #4133 (closed: fixed)

Opened 9 years ago

Last modified 5 years ago

Check/fix possible race conditions

Reported by: Martyn Gigg Owned by: Martyn Gigg
Priority: critical Milestone: Release 2.0
Component: Mantid Keywords:
Cc: taylorrj@… Blocked By:
Blocking: Tester: Peter Peterson

Description

We have been getting random unit tests failures recently that would indicate we have some race conditions somewhere in the code.

Tools such as helgrind, drd can help with this.

Attachments

helgrind.convertunits (505.5 KB) - added by Martyn Gigg 9 years ago.

Change History

Changed 9 years ago by Martyn Gigg

comment:1 Changed 9 years ago by Martyn Gigg

The attached file shows the output after running the convert units test through helgrind. It inidcates a number of problems most notably around the parameter map caching.

comment:2 Changed 9 years ago by Martyn Gigg

  • Cc taylorrj@… added

comment:3 Changed 9 years ago by Martyn Gigg

  • Status changed from new to accepted

comment:4 Changed 9 years ago by Martyn Gigg

Refs #4133. Fixed some race conditions in various places.

The valgrind tool helgrind flagged up some race conditions in the code. This is not a complete list of those that need fixing but some major ones that would affect large portions of the code

Changeset: b04a3eaa36407e9caabf0ea45ae1e76a3eb1b44a

comment:5 Changed 9 years ago by Russell Taylor

Not surprisingly, this has had an impact on the performance tests. Looking at the RHEL6 tests run at ORNL, the following tests show noticeable slowdowns:

  • CreateGroupingWorkspaceTestPerformance.test_TOPAZ_2010
  • DiffractionFocussing2TestPerformance.test_SNAP_event_one_group_dontPreserveEvents
  • DiffractionFocussing2TestPerformance.test_SNAP_event_six_groups
  • DiffractionFocussing2TestPerformance.test_SNAP_event_six_groups_dontPreserveEvents
  • Q1D2TestPerformance.test_slow_performance
  • InstrumentRayTracerTestPerformance.test_TOPAZ

It's not all doom and gloom though. The following tests got faster at this point!

  • LoadEventNexusTestPerformance.testDefaultLoad
  • LoadEventPreNexusTestPerformance.testDefaultLoad
  • MaskDetectorsInShapeTestPerformance.testMaskingLotsOfDetectors
  • NearestNeighboursTestPerformance.testUsingNumberOfNeighbours

comment:6 Changed 9 years ago by Martyn Gigg

Refs #4133. Swap assign for a fill to avoid race condition warning.

Changeset: a99e0c533e418b8c0d4c738a3eb0aa523c84dab7

comment:7 Changed 9 years ago by Martyn Gigg

I'm guessing the ones that have slowed down all do a lot of getPos calls which means they now have to go through the critical block to get in to the cache. I suppose one way of avoiding that would be switching to storing the absolute positions rather relative positions. That way it wouldn't have to go into the cache. Although if you moved something it would still have to go into the parameter map so I don't really have a feel for how much gain we would get from that

comment:8 Changed 9 years ago by Martyn Gigg

  • Status changed from accepted to verify
  • Resolution set to fixed

I ran the WISH script through helgrind last night and it didn't flag anything up either.

comment:9 Changed 9 years ago by Martyn Gigg

  • Status changed from verify to reopened
  • Resolution fixed deleted

comment:10 Changed 9 years ago by Martyn Gigg

Refs #4133. More race conditions found.

The HFIR system test randomly failed on RHEL last night. Helgrind flagged up a problem with the ParameterMap access again. This will probably affect the performance but correctness is of higher importance.

Changeset: 9775b281b348f086b4d5b47a01ba9bbc782a330a

comment:11 Changed 9 years ago by Martyn Gigg

  • Status changed from reopened to accepted

comment:12 Changed 9 years ago by Martyn Gigg

  • Status changed from accepted to verify
  • Resolution set to fixed

comment:13 Changed 9 years ago by Peter Peterson

  • Status changed from verify to verifying
  • Tester set to Peter Peterson

comment:14 Changed 9 years ago by Peter Peterson

  • Status changed from verifying to closed

Looks good to me.

comment:15 Changed 5 years ago by Stuart Campbell

This ticket has been transferred to github issue 4980

Note: See TracTickets for help on using tickets.