Ticket #793 (closed: fixed)

Opened 11 years ago

Last modified 5 years ago

Memory issues

Reported by: Nick Draper Owned by: Russell Taylor
Priority: critical Milestone: Iteration 20
Component: Keywords:
Cc: Blocked By:
Blocking: Tester:

Description (last modified by Nick Draper) (diff)

Aziz has been having problems running the same analysis lots of times in a loop.

ppt presentation sent separately via email as it is too big to attach.

Attachments

MantidLoop.ppt (1.2 MB) - added by Russell Taylor 11 years ago.

Change History

comment:1 Changed 11 years ago by Nick Draper

  • Description modified (diff)

comment:2 Changed 11 years ago by Russell Taylor

  • Status changed from new to assigned
  • Summary changed from Python: SystemExit errors when Looping on large data sets to Memory issues

We have two problems here: memory becomes fragmented over time (the reason for Aziz's crashes - it's a bad alloc when it can't find enough contiguous memory for a spectrum) and the reserved but unused memory is still not being reported correctly on Linux when running within MantidPlot (Qt is hiding it somewhere!).

I believe the solution to both these problems may be the same: segregate the workspace data from everything else. Involves writing a custom allocator to use with our vectors or even replacing the storage class completely.

comment:3 Changed 11 years ago by Russell Taylor

(In [3159]) Greatly expand the use of the MantidVec typedef. Re #793.

comment:4 Changed 11 years ago by Russell Taylor

(In [3160]) Revert workspace_iterator changes for now. Re #793.

comment:5 Changed 11 years ago by Russell Taylor

(In [3162]) Fix failures. Re #793.

comment:6 Changed 11 years ago by Russell Taylor

(In [3164]) Remove debug symbols from Linux/Mac release build. Remove compile warnings from CylinderAbsorption, without breaking it this time. Re #793.

comment:7 Changed 11 years ago by Nick Draper

  • Milestone changed from Iteration 19 to Iteration 20

Moved as part of the end of Iteration 19

comment:8 Changed 11 years ago by Russell Taylor

(In [3198]) Change the value of ManagedWorkspace.LowerMemoryLimit back to 80%. At some point it had got put up to 90, which can be a bit too high. Re #793.

comment:9 Changed 11 years ago by Nick Draper

  • Priority changed from major to critical

comment:10 Changed 11 years ago by Russell Taylor

(In [3216]) More extension of the use of the MantidVec typedef, mostly in the tests. Re #793.

comment:11 Changed 11 years ago by Russell Taylor

(In [3218]) Change typedef used in Histogram1DTest. Re #793.

comment:12 Changed 11 years ago by Russell Taylor

(In [3220]) Fix test. Re #793.

comment:13 Changed 11 years ago by Russell Taylor

(In [3226]) Added a line to MemoryManager that tells malloc on Linux to use a different memory allocation system call (mmap) to the 'usual' one (sbrk) for requests above the threshold of the second argument (in bytes). The effect of this is that, for the current threshold value of 8*4096, storage for workspaces having 4096 or greater bins per spectrum will be allocated using mmap. This should have the effect that memory is returned to the kernel as soon as a workspace is deleted,preventing things going to managed workspaces when they shouldn't. This will also hopefully reduce memory fragmentation. Potential downsides to look out for are whether this memory allocation technique makes things noticeably slower and whether it wastes memory (mmap allocates in blocks of the system page size.

Also cleared a couple of warnings elsewhere. Re #793.

comment:14 Changed 11 years ago by Russell Taylor

(In [3228]) Use MantidVec typedef where necessary in MantidPlot. Re #793.

comment:15 Changed 11 years ago by Russell Taylor

(In [3235]) Add a call to use the 'Low Fragmentation Heap' in Windows (http://msdn.microsoft.com/en-us/library/aa366750%28VS.85%29.aspx). This allows a multi-loop script that was previously failing with a bad-alloc after ~250 iterations to get up to at least 850 (the most tested so far). Initial tests show no speed penalty, but an increase in overall memory footprint. Re #793.

comment:16 Changed 11 years ago by Russell Taylor

  • Status changed from assigned to testing
  • Resolution set to fixed

The changes above appear to have taken us to an acceptable position. Much more stable (i.e. long term) running has been reported on Linux and my own tests on Windows indicate that you can run for a very long time (indefinitely?) in MantidScript (memory usage was completely stable), although there is a memory leak somewhere in MantidPlot, which eventually leads to a crash (not sure if it's when it runs out of memory or some other problem before that). Nevertheless, MantidPlot lasts long enough for most practical needs.

Changed 11 years ago by Russell Taylor

comment:17 Changed 11 years ago by Russell Taylor

  • Status changed from testing to closed

Response from Aziz after he tested this out:

Good news again: I ran a more realistic test overnight of Mantid stability, by looking at various and different RAW data in the network archive, skipping the processing for absent vanadium and files

With preliminary scripts I managed to retrieved Vruns, sample and V absorption params, cal files and all runs numbers of previously Ariel generated gss files in any powder analysis PC machine directory structure for different cycles and users. This input was successfully fed into the Mantid processing, so to have a mirror of the Ariel data treatment done in the past.

The script ran for 7 hours producing a mirror of the 3 last cycles (whose result need yet to be compared). This has generated ~800 processed runs during which I haven't noticed ANY memory leak. Mnatid then crashed when trying to load an existing Raw file, but unfortunately and despite proper mantid.user.properties, the log was not created/saved, so I can't tell if that's due to fragmentation or to

network access.

Anyway, this represents the most complicated thing we'd ever do, so I believe the test is a success, and that the crash represents only a small risk for normal operation.

comment:18 Changed 5 years ago by Stuart Campbell

This ticket has been transferred to github issue 1641

Note: See TracTickets for help on using tickets.