Ticket #793 (closed: fixed)
Memory issues
Reported by: | Nick Draper | Owned by: | Russell Taylor |
---|---|---|---|
Priority: | critical | Milestone: | Iteration 20 |
Component: | Keywords: | ||
Cc: | Blocked By: | ||
Blocking: | Tester: |
Description (last modified by Nick Draper) (diff)
Aziz has been having problems running the same analysis lots of times in a loop.
ppt presentation sent separately via email as it is too big to attach.
Attachments
Change History
comment:2 Changed 11 years ago by Russell Taylor
- Status changed from new to assigned
- Summary changed from Python: SystemExit errors when Looping on large data sets to Memory issues
We have two problems here: memory becomes fragmented over time (the reason for Aziz's crashes - it's a bad alloc when it can't find enough contiguous memory for a spectrum) and the reserved but unused memory is still not being reported correctly on Linux when running within MantidPlot (Qt is hiding it somewhere!).
I believe the solution to both these problems may be the same: segregate the workspace data from everything else. Involves writing a custom allocator to use with our vectors or even replacing the storage class completely.
comment:7 Changed 11 years ago by Nick Draper
- Milestone changed from Iteration 19 to Iteration 20
Moved as part of the end of Iteration 19
comment:10 Changed 11 years ago by Russell Taylor
comment:11 Changed 11 years ago by Russell Taylor
comment:12 Changed 11 years ago by Russell Taylor
comment:13 Changed 11 years ago by Russell Taylor
(In [3226]) Added a line to MemoryManager that tells malloc on Linux to use a different memory allocation system call (mmap) to the 'usual' one (sbrk) for requests above the threshold of the second argument (in bytes). The effect of this is that, for the current threshold value of 8*4096, storage for workspaces having 4096 or greater bins per spectrum will be allocated using mmap. This should have the effect that memory is returned to the kernel as soon as a workspace is deleted,preventing things going to managed workspaces when they shouldn't. This will also hopefully reduce memory fragmentation. Potential downsides to look out for are whether this memory allocation technique makes things noticeably slower and whether it wastes memory (mmap allocates in blocks of the system page size.
Also cleared a couple of warnings elsewhere. Re #793.
comment:14 Changed 11 years ago by Russell Taylor
comment:15 Changed 11 years ago by Russell Taylor
(In [3235]) Add a call to use the 'Low Fragmentation Heap' in Windows (http://msdn.microsoft.com/en-us/library/aa366750%28VS.85%29.aspx). This allows a multi-loop script that was previously failing with a bad-alloc after ~250 iterations to get up to at least 850 (the most tested so far). Initial tests show no speed penalty, but an increase in overall memory footprint. Re #793.
comment:16 Changed 11 years ago by Russell Taylor
- Status changed from assigned to testing
- Resolution set to fixed
The changes above appear to have taken us to an acceptable position. Much more stable (i.e. long term) running has been reported on Linux and my own tests on Windows indicate that you can run for a very long time (indefinitely?) in MantidScript (memory usage was completely stable), although there is a memory leak somewhere in MantidPlot, which eventually leads to a crash (not sure if it's when it runs out of memory or some other problem before that). Nevertheless, MantidPlot lasts long enough for most practical needs.
comment:17 Changed 11 years ago by Russell Taylor
- Status changed from testing to closed
Response from Aziz after he tested this out:
Good news again: I ran a more realistic test overnight of Mantid stability, by looking at various and different RAW data in the network archive, skipping the processing for absent vanadium and files
With preliminary scripts I managed to retrieved Vruns, sample and V absorption params, cal files and all runs numbers of previously Ariel generated gss files in any powder analysis PC machine directory structure for different cycles and users. This input was successfully fed into the Mantid processing, so to have a mirror of the Ariel data treatment done in the past.
The script ran for 7 hours producing a mirror of the 3 last cycles (whose result need yet to be compared). This has generated ~800 processed runs during which I haven't noticed ANY memory leak. Mnatid then crashed when trying to load an existing Raw file, but unfortunately and despite proper mantid.user.properties, the log was not created/saved, so I can't tell if that's due to fragmentation or to
network access.
Anyway, this represents the most complicated thing we'd ever do, so I believe the test is a success, and that the crash represents only a small risk for normal operation.
comment:18 Changed 5 years ago by Stuart Campbell
This ticket has been transferred to github issue 1641