Ticket #2187 (closed: fixed)
Investigate using alternative malloc's (e.g. tcmalloc) for performance
Reported by: | Janik Zikovsky | Owned by: | Janik Zikovsky |
---|---|---|---|
Priority: | major | Milestone: | Iteration 27 |
Component: | Mantid | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Tester: | Russell Taylor |
Description
There are a few alternative malloc's that help multithreaded object allocation: tcmalloc, nedmalloc. Researching how to use these and maybe some speed tests in Mantid.
Change History
comment:2 Changed 10 years ago by Janik Zikovsky
Results so far:
- Downloaded and compiled libunwind (needed a flag): export CFLAGS=-U_FORTIFY_SOURCE
- Downloaded and installed google-perf-tools (includes tcmalloc).
Then, using:
export LD_PRELOAD="/usr/local/lib/libtcmalloc.so"
I ran a test that loaded a large TOPAZ event file (run 1715). Time went down from 42 seconds to 24 seconds.
Modified CMake to link to libtcmalloc.so.
All compiles and tests run fine. Same test runs in about 30 seconds.
HOWEVER, python crashes with a segfault when you import MantidFramework. This is where I am at now.
comment:3 Changed 10 years ago by Janik Zikovsky
More notes:
Looks like the Python API uses dlopen to load its libraries, and there is a warning about that in the tcmalloc docs. I got around the segfault by compiling tcmalloc with Thread Local Storage turned off.
A memory leak tester that simply loads the same event file 25 times was run:
- Without tcmalloc: memory usage reported went up to 7.1 GB virt / 6.5 GB res.
- With tcmalloc: memory usage topped out at 4.3 GB / 4.0 GB.
No performance results yet.
comment:4 Changed 10 years ago by Janik Zikovsky
Performance testing results:
--- Standard allocator: --- 12.4525589943 seconds for TOPAZ instrument loading 14.2845599651 seconds for TOPAZ 1715 loading 32.47803092 seconds PG3_1370 data reduction 59.2151498795 seconds elapsed total --- tcmalloc with Local Thread Storage turned off --- 9.26182699203 seconds for TOPAZ instrument loading 8.47131490707 seconds for TOPAZ 1715 loading 29.6662449837 seconds PG3_1370 data reduction 47.3993868828 seconds elapsed total
So the improvement is across the board but it is more noticeable for TOPAZ (perhaps because it has lots of pixels with not so many events in each = lots of small allocations).
comment:5 Changed 10 years ago by Janik Zikovsky
Follow-up: This is the python code used in the tests above:
import time import MantidFramework MantidFramework.mtd.initialise() t0 = time.time() LoadEmptyInstrument("/home/8oz/Code/Mantid/Code/Mantid/Instrument/TOPAZ_Definition.xml", "topaz_instrument") t1 = time.time() LoadSNSEventNexus("/home/8oz/data/TOPAZ_1715_event.nxs", "topaz") t2 = time.time() if 1: calib = "../../../../Test/AutoTestData/pg3_mantid_det.cal" data_file = "/home/8oz/data/PG3_1370_event.nxs" wksp = "pg3" LoadSNSEventNexus(data_file, wksp) AlignDetectors(InputWorkspace=wksp, OutputWorkspace=wksp, CalibrationFile=calib) DiffractionFocussing(InputWorkspace=wksp, OutputWorkspace=wksp, GroupingFileName=calib) # Sort(InputWorkspace=wksp, SortBy="Time of Flight") ConvertUnits(InputWorkspace=wksp, OutputWorkspace=wksp, Target="TOF") NormaliseByCurrent(InputWorkspace=wksp, OutputWorkspace=wksp) t3 = time.time() print print print t1 - t0, " seconds for TOPAZ instrument loading" print t2 - t1, " seconds for TOPAZ 1715 loading" print t3 - t2, " seconds PG3_1370 data reduction" print t3 - t0, " seconds elapsed total"
comment:6 Changed 10 years ago by Janik Zikovsky
Extra note: when we have a statically-linked, "supercomputing" mantid (Vickie's ticket), then we might enable the TLS for TCMalloc
comment:7 Changed 10 years ago by Janik Zikovsky
Another memory usage test, this time from within MantidPlot, loading PG3_1370 15 times in a script:
- With tcmalloc, 4.7 GB virt / 4.0 GB resident memory; seemed stable here.
- With standard allocator, 6.0 GB virt / 5.1 GB resident memory; looked like it would keep going up slowly.
comment:8 Changed 10 years ago by Janik Zikovsky
Continued: 4.7 GB virt / 4.0 GB resident memory still after 125 loads.
comment:10 Changed 10 years ago by Janik Zikovsky
System tests run with/without TCMalloc. All tests pass (a couple were removed that failed in both cases):
Without TCMalloc:
- 374.4 sec user
- 2:55.45 time elapsed
With TCMalloc:
- 351.4 sec user
- 2:45.50 time elapsed
comment:11 Changed 10 years ago by Janik Zikovsky
(which is a 6% speedup using TCMalloc).
comment:12 Changed 10 years ago by Russell Taylor
comment:13 Changed 10 years ago by Russell Taylor
comment:14 Changed 10 years ago by Janik Zikovsky
comment:15 Changed 10 years ago by Janik Zikovsky
comment:16 Changed 10 years ago by Janik Zikovsky
- Status changed from accepted to verify
- Resolution set to fixed
TCMalloc is now integrated in cmake so it is up to developers to try it or not. Later it could be added to the linux build, but I am closing this ticket now.
comment:17 Changed 10 years ago by Russell Taylor
- Status changed from verify to verifying
- Tester set to Russell Taylor
comment:18 Changed 10 years ago by Russell Taylor
- Status changed from verifying to closed
CMake build will link to tcmalloc if it finds it on the system. Unfortunately, rhel only has a 32 bit google-perf-tools in the epel repo so we'd have to build our own if we want to use it there.
comment:19 Changed 5 years ago by Stuart Campbell
This ticket has been transferred to github issue 3034