Ticket #9215 (closed: fixed)

Opened 7 years ago

Last modified 5 years ago

Speed up data loading

Reported by: Arturs Bekasovs Owned by: Raquel Alvarez Banos
Priority: major Milestone: Backlog
Component: Muon Keywords: ALC
Cc: Blocked By: #9213, #11382
Blocking: #11319 Tester: Karl Palmen

Description (last modified by Arturs Bekasovs) (diff)

Data loading and integration takes an awful lot of time at the moment, which makes it a huge pain to play with integration parameters and change a set of runs. This is a ticket for looking into that problem.

One of the suggested solutions was to keep some of the loaded raw data in memory, and re-use it for integration. Investigate how feasible it is, given the number of files we are usually dealing with.

Another thing to investigate - make the PlotAsymmetryByLogValue use multiple threads. Loading and integrating every file is an independent operation, and probably spends quite some time waiting for the IO. This should kick the performance up a lot on multi-core machines.

Attachments

PlotAsymmetryByLogValue.txt (612 bytes) - added by Raquel Alvarez Banos 6 years ago.
PlotAsymmetryByLogValueTimes.txt (930 bytes) - added by Raquel Alvarez Banos 6 years ago.
run_plotasymmetry.py (2.2 KB) - added by Raquel Alvarez Banos 6 years ago.

Change History

comment:1 Changed 7 years ago by Nick Draper

  • Status changed from new to assigned

comment:2 Changed 6 years ago by Arturs Bekasovs

  • Keywords ALC added
  • Description modified (diff)
  • Summary changed from [ALC] Reduce loading times when playing with parameters to Reduce loading times when playing with parameters

comment:3 Changed 6 years ago by Arturs Bekasovs

  • type changed from enhancement to task
  • Description modified (diff)
  • Summary changed from Reduce loading times when playing with parameters to Speed up data loading

comment:4 Changed 6 years ago by Arturs Bekasovs

  • Description modified (diff)

comment:5 Changed 6 years ago by Arturs Bekasovs

  • Description modified (diff)

comment:6 Changed 6 years ago by Anders Markvardsen

  • Owner changed from Arturs Bekasovs to Anders Markvardsen

comment:7 Changed 6 years ago by Anders Markvardsen

  • Owner changed from Anders Markvardsen to Karl Palmen

comment:8 Changed 6 years ago by Anders Markvardsen

  • Owner changed from Karl Palmen to Raquel Alvarez Banos

comment:9 Changed 6 years ago by Raquel Alvarez Banos

I see different issues to address here:

  1. Parallelize data loading
  2. Parallelize asymmetry calculation
  3. Add intelligence so that only new datasets are loaded (this is ticket #6931)

The three of them are related to PlotAsymmetryByLogValue, and require that some variables are declared static so that their values can be used from one call to another. 1 and 2 currently belong to the same for loop and therefore they should be split in separate loops if we want the user to be able to play with integration limits without having to load all the datasets every time. I will be creating a new ticket for this task, which will be blocked by the current ticket, and will be blocking #6931.

comment:10 Changed 6 years ago by Raquel Alvarez Banos

  • Blocking 11319 added

comment:11 Changed 6 years ago by Raquel Alvarez Banos

Just to clarify comment 9:

  • In this ticket, I will be parallelizing PlotAsymmetryByLogValue, which means that data loading + asymmetry calculation will stay within the same loop.
  • In ticket #11319 both processes will be split in different loops, and asymmetry calculation will be parallelized as well.
Last edited 6 years ago by Raquel Alvarez Banos (previous) (diff)

Changed 6 years ago by Raquel Alvarez Banos

comment:12 Changed 6 years ago by Raquel Alvarez Banos

After discussion with Martyn, I am creating a new ticket that will be blocking this one. See issue #11324.

comment:13 Changed 6 years ago by Raquel Alvarez Banos

  • Blocked By 11382 added

comment:13 Changed 6 years ago by Raquel Alvarez Banos

  • Blocked By 11382 removed

And another one #11382

comment:14 Changed 6 years ago by Raquel Alvarez Banos

  • Blocked By 11382 added

comment:16 Changed 6 years ago by Raquel Alvarez Banos

  • Status changed from assigned to inprogress

Re #9215 Main loop parallelization

Changeset: ab13759a53435f722fbd3580ca99c091c0b787a1

comment:17 Changed 6 years ago by Raquel Alvarez Banos

Re #9215 Get rid of scoped workspaces which are not thread-safe

Changeset: da6b8da313ac10e9458c5b9dc31f024c27ea9c1a

comment:18 Changed 6 years ago by Raquel Alvarez

  • Status changed from inprogress to verify
  • Resolution set to fixed

This is being verified as pull request #449.

Changed 6 years ago by Raquel Alvarez Banos

Changed 6 years ago by Raquel Alvarez Banos

comment:19 Changed 6 years ago by Karl Palmen

  • Status changed from verify to verifying
  • Tester set to Karl Palmen

comment:20 Changed 6 years ago by Raquel Alvarez Banos

I have attached two files I have used to check performance: "run_plotasymmetry.py" is the python script I have run in Mantid after and before this fix, and "PlotAsymmetryByLogValueTimes.txt" is a brief summary reporting the results.

comment:21 Changed 6 years ago by Raquel Alvarez Banos

It seems that I can't follow the approach I had planned (see commits in comments 16 and 17). The reason is that muon nexus files (e.g. MUSR... and HIFI...) are in the old HDF4 format, which can not be safely accessed from multiple threads (this was causing the build to fail in all platforms but Windows). The only solution is to load nexus files in serial and then analyse the workspaces in parallel.

comment:22 Changed 6 years ago by Raquel Alvarez Banos

Re #9215 Update algorithm to allow data to be loaded in serial

Changeset: b91c6d62425afd326281d20d35c10a84769519ce

comment:23 Changed 6 years ago by Raquel Alvarez Banos

Re #9215 Store loaded data in vectors

Changeset: ef53f97e213e6da01ecb78e42d46858d948e1f43

comment:24 Changed 6 years ago by Raquel Alvarez Banos

Re #9215 Apply corrections and grouping if requested

Changeset: cc53d936258140f12cfbffcff4fa017c74139872

comment:25 Changed 6 years ago by Raquel Alvarez Banos

Re #9215 Fix bug and clear vectors

Changeset: c472788e6eeea74c0ccef31a25fa9318c7232ed5

comment:26 Changed 6 years ago by Raquel Alvarez Banos

Re #9215 Move progress report to loading step

Changeset: a0b31c70515c15d82efc374d996e41e831d08626

comment:27 Changed 6 years ago by Raquel Alvarez Banos

comment:28 Changed 6 years ago by Raquel Alvarez Banos

Re #9215 Fix compilation error on rhel7

Changeset: 6641a885c7a2993654a93ae176d66e0079226432

comment:29 Changed 6 years ago by Raquel Alvarez Banos

Re #9215 Replace omp command by macro

Changeset: 067ce917f1f15ece54708e7e2b66cd0f8e11b5ff

comment:30 Changed 6 years ago by Raquel Alvarez

Jenkins, retest this please

comment:31 Changed 6 years ago by Karl Palmen

  • Status changed from verifying to closed

Merge pull request #449 from mantidproject/9215_Speed_up_data_loading

Speed up data loading

Full changeset: 381e0374db3920e5ce13d700c0206f82f523f9ec

comment:32 Changed 5 years ago by Stuart Campbell

This ticket has been transferred to github issue 10058

Note: See TracTickets for help on using tickets.