Ticket #6385 (assigned)

Opened 8 years ago

Last modified 5 years ago

Race condition in ConvertToMD

Reported by: Vickie Lynch Owned by: Vickie Lynch
Priority: major Milestone: Backlog
Component: Framework Keywords:
Cc: Blocked By:
Blocking: Tester:

Description

There is a race condition in ConvertToMD that sometimes breaks it when running through nx, but not through ssh or on localhost.

Change History

comment:1 Changed 8 years ago by Vickie Lynch

  • Status changed from new to accepted

comment:2 Changed 8 years ago by Vickie Lynch

Tried nx on outback2 and topaz. Tried event and matrix2d workspaces. Tried ConvertToMD and ConvertToDiffractionMDWorkspace. Tried repeating conversions 5 times. Did not see crash.

comment:3 Changed 8 years ago by Vickie Lynch

  • Milestone changed from Release 2.4 to Release 2.5

comment:4 Changed 8 years ago by Vickie Lynch

Hi Vickie,

The scripts that have the problem are currently in my home directory:

/SNS/users/eu7/MANTID_SCD_REDUCTION. To see the problem, copy the scripts

MDReduceOneSCD_Run.py ReduceDictionary.py ReduceSCD_Parallel.py

and the configuration file:

MDSAPPHIRE_JUNE_SPHERE.config

to a subdirectory of your home directory. You will need to edit the configuration file and change the output directory to a subdirectory you can write to. The scripts write out a bunch of intermediate results so it's best to direct the output to a dedicated subdirectory.

The peculiar problem can be demonstrated as follows:

  1. Connect to the analysis cluster using NX.
  1. Within a command window in the NX session, ssh to local host.
  1. In that ssh session, cd to the directory with the scripts and configuration file and type "python ReduceSCD_Parallel.py MDSAPPHIRE_JUNE_SPHERE.config" In about 3 minutes the script should finish and produce the main output "SAPPHIRE_JUNE_SPHERE_Rhombohedral_R.integrate" and lots of intermediate files. That is, the scripts work fine when run through an ssh session or on a local machine.
  1. Exit the ssh local host window to get back to the command window which is directly in the NX session.
  1. Delete all the files made by the script.
  1. In the direct command window cd to the directory and run the script again. This time, after about 3 minutes the script will "hang".
  1. In a different window, check the output directory. The intermediate files for runs 5637-5644 should all have been created, but none of the final result files, "SAPPHIRE_JUNE_SPHERE*" will have been created, when the script "hangs".
  1. In the window where the script was started, press <control>c and in a few seconds the script should terminate normally, writing out all of the final result files.

The puzzle is, why does the last separate process not terminate properly when run in an NX session. That causes the driving script to hang. I've never seen the problem when running on my local systems, or from an ssh session, but it happens consistently in an NX session.

Thanks for trying this, and for any suggestion you might have on how to fix this.

NOTE: The script SegFaultDemo.py is a MUCH simpler script showing a problem in NX. Running the script directly from a command window in NX causes a segfault, but running it from an ssh session works :-(

Dennis

comment:5 Changed 8 years ago by Vickie Lynch

Scripts run using nx if this environmental variable is not set:

import os

os.environ['LD_PRELOAD'] = ''
Last edited 8 years ago by Vickie Lynch (previous) (diff)

comment:6 Changed 8 years ago by Vickie Lynch

Refs #6385 environment variable for nx python

Changeset: 737c01a1e5e8546e7a69d4ac9214601208e2ebad

comment:7 Changed 8 years ago by Vickie Lynch

  • Milestone changed from Release 2.5 to Release 2.4

comment:8 Changed 8 years ago by Nick Draper

  • Milestone changed from Release 2.4 to Release 2.5

Moved at the code freeze for release 2.4

comment:9 Changed 8 years ago by Vickie Lynch

  • Status changed from accepted to verify
  • Resolution set to fixed
  • type changed from enhancement to defect
  • Milestone changed from Release 2.5 to Release 2.4

Tested using nx on SNS analyis computers and works for stand alone python scripts with both new and old api. Use this script from Dennis to test:

import sys

sys.path.append("/opt/mantidunstable/bin")
#sys.path.append("/opt/Mantid/bin")

from MantidFramework import mtd
mtd.initialise()
#from mantidsimple import *
from mantid.simpleapi import *

file_name = "/SNS/TOPAZ/IPTS-4822/0/3857/NeXus/TOPAZ_3857_event.nxs"

LoadEventNexus( Filename=file_name, OutputWorkspace='TOPAZ_events', FilterByTofMin='500', FilterByTofMax='16000' )

ConvertToDiffractionMDWorkspace( InputWorkspace='TOPAZ_events', OutputWorkspace='TOPAZ_MDEW', 
	LorentzCorrection='1', OutputDimensions='Q (lab frame)', SplitInto='2,2,2', SplitThreshold='50',MaxRecursionDepth='12')

sys.exit()

comment:10 Changed 8 years ago by Vickie Lynch

  • Status changed from verify to reopened
  • Resolution fixed deleted

comment:11 Changed 8 years ago by Vickie Lynch

  • Status changed from reopened to accepted

comment:12 Changed 8 years ago by Vickie Lynch

Refs #6385 only set variable if using nx

Changeset: fc231d6915a73c74b2c715126316cfe93ac621e5

comment:13 Changed 8 years ago by Vickie Lynch

  • Status changed from accepted to verify
  • Resolution set to fixed

Martyn requested this variable only be changed if using nx. See script above for testing.

comment:14 Changed 8 years ago by Vickie Lynch

  • Status changed from verify to reopened
  • Resolution fixed deleted

Saw this error again so reopening

comment:15 Changed 8 years ago by Vickie Lynch

  • Status changed from reopened to accepted

comment:16 Changed 8 years ago by Vickie Lynch

Refs #6385 Reverting changes that do not work

Changeset: 638205c0f4ec50f93dd79af39741cbb402be1d5a

comment:17 Changed 8 years ago by Vickie Lynch

  • Milestone changed from Release 2.4 to Release 2.5

This only effects stand-alone python scripts run through NX on the analysis computers. The topaz computer does not have this problem. I will check if this can be fixed by changing NX on them.

comment:18 Changed 8 years ago by Vickie Lynch

Refs #6385 only set variable if using nx

Changeset: fc231d6915a73c74b2c715126316cfe93ac621e5

comment:19 Changed 8 years ago by Vickie Lynch

Refs #6385 Reverting changes that do not work

Changeset: 638205c0f4ec50f93dd79af39741cbb402be1d5a

comment:20 Changed 8 years ago by Vickie Lynch

Refs #6385 set environment variable needed for SNS analysis computers

Changeset: 4be36a5245218aeb0799368468158873dbc8f791

comment:21 Changed 8 years ago by Vickie Lynch

Refs #6385 LD_PRELOAD in mantid.csh

Changeset: 696aab3a4dd3f73ad107605121b20a662c44264c

comment:22 Changed 7 years ago by Nick Draper

  • Milestone changed from Release 2.5 to Release 2.6

Moved to r2.6 at the end of r2.5

comment:23 Changed 7 years ago by Nick Draper

  • Status changed from accepted to assigned

comment:24 Changed 7 years ago by Nick Draper

  • Status changed from assigned to new

comment:25 Changed 7 years ago by Nick Draper

  • Component changed from Mantid to Framework

comment:26 Changed 7 years ago by Nick Draper

  • Milestone changed from Release 2.6 to Backlog

Moved to backlog at the code freeze for R2.6

comment:27 Changed 7 years ago by Nick Draper

  • Status changed from new to assigned

Bulk move to assigned at the introduction of the triage step

comment:28 Changed 5 years ago by Stuart Campbell

This ticket has been transferred to github issue 7231

Note: See TracTickets for help on using tickets.