Ticket #5805 (inprogress)

Opened 8 years ago

Last modified 5 years ago

Trying to load a bad file causes a segfault

Reported by: Stuart Campbell Owned by: Federico M Pouzols
Priority: minor Milestone: Release 3.5
Component: Framework Keywords:
Cc: Blocked By:
Blocking: Tester: Lottie Greenwood

Description (last modified by Dan Nixon) (diff)

During the testing of the new streaming project at the SNS, I discovered that some of the files that were produced caused mantid to segfault and crash. Although these files are probably created incorrectly and may contain junk, we still should fail gracefully rather than segfault.

These files are supposed to be event nexus files.

Attachments

HYSA_274_causes_crash.nxs.h5 (1.4 MB) - added by Stuart Campbell 8 years ago.
HYSA_275_causes_crash.nxs.h5 (1.4 MB) - added by Stuart Campbell 8 years ago.

Change History

Changed 8 years ago by Stuart Campbell

Changed 8 years ago by Stuart Campbell

comment:1 Changed 8 years ago by Stuart Campbell

The attached files were originally name HYSA_274.nxs.h5 and HYSA_275.nxs.h5. They can be renamed to the old format of HYSA_274_event.nxs and they still fail.

comment:2 Changed 8 years ago by Nick Draper

  • Milestone changed from Release 2.3 to Release 2.4

Moved to milestone 2.4

comment:3 Changed 8 years ago by Andrei Savici

  • Status changed from new to accepted
  • Owner set to Andrei Savici

comment:4 Changed 8 years ago by Andrei Savici

  • Status changed from accepted to verify
  • Resolution set to worksforme

Log values in DASLogs have problems. Seems to be working for me now (Ubuntu 12.10 and RHEL6)

comment:5 Changed 8 years ago by Russell Taylor

  • Status changed from verify to verifying
  • Tester set to Russell Taylor

comment:6 Changed 8 years ago by Russell Taylor

  • Status changed from verifying to reopened
  • Resolution worksforme deleted

The first file does just fail the algorithm, but the second one still crashes MantidPlot (both on RHEL6 & OSX). The failure looks to be fairly deep inside hdf so I don't know how easy it will be to fix (or how much we're bothered).

comment:7 Changed 8 years ago by Andrei Savici

  • Owner changed from Andrei Savici to Anyone
  • Status changed from reopened to assigned

comment:8 Changed 8 years ago by Nick Draper

  • Priority changed from major to minor
  • Milestone changed from Release 2.4 to Release 2.5

comment:9 Changed 7 years ago by Nick Draper

  • Milestone changed from Release 2.5 to Release 2.6

Moved to r2.6 at the end of r2.5

comment:10 Changed 7 years ago by Nick Draper

  • Status changed from assigned to new

comment:11 Changed 7 years ago by Nick Draper

  • Component changed from Mantid to Framework

comment:12 Changed 7 years ago by Nick Draper

  • Milestone changed from Release 2.6 to Backlog

Moved to the Backlog after the code freeze for R2.6

comment:13 Changed 7 years ago by Nick Draper

  • Status changed from new to assigned

bulk move to assigned at the into of the triage step

comment:14 Changed 6 years ago by Dan Nixon

  • Description modified (diff)

I'm not sure if this is actually the cause of the fault, but in the Nexus code there is a comment which implies getNextEntry is broken somehow.

In the stack trace from GDB I see a call to getNextEntry from getEntries()

#20 0x00007fffece93d15 in ?? () from /usr/lib/libNeXus.so.0
#21 0x00007fffeeb7ff45 in NeXus::File::getNextEntry() ()
   from /usr/lib/libNeXusCPP.so.0
#22 0x00007fffeeb83d8a in NeXus::File::getEntries(std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) () from /usr/lib/libNeXusCPP.so.0
#23 0x00007fffeeb841f7 in NeXus::File::getEntries() ()
   from /usr/lib/libNeXusCPP.so.0
#24 0x00007ffff78eb95d in Mantid::Kernel::NexusDescriptor::walkFile (
    this=0x7fffb7a57870, file=..., rootPath=..., className=..., pmap=..., 
    level=3)
    at /home/dan/mantid/Code/Mantid/Framework/Kernel/src/NexusDescriptor.cpp:242

comment:15 Changed 6 years ago by Federico M Pouzols

  • Status changed from assigned to inprogress
  • Owner changed from Anyone to Federico M Pouzols
  • Milestone changed from Backlog to Release 3.4

comment:16 Changed 6 years ago by Federico Montesino Pouzols

handle exceptions in input props when opening alg dialogs, re #5805

Changeset: 9fae7e89036114ed811e7a223696f579b2c01a6c

comment:17 Changed 6 years ago by Federico Montesino Pouzols

more informative exception messages when files badly broken, re #5805

Changeset: 1cf5483fdc3447e3477c84c1b7d38ea43c20829b

comment:18 Changed 6 years ago by Federico M Pouzols

  • Status changed from inprogress to verify
  • Resolution set to fixed
  • Tester Russell Taylor deleted

The first bad file (274) has some errors in the logs and that's reported gracefully by LoadNexusLogs. All is well here.

The second one (275) has more obscure errors, and as this can cause a nasty crash in the generic Load algorithm. Independently of whether this could be better handled in libNeXus (see comment above), I've added error handling for this kind of unexpectedly, badly broken file that makes the chosen loader throw a wild exception.

The current behavior for 275 is:

  • if you try load the broken file from Algorithms->Execute->Load or Workspaces->Load, it will be identified as a wrong input file, the file field will be marked with a red asterisk, and you should not get a crash. If you hover the red asterisk you should see an informative message.
  • if you try to load the broken file from File->Load->File, and error message will be printed in the results log.

Suggestion to test: try to make mantid crash with the two example files or similar nasty files.

comment:19 Changed 6 years ago by Lottie Greenwood

  • Status changed from verify to verifying
  • Tester set to Lottie Greenwood

comment:20 Changed 6 years ago by Lottie Greenwood

  • Status changed from verifying to reopened
  • Resolution fixed deleted

File 274 fails gracefully with informative error messages, but file 275 just crashed mantid, without any identifying it as a wrong input file, or giving an error message.

comment:21 Changed 6 years ago by Federico M Pouzols

Yes, 275 still makes Mantid crash really badly on windows. I'm trying to figure out if this can be handled properly in Mantid. Linux seems ok.

In view of this, and assuming that it could be fixed on windows, it might be a good idea to test this ticket also on a mac.

comment:22 Changed 6 years ago by Federico Montesino Pouzols

  • Status changed from reopened to inprogress

more verbose excepts, handle sigsegv (happens on windows), re #5805

Changeset: 3f00eb29183b6d40a886fab56dbc66bc88afdef2

comment:23 Changed 6 years ago by Federico M Pouzols

On windows libNeXus is segfaulting in NeXus::File::getNextEntry() -> NXgetnextentry

It should be possible to avoid this crash, as for example HDFView just doesn't open the entry that is making libNeXus crash.

The last changes that I've committed handle this sigsegv. That doesn't look like an ideal solution to me, but it's probably the only thing that can be done. This anyway raises the issue of what do we do in Mantid when our dependencies/libraries misbehave -memory violation, etc.? At present, Mantid would crash instead of trying to show the exception critical error message.

This ticket might remain open until the issue is fixed in libNeXus.

comment:24 Changed 5 years ago by Nick Draper

  • Milestone changed from Release 3.4 to Release 3.5

Moved to R3.5 at the R3.4 code freeze

comment:25 Changed 5 years ago by Stuart Campbell

This ticket has been transferred to github issue 6651

Note: See TracTickets for help on using tickets.