Multithreading

I am trying to use the new version (6.3.0) in a multithreading enviroment. Essentially I have to intersect a series of lines with a surface and in order to speed up the computation I would like to use two threads (on a dual core processor). Unfortunately the class IntCurvesFace_Intersector does not seem thread safe. For example there are some static variable in the file Intf_InterferencePolygonPolyhedron.gxx.
Has anybody faced a similar problem?

Roman Lygin's picture

OCC has initial preparations for multi-threading especially introduced in 6.3. However there are really a lot of static variables that makes it thread-unsafe "as is", i.e. unless you protect it properly yourself.
I have just met the issue with *_Type_() functions (e.g. Geom_Surface_Type_()) located in private *.ixx files for Handle types. They do contain static variables and simultaneous unprotected access from threads leads to further crash.

However, the very memory manager is protected and seems thread-safe, so if you carefully separate your threads and prevent them from using the same data you may end up in reaching parallelism. Don't forget to set MMGT_REENTRANT env var or call Standard::SetReentrant().

Good luck

Bearloga's picture

Dear Roman,
I do not agree that "simultaneous unprotected access from threads" to *_Type_() functions "lead to further crash". These functions return handled objects by reference rather than by value, and their values are used for read only access. I do believe that reading the same memory from different threads simultaneously is safe.
Regards

gianni's picture

The functions *_Type_() are not clear to me.
However, at the moment I do not have any crash.
I solved a few problems declearing some static variables
__declspec(thread), I also set MMGT_OPT=0 just to avoid memory manager issue.
Unfortunately, when I use two thread I obtain wrong results.

Roman Lygin's picture

Hi,

The issue appears when two or more threads simultaneously enter the _Type() function and begin to initialize static variables for the first time. Note that the Handle() methods are not thread safe. It seems as a result the _aType variable is bad.
I am now working on rewriting CDL extractor to generate another function body but did not finish this yet (this causes multiple other dependencies to do). Will share results when available.
Good luck

gianni's picture

This would be great.
Thank you very much.

Roman Lygin's picture

OK, so here it is after several weeks of programming in my spare time.

I have chosen IGES import of IGES groups (type 402) as a testing scenario to translate them in parallel.
I went and implemented an infrastructure to process loops. It creates a set of pipelines running in parallel, each processing a loop subrange. That is, the idea is similar to Intel Threading Building Blocks (see its parallel_for). The infrastructure is managed by a global ThreadScheduler (which implements singleton pattern, i.e. one global instance). Number of created pipelines is equal to number of logical CPU’s - 1 (one is reserved for master thread). I.e. on single processor system, no pipelines are created and everything is done within a master thread. The method OSD_ThreadScheduler::PerformParallelLoop() checks for a number of currently empty pipelines and dispatches a loop among them (i.e. in the best case it will split into n parts, in the worst – into 1, i.e. within master thread). IGES groups can be recursive and so can the translation.

OSD_ThreadScheduler::PerformParallelLoop() is my reinvention of a bycicle , it can be written on Intel TBB ::parallel_for() instead. But this will create a dependence on it in OCC and OCC team may not want it. So extension can be to have this method virtual and re-implemented in your own ancestor.

Writing the infrastructure I had to write a few synchronization objects – OSD_WaitCondition and OSD_InverseSemaphore that are used by OSD_ThreadScheduler and internally by OSD_Pipeline.

Once the very first draft was ready I started running it on multi-core system (4 cores) and multi-thread issues started to appear. First *_Type() function (that is STANDARD_TYPE macro) appeared not thread safe and this forced me to rewrite the CDL extractor (one that is used by WOK to generate .hxx files and drv/*/* files). In parallel, as an attempt to remove code duplication, I have updated the CDL extractor to generate code using DEFINE_/IMPLEMENT_STANDARD_HANDLE, etc macros. So, hxx and drv/*.cxx are now much more concise. Some refactoring was done in Standard_DefineHandle.hxx.

Of course, IGESToBRep_CurveAndSurface was modified to use this new infrastructure and protect access to its Transfer_TransferProcess field with mutexes.

Once this has been done this gave a good basis to test. Measurements showed that there was almost no performance increase on the translation part. So I took Intel Thread Profiler to understand why was that. Its timeline showed two major things:
- the timeline of running threads was “chess”-like (i.e. while 1 of 4 threads was running three others were waiting)
- the was gaps in timeline when no threads were running
The latter was most unclear, so I dug and found that it is caused by Standard_Mutex::Lock() which is TryEnterCriticalSection() + Sleep(). Wow ! Why is that ? Open CASCADE folks state in the comment that it is due to performance decrease on single core processors. Maybe, I did not re-checked. But I changed it with normal EnterCriticalSection(). The 2nd issue disappeared and threads became more concurrent (they overlapped more on a timeline).
Then I dug into 1st issue and found out that it was due to high contention in Standard::Allocate() and Free(). That is, threads are competing for one mutex trying to create and destroy multiple objects as they appear and die during IGES translation. Well, this seems to be an unavoidable price for using OCC memory manager. But, if you want to use directly system calls then set system variable MMGT=0 (read Foundation Classes User’s Guide on Memory manager for details). With MMGT=0 all the threads became almost concurrent ! Hooray !

There were several other side modifications (e.g. to fight with windows.h spoiling the code, ugly “optimizations” in Transfer_TransferProcess that made it totally thread unsafe and non-reentrant, static variables in Geom and BSpl*Lib, etc) not worth explaining in details.

The results (measured on code compiled with /Od flag):
- with MMGT=1 the speed up was 16%-19% (tested on a couple of IGES files – one with multiple nested groups and one just with many objects) on translation part. Note that concurrency is low due to OCC memory manager.
- with MMGT=0 – 56% (vs MMGT=1 in OCC).
Not bad for the first experiment I would say ;-). This promises that OCC can be improved in some places to speed up (maybe visualization, shape healing, etc).
The most tough thing will likely be data sharing (e.g. Transfer_TransferProcess, ShapeBuild_ReShape, something in meshing, etc). But this is a common problem of making apps parallel.

In the conclusion, I would recognize great assistance from the OCC team who provided leading clues, pieces of advice and suggestions. The code comments (in the places related to the kernel stuff – mutex, memory manager) were also very informative and helped.

I sent the fixes to the OCC team for review and possibly inclusion into future releases (maybe except IGES part itself that can be improved). For those who are interested I can do so as well – just send me an email. However note that due to CDL extractor modifications you will have to regenerate the drv/ and inc/ files yourself. Otherwise I can upload generated files to somewhere, if someone can instruct me how to.

Good luck in your own experiments !

Roman

gianni's picture

Dear Roman thank you for your reply.
I did some experiment too. As I said in my first message my interest was to use the class IntCurvesFace_Intersector
in parallel. I managed to get something that does not crash but unfortunately I randomly get wrong results (of course everythings works fine with only one thread).
I suspect that somewhere there is a static variable that is shared among the threads but I have no idea how to find it.
Concening the TryEnterCriticalSection() + Sleep() issue I changed it to EnterCriticalSection() but I also changed the InitializeCriticalSection() in the constructor of the Standard_Mutex with InitializeCriticalSectionAndSpinCount( &myMutex, 4000 ) that seems to work much better.

I am very interested in your code and your experiments. Is it difficult to use WOK to generate drv/ inc/ directory?

My e-mail is lugica2001 at gmail dot com

Thank you
Gianni

Roman Lygin's picture

Hi Gianni,

Yes, I also suggested OCC team to consider InitializeCriticalSectionAndSpinCount(). Perhaps the Standard_Mutex constructor can be extended to accept the integer parameter and thereby call Init...SpinCount. 4000, if I recall, comes from MSDN as a value used internally inside heap management. So must be very reasonable.

Yes, your random results surely indicate there is something thread-unsafe. It can be "static" or something else. For instance Transfer_TransferProcess stored values of last element, and its index in the map after last call to Find...() methods and other methods relied on them. So obviously this broke thread safety.
You can try Intel Thread Checker to diagnose thread errors. Intel will soon be unveiling its new Parallel Studio, and inside it there will be Parallel Inspector which will help that as well.

Regarding WOK - you can try documentation which is supplied with OCC to get it run. To generate (and not compile - what you can do in MSVS) code from cdl you will just need to launch WOK from start menu. That should initialize an environment for you. Then from my patch you will have to copy *.edl files to %WOKHOME%/lib.
Then you will have to create factory, workshop and workbench and put OCC sources + my patch into it and then invoke
wprocess -DGroups=Src,Xcpp

This will generate files for you which you will have to manually copy to your OCC installation. Plus you will have manually edit TKernel project to add new OSD C++ files.
Will send you a patch in a few minutes.