Saturday 1 December 2018

Lessons Learnt from Developing a C++ Library

Work in PROGRESS... I am working on a C++ library with MATLAB interface, i.e., MEX files for large-scale machine learning problems. So in this blog, I am sharing the lessons (I mean scenarios which faced while development) which I have learnt from the C++ development (MEX context). Moreover, I will continue to update the blog as I will find anything new or the inputs provided by you. I will try to add code snippets and error message, as I will get time. Any suggestions are welcomed....


Lesson #1. To delete derived object, we need virtual base destructor
For efficient library, efficient memory management is required. And for every dynamic memory allocation, we need to have dynamic memory deallocation otherwise there will be memory leak and for large-scale problems it can create problems.
In some situations, we need the base class reference to hold objects of different derived classes. So while deallocating/deleting such objects, we get errors if we have not defined virtual destructor for the base class. We need virtual destructor even when the base class is abstract class. Example:

class Base { 
   virtual ~Base() {}; \\ if commented, this will give error.
};
class Derived: public Base {}; 
class C { 
    public void main() { 
       Base obj = new Derived(); 
       delete obj; 
   }
};


Lesson #2. Delete for new and free() for malloc()
While dynamically managing the memory, one should use delete to free the memory allocated using new and free() should be used with malloc() (c type methods). Otherwise it can also create trouble, as the memory is allocated and deleted to different places using them.

Lesson #3. Before returning mid way from a function, be sure to free the memory
When you are using dynamic memory allocations inside a function then be sure to deallocate the memory before leaving the function from any of the exit points. Generally, we allocate memory at the beginning of function and free at the end of the function. But this could lead to memory leaks if there are conditional exits. This means a function can have multiple exit points, i.e., using return statement and conditional exits so we should ensure that memory is deallocated whatever exit path is taken by the program.

Lesson #4. Safe include
Error: Safe includes are helpful to fix the 'multiple definitions' compile time error. When, you have a header file (say A) included into multiple source files (say B and C) and then these new files (B and C) are included in another file  (say D) then we see this error because compiler sees multiple definitions of the code present in A because it was included in B and C, and then they were included in the D so D has two copies of A.
Solution: Use safe includes with A, i.e., we define contents of header file A using compiler directives #ifndef, #define and #endif, as given below:
           #ifndef FILE_A
           #define FILE_A

           //-----//-- here comes the contents of header file A.//----
          

          #endif

Lesson #5. Comment and print
In order to debug the code, especially with MEX files one should use debugging tools like Visual Studio Code etc. but if you don't have any such tool then we can use comment and print strategy for debugging. Suppose you have huge amount of code which is not running and you are not sure where is that crashing point then first comment the complete logic part and run the blank flow. After this add few lines of code and run the code, if it works then keep on adding few lines of code until find the erroneous code. Yes you can keep printing some variables which help you track the code.

Lesson #6. Verify loop sizes, indexes
This is one of the culprit, when we copy-paste code or try to use the existing code because we forget to update the loop sizes and indexes used in the destination code environment.
Segmentation fault (known as segfaults) is one of the common error (run time), we see during the development process and it occurs whenever there is unauthorised memory access. One unfortunate thing about this error is it does not necessarily occur at the time of unauthorised access of memory but can occur at later stage so it is difficult to locate this error. One possible reason for segfaults is array out of bounds and this occurs when the indexes of arrays or the loop sizes are not correct.
So whenever there is segfaults, first check the loop sizes or indexes of the loops and every time you copy-paste the code be sure to have a look at the indexes and loop sizes.

Lesson #7. You are unlucky if your code runs in the first place and shows results
It's not very relevant to coding errors but to our learning. Every error in the code is a possibility for your learning new thing and adds something to your experience. Generally, it hardly happens that we don't see any error but if it happens then that means we have missed the opportunities to learn something. 

Lesson #8. Test new code/functionality separately and then add to library
Whenever you want to add new idea/logic/code (probably small code) to large code, don't add it directly. Whenever possible write the code separately, test it and then add it to the larger code, e.g., you can develop the logic in NetBeans (obviously for C++). This helps to reduce errors, especially logical. Remember, it is difficult to locate the error in larger code than the smaller one.

Lesson #9. Verify the allocation, deletion and initialisation of all the pointers
Once you complete any code/function/class or you can say when you see errors, verify that you have allocated, initialised and deallocated all the pointer variables. Because generally we miss some of them for some of the variables and this happens when we copy-paste or modify the existing method and this happens quite frequently when developing libraries because methods share some of the logic so you need to copy code from one place to other place.

Lesson #10. Name collision for counters of for-nested loops
Whenever we have multiple nested loops, spread over large amount of code, then many times we repeat the loop counters, i.e., we use the same variable name  in the inner loop for the counter as used in the outer loop.

Lesson #11. If some function is called very frequently then do not do allocation and deallocation in that function
This is interesting and very critical point. If one function is called frequently then do not do any memory allocations/deallocation inside that, because the allocation/deallocation process takes a lot of time and make your algorithm slow. In such situations, you can do it globally, like in the class constructor/destructor.

Lesson #12. If some function is called very frequently with large number of parameters then either reduce the params
When a function is called, its parameters are saved on to the stack along with the return address and for every function call, all the information is written to the stack. So if one function takes large number of parameters and is called a number of times then we should think of reducing number of parameters to the function call. This will save running time.

Lesson #13. Before you could think of beating other algorithms, just verify if the algorithms implemented by you are giving the results as reported earlier.
This point is not very relevant to C++ but to library development or implementing our ideas. In a paper, we compare our method with existing techniques. So in this process, we are hardly able to do it in one go, rather make mistakes, get wrong results and then we get the method right. So here I suggest you to implement your method and existing methods and run on bench marked datasets and compare with the results reported in the literature. If the results are at par only then move to comparing with other methods.

Lesson #14. If 'unsigned' is assigned to 'signed' then occasionally goes down.
When you have huge amount of data but limited RAM then even the indexes has to pay heavy value. So to be more efficient, generally I use unsigned values for indexes so that I could have high range with lessor memory and use unsigned values in the loop control. But this might put you in an infinite loop in certain situations, as I was stuck once, for at least 30 minutes.
When decremented, unsigned index can't go to -1 rather they repeat in a circular fashion. So we need to use signed indexes in such situations, e.g., look at the following situation, once upon a time I stuck with this infinite loop:
    unsigned memory_size = 5;
    for(unsigned i=memory_size; i>=0; i=i-1) {
        printf(" %d", memory_size);
    }
Lession #15. Use Debuggers for MEX Files
We should use debuggers to locate the errors, like we can use it in NetBeans. But when we are using the MEX files, we might not have much options or at least MATLAB does not offer anything (to the best of my knowledge) to debug C++ code. It is very unfortunate that MATLAB does not provide any error messages for runtime errors but directly crashes when we run the MEX files. Perhaps this is due to the fact that MATLAB passes the execution control to a different program for MEX files so for runtime errors, it directly crashes.
In the first place, one can try 'Lesson 5. Comment and Print' but that's not very good way to deal with the errors. So we should look for some debugging tools and I want to suggest Visual Studio Code, which works with Mac as well as Windows. For a tutorial to VS Code, you can follow the link

Some suggestions:
1. For some tips for debugging MEX files, you can follow this document from Caltech University: mex_debugging.
2. And this is a very nice, a little complex but very detailed and helpful post which talks about header and base class issues: link.


Dedication: to my Gurus....