Nowadays, security of the computer systems has become a major concern of security experts. In spite of many antivirus and malware detection systems, the number of malware incidents are increasing day by day. Many static and dynamic techniques have been proposed to detect the malware and classify them into malware families accurately. The dynamic malware detection has potential benefits over the static ones to detect malware effectively. Because, it is difficult to mask behavior of malware while executing than its underlying code in static malware detection. Recently, machine learning techniques have been the main focus of the security experts to detect malware and predict their families dynamically. But, to the best of our knowledge, there exists no comprehensive work that compares and evaluates a sufficient number of machine learning techniques for classifying malware and benign samples. In this work, we conducted a set of experiments to evaluate machine learning techniques for detecting malware and their classification into respective families dynamically. A set of real malware samples and benign programs have been received from VirusTotal, and executed in a controlled & isolated environment to record malware behavior for evaluation of machine learning techniques in terms of commonly used performance metrics. From the execution reports saved in the form of JSON reports, we extract a promising set of features representing behavior of a malware sample. The identified set of features is further employed to classify malware and benign samples. The Major motivation of this work is that different techniques have been designed to optimize different criteria. So, they behave differently, even in similar conditions. In addition to classification of malware and benign samples dynamically, we reveal guidelines for researchers to apply machine learning techniques for detecting malware dynamically, and directions for further research in the field.