Facial Emotion Recognition is the process of identifying human emotions from faces. It is one of the most applied concepts of computer vision and artificial intelligence. Classroom video emotion recognition, in this context, is very important for academic and online teaching institutes. This ascertains the mood of students and possibly design lecture content and delivery mechanisms for students.
Facial Emotion Recognition involves three main components: Face Detection, Face Recognition, and Facial Emotion Recognition.
Face detection is a critical step to all facial analysis algorithms. Given an arbitrary image, the goal is to determine whether there are any faces in the image or not and, if present, return the image location and the extent of each face. Though as humans, we are extremely good at face recognition, it is very challenging for a machine to detect faces while facing challenges such as pose, variation, exposure, resolution, changes to the faces with time, and so on. Hence, face detection has been studied extensively to attain human-level performance. We adopted a state-of-the-art model developed by CMU, called tiny-faces model developed by Peiyun Hu et al.
This model is trained on a massive 3.9 million images dataset, to be robust to scale and resolution variance with an accuracy of about 81%.
Face recognition is a system that verifies a person from a given digital image. Face recognition systems are extensively used in applications like commercial verification, video surveillance image indexing, and so on. The FaceNet model directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space is produced, tasks such as face recognition, verification, and clustering are easily implemented using standard techniques with FaceNet embeddings as feature vectors.
This model achieves around 99% accuracy on LFW dataset.
Facial emotion recognition is a process of identifying human emotion from facial expressions. Humans do this with high accuracy but to machines, understanding facial expressions and classifying the corresponding emotions is no easy task. To achieve this human-level performance, various computational methodologies have been developed, leveraging techniques from areas such as Computer Vision, Machine Learning, and AI.
One of such models developed for facial emotion recognition, called mini-Xception model, achieved around 66% accuracy on the famous FER2013 dataset. We chose this model to do infer and classify the emotions of recognized faces.
The above three components are used together to obtain an inference-frame work to predict the facial emotion of a student given a classroom video. We captured classroom video clips at decided intervals and converted them into image frames. But while observing the images of students in classrooms, we found that the students were not expressive enough to tag them as happy or sad. So, we tried tagging them as ‘Listening’ or ‘Distracted’ and realized that the emotional expressions of people differ a lot while accessing knowledge from person to person. Thus in the due time, we faced challenges in tagging the data and eventually trained the models. We used the state-of-the-art-models to predict our classroom images to see the performance.
The face detection algorithm gives bounding boxes of detected faces, and we cropped them to get individual students images.
The recognition algorithm takes the passport size photograph given by the student during admission, gets the corresponding facial embedding, and saves them as a baseline reference. The cropped image of the output of the detection algorithm, which is the image of a student in-class, is sent through a forward-pass to get corresponding in-class facial embedding. The nearest base-line embedding for this in-class embedding based on Euclidean distance is chosen as recognized, i.e., the closest cropped image of the passport photo is chosen as recognized.
The cropped images given by the face detection algorithm can sequentially be an input to the emotion recognition, which is a deep CNN implementation to predict emotions.
For a given video at selected frame rates, image frames are sampled. Each of the images is then passed through detection, face recognition and emotion recognition modules. An aggregated mood at each frame is calculated using the mode of emotions of students in that frame and an overall timeline is generated.
Through visual inspection, the results seem to be very convincing for such a humble attempt both in terms of time and resources.
Along with the overall mood timeline for the entire class, personalized student diagnostics reports can be developed to show the average mood of a student across the course timeline. Also, the refinement of the idea of tagging is necessary to improve the emotion recognition module.
The resolution and exposure of images/videos are poor due to capturing data with mobile phones. While tagging the data, we found that listening expressions are very different from universal facial expressions of emotions. Hence, it was subjective and difficult to tag the data.