In the billions of faces that are shaped by thousands of different cultures and ethnicities, one thing remains universal: the way emotions are expressed. To take the next step in human-machine interactions, a machine must be able to clarify facial emotions. Allowing machines to recognize micro-expressions gives them a deeper dive into a person’s true feelings at an instant which allows designers to create more empathetic machines that will take human emotion into account while making optimal decisions; e.g., these machines will be potentially able to detect dangerous situations, alert caregivers to challenges, and provide appropriate responses. Micro-expressions are involuntary and transient facial expressions capable of revealing genuine emotions. We propose to design and train a set of neural network (NN) models capable of micro-expression recognition in real-time applications. Different NN models are explored and compared in this study to design a hybrid deep learning model by combining a convolutional neural network (CNN), a recurrent neural network (RNN, e.g., long short-term memory [LSTM]), and a vision transformer. The CNN can extract spatial features (of a neighborhood within an image) whereas the LSTM can summarize temporal features. In addition, a transformer with an attention mechanism can capture sparse spatial relations residing an image or between frames in a video clip. The inputs of the model are short facial videos, while the outputs are the micro-expressions gleaned from the videos. The deep learning models are trained and tested with publicly available facial micro-expression datasets to recognize different micro-expressions (e.g., happiness, fear, anger, surprise, disgust, sadness). The results of our proposed models are compared with that of literature-reported methods tested on the same datasets. The proposed hybrid models perform the best.
|