Alphabet Hand Gestures Recognition Using Media Pipe - OptiSol

Alphabet Hand Gestures Recognition Using Media Pipe

Media Pipe is a cross-platform (Android, ios,web) framework used to build Machine Learning pipelines for audio, video, time-series data etc. MediaPipe is used by many internal Google products and teams including: Nest, Gmail, Lens, Maps, Android Auto, Photos, Google Home, and YouTube.

From Wired to Wireless, Keyboards to Touch Screen, offline to online we came a long way. The way we communicate with computer devices has drastically changed with Face Recognition, Speech Recognition, Touch Screens and many more. All this is just because of the rapid developments in Technology. Today, we are using many AI/Machine Learning technologies in our daily life.

Likewise, Hand Gestures is also a way to communicate with computers for various reasons. We can use this application in various fields like Augmented Reality, Handicapped, Play Station Games, Car Dashboard, Smart TV’s nowadays uses gestures to operate etc.


Mediapipe is a framework used to build Machine Learning Pipelines. It works on many different solutions like Face Detection, Hands, Object Detection, Holiste Mic, Fac Poseesh,, etc.

MediaPipe Hand is a machine-learning employed high-fidelity hand and finger tracking solution. It detects 21 Landmark points as shown in Fig. are recorded from a hand in a single frame with the help of multiple models which are working simultaneously.

Hand Coordinates

Mediapipe Hands consists of two different models working together namely Palm Detection Model in which a full image is identified and it draws a box around the hand, and Hand Landmark Model operates on this boxed image formed by Palm Detector and provides high fidelity 2D hand keypoint coordinates. (As shown in above fig.)

The following code snippet defines the parameters to be set in the hands mediapipe model

Code Requirements

  • mediapipe 0.8.1
  • OpenCV 3.4.2 or Later
  • Tensorflow 2.3.0 or Later
  • tf-nightly or later
  • scikit-learn 0.23.2 or Later
  • matplotlib 3.3.2 or Later

Let’s Jump in Building the model….

The goal is to recognize all the 26 Alphabets using Hand Gestures through a web camera.

Data Description

We have trained 26 Alphabets using the corresponding hand gestures as shown in Fig. Around 500 to 1000 labels are recorded using the webcam for each gesture label alphabet for training the media pipe model.

Model Workflow

At first, the webcam captures the Palm using Palm Detector Model and draws a bounding box around the hand.

Next, the hand landmark model locates 21 keypoint 2D hand coordinates.

Then, these hand landmarks are captured by the model which is preprocessed further and sent to the keypoint classifier model to classify the hand gestures.


Hand Gestures Recognition Training

Data Collection

To run the project model, use the following command: python

  1. To collect data press key “k” while running the file which switches to listening mode as displayed in the figure below.
  2. Then by pressing keys from 0 to 9, we can load each gesture for the hand gesture label.
  3. After Finding coordinates, Keypoint Classifier undergoes 4 steps of Data Preprocessing steps: Land Mark Coordinates, relative coordinate conversion, Flattening to a 1-D array, and Normalized values.
  4. Then the key points will be then added to “model/keypoint_classifier/keypoint.csv” as shown below.

Sample Keypoint Classifier

5. Dataset Columns:-1st column denotes pressed number (used as class ID), 2nd and subsequent columns- Keypoint coordinates

6. After completion of these steps, 10 labels are created each time by pressing keys b/w 0 to 9.

7. As we are training 26 Alphabets we required to record and save all 26 labels. For this we undergo, 10 labels + 10 labels+6 labels respectively.

8. Now, we merge these separate files into a single file.

Model training

The model structure for training the key points can be found in “alpha_train.ipynb” in Jupyter Notebook and execute from top to bottom. We used 75% of the data for training and the rest 25% is allocated for testing purpose.

The image of the model prepared in “alpha_train.ipynb” is as follows.

Model Architecture

Model Intiation

The Tensorflow model trained using the following architecture (above fig. Model Architecture) is saved in the HDF5 file, converted to the TensorFlow Lite model. This Tensorflow Lite model that stores the model architecture and weights is used to classify hand gestures when the keypoint classifier function is called from

After running the model, we achieved an accuracy of 80.33% and loss of 0.5323 for training. For Validation, accuracy of 97.9% and loss of 0.1646.

Model Testing

We use for testing the model. From the classification report, it is observed that 98% accuracy is achieved for the test dataset.

i. Confusion Matrix

Confusion Matrix

ii. Classification Report

Classification Report


Hand Gesture Recognition is an important part of Human-Computer Interaction (HCI). The importance of gesture recognition lies in building efficient human-machine translation. Its application range from sign language recognition through medical rehabilitation to virtual reality. Thus, gesture recognition system promises wide-ranging applications in the field from photojournalism through medical technology to biometrics. Hands Mediapipe plays an effective role in designing this real-time Alphabet hand gesture recognition system. We would like to further extend our system on building a dynamic hand gesture recognition system.



Free Consulting