Speech Analytics That Swiggy Could Use

Speech Analytics That Swiggy Could Use!

Recently I ordered my lunch online from a nearby restaurant on Swiggy. The food was not delivered to me at the right time. As the food was not delivered to me on time, I got a call from a Swiggy office and they stated the reason for the delay in delivery and apologized for the inconvenience faced. By the end of the call, I received a feedback form regarding whether we are satisfied with their service or not. By that time I was in a hurry and I had forgotten about the form.

Here, in this case, Swiggy asks us to fill out the form to measure the agent’s performance on how they respond to customers’ queries, and to analyze the customer feedback. Not all the time customers will fill out the feedback form after the call from the agent. Most of the customers will hesitate to fill out the form or else they would forget about that. In this case, Swiggy is completely dependent on the feedback form for their analysis of the call. What if we can analyze the calls directly to track the performance of the agents, rather than sending a feedback form to fill it which in this case is skipped by most of the customers. These vast collections of audio offer unique opportunities for improving customer service.

The objective is to classify the calls in which the customers are satisfied or not satisfied based on their conversation with the agent.

The other major aspects that can be analyzed from the calls are,

  1. Customer voice or tone analytics
  2. Agent performance tracking
  3. Measuring Customer satisfaction
  4. Identifying Opportunities for Cross-selling and up-selling

The Data:

The data used for this project is obtained from a call center operated by a private company, so the data cannot be publicly disclosed. Audio datasets of call center recordings are hard to find as most of them are privately owned and subject to various privacy laws. My dataset consists of 322 calls in which 162 were classified as calls in which the customer is satisfied and 160 calls which were classified as the customer not satisfied.

Customer voice or tone analytics :

Librosa is a python package for music and audio analysis. It provides the building blocks necessary to create music information retrieval systems.

Here we are going to classify the calls based on whether the customer is satisfied or not after the call from the tone of the customer. In this step, we first have converted our mp3 audio files to wav audio files. Next, we have extracted some important features of audio files like mfcc, chroma, mel, etc using Librosa library and passed those for further data loading and segmentation purpose. In the data loading step, we have divided our data-set into train and test data, those data have been delivered to Multi-Layer Perceptron architecture for the model building step. Once all the mentioned processes got finished we have done some predictions and checked the performance report using K-Fold Cross Validation and Confusion Matrix.

                                                              Feature extraction from audio files
                                                           Model building using multi-layer perceptron

Speech analytics of customer calls:

Speech to text conversion:

For converting speech in audio format to a textual format, I used two approaches.

1.SpeechRecognition Library

2. AWS Transcribe.

Speech Recognition Library: It is an open-source python library for performing speech recognition, which supports several engines and APIs like Google Speech Recognition, Google cloud speech API, IBM Speech to text, etc.

The transcription results of this library were not accurate for this particular dataset. You could try this for your dataset and see if the results are good.

AWS Transcribe: Amazon Transcribe provides high-quality and accurate speech-to-text transcription for a wide range of use cases. AWS Transcribe can also operate on streaming audio, providing a stream of transcribed text in real-time. AWS Transcribe supports 16 languages, including 4 English variants.

The steps for transcribing speech to text using AWS Transcribe is as follows

For transcribing the audio files using AWS Transcribe first the files must be stored in Amazon S3. AWS Transcribe can only operate files that are stored in Amazon S3.

Creating S3 bucket instance:

import boto3
import os
## AWS access credentials
AWS_S3_CREDS = {
"aws_access_key_id":"YOUR_ACCESS_ID",
"aws_secret_access_key":"YOUR_SECRET_ACCESS_KEY"
}

# Creating an S3 bucket instance
client = boto3.client('s3', **AWS_S3_CREDS)
s3 = boto3.resource('s3', **AWS_S3_CREDS)
bucket = s3.Bucket('customer-calls'

Uploading files to Amazon S3:

def upload_files(file_name, bucket, object_name = None, args =None):
if object_name is None:
object_name = file_name
response = client.upload_file(file_name, bucket, object_name,
ExtraArgs = args)

for x in os.listdir('/content/drive/MyDrive/calls'):
if x.endswith(".mp3"):
upload_files('/content/drive/MyDrive/calls/'+x, 'customer-calls')

# Printing files uploaded in S3 bucket
objs = list(bucket.objects.filter(Prefix=''))
print(len(objs))
for i in range(0, len(objs)):
print(objs[i].key)

Creating AWS Transcribe instance:

transcribe_client = boto3.client('transcribe', region_name='your-aws-region', **AWS_S3_CREDS)

Transcribing files present in Amazon S3:

Whenever a call is being transcribed a unique job for each file will be created, which can be used anytime to retrieve the transcription result of the call.

def transcribe_file(job_name, file_uri, transcribe_client):
transcribe_client.start_transcription_job(
TranscriptionJobName=job_name,
Media={'MediaFileUri': file_uri},
MediaFormat='mp3',
LanguageCode='en-US'
)
max_tries = 60
while max_tries > 0:
max_tries -= 1
job = transcribe_client.get_transcription_job(TranscriptionJobName=job_name)
job_status = job['TranscriptionJob']['TranscriptionJobStatus']
if job_status in ['COMPLETED', 'FAILED']:
print(f"Job {job_name} is {job_status}.")
if job_status == 'COMPLETED':
response = urllib.request.urlopen(job['TranscriptionJob']['Transcript']['TranscriptFileUri'])
data = json.loads(response.read())
text = data['results']['transcripts'][0]['transcript']
# print(text)
return text
break
else:
print(f"Waiting for {job_name}. Current status is {job_status}.")
time.sleep(10)

file_uri = 's3://customer-calls/' #S3 bucket link
text_list = []
name_list = []
id_list = []
result_list = []
for i in range(0, len(objs)):
name = objs[i].key.split("/")[1]
text = transcribe_file('Call-'+str(i+1)+'-'+name, file_uri+objs[i].key, transcribe_client)
text_list.append(text)
name_list.append(name.replace("mp3", ""))
id_list.append(name.split("_")[1])
result_list.append(name.split("_")[2])
print('completed- '+str(i+1)+' '+name)
print(text)

Sample output :

Preparing the dataset for Text classification:

Pre-processing the transcribed data:

The necessary steps include the following:

  1. Tokenizing sentences to break the text down into sentences, words, or other units
  2. Removing stop words like “if,” “but,” “or,” and so on
  3. Normalizing words by condensing all forms of a word into a single form (Stemming or lemmatization).
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizerLe = WordNetLemmatizer()
corpus = []
for i in range(0, len(audiotext_df)):
   processed_text = re.sub('[^a-zA-Z]', ' ',
   audiotext_df['AWS_Translated_Text'][i])processed_text = processed_text.lower()
   processed_text = processed_text.split()
   processed_text = [Le.lemmatize(word) for word in processed_text   if not word in stopwords.words('english')]
   processed_text = ' '.join(processed_text)
   corpus.append(processed_text)

The text needs to be transformed into vectors so that the algorithms will be able to make predictions. In this case, we will be using the Term Frequency — Inverse Document Frequency (TFIDF) weight to evaluate how important a word is to a document in a collection of documents.

Creating a TF-IDF vectorizer for the pre-processed text corpus by specifying the maximum features by which the vectors are formed as 2000.

Note: The number of features depends upon the size of the dataset. Feel free to try out different values for the number of features and select the one giving the best accuracy.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=2000)
vectorizer.fit(corpus)
X = vectorizer.transform(corpus).toarray()
y=audiotext_df['result']

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Three benefits of performing feature selection before modeling your data are:

  • Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
  • Improves Accuracy: Less misleading data means modeling accuracy improves.
  • Reduces Training Time: fewer data means that algorithms train faster.

Here, Feature optimization is done using Chi-squared feature selection

There are some feature selections algorithms available on Scikit-learn such as Chi-squared, Variance threshold, and Mutual information.

The Chi-squared test is used to test the independence of two events. More specifically in feature selection, we use it to test whether the occurrence of a specific term and the occurrence of a specific class are independent.

Applying Chi-square feature selection on the vectorized data by limiting the features to 300.

chi2_features = SelectKBest(chi2, k = 300)
X_kbest_features = chi2_features.fit_transform(X, y)

We are going to use the SVM algorithm for building a classification model. Support vector machines is an algorithm that determines the best decision boundary between vectors that belong to a given group (or category) and vectors that do not belong to it. Here we are training our model with the final optimized vectorized result from the feature selection process for better accuracy.

from sklearn.model_selection import train_test_split
from sklearn import svm
X_train, X_test, y_train, y_test = train_test_split(X_kbest_features, y, test_size = 0.25, random_state = 0)clf = svm.SVC(kernel = 'sigmoid')
clf.fit(X_train, y_train)

Based on our NLP modeling technique and analysis we have got around 97% accuracy to classify the audio calls based on whether the customer is satisfied or not. The two major points to summarize is that we have got a decent transcribed output through AWS Transcribe and the combination of TF-IDF vectorizer followed by Chi-Square Feature selection to choose the most important features among a list of features, helped us a lot to get an optimized result. This project can be further developed by implementing Topic modeling to analyze the customer sentiments with the topics of conversation and to identify concepts and specific subjects in transcripts. Analyzing these calls could be very much effective as most of the customers answer the phone calls rather than filling the feedback form.

Connect With Us!