How to Integrate Audio and Video Calls into Mobile Apps?

Published On October 22nd, 2024 Tech Talks

WebRTC allows real-time, peer-to-peer, media exchange between two devices. A connection is established through a discovery and negotiation process called signaling. This tutorial will guide you through building a two-way video call.

WebRTC is a fully peer-to-peer technology for the real-time exchange of audio, video, and data, with one central caveat. A form of discovery and media format negotiation must take place, as discussed elsewhere, in order for two devices on different networks to locate one another.

This process is called signaling and involves both devices connecting to a third, mutually agreed-upon server. Through this third server, the two devices can locate one another, and exchange negotiation messages.

Things You Need to Know

  1. Signaling
  2. SDP – Session Description Protocol
  3. ICE Candidates
  4. STUN & TURN
  5. Peer Connection

1. Signaling Using Signal Server

Establishing a WebRTC connection between two devices requires the use of a signaling server to resolve how to connect them over the internet. A signaling server’s job is to serve as an intermediary to let two peers find and establish a connection while minimizing exposure of potentially private information as much as possible.

 
Save Your Time. Integrate Video Call SDK in 20 mins!

How do We Create this Server and How Does the Signaling Process Actually Work?

First, we need the signaling server itself. WebRTC doesn’t specify a transport mechanism for the signaling information. 

You can use anything you like, from WebSocket to XMLHttpRequest to carrier pigeons to exchange the signaling information between the two peers.

It’s important to note that the server doesn’t need to understand or interpret the signaling data content. Although it’s SDP, even this doesn’t matter so much: the content of the message going through the signaling server is, in effect, a black box. 

What does matter is when the ICE subsystem instructs you to send signaling data to the other peer, you do so, and the other peer knows how to receive this information and deliver it to its own ICE subsystem. All you have to do is channel the information back and forth. The contents don’t matter at all to the signaling server.

2. Session Description Protocol

The signaling mechanism (methods, protocols, etc.) is not specified by WebRTC. We need to build it ourselves. (Although this seems to be a complicated task, believe us — it is not. In this series, we will use Socket.IO for signaling, but there are many alternatives).

WebRTC only requires the exchange of the media metadata mentioned above between peers as offers and answers. Offers and answers are communicated in Session Description Protocol (SDP)

format which looks like the following:-

v=0
o=- 7614219274584779017 2 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE audio video
a=msid-semantic: WMS
m=audio 1 RTP/SAVPF 111 103 104 0 8 107 106 105 13 126
c=IN IP4 0.0.0.0
...

If you are wondering what each line means in the above format, don’t worry. WebRTC creates this automatically according to the audio/video device present on your laptop/PC.

3. ICE Candidates

As well as exchanging information about the media (discussed above in Offer/Answer and SDP), peers must exchange information about the network connection. This is known as an ICE candidate and details the available methods the peer is able to communicate (directly or through a TURN server). 

Typically, each peer will propose its best candidates first, making their way down the line toward their worse candidates. 

Ideally, candidates are UDP (since it’s faster, and media streams are able to recover from interruptions relatively easily), but the ICE standard does allow TCP candidates as well.

4. STUN & TURN

STUN(Session Traversal Utilities for NAT(Network Address Translator)) server: returns the IP address, port, and connectivity status of a networked device behind a NAT.

The STUN server  simply provides a meeting space for computers to exchange contact information. Once the information is exchanged, a connection is established between the peer computers and then the STUN server is left out of the rest of the conversation.

Connections established via STUN servers are the most ideal and cost-effective type of WebRTC communication. There’s hardly any running cost incurred by the users. 

Unfortunately, the connection may fail to establish for some users due to the type of NAT device each peer is using.

In such a situation, the ICE protocol requires you to provide a fallback, which is a different type of signaling server known as a TURN server.

TURN(Traversal Using Relays around NAT) server: a protocol that enables devices to receive and send data from behind a NAT or firewall.

A TURN server is an extension of the STUN server. Where it differs from its predecessor is that it handles the entire communication session.

Basically, after establishing a connection between the peers, it receives streams from one peer and relays it to the other, and vice versa. 

This type of communication is more expensive and the host has to pay for the processing and bandwidth load required to operate a TURN server.

5. Peer Connection

The RTCPeerConnection interface represents a WebRTC connection between the local computer and a remote peer. 

It provides methods to connect to a remote peer, maintain and monitor the connection, and close the connection once it’s no longer needed.

EventTarget  ← RTCPeerConnection
 
Ready to Build Secure Video Call Experiences with Our APIs?

Integration Workflow

Step 1: Exchanging session descriptions

When starting the signaling process, an offer is created by the user initiating the call. This offer includes a session description, in SDP format, and needs to be delivered to the receiving user, which we’ll call the callee. The caller responds to the offer with an answer message, also containing an SDP description. 

Our signaling server will use WebSocket to transmit offer messages with the type “video-offer”, and answer messages with the type “video-answer”.

singalling session

Step 2: Exchanging ICE candidates

Two peers need to exchange ICE candidates to negotiate the actual connection between them. Every ICE candidate describes a method that the sending peer is able to use to communicate. Each peer sends candidates in the order they’re discovered, and keeps sending candidates until it runs out of suggestions, even if media has already started streaming.

An icecandidate event is sent to the RTCPeerConnection to complete the process of adding a local description using pc.setLocalDescription(offer).

Once the two peers agree upon a mutually-compatible candidate, that candidate’s SDP is used by each peer to construct and open a connection, through which media then begins to flow. If they later agree on a better (usually higher-performance) candidate, the stream may change formats as needed.

Though not currently supported, a candidate received after media is already flowing could theoretically also be used to downgrade to a lower-bandwidth connection if needed.

Each ICE candidate is sent to the other peer by sending a JSON message of type “new-ice-candidate” over the signaling server to the remote peer.

stun server

Making a call

The invite() function can be invoked as the calle clicks on another user whom they are trying to connect for the list of users they have.

var mediaConstraints = {
  audio: true, // We want an audio track
  video: true // ...and we want a video track
};

function invite(evt) {
  if (myPeerConnection) {
    alert("You are already in another call.");
  } else {
    var clickedUsername = evt.target.textContent;

    targetUsername = clickedUsername;
    createPeerConnection();

    navigator.mediaDevices.getUserMedia(mediaConstraints)
    .then(function(localStream) {
      document.getElementById("local_video").srcObject = localStream;
      localStream.getTracks().forEach(track => myPeerConnection.addTrack(track, localStream));
    })
    .catch(handleGetUserMediaError);
  }
}

Creating the peer connection

The createPeerConnection() function is used by both the caller and the callee to construct their RTCPeerConnection objects, their respective ends of the WebRTC connection. It’s invoked by invite() when the caller tries to start a call, and by handleVideoOfferMsg() when the callee receives an offer message from the caller.

function createPeerConnection() {
  myPeerConnection = new RTCPeerConnection({
      iceServers: [     // Information about ICE servers - Use your own!
        {
          urls: "stun:stun.stunprotocol.org"
        }
      ]
  });

  myPeerConnection.onicecandidate = handleICECandidateEvent;
  myPeerConnection.ontrack = handleTrackEvent;
  myPeerConnection.onnegotiationneeded = handleNegotiationNeededEvent;
  myPeerConnection.onremovetrack = handleRemoveTrackEvent;
  myPeerConnection.oniceconnectionstatechange = handleICEConnectionStateChangeEvent;
  myPeerConnection.onicegatheringstatechange = handleICEGatheringStateChangeEvent;
  myPeerConnection.onsignalingstatechange = handleSignalingStateChangeEvent;
}

When using the RTCPeerConnection() constructor, we will specify an object providing configuration parameters for the connection. We use only one of these in this example: iceServers. This is an array of objects describing STUN and/or TURN servers for the ICE layer to use when attempting to establish a route between the caller and the callee. These servers are used to determine the best route and protocols to use when communicating between the peers, even if they’re behind a firewall or using NAT.

Starting negotiation

Once the caller has created its RTCPeerConnection, created a media stream, and added its tracks to the connection as shown in Starting a call, the browser will deliver a negotiationneeded event to the RTCPeerConnection to indicate that it’s ready to begin negotiation with the other peer.

Here’s our code for handling the negotiation needed event:

function handleNegotiationNeededEvent() {
  myPeerConnection.createOffer().then(function(offer) {
    return myPeerConnection.setLocalDescription(offer);
  })
  .then(function() {
    sendToServer({
      name: myUsername,
      target: targetUsername,
      type: "video-offer",
      sdp: myPeerConnection.localDescription
    });
  })
  .catch(reportError);
}

To start the negotiation process, we need to create and send an SDP offer to the peer we want to connect to. This offer includes a list of supported configurations for the connection, including information about the media stream we’ve added to the connection locally (that is, the video we want to send to the other end of the call), and any ICE candidates gathered by the ICE layer already.

Session Negotiation

Now that we’ve started negotiation with the other peer and have transmitted an offer, let’s look at what happens on the callee’s side of the connection for a while. The callee receives the offer and calls handleVideoOfferMsg() function to process it. Let’s see how the callee handles the “video-offer” message.

Receiving Incoming call

When the offer arrives, the callee’s handleVideoOfferMsg() function is called with the “video-offer” message that was received. This function needs to do two things. First, it needs to create its own RTCPeerConnection and add the tracks containing the audio and video from its microphone and webcam to that. Second, it needs to process the received offer, constructing and sending its answer.

function handleVideoOfferMsg(msg) {
  var localStream = null;

  targetUsername = msg.name;
  createPeerConnection();

  var desc = new RTCSessionDescription(msg.sdp);

  myPeerConnection.setRemoteDescription(desc).then(function () {
    return navigator.mediaDevices.getUserMedia(mediaConstraints);
  })
  .then(function(stream) {
    localStream = stream;
    document.getElementById("local_video").srcObject = localStream;

    localStream.getTracks().forEach(track => myPeerConnection.addTrack(track, localStream));
  })
  .then(function() {
    return myPeerConnection.createAnswer();
  })
  .then(function(answer) {
    return myPeerConnection.setLocalDescription(answer);
  })
  .then(function() {
    var msg = {
      name: myUsername,
      target: targetUsername,
      type: "video-answer",
      sdp: myPeerConnection.localDescription
    };

    sendToServer(msg);
  })
  .catch(handleGetUserMediaError);
}

This code is very similar to what we did in the invite() function back in Starting a call. It starts by creating and configuring an RTCPeerConnection using our createPeerConnection() function. Then it takes the SDP offer from the received “video-offer” message and uses it to create a new RTCSessionDescription object representing the caller’s session description.

ICE candidates exchange

Sending ICE Candidates

The ICE negotiation process involves each peer sending candidates to the other, repeatedly, until it runs out of potential ways it can support the RTCPeerConnection’s media transport needs. Since ICE doesn’t know about your signaling server, your code handles transmission of each candidate in your handler for the icecandidate event.

Your onicecandidate handler receives an event whose candidate property is the SDP describing the candidate (or is null to indicate that the ICE layer has run out of potential configurations to suggest). The contents of candidate are what you need to transmit using your signaling server.

Here’s our example’s implementation:

function handleICECandidateEvent(event) {
  if (event.candidate) {
    sendToServer({
      type: "new-ice-candidate",
      target: targetUsername,
      candidate: event.candidate
    });
  }
}

This builds an object containing the candidate, then sends it to the other peer using the sendToServer() function previously described in Sending messages to the signaling server.

Receiving ICE candidates

The signaling server delivers each ICE candidate to the destination peer using whatever method it chooses; in our example this is as JSON objects, with a type property containing the string “new-ice-candidate”. Our handleNewICECandidateMsg() function is called by our main WebSocket incoming message code to handle these messages:

function handleNewICECandidateMsg(msg) {
  var candidate = new RTCIceCandidate(msg.candidate);

  myPeerConnection.addIceCandidate(candidate)
    .catch(reportError);
}

This function constructs an RTCIceCandidate object by passing the received SDP into its constructor, then delivers the candidate to the ICE layer by passing it into myPeerConnection.addIceCandidate(). This hands the fresh ICE candidate to the local ICE layer, and finally, our role in the process of handling this candidate is complete.

Conclusion

To conclude, I hope this post has given you a full understanding of how WebRTC works on audio/ video calling apps. WebRTC is in a boom and adapting to the technology in any business app you build will be an absolutely wise investment. 

MirrorFly is an industry-leading video call API provider that offers more than 150+ video, voice and chat features that supports WebRTC for building real-time communication apps. It offers both cloud and self-hosted video call solution, so that developers can build modern voice/ video calling apps quickly and deploy flexibly on-premise or on-cloud. 

Sounds interesting? Get to know more about us from our team of experts now! Contact Sales

 
Don’t Wait, Add Video Call Features To Your App In Just 10 Mins!

Frequently Asked Questions (FAQ)

What is WebRTC audio processing?

Audio processing for WebRTC (Web Real-Time Communication) is a technique and set of tools used to manage and manipulate audio streams in real-time for web-based communication applications. WebRTC is widely used in web browsers and mobile apps for applications such as phone and video chats, as well as screen sharing and other real-time communication features.

Does WebRTC support audio?

Yes, one of the key functionalities of WebRTC (Web Real-Time Communication) is audio communication. It enables peer-to-peer audio communication and is often used for phone calls, video calls, and other real-time audio-based communication capabilities.

Is WebRTC a VoIP?

WebRTC (Web Real-Time Communication) is not an extension of VoIP (Voice over Internet Protocol) but rather a technology and set of standards that can be used to implement VoIP and real-time communication applications within web browsers and web-based applications.

What is the difference between WebRTC and VoIP?

The key difference between WebRTC and VoIP include:

  • WebRTC is for browser-based communication, whereas VoIP is for voice and multimedia over IP networks.
  • WebRTC operates on the public internet, whereas VoIP is frequently used within a company’s network infrastructure.
  • WebRTC uses the open internet, whereas VoIP uses organizational network infrastructure.
Can I use WebRTC for an Android app?

Yes, you can use WebRTC in your Android apps to facilitate real-time communication among devices. It is an open-source project maintained by Google on their GitHub repository. They also provide a pre-built library for Android-specific features like rendering UI, cameras, codecs, and so on.


Related Articles

Krishi Shivasangaran

Krishi Shivasangaran is a digital nomad and a veteran of Digital Marketing strategies. She is passionate about learning the newest trends in Live Video Call. And, when she's off-role, she loves to sketch and make people realize the true color of nature.

10 Comments "How to Integrate Audio and Video Calls into Mobile Apps?"

  1. Dinesh says:

    I’m building a dating website that will be upgraded to an app later. I want pricing options video calling apis using webrtc javascript tech stack. Please revert back as soon as possible

    1. Hello Dinesh, it is really amazing news that you are building a webrtc javascript video call solution. We offer SaaS and self-hosted pricing models for users based on their needs. To know more about our pricing, please visit our pricing page.

  2. Sristavi says:

    Hi, Looking to start a video chat app using Webrtc & Javascript. I am looking to host on azure, with web, android, and ios option. Please email me with what options do I have? How you can be of assistance, etc.

    1. Hi Sristavi! Nice to know that you are building a javascript video messaging app and appreciate your interest in MirrorFly’s in-app video chat service. Sure we will email you the full details.

  3. Abilasha says:

    I am looking to build a audio & video call solution using Webrtc and JavaScript that supports in android, ios and website. Could you please tell me the license pricing. Thanks

    1. Hi Abilasha, thanks for your interest in MirrorFly’s javascript video call solution. If you are interested to know about the pricing details, please visit our pricing page.

  4. Dayana says:

    Let me know for One to one audio & video call solution using Webrtc & javascript. how much the license cost and what time will be take

    1. Hi Dayana, thanks for your interest in MirrorFly’s webrtc video call solution. We offer 2 flexible pricing models – SaaS and SaaP (self-hosted solution) at users’ convenience. You may choose our SaaS solution if you would like to use our features for a monthly subscription fee. On the other hand, our self-hosted video calling solution renders you 100% customizable video, voice, and chat features for a one-time license cost. To know more about our pricing details, please visit our pricing page.

  5. Nichesh says:

    I have a team already and i Want to explore more about audio and video call integration using webrtc

    1. Hi Nichesh, thanks for your interest in MirrorFly’s video call solution. We will share all the features details to your email id.

Leave a Reply

Your email address will not be published. Required fields are marked *

GET A DEMO
Request Demo