Video calls and real-time streaming with WebRTC and SDKs

  • WebRTC offers real-time audio, video, and data with very low latency using getUserMedia, RTCPeerConnection, and RTCDataChannel.
  • To function in the real world it needs signaling, STUN/TURN and ICE, and scaling usually requires SFUs or media servers.
  • SDKs like Agora, Twilio, or ZEGOCLOUD simplify infrastructure at the cost of recurring costs and vendor dependence.
  • A side project can start with an SDK and evolve into its own WebRTC infrastructure as the product matures.

Video calls and real-time streaming with WebRTC and SDKs

If you're building a JavaScript side project And if you need video calls, it's normal to have doubts: Should I use pure WebRTC, an SDK like Agora, Twilio, Mux, or Zegocloud, or go all in with RN-WebRTC in React Native? The bad news is there's no single solution. The good news is that you understand real-time JavaScript, which puts you in an ideal position to make an informed decision and avoid messing up the architecture.

In the following lines you will see, step by step, how it works WebRTC InsideWhat role does Agora (and other similar providers) play? What does it mean to set up your own infrastructure (STUN/TURN, signaling, SFU, media servers…)? And what are the real trade-offs between cost, complexity, and scalability for video calls and real-time streaming?

What is WebRTC and why is it the foundation of everything?

WebRTC (Web Real-Time Communication) It's a set of open-source standards, APIs, and protocols that enable real-time audio, video, and data streaming directly from a browser or native app, without plugins or external applications. It's standardized by the W3C and IETF and supported by all modern browsers: Chrome, Firefox, Safari, Edge, Opera, and many mobile browsers.

Their philosophy is clear: to enable communication peer-to-peer (P2P) between users with very low latency, handling all the inconvenient networking issues—codecs, jitter, echo, packet loss, encryption, etc.—behind the scenes. This includes everything from a one-to-one video call to a system of interactive streaming with hundreds or thousands of spectators if you combine it with the right infrastructure.

calling app
Related article:
How to Use and Create a Calling App on Android: The Ultimate Guide for Users and Developers

Key WebRTC APIs: getUserMedia, RTCPeerConnection and RTCDataChannel

WebRTC relies on three main browser-side APIs that you will definitely use, whether you build your own solution or use an SDK like Agora:

  • MediaStream / getUserMedia: to capture video and audio (camera, microphone, and even screen or tabs).
  • RTCPeerConnection: to negotiate and transport audio and video streams between peers.
  • RTCDataChannel: to send arbitrary data (text, binary, files) with low latency between clients.

With getUserMedia You can request the browser's access to the camera and microphone and receive a MediaStream which you then associate with an element <video> with video.srcObject = stream. You can apply restricciones (resolution, framerate, front/rear camera, etc.) and, if these are not met, you will get errors such as OverconstrainedErrorwhich you must manage to offer alternatives (for example, downsizing from 1080p to 720p and applying adjustments for improve microphone audio).

The API of RTCPeerConnection It's the heart of the calls: it handles SDP (offer/response) negotiation, ICE (stun/turn) candidate collection, connection establishment, and secure transmission via SRTP. From your code, you simply create the connection, add media tracks, and react to events such as onicecandidate u ontrack and you take care of the signage.

Lastly, RTCDataChannel It allows you to set up data channels similar to a WebSocket, but point-to-point and with fine-tuned control over reliability and order. It's useful for in-video chat, file sharing, game state synchronization, or real-time collaboration. The syntax is familiar: dataChannel.send() y onmessage in the receiver.

Signaling: the “glue” that WebRTC does not define

A typical misunderstanding: WebRTC does not include signageRTCPeerConnection needs to exchange information, but it doesn't dictate how. You have to define that yourself, or a third-party SDK can abstract it for you.

The pairs are sent via signaling:

  • Session control messages: start call, hang up, errors.
  • Network information: ICE candidates (discovered IP addresses/ports).
  • Media metadata: SDP offers and responses with codecs, resolutions, etc.

This signage is usually implemented with WebSocketsSocket.IO, HTTP (polling/long-polling), MQTT, or other bidirectional mechanisms. A very typical pattern is a Node.js server with Socket.IO that manages “rooms” and forwards messages text/JSON type between clients:

employee: receives create or joinIt creates a room if one doesn't exist, supports up to two clients (for a basic video call), and forwards messages. message to the other sockets in the room. You are responsible for not exceeding the maximum number of users or for designing your own room logic.

ClientWhen loading the page, it asks for a room name (or infers it from the URL), it emits create or joinListen to events like created, joined, full, ready and agrees with the other party to initiate or reject the call.

This pattern is perfect for a prototype or side projectIt gives you a lightweight signaling server that you can scale with clusters and load balancers if needed.

STUN, TURN, ICE: Getting through NATs and firewalls without going crazy

In an ideal world, two users would always be on accessible networks and connect directly. In the real world, there are NATs, firewalls, CGNAT from ISPs and paranoid corporate networks. This is where ICE comes in, combining STUN and TURN.

  • STUN (Session Traversal Utilities for NAT) allows a client to find out its Public IP and portThe STUN server only responds with that information.
  • TURN (Traversal Using Relays around NAT) acts as relay server of media when there is no way to open a direct P2P channel. Audio/video traffic passes through it, so it consumes server bandwidth and costs money.
  • ICE (Interactive Connectivity Establishment) is responsible for testing all possible candidates (local addresses, reflected by STUN, TURN relays) until a viable route is found.

In practice, in your RTCPeerConnection configuration object you add an array of iceServers With STUN/TURN URIs, the browser does the rest. If you set up your own infrastructure, you'll have to deploy and maintain your STUN/TURN servers; if you use an SDK like Agora, Twilio, or Zegocloud, they've already got this sorted and ready for production.

Low-latency real-time streaming: WebRTC vs HLS/DASH

Video calls and real-time streaming with WebRTC and SDKs

When we talk about live streaming There are two distinct worlds: HTTP-based protocols (HLS, DASH) and WebRTC. HLS/DASH work by downloading and playing video segments from the client; this is perfect for scalability via CDN, but it introduces latencies of several seconds (5-30 seconds easily).

WebRTC, on the other hand, uses UDP + RTP and delivers the video in "push" mode from the source to the player, with very short startup times and typical latencies below 500 ms (often ~250 ms) if the network is good. It achieves this thanks to:

  • congestion control integrated, which adjusts bitrate and resolution in real time according to packet loss, jitter or RTT.
  • Use of efficient codecs (VP8, VP9, ​​H.264; increasingly AV1) with hardware acceleration when available.
  • Possibility of using SVC (Scalable Video Coding) so that the receiver only receives the layers that its network/device can support.

That's why WebRTC is the natural choice for real-time auctions, live sports betting, trading, interactive gaming, remote support, telemedicine, participatory virtual classrooms or financial dashboards that cannot afford several seconds of delay.

The problem is that pure P2P WebRTC doesn't scale well to thousands of viewers; for that you need SFUs, media servers or hybrid platformswhich is precisely where solutions like Flussonic, Agora, or similar come in.

Scaling beyond P2P: SFUs, media servers, and hybrid architectures

In a one-on-one video call, WebRTC performs flawlessly. But if you start adding 10, 20, or 100 users, things change: each client has to send/receive multiple streams, its CPU overheats, and the network crashes. Three classic patterns emerge here:

  • MCU (Multipoint Control Unit)The server receives all the streams, mixes them, and sends a single stream to each client. Advantage: low resource consumption on the client. Disadvantages: heavy server load, less individual quality control.
  • SFU (Selective Forwarding Unit)The server receives streams and selectively forwards them without mixing them. Each viewer receives the streams they need, possibly in different qualities. This is the most commonly used pattern today for multi-user videoconferencing and scalable interactive streaming.
  • Hybrid architectures WebRTC + HLS/DASHWebRTC is used for ingestion and interaction, while HLS/DASH distributes to large audiences that don't need real-time interaction. It's a balance between ultra low latency for the “actors” and massive scalability for “spectators”.

Media servers like Flussonic Others provide the necessary backend: they receive the WebRTC stream, transcode it if needed, forward it via WebRTC to other clients, or convert it to HLS-type protocols for mass distribution. This type of infrastructure is what, in practice, makes it viable to go beyond one-to-one calls without having to reinvent the wheel.

Typical use cases: video calls, streaming, IoT, and much more

WebRTC has become ubiquitous, and you probably use it every day without realizing it. Some examples where it fits particularly well are... video calls and video conferences:

  • Video calls and video conferencesGoogle Meet, Jitsi, Slack, Microsoft Teams and many other tools rely on WebRTC (in part or in full) for video, audio and screen sharing.
  • Real-time streaming servicesPlatforms such as Twitch, Meta Live, Vimeo Livestream or tools like Streamyard combine WebRTC for ingest and other technologies for mass distribution.
  • Chat and messaging with file sharingThanks to RTCDataChannel you can have real-time chat, file sharing, status synchronization, etc., without central media servers.
  • Cloud gaming and multiplayerServices like GeForce NOW or Xbox Cloud Gaming leverage similar technologies for interactive video; many P2P games use WebRTC to synchronize gameplay.
  • IoT and surveillanceSmart cameras, baby monitors, video doorbells, or drones can send real-time video to mobile devices and browsers using WebRTC.
  • Education and telemedicine: virtual classrooms with whiteboards, quizzes and two-way video, or online medical consultations where latency and security are crucial.

WebRTC security: encryption, permissions, and best practices

Security in WebRTC is not an extra: it's built in. integrated from the designAll media components are encrypted and the APIs only work from secure origins (HTTPS or localhost), although it's advisable to be vigilant. scams via video calls.

  • DTLS (Datagram Transport Layer Security) encrypts data in transit.
  • SRTP (Secure Real-time Transport Protocol) protects audio and video so that they cannot be easily manipulated or intercepted.
  • Access to camera and microphone It requires explicit user permission, with visible visual indicators (icons, colored dots, etc.).
  • Since there are no plugins to install, the risk of malicious software disguised in third-party extensions or binaries.

Even so, you have to take care of your own layer: use HTTPS throughoutReview the permissions you request, keep browsers and libraries updated, and don't neglect the security of your signaling server or your REST APIs.

WebRTC vs other technologies: VoIP, WebSockets and proprietary platforms

If you're coming from the world of traditional VoIP, you'll be familiar with SIP, PBX, softphones, and expensive servers. WebRTC changes the paradigm: you don't need to require the user to provide any information. desktop client No specific hardware is needed; a browser and a relatively simple signaling server are sufficient.

Versus Traditional VoIPWebRTC reduces the burden on core infrastructure and opens the door to applications directly integrated into the web. In many cases, you can reuse your SIP backend through gateways that translate signaling to WebRTC.

Respecto a WebSocketsThey should be seen more as complementary: they're ideal for notifications, light chat, or status updates, but not for intensive media. WebRTC is optimized for real-time audio/videowith congestion control, codecs, jitter buffer, etc. In practice, many projects use WebSockets for signaling and WebRTC for media transport.

If you compare them to platforms like Zoom, GoToMeeting or WebExThe difference lies in the model: those tools are closed solutions, often with mandatory desktop apps and a proprietary backend. WebRTC, on the other hand, is a foundational technology; you can build your own "mini-Meet" on top of it or integrate with services that already use it (like Google Meet or Microsoft Teams).

Developing with WebRTC: real complexity and common pitfalls

Although the APIs seem simple on paper, implementing WebRTC from scratch is more complex. You'll have to deal with:

How to use Tor Browser to access the deep web
Related article:
Tor Browser for Android: Advanced settings and secure use
  • Custom signage: designing messages, rooms, managing reconnections, retries, errors.
  • ICE/STUN/TURN ManagementDeploy servers, monitor TURN usage (which consumes bandwidth), adjust timeouts.
  • Quality of service (QoS): adapt bitrates, handle unstable networks, negotiate codecs, detect when a connection degrades and react.
  • climbed: move from simple P2P to groups, then to hundreds of users, introduce SFUs or media servers without breaking the original design.
  • Cross-browser compatibilityAlthough the situation is good, you will still find nuances. Use adapter.js It is still highly recommended.

In a small side project, setting up a Node server with Socket.IO and a public STUN might be enough for 1:1 calls or very small groups. But if your idea grows and you need large crowdWhether it's fine quality control, recordings, analysis, transcriptions, or monetization, you'll soon have to consider or incorporate a own media serveror switch to a specialist provider.

Real-time CDN with SDKs: Agora, Twilio, Mux, ZEGOCLOUD…

Services like Agora, Twilio, Mux, ZEGOCLOUD or similar technologies build a value layer on top of WebRTC that saves you months of work and countless headaches:

  • They offer you one global media network with SFUs distributed around the world, optimized for low latency.
  • Abstract STUN/TURN, signaling, retries, reconnections and complex network management.
  • They include well-maintained SDKs for web, iOS, Android, React Native and other frameworks.
  • They provide extras such as recording, broadcasting to RTMP/HLS, moderation, real-time statistics, quality controls, user roles (host, audience, speaker), etc.

The cost, as you probably suspect, is the main problem: if you have even a little bit of money many minutes of video Or, with a significant number of concurrent users, the bill skyrockets. Furthermore, you become dependent on their platform and its price or API changes.

In your specific situation, with strong experience in Real-time JavaScriptA sensible option is to start with an SDK to accelerate development, validate the product, and learn about its room model, roles, stream lifecycle, and state management. Later, if the project takes off and cost becomes an issue, you can gradually migrate parts of the solution to a more robust platform. proprietary WebRTC infrastructure or rely on a Flussonic-type media server to control the distribution layer.

Best practices and tools for debugging WebRTC

To avoid getting lost in the WebRTC black box, it's advisable to rely on the tools that already exist in browsers and the ecosystem:

  • chrome: // webrtc-internals (o about: webrtc (in Firefox): panel with detailed statistics of connections, bitrates, packet loss, active codecs, etc.
  • adapter.js: community-maintained shim that smooths differences between browsers and versions.
  • test.webrtc.org: to check camera, microphone, network and general compatibility on a machine.
  • Official Samples at webrtc.github.io/samples: examples of constraints, peer connections, data channels, screen sharing… very useful for copying patterns.

It's also a good idea to structure the code by clearly separating the signaling layer (sockets, rooms, messages) of the layer of Pure WebRTC (connection creation, stream management, event handlers). This allows you to replace a signaling backend or media server without rewriting all the client logic.

Android and Linux
Related article:
Android and Linux: The Best Alternatives to KDE Connect

With all of the above on the table, for a side project that is just starting out and where you value so much the development time such as medium-term costThe most balanced strategy is usually to start with a real-time SDK based on WebRTC that allows you to iterate quickly in React/React Native, internalize how they handle roles, sessions, stream lifecycle and live states, and in parallel delve deeper into WebRTC "by the skin" (getUserMedia, RTCPeerConnection, RTCDataChannel, signaling with Node+Socket.IO, STUN/TURN, SFU) so as not to be tied forever to a single platform and be able to make the leap to a more custom solution when the product justifies it.