Augmented Reality Lounge is a Eureka project with the objective to develop and integrate an innovative solution to build Augmented Reality capabilities for live sports in interactive environments with direct access to review key moments in a timeline.
On-site and OTT spectators today cannot access events with interactive, multiscreen, multiview approaches with updated rich information, because current audio-visual streaming latencies are by far too large (in the range of minutes) to synchronize with what they can see live.
The goal is to reduce OTT streaming latencies and generation of rich AR content in the range of the second for live streamed content to synchronize live events with Augmented Reality interactive content.
The project started in September 2023 and ends in august 2025.
Partners
Associated Partners
Project Structure
- Video and data contribution
- Audio, video, AR data distribution (OTT streams and Storage)
- AR app, Player & highlights
- Use cases & end to end solutions
Audio, video, AR data gathering synchronization and contribution
Implement a protocol with guaranteed performance and supported features close to the PTP protocol but adapted to 5G private networks applications willing to synchronize all the cameras covered by the same 5G private cell network. This work package will specify and implement the following key technologies:
- An absolute 5G time baseline protocol
- An accurate timestamping strategy
- 5G cell configurations minimizing latency and jitter guaranteed bit rates for contribution and enabled resource slicing for local distribution using 5G – MBS protocols
Audio, video, AR data distribution (OTT streams and Storage)
The goal of this work package is to create the key technologies of a video distribution headend. A video headend is composed of an ingest to the cloud, a transcoder, a packager, a storage, key moment highlights and CDN. ARLounge is based on a modular approach for defining the components. This work package will define key technologies the project is going to develop:
- Ingest
- Transcoding modules
- Packager / Storage modules
- CDN
- any multicast ABR
AR application, player & highlights
This work package includes the development of: an Augmented Reality solution and a Highlights detection Solution.
The software application for the Augmented Reality Solution will use Augmented Reality to display interactive content for the end-users.
The Solution for Highlights detection applies the most modern AI paradigms – machine learning, computer vision, natural language processing – to automatically detect the key moments and generate the highlights and timelines automatically for the several targeted use cases.
Both solutions will be composed of a frontend and a backend system.
For the Augmented Reality solution
- The frontend represents the application running on the end user’s device (smartphones or AR glasses).
- The backend represents the server infrastructure responsible for ingesting all the video content and metadata required by the frontend application.
For the Highlights detection Solution
- the frontend represents the server where the APIs exposing the metadata are cached for scaling towards the AR applications referred on 1a)
- The backend represents the server infrastructure responsible for ingesting the several videos and performing the key moments detections and the highlights and timeline generation, providing the APIs updated in real time towards the front end server referred on 2a)
Use cases & end to end solutions
This work package groups all aspects that will lead to the final demonstrator, from use-case specification to integration.
Use cases
AR lounge covers several elementary use cases that can be combined with each other. The combination of all use cases together leads to an overarching solution which is depicted in figure 1.
The targeted devices are either Head Mounted Displays (HMD) and mobile phones, whereas mobile phones are rather meant to be used for augmented reality on site. Other devices are not excluded but are not specifically considered here.
Additional info graphic displayed next to live video
This use case is leveraging a virtual lounge application by allowing an additional feed of data that is rendered as graphics. At home, next to the single (live) video in the virtual lounge user environment, graphics are shown to enrich the user experience, which may be user triggered and interactive, so the user can navigate through. On site, only the graphics are shown.
Three sub-use cases are envisaged:
- No stringent synchronisation: These graphics may be advertisement banners, or any information (i.e. live statistics …) related to the game or sports.
- Synchronisation with about one second tolerance margin: as an example, if a goal occurs and the actual score is shown next to the video (which may be delayed compared to the real time), then the score should be updated according to the video. – Synchronized real time data with video in the order of hundred milliseconds: users shall not be disturbed by a perceived mismatch between the graphical representation, when there is some real time graphics, and the video. The real time data is not extracted from the video and does not undergo the same process.
The place relative to the main screen may be above, next to it and/or below. The video is captured with ultra-low latency, using a private 5G network and encoded and distributed in low latency.
The whole chain shall have the lowest latency possible by tuning the different parameters of the chain. For example, the encoding can be adjusted to reduce the latency as well as the size of the segments and the packaging may use low latency and parallel transfer between origin and CDN.
Follow Action / Follow Player
In this use case, we leverage the usage of several cameras, so that the user can add a specific view to the main view that follows the action, or a particular player based on his selection. In a typical sports event set up, capture cameras are associated with data describing the capture angle and the position of the camera. By bringing this together with the real time data of the ball or the players, it is possible to select amongst the different captured views. Either the data is processed directly on the application server connected to the AR/VR devices to associate a view with players, action, ball, so the application on the AR/VR device only needs to request to correct segments. Or it is the AR/VR device that processes the data to select the segments corresponding to the appropriate views. In practice, up to 8 cameras will be simultaneously connected via a private 5G network. The default fallback view will be panorama capture view, in case no appropriate view can be selected.
Highlights detection with one single or multiple camera
A processor analyses the video contained in a buffer and it extracts highlights by its type and time code. It transmits it to the server communicating in real time with the application on the rendering device. The application can either display a notification to the user or show a video highlight in “full view” or as a side video.
Two approaches are considered:
- Using a single camera: Given the fact that there is only one raw video capture available in this use case, the detection of highlights may be much more limited than if several cameras were involved or a fully produced signal with layout was available. It may be possible to combine real-time data from an external source to improve the highlight types detected, and deliver the timecode with the type of the event to the server permanently connected to the AR / VR device.
- Using multiple cameras: The previous approach (detection of highlights with one single camera view) is easy to integrate and to set up but is very challenging to achieve a fair detection and recognition of the full list of highlights for a competition or a game. This approach considers using several camera views at once and hence would allow more types of highlights detections and playback from multiple angles. In a simplest case, one camera is steadily capturing the score and digital display on site, or a display has a steady “bird view”. Combining real-time data from an external sources or processing in addition a fully produced signal with layout will increase the list of highlights supported.
ARCHITECTURE
The overall system, depicted in Figure 2, is based on an all-IP architecture that does not depend on specific interfaces and where components can be added and/or replaced by another component without changing the architecture. It is composed of four main functional parts: contribution, detection, transcoding, and AR/VR distribution. It grabs video/audio/data streams by 5G user equipment in a 5G standalone network which is completely independent of the public telecom network. Video raw signals are compressed into streams using HEVC or AVC standards. Additional rich information can be collected according to the type of event such as score playing time and various relevant statistics. Then, from all streams, highlights can be detected and sent to the distribution and AR application to specify the clips that can be made available for user delivery. Communication protocols are selected as follows: RTP/RTSP for contribution for the output of the 5G base station and CMAF for distribution. As the Common Media Application Format (CMAF) is a versatile standard designed for the encoding and packaging of segmented media content, it facilitates the delivery and decoding of media segments on end-user devices as part of adaptive multimedia streams.
The distribution of the streams will be performed by encoding in multiple bitrates at low latency, chunk transfer between encoder and packager, low latency packaging and chunk transfer of the fragments to the CDN. These fragments are standard low latency HLS or DASH fragments that can be parsed and decoded by a player on the rendering client.