Showing word clouds in AR

What I made

I made a AR camera application that can detect an object via camera and display its name and related words in augmented reality.

Code here:

https://github.com/shibuiwilliam/ARWithWord

Why I made it

Many AR applications displays object or image in the space, like Google’s AR search.

Although there are a plenty of AR applications that shows visual effect via camera, I see a few that deals with natural language. I agree that these applications are brilliant, while I also believe in power of words, that people recognize the world with language and the AR is one representation of the world. There may be a possibility that word can do something in AR world.

So I decided to make a mock application that shows word cloud in AR.

What is word cloud

The word cloud is a form of data visualization oftenly used in natural language processing that visualize important words bigger in a plane to explain a text, like below.

https://boostlabs.com/wp-content/uploads/2014/09/word-cloud-V3-1.png

You can understant a context of the text by just looking at the word cloud.

While many word clouds are made in 2 dimensional image, I thought of making it in 3 dimensional AR space.

How I made it

Overview

The basis of the AR application is made with Unity and AR Foundation. I use tiny YOLO v3 model to run object detection in the camera to retrieve a name of an object. The related words for the object is obtained from fastText API, deployed on Kubernetes in GCP. So I made Unity-based AR application and container-based REST API on Kubernetes with fastText. It is a combination of AR, Edge AI, and server-side AI.

AR

I used Unity to develop AR application with AR Foundation.

Object detection

The object detection is executed on Unity Barracuda with tiny YOLO v3 loaded. I first tried Unity Perception for the runtime, though I couldn’t succeed installing it.

The model is pretrained ONNX tiny YOLO v3 with COCO datasetfor training dataset. So the only things the model can detect are these 80 categories in the list.

Natural language processing

I used fastText to find related words for the detected object. The fastText is a word vector model by Facebook that can be used to search for words in close vector distance. The model I used is a pretrained model with using Wikipedia for dataset.

Since the vector model is quite huge, over few GB, to be loaded to a smartphone, I made a REST API server to serve the fastText similar word search.

More specific

Unity

The code I developed in Unity is in can be found in the repository below.

https://github.com/shibuiwilliam/ARWithWord/tree/main/ARWithWord

The Unity application is based on 3D project.

I added some packages to make AR and Edge AI work in the Unity.

- AR Foundation

- Barracuda: See here. If you cannnot find the package in the Package Manager, you can install Barracuda with `Add package from git URL` and `com.unity.barracuda`.

- Android Logcat: Logging tool for Android via Unity.

After I installed packages, I added some AR related components: `AR Session Origin`, `AR Raycast Manager`, `AR Anchor Manager`, and `AR Plane Manager`. In addition, I made `ObjectDetector` for object detection and `SimilarWordClient` for sending request to fastText API.

The spawning code for the AR Session Origin is in `SpawnManager.cs`.

The workflow in the code is like this:

The processes are made asynchronous with them connected with Queues. I chose asynchronous queue to let object detection and similar word request not affect UI latency.

Image in the AR camera can be obtain in `OnCameraFrameReceived` method. You can refer to the official document for how it works.

Once you touch the smartphone display, the object detector will run prediction to the latest camera image. The application retrieves the position and the label of the object, and enqueue the data to be used in similar word search.

In the object detection, the `TransformInput` method will convert the image into Tensor and resize it. `PeekOut` obtains prediction from the object detection. You can find the network layers of the object detection ONNX model in Unity.

The `SimilarWordClient` sends request to fastText API in Kubernetes cluster. I used UnityWebRequest to implement the client. Its input and output are both JSON.

Kubernetes and similar word search REST API

You can find the source code for backend in the repository.

https://github.com/shibuiwilliam/ARWithWord/tree/main/backend

I used a pretrained fastText model. The model file is 4GB large in zip, 7GB extracted and needs 15GB RAM to be loaded to memory. It is apparently too big, so I reduced the dimension from 300 to 100 vectors. The new model is now 2GB in size with 4GB for memory. You can find the dimension reduction in the document.

For the REST API server, I chose FastAPI. I have tried many Python web frameworks, and found FastAPI is best balanced in development and performance.

The REST API server is made to be deployed on GCP GKE, Kubernetes cluster. The fastText model is stored in my GCP storage, and is downloaded each time the server spawned.

## Finally

It looks like this.

So why I made it

The application is not made to solve some specific issue. Rather, I aimed to make an example of combining AR, Edge AI and natural language processing to add AI for image and natural language to AR world.

I believe the AI for image has good compatibility with AR, and since the human beings recognize the world with word, I am sure there is a high possibility of even using natural language into AR to expand the reality.

Thank you!

https://github.com/shibuiwilliam/ARWithWord

technical engineer of cloud, container, Kubernetes, ML, and AR.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store