About

TLDR;
Motivation
3D simulation
- Modeling 3D enviroment in Blender
- Adding game controls in Unity
- Bridging simulation to browser (Unity-side)
Client-side integration
- Loading Unity Script
- Bridging simulation to browser (browser-side)
- Answer simulation question
Server-side API
- Request handling
Machine Learning
- Data
- Training
- Results
References
Authors
- Frederick Roman
- Homero Roman

TLDR;

Navlead is an AI-powered assistant chatbot for navigation in 3D enviroments. The simulation, the chatbot, and their deployment online are all part of this project.

The simulation has custom 3D assets, a chat dialog box, motion controls and game logic. The chatbot has a custom ML NLP model that has a seq2seq architecture and was trained on CVDN dataset. The client-host loads the simulation and mediates its interactions with the API server. The API server authenticates the request and runs the ML NLP model.

Simulation

Front-end

(Browser client)

Back-end

(API server)

Machine learning

(Navigation guide)

Motivation

Inspired by the rise of VR and AI, this project aimed to combine both into navigation assistance system and run it through the browser.

Navlead on an iPad showing simulation with controls

Navlead on an iPhone showing dialog box for chatbot

3D simulation

Modeling 3D enviroment in Blender

The 3D virtual enviroment used in the demo is based on Frank Lloyd Wright's final design: the Circular Sun House.

It is also known as the Norman Lykes House and it is located in Phoenix, Arizona. This is not typical house. Despite of its name, we think that the Circular Sun House resembles more a snail that anything else; with the head being the long corridor that links the suits of the private area, and the shell being the public area that has a kitchen and an office above it. Here is a tour of the house by Architectural Digest (and you be the judge):

So why did we choose this odd house for the demo? Simply put, we needed data to train the navigation AI and the CVDN dataset includes this house.

Now to the actual modeling.

First we had to decided how much of the house we needed (and wanted) to model. To that aim, we did quick a 3D sketch of the floor plan. We did the sketch with Voxeldesk. Voxeldesk is an online voxel-art editor thats makes 3D art (such as this sketch) easy to create, and it was developed by us too. So this is our first draft of the Circular Sun House layout.

Granted this is very rough sketch; but it got us to realize that we didn't need the exterior the house and even the ceiling. It also got us to realize that building such a curvy house would be too hard with voxels. Since we needed a bigger boat, we got Blender.

Blender is a free and open-source professional-level 3D editor with a strong community of developers. For these reasons, Blender is the main tool for modeling the 3D house in this demo. The walls, floor, and objects throughout the house are modeled in great detail using the power of Blender. We carefully placed the 3D objects in the house to match the locations shown in the CVDN dataset. This is how creating a bedroom in Blender looks like:

For navigation
• The doorways are marked and there are no doors
• The floor has a unique distinctive color so users can learn to distinguish walkable surfaces.
• The placement of the furniture avoid leaving tight spaces where you could get stuck.

For efficiency
• The house and objects are low-poly to optimize for space, download, and rendering time over the web.
• The underlying structure of repeated objects is linked, instead of copied, across these objects for efficiency.

This is how creating a bedroom in Blender looks like:

And this is how the house looks from above:

Adding game controls in Unity

Our simulation wouldn't be very useful, if we could navigate in the scene. For this we used Unity.

Unity is a cross-platform game engine that allows us not only to interact with the environment but also add physical properties such as gravity and rigid body collision. Below is a shot of what the starting room of the house looked like in Unity:

With Unity, we wanted to achieve two goals: add 3D motion controls and a 2D UI. The 2D UI would allow us to select a target and ask for guidance along the journey.

Motion Controls

For the player's movement, the player object has a script which updates the player's position based on the WASD keys. This movement is smoothed out over the frame rate at the `Update` runtime hook. The code for player movement is as follows:

PlayerMove.cs

Here `_charController` is the CharacterControllercomponent attached to the player object. This component allows the movement of the player to be constrained by object collisions and also to react to gravity like a rigid body.

Unity 2D UI

Navlead makes use of several 2D Panels overlayed on top of the display canvas. Each of these panels provides a visual interface to necessary functionality for the simulation. In total, there are 5 main simulation states as illustrated by the diagram below:

Navlead 2D panels used in the simulation

Before the simulation starts, the Splash Screen is active and shows the camera view of the living room as the camera looks around the room. Then, when the user clicks on Start, the Target Popup becomes active for selecting a target. Clicking on Ready starts the simulation and places the player at the start position of the Main Scene. From there, the player may open the Dialog Popup for asking questions or click on the reset button which will prompt the user to select a new target. Targets have triggers attached to them, so when a user reaches the desired target the Congratulations Panel congratulates the player and gives the option to re-start the simulation

In order to manage the state of the simulation, a central controller of type `UIController` has a reference to the UI panels and is able to open or close the panels according to different actions in the simulation. However, there is no explicit notion of a simulation state; meaning that the controller does not explicitly save what is the current simulation state. As shown in the diagram above, except for the splash screen, the panels can be opened from the main scene and can lead back to it.

The Splash Screen

When the simulation loads, the first screen that the user can interact with is the splash screen. This panel, branded by the Navlead logo, has a single button for enabling the user to start the simulation. The background of the splash screen demonstrates how the 3D environment will look like even before the user decides to click on Start. This rotating background is achieved by having a camera in the simulation rotate inside the living room.

Camera rotation in the simulation living room for splash screen

As shown here, the splash screen camera has a fixed position but a constant angle of rotation demonstrating how it looks like to look around inside the living room. The splash screen panel is a 2D UI so the way that the camera view is loaded in the background is through a special kind of image called a “Render Texture” which, as the name suggests, is rendered at runtime.

The Target Popup

Once the user clicks on Start, the next panel that the user interacts with is the Target Popup. It is at this point that the user can choose the target for the simulation. Each target is shown as a rotating 3D object using the same “Render Texture” technique described above for the splash screen. The key piece of functionality of the Target Popup is the dropdown with a list of all the targets.

Target popup selecting and showing a bed

The Target Popup saves a private variable with the name of the current target selected. By default, the selected target is “Bed 1.” In order for this variable to be updated when the user clicks on an item in the dropdown, we attach a listener to the dropdown for the `onValueChanged` event. This way, whenever the user changes the selected value of the dropdown, the corresponding variable for the current target is also updated. Clicking on Ready at this point closes the Target Popup and opens the Main Scene.

The Main Scene

It is at the Main Scene that most of the action happens. Here the user is able to use the mouse or drag in a mobile device to turn around the place. Also, to be able to move, the player can use the WASD keys or the arrow buttons labelled with their appropriate keys.

Main scene showing a bedroom with action button on the screen corners

The arrow movement keys are semi-transparent so as to not obstruct the view while still being big enough for a user to tap on them with their fingers in the mobile device. The Main Scene also contains a button on the upper left for resetting the simulation. This also has the effect of opening up the Target Popup since a target needs to be selected in order to start the simulation. The Main Scene also contains button on the upper right for opening up the Dialog Popup. It is at the Dialog Popup that the user can ask questions to the NavLead Net at any point during the simulation. The Dialog Popup provides a closing button for returning to the Main Scene so the user can continue the search for the target. When the player is close enough to the target, a corresponding trigger activates opening up the Congratulations Panel.

The Dialog Popup

The user is able to ask questions to the NavLed Net through the Dialog Popup.

This panel contain a text input field for entering questions. On a computer with a keyboard, the user can type their questions and then send them by clicking on the airplane-shaped send button. And on a mobile device the user can tap on the keyboard icon for opening up an on-screen keyboard to tap each letter at a time. Pressing enter on the keyboard has the same effect as clicking on the send button. Once the question is submitted, Unity sends the GameState to the front-end to request the Django backend for an answer. The area below the text input field shows the conversation history with the latest question and answer at the top. While Unity is still waiting for the response from the NavLeadNet, the answer has the placeholder of “… waiting for response …” As more questions and answers are added to the conversation history, a sidebar appears to allow the user to scroll down to older questions. The Dialog Popup can be closed by clicking on the X button on the upper right corner. Closing the dialog still preserves the conversation history in case the user still wishes to go back to it.

The Congratulations Panel

Finally, in order to notify the user that they have found the target, the Congratulations Panel is a semi-transparent screen that displays the message “Congratulations! You found the target!” and enables the user to go back to the Target Popup by exposing the Play Again button.

As mentioned above, this Congratulations Panel is activated whenever the player enters the trigger surrounding the target object. Each target has a trigger that can check whether the player has entered it. The code for doing precisely just this is as follows:

Trigger.cs

Triggers expose the `OnTriggerEnter` runtime hook that runs when the trigger detects that the player has entered it. In this code snippet, the trigger checks whether it is in charge of the current target. If so, it calls the `UIController` to display the Congratulations Panel.

Unity Keyboard

In order to enable mobile devices to input text in the Dialog popup, we implemented our own custom keyboard. This keyboard works with clicks, taps, and is able to scale with different screen sizes.

The way this keyboard is implemented is by having an input button for each key. A UI panel called “Keyboard Panel” acts as the parent of all the keys and has a list with references to all these keys. Each key can have one of three states and saves as a private string the value the key takes for each state. The Keyboard panel takes care of managing the state of the entire keyboard. The following is the script attached to each key:

Key.cs

As mentioned before, each key can have one of 3 states; the default, when the shift key is pressed, and when the numbers key is pressed. Upon awakening, each key makes sure to save a reference to its parent keyboard and its child text. When a key is pressed, the `KeyPressed` function sends the name of the key to the Keyboard parent for handling. The key also exposes functions for setting its text to one of the 3 states mentioned above. At the parent Keyboard, the `ProcessKey` function takes care of updating the state of the keyboard, adding or removing characters from the input field according to the key that was pressed.

ProcessKey.cs

The key <~ stands for delete and calls the `RemoveInputChar` function to delete the last character in the input field. The “shift” key prompts the keyboard to loop through the list of keys and call the function `SetToShiftText` to update each key to the shift state. Similarly, when the key ?12 is pressed, the letters in the keyboard are replaced with numbers and the ?12 key becomes the abc key for restoring the keyboard to its default state. The following image shows the keyboard in the state when the numbers key is pressed.

In-game Unity keyboard (numbers and special characters)

Finally, pressing enter has the same effect as submitting the question and closing the keyboard.

Bridging simulation to browser (simulation-side)

Now that we have a simulation, we need to publish it. And one way to do that is serving it through browsers.

Fortunately Unity can compile to run on web browsers. It does so by using WebGL and Web Assembly .Virtual and Augmented reality are two emerging technologies that greatly benefit of WebGL and Web Assembly for running applications on the web.

The compiled simulation will send and receive messages from and to the browser as follows:

Simulation to client communication sequence diagram

When a user clicks the send button to submit a question, the `OnSubmitQuestion` callback is called. This function adds the question to the current list of questions. Then, it collects the state of the game in a data class called `GameState`. The `GameState` consists of the current question, the target, the location, and the direction vector of the player.

OnSubmitQuestion.cs

In order to send the game state to the JavaScript front-end for processing, the function `sendGameStateToJS` transforms the GameState into a JSON string representation and calls on the SendToJS bridge between Unity and the front-end.

sendGameStateToJS.cs

The bridge is written in “jslib” which stands for JavaScript Library and is a subset of the JavaScript language that Unity can expose to the rest of the C# code. Functions in jslib can access the browser window of a WebGL-enabled application and make calls to function in the webpage. It is through this ability that the “SendToJS” is able to call the `processGameStateFromUnity` function in the webpage front-end.

SendToJS.jslib

Once the game state along with the question has been sent to the Next.js front-end, the request needs to be processed, sent to the Django back end, and then the text answer needs to be sent from the webpage front-end back into Unity. In order for Unity to receive the answer, the game Controller exposes the callback “onAnswerFromJS(string answer)” which as the name suggests takes care of receiving the answer and instructing the Dialog Popup to update the list of questions and answers to include the new answer.

receiveAnswerFromJS.cs

Client-side integration

Once the simulation is finished, we must brought it to the browser. Since we decided to embed it in a full-fledge web site, we decided to choose a front-end UI framework. We chose React. React is a free and open-source front-end JavaScript library for building user interfaces based on UI components.

Bridging simulation to browser (browser-side)

After we built the Unity simulation (with the official WebGL template), we stablished the communication between it and the browser by passing and receiving messages through the window object. We accomplished that by adding the following lines to script.js (Unity's bootstrap loader).

script.js (inserted lines)

Loading Unity script

To run the Unity simulation in the browser, we must first load the Unity bootstrap loader script (script.js). We accomplish this by appending the script asynchronously to the document's body when the Simulation component is mounted (as suggested by this article).

Since the Unity bootstrap loader script appends the simulation build scripts, we should remove them when the Simulation component is unmounted. We can implement all that in the following custom React hook:

UML state machine diagram for loading/unloading unity

useLoadUnityScript.tsx

Answer simulation question

UML sequence diagram for answering a question from simulation to server

When the user (traveler) asks for navigation guidance, the inquiry itself is sent to a remote ML API service through the NavGuideService.

useNavGuide.tsx

Server-side API

Request handling

The NavGuideService acts as a proxy that sends a request like this to the ML REST API server:

POST /navigationGuidance

Body :{
  Traveler: {
    Question: string;
    Target: string;
    Location: {
      x: number,
      y: number,
      z: number,
    }
  },
  ...,
  API_key: string
}

After validation and authentication, it runs NavleadNet to infer the answer to the traveler's question given his/her location and target. The response looks as follows:

Response :{
  ...
  answer: string
}

Machine Learning modeling

Data

Base dataset (CVDN)

NavLead is trained using the text dialogues from the CVDN dataset. CVDN is a dataset of more than 2,000 human-human dialogs situated in simulated, photorealistic home environments. In each dialogue, 2 people work together to reach a target. The first person is the Navigator and can move around the 3D environment as well as ask questions to the Guide. The Guide is also a person, and observes the 3D environment through the screen of the Navigator, but unlike the Navigator, the Guide does not interact with the environment. Instead, the Guide has access to pictures of the environment along the next steps along the shortest path towards the target. The Guide’s role, then, is to provide good answers to the Navigator’s questions, so the Navigator can reach the target as quickly as possible. The video below shows a full demo of this interaction:

The following is an example of a single question-answer exchange with “picture” as a target:

Augmenting dataset (CVDN)

In order to generate more data for training, the CVDN dataset is augmented during training with paths having random start and end locations. On each training iteration, we train once with the augmented data and then with the original data point. The augmented datapoint is generated by taking the original datapoint at that iteration and randomly selecting a location with a target in the same environment and then randomly selecting a starting location. Non-connected locations are excluded. Once start and end locations are selected, a set of text instructions is generated from the shortest path by mapping the movements from location to location to the closest cardinal direction. The possible set of generated instructions per step is illustrated in the following diagram:

Diagram showing the mapping from vector direction to word instructions (relative to the traveler)

Synthetic dialog creation for random initial location and target

Continue forward

go to your left

go to your right

go to your left

Training

Architecture (seq2seq with attention using LSTMs)

For training, NavLead takes in as input the question as a string of characters along with the coordinates and direction the Navigator is facing when asking the question. Then, NavLead outputs a sequence of words that correspond to the inferred answer. The way NavLead achieves this is by training a sequence-to-sequence (Seq2Seq) machine learning model. Seq2Seq takes in a sequence and outputs other sequence. It is typically used for language processing and some applications include language translation and conversation generation. NavLead in particular implements a Seq2Seq with attention using LSTMs. LSTM stands for Long Short-Term Memory and is especially powerful for processing long sequences of data like the words in a conversation. The way this is achieved is by outputting a context vector on each step of the inference. The context vector from the previous step is then used to infer the next word.

Understanding the encoder-decoder sequence to sequence architecture

In particular, NavLead implements a Seq2Seq model in two main parts. The first part is called the Encoder and has the task of representing the environmental context. For NavLead, the environmental context is the question asked, the position and direction of the Navigator and the features describing the next 5 steps along the shortest path. The encoder generates a representation by mapping these values into a context vector. This is achieved through an attention layer between the features describing the Navigator along with the question and the features describing the next 5 steps; which is then followed by a bidirectional LSTM cycle. In short, the encoder maps from the question and the environment features into a context vector that will later be used to initialize the decoder. The following diagram represents the Encoder:

With the context vector, the decoder now has information about the question, location, target, and the next 5 steps which it can use to generate the answer. To accomplish this, the decoder uses another cycle of LSTMs with attention where the attention layer captures the relationship between the context vector and the and the last word to generate the next word. The cycle starts with the <BOS> tag which stands for “Beginning of Sentence” to indicate that we are starting a new inference. The inferred word is then used as input to predict the next word and so on until the decoder predicts the <EOS> “End of Sentence.” The following diagram represents the decoder:

Training Configuration

NavLeadNet is written in Pytorch. Pytorch is an open source machine learning framework

NavleadNet has a sequence-to-sequence architecture with:

Optimizer	RMSprop
Learning Rate	0.0001
Dropout	0.5
Loss	Cross entropy with teacher forcing

The loss function used is Cross Entropy with Teacher Forcing where a good text answer is that which is as close as possible to the text answer provided by the human Guide. For example, for a vocabulary of 4 words [left, right, up, down], an untrained model would predict uniform probabilities of [1/4, 1/4, 1/4, 1/4]. With teacher forcing, a single word would be the correct answer, let's say in this case the word “right.” This means that the correct distribution is [0,1,0,0]. Therefore, the cross-entropy loss for this toy example would be -01/4-11/4-01/4-01/4 = -1/4 Since the goal is to minimize the loss, this has the effect of favoring predictions where the model is more confident of the right answer.

Training loop

The CVDN dataset is split into 1299 training, 94 seen validation, and 260 unseen validation dialogs. And, as mentioned above, data augmentation is performed at runtime interleaving training of augmented data with the original data at each iteration. The function `trainVal` contains all the necessary steps for training and evaluation as follows:

trainVal.py

Here `WorldCollection` refers to a data structure for storing information about each house in the CVDN dataset. The basic steps performed in trainVal are those of initializing the training environment, loading the houses into the `WorldCollection`, loading the dialogs with help from the `DialogBatcher`, reading the vocabulary for the tokenizer, and finally calling on `trainSpeaker` with all these information to run the training loop for NavLead. More in detail, the function `setup` sets the random seed for all random number generators used. Also, the `WorldCollection` is a class that contains a dictionary that maps from world id to an instance of a `World`. Here houses are referred to as instances of the `World` class. Each `World` is simply a collection of information about a particular house:

World.py

The two most important pieces of information that the World holds about a house is the name and the viewpoints in that house. While the 3D space in the house is continuous, the CVDN dataset divides up the areas in the house into discrete locations called viewpoints. For instance, the entrance to a room can be a viewpoint. Each room can potentially have multiple viewpoints, so it is common for a house with many rooms to have hundreds of these viewpoints. In order to organize the information about each viewpoint, they are saved as instances of the `Viewpoint` data class.

Viewpoint.py

As shown here, each `Viewpoint` contains information about its location in 3D space and a list of neighbors as well as targets at that location. Of these, the list of neighbors is critical for being able to determine the path between two viewpoints. For example, when doing data augmentation, a function called `getShortestPath` runs Dijkstra algorithm and returns the list of viewpoints to traverse in order to reach the end viewpoint given a start viewpoint. Moreover, each viewpoint instance records the names of the targets at that location. Targets are recognizable 3D objects like a bed or a plant. A room may contain multiple targets or no targets at all.

Looking back at the `trainVal` function, as was mentioned before, the `DialogBatcher` is a data structure that stores a list of dialogs for a particular set of houses. Since the training and validation sets are two separate sets of dialogs, an instance of `DialogBatcher` is created for each. The dialogs themselves are organized by the `Dialog` data structure as follows:

Dialog.py

In a few words, a `Dialog` consists of a question, an answer, the world id and viewpoint where the question was asked along with the target and optionally the next steps taken towards the end viewpoint with the target. The `Dialog`s are extracted from the question-answer exchanges in the CVDN dataset when the `DialogBatcher` is instantiated.

Then, when the `trainVal` has loaded the training data and validation data, `trainSpeaker` takes care of actually running the training loop to train the NavLead neural network to produce answers from questions.

The full set of steps for training can be found in the `trainSpeaker` function as follows:

trainSpeaker.py

The training of NavLead can be divided into 3 main parts.
1. Instantiation and setup of the model.
2. The training loop at each interval where we interleave training of the CVDN data with data augmented online at each iteration.
3. Validation at the end of each interval.

On the first part, we create an instance of `Speaker` and assign it the `worldsCollection` environment.

trainSpeaker.py (part 1)

As mentioned above, the `Speaker` class is a sequence-to-sequence model that takes in as input a text question and outputs a text answer. Under the hood, it is implemented as an encoder-decoder network. At each step, when walking inside the house, the encoder encodes the 3D location, the direction, the mobility, the targets nearby, the requested target and finally the question into a context vector. This context vector is then used by the decoder to produce the text answer on word at a time. The direction is represented a 2D unit vector whose origin is the current location. The mobility refers to 8 Booleans that indicate for each cardinal direction whether or not the player can move in that direction (North, South, West, East, Northwest, Southwest, Southeast, and Northeast). For instance, if all Booleans are true, then that means that the player can move in all directions. In order to determine whether it is possible to move along a cardinal direction, we loop through the neighboring viewpoints calculating the direction vector from the current location to the neighbor. With this, vector we can determine which is the closest cardinal direction that matches this vector by comparing the angle between the vectors. The class method `getCardinal Direction AlongVector` takes care of this task as follows:

getCardinalDirectionAlongVector.py

The idea is that using the dot product we can extract the angle between the direction vector and a cardinal vector. If we compare against all cardinal vectors we can decide which is the closest cardinal vector and hence decide what direction the player is able to move from a location.

Going back to `train Speaker`, on the second part we have the bulk of the training. At each iteration a random batch of dialogs is selected for training using the `get Random Batch` function provided by the Dialog Batcher As mentioned before, on each training step, the `convert To Step By Step Instructions` takes in the current dialogs and generates new step-by-step instructions between 2 random locations in the same house as the training dialogs. These augmented dialogs are used for training before training on the real dialogs at each iteration.

trainSpeaker.py (part 2)

Then on part 3, at the end of each interval, the model is evaluated on a set of random dialogs from the validation set. The BLEU score, loss, and accuracy are calculated using these validation dialogs. The BLEU score is a metric typically used in machine translation for comparing 2 text sequences. In our case, it is used to judge how close the generated text is to the reference human text answer. The BLEU score itself computes the modified precision metric using n-grams instead of matching word by word between the 2 sentences.

trainSpeaker.py (part 3)

It is also during this part that we save the model with the best average BLEU score as well as the model with the best average loss. In practice, the model with best loss tends to give better results because the BLEU score is a less stable metric than the loss since it is based on matching of n-grams whereas the loss takes into account the cross entropy of the predicted probabilities making it a more continuous metric.

Results

The BLEU score, loss, and accuracy are calculated using these validation dialogs. The BLEU score is a metric typically used in machine translation for comparing 2 text sequences. In our case, it is used to judge how close the generated text is to the reference human text answer. The BLEU score itself computes the modified precision metric using n-grams instead of matching word by word between the 2 sentences.

References

Simulation

Circular Sun House (a.k.a. Norman Lykes House) on Curbed Blender official site Unity official site Door problem of game design

Client

React Official site React hooks Nextjs Officail site Bridging Unity to javascript Adding external js files

Server

Django Official site REST API

Machine Learning

CVDN dataset official site Pytorch official site Sequence to sequency architecture Cross entropy loss Teacher forcing BLUE score

Authors

Frederick Roman

Main Page Github LinkedIn

I am ML software engineer that builds and brings AI systems to life on the web.

Main contribution areas: Front-end (React, Next), back-end (Django) and DevOps. I am the webmaster for this page, and primary point of contact for this project.

I put the modules (simulation, AI) together and deploy the connected system. I …
Connected the client-host to the API server
Build the client-side bridge from the simulation to the client-host
Deployed the system online.

Homero Roman

Main Page Github LinkedIn

Machine Learning Software engineer with a passion for XR

Main contribution areas: 3D modeling (Blender), VR (Unity), and ML development (Pytorch)

Homero wrote the majority of the code for the machine learning model end-to-end. The initial code base was the CVDN GitHub repository. To this, Homero added the speaker model using a sequence-to-sequence architecture. Also, he implemented several data structures and algorithms for handling and interaction with the CVDN data to be able to be ported to any simulator.

Homero also leverage web development full-stack expertise to implement the Django server that takes in question requests and responds with text answers from the NavLead neural network.

About

TLDR;

Navlead is an AI-powered assistant chatbot for navigation in 3D enviroments. The simulation, the chatbot, and their deployment online are all part of this project.

Simulation

Front-end

(Browser client)

Back-end

(API server)

Machine learning

(Navigation guide)

Motivation

Inspired by the rise of VR and AI, this project aimed to combine both into navigation assistance system and run it through the browser.

3D simulation

Modeling 3D enviroment in Blender

The 3D virtual enviroment used in the demo is based on Frank Lloyd Wright's final design: the Circular Sun House.

Now to the actual modeling.

Granted this is very rough sketch; but it got us to realize that we didn't need the exterior the house and even the ceiling. It also got us to realize that building such a curvy house would be too hard with voxels. Since we needed a bigger boat, we got Blender.

For navigation • The doorways are marked and there are no doors • The floor has a unique distinctive color so users can learn to distinguish walkable surfaces. • The placement of the furniture avoid leaving tight spaces where you could get stuck.

For efficiency • The house and objects are low-poly to optimize for space, download, and rendering time over the web.• The underlying structure of repeated objects is linked, instead of copied, across these objects for efficiency.

This is how creating a bedroom in Blender looks like:

And this is how the house looks from above:

Adding game controls in Unity

Our simulation wouldn't be very useful, if we could navigate in the scene. For this we used Unity.

Unity is a cross-platform game engine that allows us not only to interact with the environment but also add physical properties such as gravity and rigid body collision. Below is a shot of what the starting room of the house looked like in Unity:

With Unity, we wanted to achieve two goals: add 3D motion controls and a 2D UI. The 2D UI would allow us to select a target and ask for guidance along the journey.

Motion Controls

For the player's movement, the player object has a script which updates the player's position based on the WASD keys. This movement is smoothed out over the frame rate at the Update runtime hook. The code for player movement is as follows:

Here _charController is the CharacterControllercomponent attached to the player object. This component allows the movement of the player to be constrained by object collisions and also to react to gravity like a rigid body.

Unity 2D UI

Navlead makes use of several 2D Panels overlayed on top of the display canvas. Each of these panels provides a visual interface to necessary functionality for the simulation. In total, there are 5 main simulation states as illustrated by the diagram below:

The Splash Screen

The Target Popup

The Main Scene

It is at the Main Scene that most of the action happens. Here the user is able to use the mouse or drag in a mobile device to turn around the place. Also, to be able to move, the player can use the WASD keys or the arrow buttons labelled with their appropriate keys.

The Dialog Popup

The user is able to ask questions to the NavLed Net through the Dialog Popup.

The Congratulations Panel

Finally, in order to notify the user that they have found the target, the Congratulations Panel is a semi-transparent screen that displays the message “Congratulations! You found the target!” and enables the user to go back to the Target Popup by exposing the Play Again button.

As mentioned above, this Congratulations Panel is activated whenever the player enters the trigger surrounding the target object. Each target has a trigger that can check whether the player has entered it. The code for doing precisely just this is as follows:

Triggers expose the OnTriggerEnter runtime hook that runs when the trigger detects that the player has entered it. In this code snippet, the trigger checks whether it is in charge of the current target. If so, it calls the UIController to display the Congratulations Panel.

Unity Keyboard

In order to enable mobile devices to input text in the Dialog popup, we implemented our own custom keyboard. This keyboard works with clicks, taps, and is able to scale with different screen sizes.

Finally, pressing enter has the same effect as submitting the question and closing the keyboard.

Bridging simulation to browser (simulation-side)

Now that we have a simulation, we need to publish it. And one way to do that is serving it through browsers.

Fortunately Unity can compile to run on web browsers. It does so by using WebGL and Web Assembly .Virtual and Augmented reality are two emerging technologies that greatly benefit of WebGL and Web Assembly for running applications on the web.

The compiled simulation will send and receive messages from and to the browser as follows:

In order to send the game state to the JavaScript front-end for processing, the function sendGameStateToJS transforms the GameState into a JSON string representation and calls on the SendToJS bridge between Unity and the front-end.

Client-side integration

Bridging simulation to browser (browser-side)

After we built the Unity simulation (with the official WebGL template), we stablished the communication between it and the browser by passing and receiving messages through the window object. We accomplished that by adding the following lines to script.js (Unity's bootstrap loader).

Loading Unity script

To run the Unity simulation in the browser, we must first load the Unity bootstrap loader script (script.js). We accomplish this by appending the script asynchronously to the document's body when the Simulation component is mounted (as suggested by this article).

Since the Unity bootstrap loader script appends the simulation build scripts, we should remove them when the Simulation component is unmounted. We can implement all that in the following custom React hook:

Answer simulation question

When the user (traveler) asks for navigation guidance, the inquiry itself is sent to a remote ML API service through the NavGuideService.

Server-side API

Request handling

The NavGuideService acts as a proxy that sends a request like this to the ML REST API server:

After validation and authentication, it runs NavleadNet to infer the answer to the traveler's question given his/her location and target. The response looks as follows:

Machine Learning modeling

Data

Base dataset (CVDN)

The following is an example of a single question-answer exchange with “picture” as a target:

Augmenting dataset (CVDN)

Diagram showing the mapping from vector direction to word instructions (relative to the traveler)

Synthetic dialog creation for random initial location and target

Training

Architecture (seq2seq with attention using LSTMs)

Understanding the encoder-decoder sequence to sequence architecture

Training Configuration

NavLeadNet is written in Pytorch. Pytorch is an open source machine learning framework

NavleadNet has a sequence-to-sequence architecture with:

Training loop

Then, when the trainVal has loaded the training data and validation data, trainSpeaker takes care of actually running the training loop to train the NavLead neural network to produce answers from questions.

The full set of steps for training can be found in the trainSpeaker function as follows:

The training of NavLead can be divided into 3 main parts. 1. Instantiation and setup of the model. 2. The training loop at each interval where we interleave training of the CVDN data with data augmented online at each iteration. 3. Validation at the end of each interval.

On the first part, we create an instance of Speaker and assign it the worldsCollection environment.

The idea is that using the dot product we can extract the angle between the direction vector and a cardinal vector. If we compare against all cardinal vectors we can decide which is the closest cardinal vector and hence decide what direction the player is able to move from a location.

Results

For navigation
• The doorways are marked and there are no doors
• The floor has a unique distinctive color so users can learn to distinguish walkable surfaces.
• The placement of the furniture avoid leaving tight spaces where you could get stuck.

For efficiency
• The house and objects are low-poly to optimize for space, download, and rendering time over the web.
• The underlying structure of repeated objects is linked, instead of copied, across these objects for efficiency.

For the player's movement, the player object has a script which updates the player's position based on the WASD keys. This movement is smoothed out over the frame rate at the `Update` runtime hook. The code for player movement is as follows:

Here `_charController` is the CharacterControllercomponent attached to the player object. This component allows the movement of the player to be constrained by object collisions and also to react to gravity like a rigid body.

Triggers expose the `OnTriggerEnter` runtime hook that runs when the trigger detects that the player has entered it. In this code snippet, the trigger checks whether it is in charge of the current target. If so, it calls the `UIController` to display the Congratulations Panel.

In order to send the game state to the JavaScript front-end for processing, the function `sendGameStateToJS` transforms the GameState into a JSON string representation and calls on the SendToJS bridge between Unity and the front-end.

Then, when the `trainVal` has loaded the training data and validation data, `trainSpeaker` takes care of actually running the training loop to train the NavLead neural network to produce answers from questions.

The full set of steps for training can be found in the `trainSpeaker` function as follows:

The training of NavLead can be divided into 3 main parts.
1. Instantiation and setup of the model.
2. The training loop at each interval where we interleave training of the CVDN data with data augmented online at each iteration.
3. Validation at the end of each interval.

On the first part, we create an instance of `Speaker` and assign it the `worldsCollection` environment.

I put the modules (simulation, AI) together and deploy the connected system. I …
Connected the client-host to the API server
Build the client-side bridge from the simulation to the client-host
Deployed the system online.