Semi-Supervised Learning for Bilingual Lexicon Induction

This blog post contains an introduction to Unsupervised Bilingual Alignment and Multilingual Alignment. We also go through the theoretical framework behind Learning to Rank and discuss how it might help produce better alignments in a Semi-Supervised fashion.

This post was written with Gauthier Guinet and was initially supposed to explore the impact of learning to rank to Unsupervised Translation. We both hope that this post will serve as a good introduction to anyone interested in this topic.

I now work at Flaneer, feel free to reach out if you are interested!

Introduction


Example of the embedding of "Etudiant" in 6 dimensions.


Word vectors are conceived to synthesize and quantify semantic nuances, using a few hundred coordinates. These are generally used in downstream tasks to improve generalization when the amount of data is scarce. The widespread use and successes of these “word embeddings” in monolingual tasks have inspired further research on the induction of multilingual word embeddings for two or more languages in the same vector space.


Words embeddings in 2 different languages but in the same vector space.


The starting point was the discovery [4] that word embedding spaces have similar structures across languages, even when considering distant language pairs like English and Vietnamese. More precisely, two sets of pre-trained vectors in different languages can be aligned to some extent: good word translations can be produced through a simple linear mapping between the two sets of embeddings. For example, learning a direct mapping between Italian and Portuguese leads to a word translation accuracy of 78.1% with a nearest neighbor (NN) criterion.


Example of a direct mapping and translation with a NN criterion for French and English.


Embeddings of translations and words with similar meaning are close (geometrically) in the shared cross-lingual vector space. This property makes them very effective for cross-lingual Natural Language Processing (NLP) tasks. The simplest way to evaluate the result is the Bilingual Lexicon Induction (BLI) criterion, which designs the dictionary’s percentage that can be correctly induced. Thus, BLI is often the first step towards several downstream tasks such as Part-Of-Speech (POS) tagging, parsing, document classification, language genealogy analysis, or (unsupervised) machine translation.

These common representations are frequently learned through a two-step process, whether in a bilingual or multilingual setting. First, monolingual word representations are learned over large portions of text; these pre-formed representations are actually available for several languages and are widely used, such as the Fasttext Wikipedia. Second, a correspondence between languages is learned in three ways: in a supervised manner, if parallel dictionaries or data are available to be used for supervisory purposes, with minimal supervision, for example, by using only identical strings, or in a completely unsupervised manner.

It is common practice in the literature on the subject to separate these two steps and not to address them simultaneously in a paper. Indeed, measuring the algorithm’s efficiency would lose its meaning if the corpus of vectors is not identical at the beginning.

Concerning the second point, although three different approaches exist, they are broadly based on the same ideas: the goal is to identify a subset of points that are then used as anchors points to achieve alignment. In the supervised approach, these are the words for which the translation is available. In the semi-supervised approach, we will gradually try to enrich the small initial corpus to have more and more anchor points. The non-supervised approach differs because there is no parallel corpus or dictionary between the two languages. The subtlety of the algorithms will be to release a potential dictionary and then to enrich it progressively.

We will focus on this third approach. Although it is a less frequent scenario, it is of great interest for several reasons. First of all, from a theoretical point of view, it provides a practical answer to a fascinating problem of information theory: given a set of texts in a totally unknown language, what information can we retrieve? The algorithms we chose to implement contrast neatly with the classical approach used until now. Finally, for very distinct languages or languages that are no longer used, the common corpus can indeed be very thin.

Many developments have therefore taken place in recent years in this field of unsupervised bilingual lexicon induction. One of the recent discoveries is the idea that using information from other languages during the training process helps improve translating language pairs.

This discovery led us to formulate the problem as follows: is it possible to gain experience in the progressive learning of several languages? In other words, how can we make fair use of the learning of several acquired languages to learn a new one? This new formulation can lead one to consider the lexicon induction as a ranking problem.

We will proceed as follows: First, we will outline state of the art and the different techniques used for unsupervised learning in this context. In particular, we will explain the Wasserstein Procrustes approach for bilingual and multi alignment. We then emphasize the lexicon induction given the alignment. We then present the Learning to Rank key concepts. We then discuss a method using learning to rank for lexicon induction, and then present some experimental results.

Unsupervised Bilingual Alignment

This section provides a brief overview of unsupervised bilingual alignment methods to learn a mapping between two sets of embeddings. The majority are divided into two stages: the actual alignment and lexicon induction, given the alignment. Even if the lexicon induction is often taken into account when aligning (directly or indirectly, through the loss function), this distinction is useful from a theoretical perspective.


Word embeddings alignment (in dimension 2).


Historically, the problem of word vector alignment has been formulated as as a quadratic problem. This approach, resulting from the supervised literature, is then allowed to presume the absence of lexicon without modifying the framework. That is why we will deal with it first in what follows.

Orthogonal Procrustes Problem


Set of n words (embeddings of dimensions d) for 2 languages.


Procustes is a method that aligns points if given the correspondences between them (supervised scenario). $\mathbf{X} \in \mathbb{R}^{n \times d}$ and $\mathbf{Y} \in \mathbb{R}^{n \times d}$ are the two sets of word embeddings or points and we suppose, as previously said, that we know which point X corresponds to which point Y. This leads us to solve the following least-square problem of optimization, looking for the W matrix performing the alignment [5]:


Operation applied for each word in the first language. The goal is to minimize the distance between the orange and the blue vectors.


We have access to a closed form solution with a cubic complexity. Restraining W to the set of orthogonal matrices $\mathcal{O}_{d}$, improves the alignments for two reasons: it limits overfitting by reducing the size of the minimization space and allows to translate the idea of keeping distances and angles, resulting from the similarity in the space structure. The resulting problem is known as Orthogonal Procrustes. It also admits a closed-form solution through a singular value decomposition (cubic complexity).

Thus, if their correspondences are known, the translation matrix between two sets of points can be inferred without too many difficulties. The next step leading to unsupervised learning is to discover these point correspondences using Wasserstein distance.

Wasserstein Distance

In a similar fashion, finding the correct mapping between two sets of word can be done by solving the following minimization problem:

$\mathcal{P}_{n}$ containing all the permutation matrices, the solution of the minimization, $P_t$ will be an alignment matrix giving away the pair of words. This 1 to 1 mapping can be achieved thanks to the Hungarian algorithm. It is equivalent to solve the following linear program:

The combination of the Procustes- Wasserstein minimization problem is the following:

To solve this problem, the approach of [5] was to use a stochastic optimization algorithm. As solving separately those 2 problems led to bad local optima, their choice was to select a smaller batch of size b, and perform their minimization algorithm on these sub-samples. The batch is playing the role of anchors points. Combining this with a convex relaxation for an optimal initialization, it leads to the following algorithm:

Other unsupervised approaches

Other approaches exist, but they are currently less efficient than the one described above for various reasons: complexity, efficiency… We will briefly describe the two main ones below.

Optimal transport:

Optimal transport [6] formalizes the problem of finding a minimum cost mapping between two word embedding sets, viewed as discrete distributions. More precisely, they assume the following distributions:

and look for a transportation map realizing:

where the cost $c(\mathbf{x}, T(\mathbf{x}))$ is typically just $| T(\mathbf{x}) - \mathbf{x} |$ and $T_{\sharp} \mu = \nu$ implies that the source points must exactly map to the targets.


Push forward operator


Yet, this transportation not always exist and a relaxation is used. Thus, the discrete optimal transport (DOT) problem consists of finding a plan $\Gamma$ that solves

where $\mathbf{C} \in \mathbb{R}^{n \times m}$, e.g., $C_{i j}=\left|\mathbf{x}^{(i)}-\mathbf{y}^{(j)}\right|$ is the cost matrix and the total cost induced by $\Gamma$ is:

where $\Gamma$ belongs to the polytope:

A regularization is usually added, mostly through the form of an entropy penalization:

Some works [7] are based on these observations and then propose algorithms, effective in our case, because they adapt to the particularities of word embeddings. However, we notice that even if the efficiency is higher, the complexity is redibitive and does not allow large vocabularies. Moreover, the research is more oriented towards improving the algorithm described above with optimal transport tools rather than towards a different path. This is why we will not focus in particular on this track. We should, however, note the use of Gromov-Wasserstein distance [8], which allows calculating the distance between languages innovatively, although they are in distinct spaces, enabling to compare the metric spaces directly instead of samples across the spaces.

Adversarial Training: Another popular alternative approach derived from the literature on generative adversarial network [9] is to align point clouds without cross-lingual supervision by training a discriminator and a generator [10]. The discriminator aims at maximizing its ability to identify the origin of an embedding. The generator of W aims at preventing the discriminator from doing so by making WX and Y as similar as possible.

They note $\theta_{D}$ the discriminator parameters and consider the probability $P_{\theta_{D}}(\text { source }=1 | z)$ that a vector $z$ is the mapping of a source embedding (as opposed to a target embedding) according to the discriminator. The discriminator loss can then be written as:

On the other hand, the loss of the generator is:

For every input sample, the discriminator and the mapping matrix $W$ are trained successively with stochastic gradient updates to minimize $\mathcal{L}_W$ and $\mathcal{L}_D$.

Yet, papers [10] on the subject show that, although innovative, this framework is more useful as a pre-training for the classical model than as a full-fledged algorithm.

Multilingual alignment

A natural way to improve the efficiency of these algorithms is to consider more than 2 languages. Thus, when it comes to aligning multiple languages together, two principle approaches quickly come to mind and correspond to two optimization problems:

  • Align all languages to one pivot language, often English, without considering the loss function other alignments. This leads to low complexity but also to low efficiency between the very distinct language, forced to transit through English.
  • Align all language pairs by putting them all in the loss function, without giving importance to any one in particular. If this improves the efficiency of the algorithm, the counterpart is in the complexity, which is very important because it is quadratic in the number of languages.


Example of both approaches to align multiple languages.


A trade-off must therefore be found between these two approaches.

Let us consider $\mathbf{X}_i$ word embeddings for each language $i$, $i=0$ can be considered as the reference language, $\mathbf{W}_i$ is the mapping matrix we want to learn and $\mathbf{P}_i$ the permutation matrix. The alignment of multiple languages using a reference language as pivot can be resumed by the following problem:

As said above, although this method gives satisfying results concerning the translations towards the reference language, it provides poor alignment for the secondary languages between themselves.

Therefore an interesting way of jointly aligning multiple languages to a common space has been brought through by Alaux et al. [11].

The idea is to consider each interaction between two given languages, therefore the previous sum becomes a double sum with two indexes i and j. To prevent the complexity from being to high and to keep track and control over the different translations, each translation between two languages is given a weight $\alpha_{i j}$:

The choice of these weights depends on the importance we want to give to the translation from language i to language j.

The previous knowledge we have on the similarities between two languages can come at hand here, for they will have a direct influence on the choice of the weight. However, choosing the appropriate weights can be uneasy. For instance, giving a high weight to a pair of close languages can be unnecessary, and doing the same for two distant languages can be a waste of computation. To effectively reach this minimization, they use an algorithm very similar to the stochastic optimization algorithm described above.

At the beginning, we wanted to use this algorithm to incorporate exogenous knowledge about languages to propose constants $\alpha_{i j}$ more relevant and leading to greater efficiency. Different techniques could result in these parameters: from the mathematical literature such as the Gromov-Wasserstein distance evoked above or from the linguistic literature, using the etymological tree of languages to approximate their degree of proximity or even from both. In the article implementing this algorithm, it is specified that the final $\alpha_{i j}$ is actually very simple: N if we consider a link to the pivot or 1 otherwise. Practical simulations have also led us to doubt the efficiency of this idea. This is why we decided to focus on the idea below that seemed more promising rather than on multialignments.

Word translation as a retrieval task: Post-alignment Lexicon Induction

The core idea of the least-square problem of optimization in Wasserstein Procrustes is to minimize the distance between a word and its translation. Hence, given the alignment, the inference part first consisted of finding the nearest neighbors (NN). Yet, this criterion had a major issue: Nearest neighbors are by nature asymmetric: y being a K-NN of x does not imply that x is a K-NN of y. In high-dimensional spaces, this leads to a phenomenon that is detrimental to matching pairs based on a nearest neighbor rule: some vectors, called hubs, are with high probability nearest neighbors of many other points, while others (anti-hubs) are not nearest neighbors of any point. [12]

Two solutions to this problem have been brought through new criteria, aiming to give similarity measure between two embeddings, thus matching them appropriately. Among them, the most popular is Cross-Domain Similarity Local Scaling (CSLS) [10]. Others exist, such as Inverted Softmax (ISF)[13]. Yet, they usually require to estimate noisy parameter in an unsupervised setting where we do not have a direct cross-validation criterion.

The idea behind CSLS is quite simple: it is a matter of calculating a cosine similarity between the two vectors, subtracting a penalty if one or both of the vectors is also similar at many other points.

More formally, we denote by $\mathcal{N}_{\mathrm{T}} (\mathbf{W} x_s)$ the neighboors of $\boldsymbol{x}_S$ for the target language, after the alignment (hence the presence of $\mathbf{W}$).

Similarly we denote by $\mathcal{N}_{\mathrm{S}}(y_t)$ the neighborhood associated with a word $t$ of the target language. The penalty term we consider is the mean similarity of a source embedding $x_s$ to its target neighborhood:

where cos(…) is the cosine similarity.

Likewise we denote by $r_{\mathrm{S}}\left(y_{t}\right)$ the mean similarity of a target word $y_{t}$ to its neighborhood. Finally, the CSLS is defined as:

However, it may seem irrelevant to align the embedding words with the NN criterion metric and to use the CSLS criterion in the inference phase. Indeed, it creates a discrepancy between the learning of the translation model and the inference: the global minimum on the set of vectors of one does not necessarily correspond to one of the other. This naturally led to modify the least-square optimization problem to propose a loss function associated with CSLS.

By assuming that word vectors are $\ell_{2}-$ normalized, we have:

Similarly, we have:

Therefore, finding the $k$ nearest neighbors of $\mathbf{W} \mathbf{x}_i$ among the elements of $\mathbf{Y}$ is equivalent to finding the $k$ elements of $\mathbf{Y}$ which have the largest dot product with $\mathbf{W} \mathbf{x}_i$. This equivalent formulation is adopted because it leads to a convex formulation when relaxing the orthogonality constraint on $\mathbf{W}$. This optimization problem with the Relaxed CSLS loss (RCSLS) is written as:

A convex relaxation can then be computed by considering the convex hull of $\mathcal{O}_d$, i.e., the unit ball of the spectral norm. The results of the papers [10] point out that RCSLS outperforms state of the art by, on average, 3 to 4% in accuracy compared to benchmark. This shows the importance of using the same criterion during training and inference.

Such an improvement using a relatively simple deterministic function led us to wonder whether we could go even further in improving performance. More precisely, considering Word translation as a retrieval task, the framework implemented was that of a ranking problem. To find the right translation, it was essential to optimally rank potential candidates. This naturally led us to want to clearly define this ranking problem and use the state of the art research on raking to tackle it. In this framework, the use of simple deterministic criteria such as NN, CSLS, or ISF was a low-tech answer and left a large field of potential improvement to be explored.

However, we wanted to keep the unsupervised framework, hence the idea of training the learning to rank algorithms on the learning of the translation of a language pair, English-Spanish, for instance, assuming the existence of a dictionary. This would then allow us to apply the learning to rank algorithm for another language pair without a dictionary, English-Italian, for instance. Similarly to the case of CSLS, the criterion can be tested first at the end of the alignment carried out thanks to the Procrustes-Wasserstein method. Then, in a second step, it can be integrated directly through the loss function in the alignment step. The following will quickly present the learning to rank framework to understand our implementation in more detail.

Learning to Rank

A ranking problem is defined as the task of ordering a set of items to maximize the utility of the entire set. Such a question is widely studied in several domains, such as Information Retrieval or Natural Language Processing. For example, on any e-commerce website, when given a query “iPhone black case” and the list of available products, the return list should be ordered by the probability of getting purchased. One can start understanding why a ranking problem is different than a classification or a regression task. While their goal is to predict a class or a value, the ranking task needs to order an entire list, such that the higher you are, the more relevant you should be.

Theoretical framework

Let’s start by diving into the theoretical framework of Learning to Rank. Let $\psi := { (X,Y) \in \mathcal{X}^n \times \mathbb{R}^{n}_{+}}$ be a training set, where:

  • $X \in \mathcal{X}^n$ is a vector, also defined as $(x_1,…,x_n)$ where $x_i$ is an item.
  • $Y \in \mathbb{R}^{n}_{+}$ is a vector, also defined as $(y_1,…,y_n)$ where $y_i$ is a relevance labels.
  • $\mathcal{X}$ is the space of all items.

Furthermore, we define an item $x \in \mathcal{X}$ as a query-documents pair $(q,d)$.

The goal is to find a scoring function $f : \mathcal{X}^n \rightarrow \mathbb{R}^{n}_{+}$ that would minimizes the following loss :

where $l : (\mathbb{R}^{n}_{+})^2 \rightarrow \mathbb{R}$ is a local loss function.

One first et very important note is how $f$ is defined. This could be done in two ways:

  • We consider $f$ as a univariate scoring function, meaning that it can be decomposed into a per-item scoring function with $u : x \mapsto \mathbb{R}_+$. We will have $f(X) = [u(x_0), \cdots , u(x_n)]$.
  • We consider $f$ as a multivariate scoring function, meaning that each item is scored relatively to every other item in the set, with $f$ in $\mathbb{R}^{n}_{+}$. This means that changing one item could change the score of the rest of the set.

While the first option is simpler to implement, the second one is much closer to reality, as an item’s relevance often depends on the distribution its in. For example, an article’s relevance to an e-commerce query will always depend on what the website offers you next to it.

We now have to define some metrics in order to judge how good a ranking is. We start by defining the Discounted Cumulativ Gain (DCG) of a list:

where:

  • $Y = (y_1,…,y_n)$ are the ground truth labels
  • $\pi(j)$ is the rank of the j-th item in X
  • $\frac{1}{ln_{2}(1+\pi(j))}$ is the discount factor
  • $k$ is how much we want to go deep into the list. A low value of $k$ means that we want to focus on how well ranked the start of our list is.

Most of the time however we want to compare this metric to the DCG obtained from the ground truth labels. We then define:

where $\pi^*$ is the item permutations induced by Y.

Loss functions

In

we defined $l$ as a loss function between two ordered sets of items. One approach could be to use the metrics defined above, but as they are non-differentiable, this is not a feasible choice. Therefore, we have to develop some sort of surrogate loss function, differentiable, and with the same goal as our metric. Before diving into the possible approaches, one must define what pointwise, pairwise, and listwise loss functions are.

A pointwise loss will only compare one predicted label to the real one. Therefore, each item’s label is not compared to any other piece of information. A pairwise loss will compare 2 scores from 2 items at the same time. With the pairwise loss, a model will minimize the number of pairs in the wrong order relative to the proper labels. A listwise loss can capture differences in scores throughout the entire list of items. While this is a more complex approach, it allows us to compare each score against the others.

Let’s give an example for each sort of loss function, starting with a pointwise one. The sigmoid cross entropy for binary relevance labels can be defined as:

where $p_j = \frac{1}{1 + e^{-\hat y_j}}$ and $\hat Y$ is the predicted labels for each item from one model.

The pairwise logistic loss [1] is a pairwise loss function that compares if pair of items are ordered in the right order. We can define it as:

where $\mathbb{I}_{x}$ is the indicator function.

Finally, a listwise loss function like the Softmax cross-entropy [2] can be defined as:

All of these loss functions are surrogates that try to capture the goal of our metrics. Another approach would be to define a listwise loss function, as close as our metrics as possible. For example, one could try to build a differentiable version of the NDCG [3]

ApproxNDCG: a differentiable NDCG

Let’s take a query $q$, and define several useful functions:

the relevance of an item $x$ regarding of the fixed query $q$, and

the position of $x$ in the ranked list $\pi$ and finally, $\mathbb{I}_{{\pi(x) \leq k}}$ still represents the indicator function of the set ${\pi(x) \leq k}$. The idea is to use an differentiable approximation of the indicator function. One can show that we have the following approximation:

where $\alpha$ is a hyperparameter.

Example of approximation of the Indicator function for various values of alpha.


For a fixed query, we can re-define the DCG metric with the following equality :

We now have to get an approximation of the $\pi$ function. The idea here is to get back to an indicator function since it is possible to compute them. We will be using the following equality :

where $s_x$ is defined as the score given to $x$ according to $f$, and define an approximation of $\pi$ with:

We can now define our differentiable version of the DCG metric by using these approximations.

RUBI: Ranked Unsupervised Bilingual Induction

Motivations: Let’s describe more precisely the functioning of our algorithm. Two points guided our approach:

  • From a linguistic point of view, there is obviously a learning to learn phenomenon for languages. We observe that by assimilating the structure of the new language, its grammar, and vocabulary to one of the already known languages, it is easier for us to create links that help learning. It is the search for these links that motivate us. We are convinced that they can be useful when inferring vocabulary.
  • Improvement induced by using the CSLS criterion suggests that there are complex geometrical phenomena (going beyond the existence as mentioned above of hubs) within the representations of languages, both ante, and post-alignment. Understanding these phenomena can lead to significantly increased efficiency.

Framework: Our goal is the same as for unsupervised bilingual alignment: we have a source language A and a target language B with no parallel data between the two. We want to derive an A-B dictionary, a classic BLI task. Our study’s specificity is to assume that we also have a C language and an A-C dictionary at our disposal. To set up the learning to learn procedure, we proceed in 2 steps:

  • Learning: Using the Procrustes-Wasserstein algorithm, we align languages A and C in an unsupervised way. We then build a corpus of queries between the words from language A known from our dictionary and their potential translation into language C. Classical methods proposed the translation that maximized the NN or CSLS criteria. We use deep learning as part of our learning to rank framework to find a more complex criterion. One of the innovative features of our work is, therefore, to allow access to a much larger class of functions for the vocabulary induction stage. A sub-part of the dictionary is used for cross-validation. The way of rating the relevance of the potential translations, the inputs of the algorithm, the loss functions are all parameters that we studied and that are described in the next section.
  • Prediction: We thus have at the end of the training an algorithm taking as input a vocabulary word, in the form of an embedding and a list of potential translations. Our algorithm’s output is the list sorted according to the learned criteria of these possible translations, the first word corresponding to the most probable translation, and so on. We first perform the alignment of languages A and B using the Procrustes-Wasserstein algorithm again. In a second step, thanks to the learning to rank, we perform the lexicon induction step.

Finally, a final conceptual point is important to raise. In the context of the CSLS criterion, we have seen in the above that its use after alignment has improved. However, actually incorporating it in the alignment phase by modifying the loss function has allowed for greater consistency and a second improvement. However, these two changes were separated. Yet, the learning to rank framework is quite different. The main reason is the non-linearity resulting from deep-learning, unlike CSLS. Therefore, the global optimization is much more complex and does not allow relaxation to get back to a convex case. However, it is an area for improvement to be considered very seriously for future work.

Results

We can split our results into two very distinct parts. They both depend on how the Learning to Rank item sets are built. Given a word, you can build the list of potential traduction from a CSLS criterion and then force or not the right translation presence. This choice needs to be discussed thoroughly. First, let’s quickly present some results with and without the correct prediction forced in the query.

With the forced prediction

Method EN-ES ES-EN EN-FR FR-EN
Wass. Proc. - NN 77.2 75.6 75.0 72.1
Wass. Proc. - CSLS 79.8 81.8 79.8 78.0
Wass. Proc. - ISF 80.2 80.3 79.6 77.2
Adv. - NN 69.8 71.3 70.4 61.9
Adv. -CSLS 75.7 79.7 77.8 71.2
RCSLS+spectral 83.5 85.7 82.3 84.1
RCSLS 84.1 86.3 83.3 84.1
RUBI 93.3 (DE) 91.6 (FR) 93.8 (NL) 91.9 (IT)
  EN-DE DE-EN EN-RU RU-EN
Wass. Proc. - NN 66.0 62.9 32.6 48.6
Wass. Proc. - CSLS 69.4 66.4 37.5 50.3
Wass. Proc. - ISF 66.9 64.2 36.9 50.3
Adv. - NN 63.1 59.6 29.1 41.5
Adv. -CSLS 70.1 66.4 37.2 48.1
RCSLS+spectral 78.2 75.8 56.1 66.5
RCSLS 79.1 76.3 57.9 67.2
RUBI 93.6 (HU) 89.8 (FR) 83.7 (HU) -

1: Loss function impact - Loss used for the Learning to Rank model. In other experiments, we found that ApproxNDCG and List MLE continue to perform similarly, hence our default choice of Approx NDCG.

2: Group size impact - The group size measures how many items the Learning to Rank model takes as input simultaneously (multivariate vs. univariate). However, the dilemma is to optimize the computation time because increasing the group size exponentially increases the number of calculations.

3: CSLS feature impact - The features for each potential translation in a query can incorporate several elements:

  • the word embedding of the potential translation (size 300)
  • the word embedding of the query (size 300)
  • pre-computed features such as distance to query word in the aligned vector space, CSLS distance, ISF…

Those features are crucial for learning as it will entirely rely on it. At first, we decided to only use the word embedding of the potential translation and the query. That gave us a 600 feature list. However, after several experiments, we noticed that the learning to rank algorithm, despite the variation of the parameters, could not learn relevant information from these 600 features, the performance was poor. The function learned through deep learning was less efficient than a simple Euclidean distance between the potential translation and the query (NN criterion). In fact, after consulting the literature, we realized that using such a number of features is not very common. Most algorithms were only using pre-computed features (often less than a hundred). Although this information is already interesting in itself, we, therefore, turned to the second approach. We chose to restrict ourselves to certain well-specified types of pre-computed features to evaluate their full impact. More precisely, for a fixed k parameter, we provided as features the euclidean distance to the query and the CSLS(i) “distance” for i ranging from 1 to k. In other words, we provided information about the neighborhood through the penalties described in the section on CSLS. Training with the full embeddings was also performed but did not lead to any improvements.

4: Query size impact - Number of potential translations given with a query (number of items). There is a low incidence of the number of queries on the results, a very slight but perceptible decrease. Therefore, the algorithm is able, despite a large number of candidates, to discern the correct information.

Plot of the BLI criterion in the training step according to the CSLS criterion, i.e., the quality of the language’s alignment for learning with English. The trend that emerges is that of a very clear positive correlation (linear trend plotted in red, (R ^2 = 0.82)). We have also shown the averages per language family (Romance, Germanic and Uralic). In conclusion, it seems easier to learn using a language that is well-aligned with English. Although this seems logical, it is not that obvious. Three clusters seem to appear in conjunction with the different families. Romance languages are associated with a high rate of alignment with English and, therefore, with high performance in the learning stage. The Germanic language cluster has a lower performance combined with a slightly lower quality alignment. Knowing that English belongs to the Germanic language type, it is interesting to note this slight underperformance in alignment compared to Romance. Finally, the Slave cluster shows the worst performance in terms of alignment with English and, therefore, also the worst for the learning step.

Without forced prediction

Ablation study for the same pipeline as above, but without the right prediction forced into the query.


Without going into too many details, the same trends as above appear here and hold even when the proper translation is not artificially added to the query.

The main difference comes from the value of the BLI results: we only achieve the same results as the state of the art, or slightly better. We discuss in the conclusion the implication of these results, and especially the importance of forcing the right translation to appear in each query.

Conclusion

If in the Learning to rank framework, a query without at least one relevant item doesn’t have a lot of sense, this is not something one can ensure when working with unsupervised translation. If it is clear that given a query where the correct translation appears, a Learning to Rank model surpasses existing methods, it is still unclear on how to achieve such a query every time. While one way could be to extend the query size drastically, one still has to keep in mind the memory capability of such a model. Another bias might come from every query where the correct translation does not appear naturally (thus always giving poor results to every model except “forced translation” model).

We believe that leveraging the knowledge of previous idioms acquisition can keep leading to many improvements over existing models. This could be achieved thanks to Learning to Rank.

References

1: CHEN, WEI, YAN LIU, TIE, LAN, YANYAN, MING MA, ZHI & LI, HANG 2009 Ranking measures and loss functions in learning to rank. In Advances in Neural Information Processing Systems 22 (ed. Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams & A. Culotta), pp. 315–323. Curran Associates, Inc

2: CAO, ZHE, QIN, TAO, LIU, TIE-YAN, TSAI, MING-FENG & LI, HANG 2007 Learn- ing to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, pp. 129–136. New York, NY, USA: ACM.

3: QIN et al. A general approximation framework for direct optimization of information retrieval measures.

4: Ilya Sutskever, Thomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean 2013. Distributed Representations of Words and Phrases and their Compositionality

5: Quentin Berthet, Edouard Grave and Armand Joulin: Unsupervised Alignment of Embeddings with Wasserstein Procrustes, 2018

6: Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.

7: David Alvarez-Melis and Tommi Jaakkola. Gromov-wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.

8: Mikhail Gromov. Metric structures for Riemannian and non-Riemannian spaces. Springer Science & Business Media, 2007.

9: Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, pp. 2672–2680, 2014.

10: Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and Jégou, H: Word translation without parallel data, 2017

11: Jean Alaux, Edouard Grave, Marco Cuturi and Armand Joulin: Unsupervised hyperaligment for multilingual word embeddings, 2019

12: Georgiana Dinu, Angeliki Lazaridou, and Marco Baroni. Improving zero-shot learning by mitigating the hubness problem. International Conference on Learning Representations, Workshop Track, 2015.

13: Smith, S. L., Turban, D. H., Hamblin, S., and Hammerla, N. Y.: Bilingual word vectors, orthogonal transformations and the inverted softmax., 2017

Github Repository

Football Data Analysis - Liverpool FC attacking system

This post was made possible by @lastrowview and @Soccermatics which shared the tracking data of 19 goals scored by LFC during 2018–2019 and 2019-2020 seasons. Our code is available on this GitHub repository.

This article was co-written with Théophane, a classmate, while we worked on this project for a few days.

I now work at Flaneer, feel free to reach out if you are interested!

Sadio Mané, Mohamed Salah and Firmino celebrating.


This post illustrates how data analysis and machine learning can be applied to football players’ tracking data in order to reveal key insights. Our analysis will be articulated in two parts :

  • Pitch control for opponent analysis
  • Deep Reinforcement Learning (DRL) for accessing action value

Pitch control for opponent analysis

A pitch control model basically tells you which team is in control of every pitch part at a given time. Here, “in control” means being the quicker to arrive at a specific location in case the ball is shot there.

Starting from position data, we got access to speed data thanks to discrete derivation and Savitzky–Golay filter. Then, our approach was to use a time-based Voronoi tessellation operating with ball and players’ tracking data (positions and speeds) as inputs and physical characteristics (player maximum speed, reaction time…) as parameters. This model is an adaptation from the one developed by Laurie Shaw (Harvard) for Metrica’s tracking data.

Example of our pitch control output for a given frame extracted from Mohamed Salah’s goal against Manchester City on November 10th, 2019. Liverpool players are in red (of course), Manchester City players are in blue and the ball is the black dot. Arrows represent players’ speeds in meters per second.


After calculating pitch control map frame per frame, we were then able to produce video analysis like this one :

Pitch control analysis of Mohamed Salah’s goal against Manchester City on November 10th, 2019 (assist from Andrew Robertson)


Wing-backs : the two lungs of Liverpool’s attack

Football analysts often emphasize on LFC amazing attacking trio : Mané, Salah and Firmino. In fact, the two wing-backs, Andrew Robertson and Trent Alexander-Arnold, play an offensive role almost as significant as the famous trio.

Analyzing the Reds goals, their attacking system seems to focus on the wings in order to get around the defensive line. Indeed, most of the goals are coming from either curved passes (and crosses) or diving runs from outside to inside.

Our pitch control model enabled us to illustrate space creation by wing-backs such as this triggering run from Alexander-Arnold against Porto FC.

Trent Alexander-Arnold triggers a key diving run to deliver a precise assist against Porto FC in Champions League on April 13th, 2019


This attacking system is based on speed and quick decisions as opposed to more centered playing styles creating a nexus of short passes such as Guardiola’s FC Bayern Munich from 2015–2016 season (or even more in 2010 during Barcelona FC golden age). Therefore, defenders’ response time must be extremely short to counteract the Reds and reach a clean sheet.

Defenders’ response time : every second is precious against LFC

Comparative analysis of LFC attacking player’s speed evolution versus their direct defenders. Tracking data include 20 frames per second (i.e. 0.05s between each frame)


At first sight, these graphs just seem to show that attacking players are faster than defending players which is not surprising. However, the important point we would like to highlight is the time taken by defenders to reach the attacker’s initial speed. Defenders from Bayern and Everton seem to take a lot more time to react whereas the ones from Newcastle were very reactive (but victim of a huge space between their two center-backs).

The three goals are available here as the support of this analysis.

Same order as the one used for the graph : Bayern Munich, Everton and Newcastle.


This pitch control model enables to produce some insights on the game. Now, let’s see how we can value actions thanks to Deep Reinforcement Learning.

Deep Reinforcement Learning Agent

An important aspect of football analysis is the concept of expected goals (xG). An overview can be found here, with several references as well. We adapt this concept here to the Reinforcement Learning framework, considering a football game as a Markov Decision Process. At its core, reinforcement learning models an agent interacting with an environment and receiving observations and rewards from it. The goal of an agent is to determine the best action in any given state. In the case of football, we could consider :

  • An episode as one game.
  • A state as the position and velocity of each player and the ball.
  • Possible actions as a player moving or making a pass.
  • The reward as +1 if we score a goal, -1 if we conceded one.

One adaptation of the Reinforcement Learning framework to football can be found here.

A few mathematics considerations

We define a trajectory $\tau$ as a sequence of states and actions experienced by the agent :

\[\tau = \big( s_0, a_0, s_1, a_1, ... \big)\]

and we define the discounted cumulative reward as the gain along a trajectory :

\[R(\tau) = \displaystyle \sum_{t=0}^{T} \gamma^{t} r_{t}\]

where $\gamma$ is a discount factor, and $r_t$ the reward at time $t$. 

Finally, we define the state value function according to a policy $\pi$ (the way the agent behave) as : 

\[V^\pi (s) = \underset{\tau \sim \pi}{\mathbb{E}} \big[ R(\tau) \vert s \big]\]

This value function can represent the expected discounted cumulative reward starting from a state $s$ and following a given policy $\pi$. Since, in our framework, we only get a reward when the agent scores a goal, we can interpret $V(s)$ as the expected discounted goals from a given state.

Granted, you would need a policy or an agent that can play the same way the players you are targetting, and that is a complicated issue to tackle. We used here the Google Research Football environment to train our agents and used their hyper-parameters (except for the reward, we used the scoring reward):

--level 11_vs_11_easy_stochastic \
  --reward_experiment scoring \
  --policy impala_cnn \
  --cliprange 0.08 \
  --gamma 0.993 \
  --ent_coef 0.003 \
  --num_timesteps 50000000 \
  --max_grad_norm 0.64 \
  --lr 0.000343 \
  --num_envs 16 \
  --noptepochs 2 \
  --nminibatches 8 \
  --nsteps 512 \

We also kept training our agent against middle and hard opponents overtime and finished the training with self-play. We release the checkpoint used for the value function evaluation in our repository.

Finally, we wanted to fine-tune our agent with some Imitation Learning, by using real games data. However, due to the lack of time, this did not happen. It will be the next step for this particular project. We will also need to conduct a more thorough analysis of the differences between this approach and the more usual Expected Goals models. A multi-agent approach might also bring some value to this application, with a more precise prediction of the Expected Discounted Goals.

Results

We convert the tracking data from Last Row into a readable format for our agent :

“The observation is composed of 4 planes of size ‘channel_dimensions’. Its size is then ‘channel_dimensions’x4 (or ‘channel_dimensions’x16 when stacked is True). The first plane P holds the position of players on the left team, P(y,x) is 255 if there is a player at position (x,y), otherwise, its value is 0. The second plane holds in the same way the position of players on the right team. The third plane holds the position of the ball. The last plane holds the active player.” extracted from Google Research Football GitHub

that are stacked for four frames.

We then compute the value function for each time step and smooth it using a Savitzky-Golay filter. We apply our value function to a goal against Leicester. The results are available below :

Analyses of a goal against Leicester. We present the value function, and the tracking of the players. A voronoi diagram and the player's velocity vectors are presented as well.


Without over-analyzing the value function, a few things are still interesting to notice :

  • The pass between time frames 20–30 increases a lot the value function, as well as the assist from Alexander-Arnold.
  • Overall, the entire action on the right side brings a lot of value.
  • The value function decreases before the shot (as the agent believes that the action is compromised) and after the goal (since such a configuration is never seen within the simulation)

Analyses of a goal against Leicester. We present the value function, and the real time live goal.


Conclusion

Through this post, we tried to find interesting approaches to Liverpool FC tracking data and we are fully aware our analysis is not complete. We still had a lot of fun working on the dataset provided by @lastrowview and we would also like to thank Friends of Tracking youtube channel for sharing useful repositories and gathering knowledge on Football Data Analysis.

The PI-Mobile, an autonomous car made with Lego, RaspberryPi and Deep Learning

The Pi-Mobile V2, driving on its own

I now work at Flaneer, feel free to reach out if you are interested!

When we started talking about this project with Matthias and then Arthur, we knew that both building a Lego motorized car, and learning to drive with real-life Deep Reinforcement Learning was possible. However, we wanted to do both at the same time. We also wanted to try another approach: Imitation Learning. We recently acquired a Raspberry Pi 4b and knew that the power limitations set by the previous model vanished. However, we didn’t know exactly how to fit everything into a lego car, and how to train it properly. We came up with 2 vehicles and used 2 methods of learning :

  • a first model (that looks terrible), driving with a Deep Reinforcement Learning Agent
  • a much better-looking car, driving with an Imitation Learning algorithm

The first car we built, not the best looking...


The Pi-Mobile V2, based on a Lego Technic Porsche Model

The first one drove on an inside track built in my home, while the second drove outside.

PI-Mobile V1

Assembling the car

The first step of building a lego car is building its structure and deciding the location of each component in the car to make it as compact as possible. We started by building each block and then assembled them. It resulted in a working car, but not in the fastest nor best-looking one.

The structure

We decided to choose a 3-engines solution: two at the back of the car that power each wheel separately, and one at the front powering the steering system that we built.

From top left to bottom right : Lego Mindstorm cables, RaspberyPi 4b + BrickPi, 8 AA batteries, RasbperryPi 3b + Lithium battery + fan, 2 Large Lego Mindstorm motors, 1 Medium Lego Mindstorm rotor


We chose to separate the car’s electronic system in two halves: one half is the core of the vehicle and will be linked to the motors, powering them, and will take care of the DRL part of the project. The second half is down the first part, as all the operations listed above are energy-consuming.

The core part consists of a Raspberry Pi 4b (we chose this device to be the center of all the operations because of its vast specifications compared to the Raspberry Pi 3b) and a card from Dexter Industries that manages the link with the lego technics. In contrast, the second part consists of a battery powering a Raspberry Pi 3 that powers a small fan, installed in between the two cards of the car’s core. While using a Raspberry Pi 3 to power a fan seems a bit of an overkill; it also allows us convenient a flexibility in programming, were we to decide to use a Master/slave architecture to compute the DRL part if the RP4 capped its power due to overheating.

We then used a raspbian image, unaltered, with the convenient package VNC Viewer, which allows us to access the raspbian of the PI via SSH protocol. From there, we can launch our script to make the car move.

We also used a raspberry camera on top of the car. This camera was the only input our car would use to drive around.

The track

The PI-Mobile V1 trained on a track we built inside my house. It is made of paper and is 8 meters long. However, it has a bridge in the middle, a bridge the car could never learn to pass properly (it struggled to go only forward because of a faulty steering system). Because of this, we split the track into 2 parts: a training part and a testing part :

The learning algorithm - DQL

We used a Deep Q-learning agent to train our car to drive, with some minor modifications: human control. Since the agent is not interacting with a simulation but with the real world, we are the one telling it when to stop, or when it’s good or bad. We set up several keys: one to pause the current episode, one to finish it when the car is doing something bad, and finally, one to end the episode when the car reaches the end of the track. The algorithm is available below :

A more detailed explanation of Deep Reinforcement Learning is available on a previous blog post and paper.

Regarding the reward, we used a simple function: 1 at every step, except when the car fails (-1) or when it reaches the end of the trach (time spent on the track). The goal is to drive correctly and as fast as it could.

public double reward(String key, double time_since_start) {
    double reward = 0;
    
    if (key == stop) {
        reward = -1;
    } else if (key == end) {
        reward = (10. - time_since_start)/100.;
    } else {
        reward = 1;
    }
    
    return reward;
}

Overall, our car learned to drive on the testing track in less than 50 episodes (around an hour and a half). We tracked the distance made autonomously on the testing track every 5 training episodes :

Distance autonomously driven by the car on the testing track

An example of a training episode is available below :

PI-Mobile V2

Now that we knew that our car could learn to drive on its own, we wanted to make some modifications:

Car V1 Modifications
Car's structure was not robust enough Started from an existing Lego Technic car
The fan system was taking a lot of place Remove the fan and the Raspberry 3
The DRL agent was not necessarily the best solution Use of an Imitation Learning algorithm
The track was too small Train the car on a bigger track outside

Structure

We kept the same main pieces (Raspberry Pi 4, BrickPi, and 3 motors) but changed the rest of the car. We used a Lego Technic Porsche 911 model as a base for the rest of the car.

Our goal was to use the following blueprint :

However, we had to make some modifications, mainly because of 2 things:

  • The suspension system at the back of the car was not compatible with our Large motors
  • The case used for the Raspberry Pi + BrickPi was not made for a Raspberry Pi 4, and therefore was not as robust as it should have been.

We removed the V6 engine and replaced it by the Raspberry Pi and the batteries. We also made some modifications to insert the Lego Rotor at the front, and the 2 Lego motors at the back. Using this as a baseline allows the car to be much more stable and with a more precise steering system. The first version also tended to break in the middle of the chassis; this doesn’t happen anymore.

Chassis of the PI-Mobile V2, with the 2 motors and the Raspberry Pi/BrickPi in the middle.

The track

We wanted to be a bit more ambitious and step aside from the small inside track we used for the first car. We used a short pathway around the house as a circuit. This pathway lies inside a large garden, making it easy to define where the vehicle should drive and make a loop: precisely what we needed.

Track used to gather data for the PI-Mobile V2. Track is around 400 meters long.

We would then use the rest of the pathway for testing our model. An example of such a pathway is available below :

Example of the road used for the second track

Now that we have a new car and a new track, it is time to talk about the models we used to make our car autonomous.

Imitation Learning and Conditional Imitation Learning

Imitation Learning

Let $\mathcal{O}$ be the set of possible observations. In our case, it is the input from our camera. Let $\mathcal{A}$ the set of possible actions, here it could be turn left or right or go straight. Let’s suppose we have a dataset $\mathcal{D}$ made of observation-action pairs $(o_i,a_i)$, collected from our driving, for example.

The idea behind Imitation Learning is to learn a policy $\pi$ mapping observations to actions, $\pi _{\theta} : \mathcal{O} \rightarrow \mathcal{A}$. The policy would allow the car to know which actions to perform depending on the observation: the camera input, according to the driving we’ve done before.

The policy can then be trained by minimizing the distance between the policy $\pi _{\theta} (o_t)$ and the expert’s actions $a_t$. However, in some cases, the expert’s action is not enough. For example, our car might have the possibility to turn left or right in some cases, and we need to tell it what it should do: this is Conditional Imitation Learning.

Conditional Imitation Learning

The framework is the same as above, but we add a command $c_t$, expressing the expert’s intent when the policy is predicting an action to do. The mapping now takes the following form : $\pi _ {\theta} (o_t, c_t)$. The minimization process changes to

$\min\limits_{\theta} \displaystyle \sum_{t} \mathcal{L} (\pi _ {\theta} (o_t, c_t),a_t)$

Let’s now dive into the model we used and the dataset we collected :

Model & Dataset

Model

We kept the same neural network base, with a MobileNetV2 pretrained on ImageNet:

Dataset

We drove the car for around 15 minutes on the track, gathering around 9000 frames with inputs and controls (the controls were added by hand afterward when 2 decisions were possible). We used 5000 frames for the training, 2000 for the validation, and 2000 for testing. However, because of another steering issue, the car was naturally going to the left when we were inputting a forward command. Therefore, most of the driving consisted of inputting forward and right to make sure the right would stay on track. Because of this, only a few left inputs were issued, mostly for significant turns. This can be seen below. We also split the dataset into smaller episodes to make sure that the Raspberry stoping itself would not delete a file too large.

Distribution of the input and timestamp on the dataset

We trained the model twice: once without data augmentation and once by including it. The data augmentation consisted of adding some random modifications regarding:

  • Brightness
  • Small rotation
  • Small height and width translations

The model without data augmentation gave poor results, especially during our first test session, where the time of day and weather were completely different. This emphasizes the fact that the training dataset must generalize to real-world testing: different weather, different lighting and different road style. This data augmentations made the car going from a complete failure to an acceptable robot.

Overall, we should have made a better dataset, mainly making sure we had no steering issues. This made the testing difficult, forcing us to add artificial left steering and to make the car starts with the same proper angle as in the dataset.

Results

Nevertheless, we achieved good results with the augmented dataset. The predictions made by the model on the testing dataset are available below :

We also tested the car on another portion of the track, although with similar weather conditions from the training dataset :

Self driving car V2, some outside modifications were made to make an easier access to the RaspberryPi and the batteries.

Conclusion :

Overall, it was an extremely fun project to do, and more importantly, did not require a lot of sophisticated knowledge. The car can be built with Legos, the motor, and the BrickPi from Dexter Industry is easy to use with Python. Furthermore, the Deep Learning algorithms we used can be found anywhere. It was actually exciting to see how we could put every “tech brick” one after another, and end up with a working self-driving car. Granted, following a track and making a few turns is far away from the demanding needs of a real autonomous car, interacting with others, but the basics of driving can still be learned with simple components.

References :

  1. Learning to Drive in a day - Wayve
  2. Urban Driving with Conditional Imitation Learning - Wayve
  3. End-to-end Driving via Conditional Imitation Learning - Codevilla et al.

Github Repository

A review on Deep Reinforcement Learning for Fluid Mechanics

I now work at Flaneer, feel free to reach out if you are interested!

When I started an Internship at the CEMEF, I’ve already worked with both Deep Reinforcement Learning (DRL) and Fluid Mechanics, but never used one with the other. I knew that several people already went down that lane, some where even currently working at the CEMEF when I arrived. However, I (and my tutor Elie Hachem) still had several questions/goals on my mind :

  • How was DRL applied to Fluid Mechanics?
  • To improve in both subject, I wanted to try a test case on my own;
  • How could we code something as general as possible?

The results of these questions are available here, with the help of a fantastic team.

Applications of DRL to Fluid mechanics

Several Fluid Mechanics problems have already been tackled with the help of DRL. They always (mostly) follow the same pattern, using DRL tools on one side (such as Gym, Tensorforce or stables-baselines, etc) and Computational Fluid Dynamics (CFD) on the other (Fenics, Openfoam, etc).

Currently, most of the reviewed cases always consider objects (a cylinder, a square, a fish, etc.) in a 2D/3D fluid. The DRL agent will then be able to perform several tasks :

  • Move the object
  • Change the geometry or size of the object
  • Change the fluid directly

A small step into the world of fluid mechanics

The Navier-Stokes equation looks pretty (and complicated) but is in fact traduction of Newton's equation for a fluid. Rather than describing the movement of an object, it is representing the movement of a fluid (more precisely, of a field (of points, vectors, etc).).

This equation is quite a paradox itself. Its application are enormous, but we still don't know if solutions always exists (it is even one of the seven Millenium Prize Problems).

Solving the Navier-Stokes now doesn't mean we are trying to solve it. We use the finite element method to 'slice' our domain into small parts, and starting from initial conditions, we run our simulation, making sure our equations are 'approximatively verified' every (little) timestep.

This agent can perform these tasks at two key moments : (i) when the experiment is done, and you want to start a new one, (ii) during the experiment (i.e., during the CFD time). One is about direct optimization; the other one is about continuous control. A few examples can be seen below :

fish_article

Leader and Follower swimmer reproduced from Novati et al. (2017), with the displacement and the orientation between the leader and the follower.

flow_control_jet_article

Comparison of the velocity magnitude without (top) and with (bottom) active flow control, reproduced from Rabault et al. (2019). This active flow control being linked to a DRL agent.

fluid_solid_article

Fluid jets to control rigid body reproduced from Ma et al. (2018). The position and orientation of the jet are DRL controlled.

and even this video. (which is pretty amazing)

My own test case

In order to dive deeper into the subject, I wanted to try a test case on my own (with the help of Jonathan Viquerat at the beginning though). We took inspiration from this paper and considered a simple case: a laminar flow past a square. Now, it is convenient to compute the drag of this square (in this very set-up).

However, the question asked by the paper was: if we add a tiny cylinder, somewhere close to the square, could we be able to reduce the total drag of both the square and the small cylinder?

The answer is yes, and the result is shown below:

meliga_100

Variation of drag induced by a small control cylinder at Reynolds = 100, reproduced from Meliga et al. (2014).

Now, of course solving this case with DRL was interesting, but I wanted to see if more was possible (since the case was already settled technically). Several ideas were possible (thanks to discussions with peoples from the CEMEF and the review I did previously) :

  1. Was there any difference between direct optimization and continuous control?
  2. Was the use of an autoencoder easy?
  3. Was Transfer Learning possible, and performant?
  4. Could we make a code as clear as possible?

Regarding the first point, I found no particular difference between direct optimization and continuous control. Both agents converge and find almost the same optimal final positions.

Continuous control over 17 time-steps at Reynolds = 40 and 100, from two random positions.

I used an autoencoder but had no time no compare results with and without it. I do have to explain why an autoencoder could be of any help here. The agent needs observations, and when dealing with fluid mechanics, the fluid characteristics are significant observations.

However, they can be of very high dimensions (more than 10,000). The Neural Network (NN) we would deal with would then be incredibly vast. In order to surpass this issue, an autoencoder could be used to extract simple features from high-dimensional fluid fields. It does seem like using as many fluid features as possible (which is possible with the autoencoder) is the best thing to do:

test fish8

Value of the drag coefficient with active flow control reproduced from Rabault et al. (2019). It is important to see that the more pieces of information we give about the fluid (i.e., the more probes we give the agent), the more performant it will be. Being able to give everything without exploding the size of the neural network is therefore super valuable.

In total, I trained six agents, as shown below :

Agent Reynolds Number - Re Goal Details
Agent 3.2 Re=100 Direct optimization Transfer learning from Agent 1.2
Agent 1.1 Re=10 Continuous control Trained from scratch
Agent 1.2 Re=10 Direct optimization Trained from scratch
Agent 2.1 Re=40 Continuous control Transfer learning from Agent 1.1
Agent 2.2 Re=40 Direct optimization Transfer learning from Agent 1.2
Agent 3.1 Re=100 Continuous control Transfer learning from Agent 1.1

As you can see, I used transfer learning and obtained excellent results from it: the agent was learning from 5 to 10 times faster than the agent trained from scratch. This is an exciting feature in general, but especially for Fluid Dynamics, where CFD is computationally expensive.

A DRL-fluid mechanics library?

At the beginning, my goal was to find out if such a library could be built (in a reasonable amount of time). During the creation of my test case, I realized the biggest issue with such a library was coming from the CFD solver: not only no one is using the same, but they all are pretty different.

For my test case, I based my code on Fenics, an easy-to-use solver (and super convenient since you can use it directly in Python). The code is available here. It is supposed to be as clear and re-usable as possible. It is based on Gym and stable-baselines.

I used Gym to build custom environment, always following the same pattern :

class FluidMechanicsEnv_(gym.Env):
    metadata = {'render.modes': ['human']}
    def __init__(self,
                    **kwargs):
        ...
        self.problem = self._build_problem()
        self.reward_range = (-1,1)
        self.observation_space = spaces.Box(low=np.array([]), high=np.array([]), dtype=np.float16)
        self.action_space = spaces.Box(low=np.array([]), high=np.array([]), dtype=np.float16)     
    def _build_problem(self,main_drag):
        ...
        return problem                
    def _next_observation(self):
        ...      
    def step(self, action):
        ...
        return obs, reward, done, {}   
    def reset(self):
        ...

With stable-baselines, I used their DRL algorithms implementations, and finally, as I said earlier, I used Fenics to build my test case using custom class that I hope will be re-used for other test cases. They use three main class: Channel, Obstacles, and Problem :

Channel :

Basically, this is the box you want your experiment to take place in. You only define it by giving the lower and upper corner of the box.

class Channel:

    """
    Object computing where our problem will take place.
    At the moment, we restraint ourself to the easier domain possible : a rectangle.
    :min_bounds:    (tuple (x_min,y_min))   The lower corner of the domain
    :max_bounds:    (tuple (x_max,y_max))   The upper corner of the domain
    :Channel:       (Channel)               Where the simulation will take place
    """
    def __init__(self,
                min_bounds,
                max_bounds,
                type='Rectangle'):

Obstacles :

This is the objects you want to add to the previously seen box. If you only want to use Fenics, you can only easily add squares, circles, or polygons. However, if you want to add much more complex shapes and mesh, you can use the Mesh function from Fenics.

class Obstacles:

    """
    These are the obstacles we will add to our domain.
    You can add as much as you want.
    At the moment, we only take care of squares and circles and polygons
    :size_obstacles:    (int)           The amount of obstacle you wanna add
    :coords:            (python list)   List of tuples, according to 'fenics_utils.py : add_obstacle'
    :Obstacles:         (Obstacles)     An Obstacles object containing every shape we wanna add to our domain
    """


    def __init__(self,
                size_obstacles=0,
                coords = []):

Problem :

This is where everything is taking place: setting up the box, placing the several objects, defining the boundary conditions, etc.

class Problem:

    def __init__(self,
                min_bounds,
                max_bounds,
                import_mesh_path = None,
                types='Rectangle',
                size_obstacles=0,
                coords_obstacle = [],
                size_mesh = 64,
                cfl=5,
                final_time=20,
                meshname='mesh_',
                filename='img_',
                **kwargs):

        """
        We build here a problem Object, that is to say our simulation.
        :min_bounds:                (tuple      (x_min,y_min)) The lower corner of the domain
        :max_bounds:                (tuple      (x_max,y_max)) The upper corner of the domain
        :import:import_mesh_path:   (String)    If you want to import a problem from a mesh, and not built it with mshr
        :types:                     (String)    Forced to Rectangle at the moment
        :size_obstacle:             (int)       The amount of shape you wanna add to your domain
        :coords_obstacle:           (tuple)     the parameters for the shape creation
        :size_mesh:                 (int)       Size of the mesh you wanna create
        :cfl:                       (int)
        :final_time:                (int)       Length of your simulation
        :mesh_name:                 (String)    Name of the file you wanna save your mesh in
        :filename:                  (String)    Name of the file you wanna save the imgs in
        """

However, this is not getting even close to a true DRL-fluid mechanics library, the issue being Fenics. While being very easy to use, it is a slow solver, and it would be impracticable to use it for a challenging problem. This is why, with other people working at the CEMEF, the goal is to build this library, linking DRL with other CFD libraries, most of them being C++ based.

In the end, we now have a better overview of how DRL can be used to help Fluid Mechanics problems. Moreover, several important features can be kept in mind: transfer learning, autoencoder, LSTM, and so on. If the goal of building a real DRL-CFD library was not achieved, the way it could be possible is now much clearer.

References

  1. Synchronisation through learning for two self-propelled swimmers - Novati et al. (2017)

  2. Artificial Neural Networks trained through Deep Reinforcement Learning discover control strategies for active flow control - Rabault et al. (2019)

  3. fluid directed rigid body control using deep reinforcement learning - Ma et al. (2018)

  4. Sensitivity of aerodynamic forces in laminar and turbulent flow past a square cylinder - Meliga et al. (2014)

  5. A review on Deep Reinforcement Learning for Fluid Mechanics - Garnier et al. (2019)

Github Repository

Legends of Code & Magic - a Codingame contest

Codingame

I now work at Flaneer, feel free to reach out if you are interested!

Codingame is a website where you can solve puzzles, work on optimization problems, and build bots for games. You have access to a vast range of codding languages, and I personally mainly used Java, Python & C++. What I enjoy the most is to build AI for multiplayer games: a card game, a quidditch match, a Tron battle, etc.

quidditch

A quidditch game from Codingame where your AI controls 2 players. This is a great game that includes great AI and physics.

csb

A game inspired from Star Wars where you control 2 pods. You can learn a lot about several algorithms: genetic algorithms, MCTS, min-max, etc.

Legends of Code & Magic (LOCAM)

The game

This is a game created by Jakub Kowalski and Radosław Miernik, in cooperation with Codingame. This is a card game, opposing two players. The game is turn-based, zero-sum, and without a possibility to tie. It is important to note that you only have 100ms to play each turn. The game is separated into two parts :

  • First, you have to draw 30 cards from a pool of 160 cards. During 30 turns, the game will propose you three cards, and you have to choose the one you want to pick.
  • After that, each player will play. I can do several actions: summon a creature, attack the opponent, use an item, etc.
  • The game will end whenever someone’s hp drop to zero.

locam-1

Overview of the game, with both players HP, mana, cards in hand and cards on the board.

The rules

We can split the game into several components: Drawings, Mana, Creatures, and Items.

  • Drawings: The first player starts with four cards, while the opponent begins with five. Each turn you will draw a new one (some of them might even allow you to draw several cards at the same time).

  • Mana: You can only use a card if you have enough mana. Both players start with one Mana and gain one for every turn they play.

  • Creatures: Creatures have a mana cost, attack points, and defense points. Some creatures start with specific abilities. By default, creatures can’t attack the turn they are summoned, and they can attack only once per turn. When a creature attacks another one, they both deal damage equals to their attack to the defense of the other creature. When a creature attacks the opponent, it deals damage equals to its attack to the HP of the opponent.

  • Items: When played, items have an immediate and permanent effect on the board or the players. They can modify the attack or defense of a creature, add or remove abilities. They can also damage or heal the opponent or yourself. You can find three types of items: green items (positive effect on your creatures), red items (harmful effect on the opponent’s creatures) and blue items (effects on yourself or the opponent directly).

As for the six special abilities :

One of the strongest card in Legends of Code & Magic, with the six different abilities.
  • Breakthrough: Creatures with Breakthrough can deal extra damage to the opponent when they attack enemy creatures. If their attack is higher than the defending creature’s defense, the excess damage is dealt to the opponent.

  • Charge: Creatures with Charge can attack the turn they are summoned.

  • Ward: Creatures with Ward ignore once the next damage they would receive from any source. The “shield” given by the Ward ability is then lost.

  • Lethal: Creatures with Lethal kill the creatures they deal damage to.

  • Guard: Enemy creatures must attack creatures with Guard first.

  • Drain: Creatures with Drain heal the player of the amount of the damage they deal (when attacking only).

My bot

My bot was split into several parts. The first one was dealing with the Drafting phase. It was a simple heuristic with a score for each card. The second part was dealing with the summoning phase, while the last one dealt with the attacking phase. I built both of them with variants of Monte-Carlo methods and Monte-Carlo Tree Search.

You only have a limited amount of time when you play :

Actions Allocated time
Response time per turn 100ms
Response time for the first draft turn 1000ms
Response time for the first battle turn 1000ms

Monte Carlo

Intuitively, a Monte-Carlo method means that we will use several random draws to approximate the value of something. It could mean: shooting random arrows in a square with a circle in it. The proportion of arrows into the circle will allow us to approximate $\pi$. In a game, we could simulate random moves and evaluate them to find the best sequence of move possible.

On a more mathematical point of view, let $x_1, x_2, …, x_N$ be i.i.d. laws, following the same law as a random variable $X$. For any borelian fuction $f$ (with $f(X) \in L^1$), we can calculate the following approximation :

\[\mathbb{E} (f(X)) = \frac{1}{N} \displaystyle \sum_{i=1}^{N} f(x_i)\]

In fact, if we denote $I = \mathbb{E}(f(X))$ and $I_N = \frac{1}{N} \displaystyle \sum_{i=1}^{N} f(x_i)$, we have thanks to the strong law of large number :

\[I_N \rightarrow I \; \; a.s.\]

For example, let’s say we are playing a game of tic-tac-toe. Every time we have to play, one solution could be to perform several random moves, evaluate them (which one is the most promising?) and play the best. Even more, we could also play several sequences of random moves (e.g.,: one move for me - one for my opponent - one for me) and apply the same strategy.

MCTS

Now, a Monte-Carlo method has several flaws, but one of them is the fact the moves are drawn randomly. We could spend a lot of time simulating moves that are not interesting at all. Therefore, an idea would be to simulate and evaluate the more promising moves. This is called a Monte-Carlo Tree Search.

The first idea is to view a game as a tree: each node being one possible state of the game.

Trees

A tree is a mathematical/CS object with powerful properties. The best way to see it is to consider how one is built :

  • Each tree starts with a particular node: the root
  • Each node has several sons
  • Each nodes (except the root) has unique father

A tree is a great and understandable way to store informations. A very vast theory also exists around it: how to run trough one, etc.

ttt-mcts

An overview of the beginning of a tic-tac-toe game, reproduced from Medium

An MCTS always follows the same four steps:

Selection

Starting from the root, we select childrens, often following the UCT formula (the node maximizing the following):

\[\frac{w}{n} + c\sqrt{\frac{\ln{N}}{n}}\]

where $w$ is the amount of win from this node, $n$ the amount of time we visited this node, $N$ the amount of time we visited this node’s father and $c$ an exploration parameter. This formula is supposed to give a good comprise between exploration and exploitation.

Expansion

If the selected node isn’t terminal, we simulate one child from it.

Simulation

We simulate a game, starting from this child until we reach a terminal state.

Backpropagation

Since we reached a terminal state, we can update every visited node with the new information: we lost, or we won.

ttt-mcts-full

Explanation of an MCTS, reproduced from Chaslot et al. (2006).

Evaluation function

An MCTS supposes that it is possible (in a reasonable amount of time) to reach a terminal state. Now, this might not be possible: only partial observations (like in this game, where you can’t predict the hand of the opponent), computational time that is too heavy, etc. Therefore, it is interesting to define an evaluation function (like in Evolutianory Algorithm): the goal of such a function is to characterize how good (or bad) a state of the game is.

For example in a game of tic-tac-toe, it could be the number of opportunity (square that would make you win if you played here)

Draft phase

When the contest started, I believed that the draft phase would not be the crucial part of the contest. As it will turn out, this was a mistake. My first strategy was to follow a mana curve (a sort of gaussian, with a mean at 6 of mana). If this strategy worked fine at first, this was not the best choice.

After some talks with other coders from the contest, the idea of hardcoding a score for each card emerged. The plan would then be to draw the card with the higher score. The real question was: how to compute such scores? The strategy I used was to build genetic algorithms; each individual of the population being a set of scores for each card.

Genetic Algorithms

To keep things as simple as possible, a genetic algorithm starts with a population made of a fixed number of individuals. At every generation, you compute the score of each individual, and then keep some of the best for the next generation.

To build the rest of this next generation, several options are available: mutations, melting two existing solutions, tournament to keep some of them, etc.

This is an optimization algorithm I love, and I used it a lot on Codingame.

Overview of a Genetic Algorithm. The part with different implementations can use tournament, random selections, full crossover or only mutations of the best individuals.

In order to evaluate each individual, I used several methods. All of them are based on the same strategy: playing many games with this draft against another fixed one, and keeping the win rate as the score of the individuals.

In the end, this method gave me a score for each card, score I kept until the end of the contest.

It seemed like a good idea, and it turned out it was. However, some other players went further, with varying scores depending on the deck you already had, and achieved even better results.

Playing phase

A key point here, since as I said earlier, it is impossible to simulate the game until a terminal state, is an evaluation function.

I used the same one for both the summoning and the attacking phase (you could even unsplit these two parts and evaluate them at the same time). During the contest, I built several functions:

Evaluation function 1 - Complex

The first I built (and the one I used in the end) is a function that takes the following form :

\[HP_{me} - HP_{opponent} + \displaystyle \sum_{\forall \; card \; \in \; B_{me}} f(card) - \displaystyle \sum_{\forall \; card \; \in \; B_{opponent}} f(card)\]

where $B_{me}$ (respectively $B_{opponent}$) is my board (respectively the opponent’s board), and $f$ is an evaluation function for each card, that takes the following (complex form) :

public double f(int n,int mineHp,int opponentHp){

  double score = 2*(double)this.defense + 5*(double)this.attack;

  if (opponentHp <= this.hpChangeOpponent && n == 0){
    score += 1000;
  }

  if (this.guard && mineHp < 20 && n == 0){
    score += 1.5*(double)this.defense;
  }
  if (this.CardDraw > 0 && mineHp < 15 && n == 0){
    score += 4;
  }
  if (this.breaktrough){
    score += ((double)this.attack - 1.0)*0.9;
  }
  if (this.drain){
    score += 0.7*(double)this.attack;
  }
  if (this.lethal){
    score += 20.0 - (double)this.attack;
  }
  if (this.ward){
    score += (double)this.attack;
    score += 4;
  }
  if (this.ward && this.lethal){
    score += 20.0 - (double)this.attack;
  }

  if (mineHp < 10 && this.guard && n == 0){
    score += 3*(double)this.defense;
  }
  if (opponentHp < 10 && this.breaktrough && n == 0){
    score += (double)this.attack;
  }
  if (opponentHp < 10 && this.drain && n == 1){
    score += (double)this.attack;
  }
  return score;  
}

Evaluation function 2 - Simple

However, I also tried (unsuccessfully) other evaluation function during the contest. A simpler one was talking the same shape as above, but where $f$ was the score of each card from the drafting phase.

Evaluation function 3 - Tempo

The last one was based on the tempo of the player. Here is a definition from an Hearthstone Wiki :

While tempo is often considered to be a common term and fairly simple to understand, it is also often considered to be among the more difficult things in Hearthstone to describe by definition rather than by examples or through a practice run. Many suggestions have been made for verbally describing this use of tempo, some of which are below:

  • The momentum of the match. Probably the original definition and among many unofficially accepted “right” ones. However, its imprecision can lead to misunderstandings and odd interpretations causing confusion in discussion.
  • The change of the board state in a given time (usually either a play or a turn, as in “a good/bad tempo play/turn”). What this definition has in its favor is its simplicity, but this very simplicity means that it does not cover all the uses of this term.
  • The speed at which a player is reaching his/her win condition. An alternate interpretation of the original definition that is especially useful when discussing decks that don’t mainly win by gaining board control or that don’t aim at reaching that goal at the start of the game.
  • The value of the effect on the board state divided by the mana used. The main upside of this definition is its being in the form of a mathematical formula because in the future it could be applied to calculations.

I tried to evaluate both my and the opponent’s tempo, but this wasn’t successful either. This evaluation was, in fact, pretty complex too, using both players boards and life.

Summoning phase

The summoning phase was kept as simple as possible to waste as little time as possible.

First, I would check if I had to summon some cards. For example, a Guard card could save me if I was about to die, or a Charge card could finish the opponent immediately. Therefore, I would always check first if certain situations occurred and apply them directly.

Once this was done, I would generate a random sequence of possible cards to summon, simulate it, and evaluate the board. If the score was better than the current best, I would keep it.

ttt-mcts-full

Representation of the process going on for the summoning phase.

Attacking phase

Regarding the attacking phase, my strategy was to implement a depth-3 MCTS, as shown in the figure below.

ttt-mcts-full

View of the tree I built during the attacking phase.

However, some minor changes had to be done. First, this scenario simulates attacks from the opponent and the MCTS selection has to be modified: you now have to take into account your loose (or your worst evaluation function) since you want the opponent to be as good as possible. Then, you can’t simulate the whole game until the end, and this is why I decided to stop the expansion phase at depth-3, with an evaluation of the board (rather than a boolean you win, you lose).

Finally, my code took the following form (with the time limit in mind):

Actions Allocated time
Sending outputs 5ms
Response time for the summoning phase 25ms
Response time for the attacking phase 70ms

Conclusion

In the end, 2200 people participated in the contest. It was leagues based, and we were around 90 in the final league. To go to a higher league, you need to have a better score than the boss. The score is based on your wins and losses, against other players. You can find below my ranking throughout the contest. I ended up in the 69th position and the 11th best student.

Overall, it was an enjoyable contest. There was a lot of work to do, a lot of different aspects to take into account; several algorithms could be used: a great game!

ttt-mcts-full

My ranking (out of 2,200) some time in the contest.

References

  1. Medium - MCTS

  2. Monte-Carlo Tree Search: A New Framework for Game AI - Chaslot et al. (2006)

  3. Codingame

  4. Legens of Code & Magic

  5. Stream from reCurse during the contest

  6. Basic RL approach from eiisolver