Simulate First

The complexity of this system was one of the primary drivers behind choosing to implement the concept in a simulation only first. In addition to being much easier to debug, simulations provide reliable, reproducible results that are not subject to noise from the physical world. This would allow us to iterate quickly and frequently on the design decisions and parameters until we found a model that we knew worked with confidence. Since our project was split into multiple phases, this also allowed for minimum downtime between phase movement as the output of one phase could be directly pipelined into the next without needing to rewrite a lot of code.

However, we viewed the simulations as only intending to serve as a proof of concept and not as a substitute for an actual physical testbed. However, since building one required waiting on parts, reading and understanding component specs / writing appropriate drivers, and dealing with real physical phenomena, we anticipated this sytem would be a significant source of delay with several aspects outside our direct control. In additiona, debugging would be a lot more difficult due to the coupling of hardware and software components.

Introduction of Reinforcement Learning

At its core reinforcement learning is primarily concerned with agents taking actions with regards to a particular state in order to maximize a reward function^[1]. Employment of this technique to our problem seemed especially promising considering that the pendulum, at any point in time, can be modeled or represented as a function of four primary variables - distance along the track (x), linear track velocity (x'), angle with respect to upright neutral (θ), and angular velocity (θ')^[2]. We can represent a state from these four quantities and deltas in the updated parameters as transitions to the next state. However, one thing to note here is that since all of these state action pairs need to be represented in a table with a corresponding reward value (which can be updated) - in this regard, we needed to discretize some of these states to prevent the size of the table from becoming too large. But, by maximizing the eventual reward, this approach can teach the pendulum to learn to balance itself.

The real advantage in employing this, or any deep learning approach for that matter, is that the state-action table can be iterated upon and finalized offline^[2]. Then, when deployed to an online system, the trained network can be referenced and used in real time since all of its necessary parameters are fixed. This significant reduction in computational complexity, we believe, would allow the trained network to serve as a real time guard - or layer of security - on top of cyber physical systems to potentially identify compromised hardware components. The trained network would have prior knowledge of good and bad behavior based on its Q values, and can therefore, we believe, detect an attack by classifying said attack as an irregular behavior that would substantially drop the Q value in comparison to all possible permissible transitons.

Reward Function Selection

The convergence of reinforcement learning training is dictated largely by the reward function by which values in the Q-table are updated^[1]. In other words, it was crucial for us to select a function that, if optimized for maximal value over the set of discretized states after many trials, would result in an end behavior which would leave the pendulum standing upright. Luckily, prior work in this area indicated that the four variables of importance were the exact four mentioned in state modeling^[2]. Thus there exists intuition behind the selection. In our case, however, we decided to zone in on only two of these attributes - x and θ - since we reasoned that the angular and linear velocity that were intermediately needed to maximize these two values was somewhat inconsequantial. Our reward function is thus:

R(x, x', θ, θ') = -(x² + θ²)

There is, however, an important point worth addressing here - the reader may wonder why there is such a strong, let alone any, dependence on x at all considering that the angle is ultimately what decides the success or failure of the balance. While not as critical for a simulation, in reality the track on which an inverted pendulum attempts to balance itself has a finite distance. We found that by removing this parameter that the slider moved quite far to the left or right all in an effort to maximize θ. In order to impose a tighter constrain on linear movement, we decided to negatively impact the reward for far-swinging motions along the track. However, to increase the importance of theta, the value of x is clamped to a much smaller range.

Attack Detection

Once the network has been trained, it is employed to serve as a guard layer situated on top of the standard controller (to keep things simple, we focus primarily on a compromised controller). The network is trained offline to assign appropriate q-values to possible state and action combinations with the expectation of expecting the controlling agent to maximize its reward for any current state-action pair. Our intention is to essentially have this (the network) serve as a baseline to cross reference so that any particular action that does not fall within a reasonable error of the expectation can be classified as an anomaly, which in this case we would attribute to an attack.

The current algorithm for attack detection is built on the premise of the RL agent assigning a measure of "goodness" for a particular action. The action taken by the controller at any particular state is assigned a scaled score from 0 to 100 - 100 here corresponding to the controller taking the best (maximal q-value yielding) action for a given state. We compare this q-value to all of the discretized action values for a given state and ultiately are interested in evaluating the percentage of actions for which the decision made by the controller matches or exceeds that of the neural network expectation. If this percentage happens to be abnormally low, an attack is flagged. The attack flag would remain high for the duration of the attack and would only drop when the controller demonstrates signs of recovery.

Attack Correction

The final step after detecting the attack would be to correct the behavior in the event of the controller being compromised. In other words, when the attack flag goes high, the trained neural network will wrest control away from the PID controller and continue to operate the system while the controller either attempts to weather the attack or is deemed compromised beyond repair. In the event that the controller is able to return to a healthy state, the neural network will relinquish control back to the PID controller.

While this may naturally pose the question of why the PID controller should be used at all, the important point to note here is that even a trained neural network is at best only approximating the expected behavior of the system. As such, it does not maintain as high a quality of control as a standard controller, and thus should only be used as a secondary resort in the event of an emergency.

References

[1] Pierre Yves Glorennec, “Reinforcement Learning: an Overview” [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.4135&rep=rep1&type=pdf. [Accessed: 19-Nov-2017].