Analysing cyclic tag systems with machine learning, part 1
Table of Contents
This post was meant to come after a part 2 post to my last post which would have detailed the development process for my cyclic tag system simulator, which I moved to its own GitHub and named cyts.
That post ended up taking far longer than expected, so this is coming first.
Why I’m using neural networks #
As I explain in the project page here, I want to find heuristics to estimate whether a cyclic tag system halts or not. This is essentially approximating the halting problem for a finite set of programs. Specifically using neural networks is largely a tool for me to learn neural networks, but I also realized that nonlinearity may be required in tackling this problem. To see why, it’s helpful to graph the data:
Notice how on production 8 (corresponding to 000), almost no tag system does not halt. Similarly, no tag system does not halt when both productions are 6 or less, which corresponds to the following production rules:
- \(1 \rightarrow \epsilon\) (empty string)
- \(2 \rightarrow 0\)
- \(3 \rightarrow 1\)
- \(4 \rightarrow 00\)
- \(5 \rightarrow 01\)
- \(6 \rightarrow 10\)
My goal was to have a single model pick up both of these traits and any others in the underlying data. For that reason, I assumed a neural network would be the best option (spoiler: it wasn’t).
Designing and training the neural network #
I used PyTorch on my CPU to do the training with a dataset of 3,375 rows and 3 features each. The neural network had an input layer of 3 neurons, two hidden layers of 5 neurons, and a single neuron output. The output layer had a sigmoid activation to restrict the output to between 0 and 1, while the other layers used ReLU for their activations. I used binary cross-entropy for the loss function and the Adam optimizer built into PyTorch with a learning rate of 1e-3. PyTorch makes implementing all of these incredibly easy.
The training loop is standardized in PyTorch: take a batch, calculate the predicted outcome, calculate the loss, then zero the gradients of the optimizer, do backpropagation, and step the optimizer. Each of these steps is one line, and all that’s necessary to do manually (if not using dataloaders) is to batch the data and make two loops to loop over all the batches per epoch. Because my data was so small, the model trained in very little time, even with 5 epochs and 40 batches per epoch. The code used is here, with the training loop here.
The results #
Running the neural network and piping the diagnostic output to a file, initial results seemed promising:
Using device cpu
Sequential(
(0): Linear(in_features=3, out_features=5, bias=True)
(1): ReLU()
(2): Linear(in_features=5, out_features=5, bias=True)
(3): ReLU()
(4): Linear(in_features=5, out_features=1, bias=True)
(5): Sigmoid()
)
5 epochs, batch size 75 on training set of 2700 datapoints
Finished epoch 0, latest loss 0.6375002861022949
Finished epoch 1, latest loss 0.6172397136688232
Finished epoch 2, latest loss 0.6010852456092834
Finished epoch 3, latest loss 0.5878480076789856
Finished epoch 4, latest loss 0.5746302604675293
Accuracy on training dataset: 71.8148%, 1939 correct labels out of 2700
Accuracy on test dataset: 70.5185%, 476 correct labels out of 675
756 false positives and 5 false negatives on training dataset
197 false positives and 2 false negatives on test dataset
Type I error (α): 28.0000% training, 29.1852% test
Type II error (β): 0.1852% training, 0.2963% test
Conclusion #
My biggest mistake was assuming that since a neural network is a complicated model with a ton of parameters to tune, that it’d work the best to capture the structure of the data. I’ll need to go back to the drawing board and try to better understand what the data is actually telling me. I’ll test and see whether perceptrons or decision trees perform better, and if they do, why they perform better and if they tell me anything interesting about the data.