Skip to main content
  1. Posts/

Analysing cyclic tag systems with machine learning, part 1

·891 words·5 mins

This post was meant to come after a part 2 post to my last post which would have detailed the development process for my cyclic tag system simulator, which I moved to its own GitHub and named cyts. That post ended up taking far longer than expected, so this is coming first.

Why I’m using neural networks #

As I explain in the project page here, I want to find heuristics to estimate whether a cyclic tag system halts or not. This is essentially approximating the halting problem for a finite set of programs. Specifically using neural networks is largely a tool for me to learn neural networks, but I also realized that nonlinearity may be required in tackling this problem. To see why, it’s helpful to graph the data:

Plot showing which systems did not halt, organized by production and color-coded
The darker the dot, the more inputs on which the cyclic tag system specified by the two productions halted.

Notice how on production 8 (corresponding to 000), almost no tag system does not halt. Similarly, no tag system does not halt when both productions are 6 or less, which corresponds to the following production rules:

  • \(1 \rightarrow \epsilon\) (empty string)
  • \(2 \rightarrow 0\)
  • \(3 \rightarrow 1\)
  • \(4 \rightarrow 00\)
  • \(5 \rightarrow 01\)
  • \(6 \rightarrow 10\)

My goal was to have a single model pick up both of these traits and any others in the underlying data. For that reason, I assumed a neural network would be the best option (spoiler: it wasn’t).

Designing and training the neural network #

I used PyTorch on my CPU to do the training with a dataset of 3,375 rows and 3 features each. The neural network had an input layer of 3 neurons, two hidden layers of 5 neurons, and a single neuron output. The output layer had a sigmoid activation to restrict the output to between 0 and 1, while the other layers used ReLU for their activations. I used binary cross-entropy for the loss function and the Adam optimizer built into PyTorch with a learning rate of 1e-3. PyTorch makes implementing all of these incredibly easy.

The training loop is standardized in PyTorch: take a batch, calculate the predicted outcome, calculate the loss, then zero the gradients of the optimizer, do backpropagation, and step the optimizer. Each of these steps is one line, and all that’s necessary to do manually (if not using dataloaders) is to batch the data and make two loops to loop over all the batches per epoch. Because my data was so small, the model trained in very little time, even with 5 epochs and 40 batches per epoch. The code used is here, with the training loop here.

The results #

Running the neural network and piping the diagnostic output to a file, initial results seemed promising:

Using device cpu
Sequential(
  (0): Linear(in_features=3, out_features=5, bias=True)
  (1): ReLU()
  (2): Linear(in_features=5, out_features=5, bias=True)
  (3): ReLU()
  (4): Linear(in_features=5, out_features=1, bias=True)
  (5): Sigmoid()
)
5 epochs, batch size 75 on training set of 2700 datapoints
Finished epoch 0, latest loss 0.6375002861022949
Finished epoch 1, latest loss 0.6172397136688232
Finished epoch 2, latest loss 0.6010852456092834
Finished epoch 3, latest loss 0.5878480076789856
Finished epoch 4, latest loss 0.5746302604675293

Accuracy on training dataset: 71.8148%, 1939 correct labels out of 2700
Accuracy on test dataset: 70.5185%, 476 correct labels out of 675
70% accuracy is great! Definitely improvable, but still great. However, experimenting didn’t lead to me making any improvements; no matter how I changed the epochs or learning rate, that accuracy wouldn’t budge. Next, I checked the predicted outputs manually with print statements, and found something worrying: almost every single datapoint I chose was predicted to halt, even if I knew that datapoint pointed to a cyclic tag system I knew hadn’t halted in my data. Since I’d set “this cyclic tag system does not halt” to be the null hypothesis, this means I was getting excessive false positives. I never found a false negative either, which was even more concerning. With this in mind, I added some print statements to calculate the Type I and II error:
756 false positives and 5 false negatives on training dataset
197 false positives and 2 false negatives on test dataset
Type I error (α): 28.0000% training, 29.1852% test
Type II error (β): 0.1852% training, 0.2963% test
This was exactly why I couldn’t improve the accuracy: my neural network just predicted “positive” for almost everything! In a good statistical estimator, you should be getting roughly equal Type I and II error or at least have them bounded, but in mine, I had massive Type I error and almost no Type II error. Turns out around 70% of the cyclic tag systems in my dataset did halt, so if the neural network simply guessed that everything halted, it’d get good accuracy - this is called the chance accuracy. So while my model performed well, it didn’t really learn much.

Conclusion #

My biggest mistake was assuming that since a neural network is a complicated model with a ton of parameters to tune, that it’d work the best to capture the structure of the data. I’ll need to go back to the drawing board and try to better understand what the data is actually telling me. I’ll test and see whether perceptrons or decision trees perform better, and if they do, why they perform better and if they tell me anything interesting about the data.