Neural Network Example#

Our goal in implementing ML-related features was to provide maximum flexibility. Although GAMSPy’s primary purpose is not neural network training, we wanted to demonstrate the process for those who are curious or need it for research. Implementing a “Mini-Batch Gradient Descent” training process would be very time-consuming and result in a non-introductory example. Therefore, we implemented a method that trains using just a single mini-batch of data, stopping after one mini-batch.

We will train a neural network to classify handwritten digits from MNIST dataset.

Feed-forward Neural Network#

For this example we will use a simple feed-forward network since it is easier to implement and demonstrate. Our neural network has flattened images in the input layer, resulting in a 28x28 = 784 dimension. We will use a single hidden layer with 20 neurons, and the output layer will have 10 neurons corresponding to 10 digits.

We start with the imports:

import sys

from collections import Counter

import gamspy as gp
import gamspy.math
import numpy as np
import pandas as pd
import torch

from gamspy.math.matrix import dim
from torchvision import datasets, transforms

Even though, we won’t use PyTorch for training the neural network, it is still useful for loading sample training data.

Next we define datasets and dataloaders :

hidden_layer_neurons = 20

torch.manual_seed(7)
np.random.seed(7)

train_kwargs = {'batch_size': 500} # should be multiple of 10
test_kwargs = {'batch_size': 128}


mean = (0.1307,)
std = (0.3081,)

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean, std)
    ])


train_dataset = datasets.MNIST('../data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('../data', train=False, transform=transform)

train_loader = torch.utils.data.DataLoader(train_dataset, **train_kwargs)
test_loader = torch.utils.data.DataLoader(test_dataset, **test_kwargs)

def get_balanced_batch(train_loader, sample_size):
    counts = [0] * 10
    good_data = []
    good_target = []
    finished = 0

    for data, target in train_loader:
        for row, row_label in zip(data, target):
            if counts[row_label.item()] < sample_size:
                good_data.append(row)
                good_target.append(row_label)
                counts[row_label.item()] += 1
                if [row_label.item()] == 10:
                    finished += 1

            if finished == 10:
                break

    return torch.stack(good_data), torch.stack(good_target)

We are loading MNIST dataset since it’s input size is not too large. The MNIST dataset consists of handwritten digits that are 28x28 in size. Examples are grayscale. Here are some examples:

MNIST from: LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.

In this example, we load only one batch of data, ensuring that each digit is equally represented, with 10 instances of each. If we were not limiting it to a single batch, this equal representation would not be necessary. This is achieved by get_balanced_batch method.

Properly initializing weights is crucial for gradient descent, and this remains true when using a non-linear solver to optimize the neural network. Without good initial weights, the solver may take much longer to optimize or might even fail to improve. Therefore, we define our method for weight initialization, using Xavier initialization.

def uniform_xavier_init(n_input: int, n_output: int, gain: float) -> np.ndarray:
    # also https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_uniform_
    a = gain * np.sqrt(6 / (n_input + n_output))
    return np.random.uniform(-a, a, (n_input, n_output))

w1_data = uniform_xavier_init(784, hidden_layer_neurons, 5/3)
w2_data = uniform_xavier_init(hidden_layer_neurons, 10, 1)

w1_data contains the weights for the first linear layer, and w2_data contains the weights for the second linear layer. With these weights available, we can begin describing the neural network in GAMSPy.

Let’s start by defining our variables:

batch = train_kwargs["batch_size"]

# Create a container
m = gp.Container()

w1 = gp.Variable(m, name="w1", domain=dim(w1_data.shape))
w1.setRecords(w1_data)
w1.lo[...] = -5
w1.up[...] = 5

w2 = gp.Variable(m, name="w2", domain=dim(w2_data.shape))
w2.setRecords(w2_data)
w2.lo[...] = -5
w2.up[...] = 5

The w1 and w2 variables hold the weights of the neural network. We set upper and lower bounds on them to speed up the learning process, as proper bounds can be beneficial. Additionally, excessively large weight values are often indicative of overfitting.

Then we define the rest of the variables:

a1 = gp.Parameter(m, name="a1", domain=dim((batch, 784))) # input

target_set = gp.Set(
    m,
    name="targets",
    domain=dim([batch, 10]),
    uels_on_axes=True
)

z2 = gp.Variable(m, name="z2", domain=dim((batch, hidden_layer_neurons)))
z3 = gp.Variable(m, name="z3", domain=dim((batch, 10)))
a2 = gp.Variable(m, name="a2", domain=dim((batch, hidden_layer_neurons)))

loss = gp.Variable(m, name="loss")

a1 represents our input layer, containing a batch of images. target_set is the set that holds the expected labels for these images. z2 is the output after the first linear layer, and a2 is the tanh-activated version of z2. z3 is the output after the second and final linear layer. a3, which we will define later, will be the log_softmax-activated version of z3, representing our log probabilities. Finally, the loss variable is used to calculate the negative log-likelihood loss.

Now, let’s define the relationships between our variables, also known as the forward pass.

calc_mm_1 = gp.Equation(m, name="calc_mm_1", domain=dim((batch, hidden_layer_neurons)))
calc_mm_1[...] = z2 == a1 @ w1

calc_activation = gp.Equation(m, name="calc_activation", domain=dim((batch, hidden_layer_neurons)))
calc_activation[...] = a2 == gp.math.tanh(z2)

calc_mm_2 = gp.Equation(m, name="calc_mm_2", domain=dim((batch, 10)))
calc_mm_2[...] = z3 == a2 @ w2

a3, _ = gp.math.activation.log_softmax(z3)

set_loss = gp.Equation(m, name="calc_loss")
set_loss[...] = loss == gp.Sum(target_set[a3.domain[0], a3.domain[1]], -a3)

train_nn = gp.Model(
     m,
     name="train",
     equations=m.getEquations(),
     problem="NLP",
     sense="min",
     objective=loss,
)

z2 is defined as the input multiplied by the weights of the first linear layer, using the @ sign for Matrix Multiplication. a2 is the tanh activation of z2. Similarly, z3 is a2 multiplied by the weights of the second linear layer. Finally, we obtain our log probabilities using the log_softmax function. If you are curious about why we needed equations for tanh but not for log_softmax, you can refer to our Activation Functions section. We then calculate the negative log likelihood loss by summing up the negative log probabilities of the expected neurons in the output layer for each batch.

We define train_nn with the objective of minimizing the loss function based on the equations we provided. This problem is classified as an NLP (Nonlinear Programming) problem.

So far, we have defined our model in GAMSPy and created data loaders. However, our model is not yet loaded with data. We will address that next.

data, target = get_balanced_batch(train_loader, batch // 10)
data = data.reshape(batch, -1)
init_data = data.detach().numpy()

# reshape the target, labels, so that we can provide them to GAMSPy
target_df = pd.DataFrame(target)
target_df["val"] = 1
target_df = target_df.pivot(columns=[0], values="val").fillna(0).astype(bool)


a1.setRecords(init_data)
target_set.setRecords(target_df, uels_on_axes=True)

z2_data = init_data @ w1.toDense()
z2.setRecords(z2_data)

a2_data = np.tanh(z2_data)
a2.setRecords(a2_data)

z3_data = a2_data @ w2.toDense()
z3.setRecords(z3_data)

We obtain a single batch of data and format it to be compatible with GAMSPy. Next, we set our input a1 and labels target_set. While we could stop here, our experiments show that providing the solver with good initial values for every variable significantly speeds up training.

Finally, we start training:

train_nn.solve(
    solver='conopt',
    output=sys.stdout,
)

For this example, we picked Conopt. You can choose any nonlinear solver; however, selecting a local nonlinear solver might be wise since finding a globally optimal solution could lead to severe overfitting and is extremely challenging.

...
CONOPT 4         50.3.0 5a352073 Jul 30, 2025          LEG x86 64bit/Linux
C O N O P T   version 4.37
Copyright (C) GAMS Software GmbH
              GAMS Development Corporation

Will use up to 4 threads.


The user model has 30001 constraints and 45881 variables
with 8130501 Jacobian elements, 260000 of which are nonlinear.
The Hessian of the Lagrangian has 15000 elements on the diagonal,
122500 elements below the diagonal, and 25200 nonlinear variables.

Iter Phase   Ninf   Infeasibility   RGmax      NSB   Step  InItr MX OK
 0          1.3114504934E+04 (Input point)

The post-triangular part of the model has 30001 constraints and variables.

Preprocessed model has no constraints and 15880 variables.

Iter Phase   Ninf   Infeasibility   RGmax      NSB   Step  InItr MX OK
                  0.0000000000E+00 (Full preprocessed model)
                  0.0000000000E+00 (After scaling)

** Feasible solution. Value of objective =    1268.79215357

 Iter Phase   Ninf     Objective     RGmax      NSB   Step  InItr MX OK
 4          7.9480833012E+02 5.4E+01   15880 1.0E+00     2 F  T
 4          6.3703062619E+02 4.3E+01   15880 6.6E-01     2 F  T
 4          5.1730468682E+02 3.1E+01   15880 4.9E-01     3 F  T
 4          4.2877298531E+02 3.4E+01   15880 1.0E+00     2 F  T
 4          4.0455589931E+02 2.4E+01   15880 6.4E-01     1 F  T
 4          3.0516840899E+02 2.4E+01   15880 2.4E-01     4 F  T
 4          2.6072377008E+02 2.4E+01   15880 5.6E-02     3 F  T
 4          2.2009534240E+02 2.7E+01   15880 2.2E-01     3 F  T
 4          1.8674435558E+02 2.7E+01   15880 6.9E-01     2 F  T
 4          1.2090601386E+02 1.7E+01   15880 3.4E-01     4 F  T

 Iter Phase   Ninf     Objective     RGmax      NSB   Step  InItr MX OK
 4          9.5285144244E+01 1.4E+01   15880 6.0E-01     3 F  T
 4          8.6722139688E+01 9.7E+00   15880 4.0E-03    25 F  T
 4          6.5546219326E+01 1.1E+01   15880 1.0E+00     9 F  T
 4          4.5501230397E+01 7.0E+00   15880 6.6E-01     3 F  T
 4          2.9958480845E+01 5.8E+00   15880 5.4E-01     7 F  T
 4          2.2066082695E+01 6.2E+00   15880 5.2E-01     5 F  T
 4          1.6801705243E+01 3.6E+00   15880 1.8E-01    12 F  T
 4          1.2218322153E+01 3.7E+00   15880 1.0E+00     3 F  T
 4          8.8761584923E+00 4.1E+00   15880 2.1E-01     9 F  T
 4          6.4697692798E+00 3.8E+00   15880 1.0E+00     4 F  T

 Iter Phase   Ninf     Objective     RGmax      NSB   Step  InItr MX OK
 4          4.4806826778E+00 1.4E+00   15880 2.1E-01    12 F  T
 4          3.5702970824E+00 1.9E+00   15880 4.1E-01     7 F  T
 4          2.6617260967E+00 2.2E+00   15880 1.0E+00     4 F  T
 4          1.5591133598E+00 5.3E-01   15880 4.4E-01    10 F  T
 4          1.0895366071E+00 6.6E-01   15880 1.0E+00     5 F  T
 4          9.2131860350E-01 2.4E-01   15880 2.4E-02    14 F  T
 4          7.0704833407E-01 3.1E-01   15880 1.0E+00     3 F  T
 4          6.0728995605E-01 1.7E-01   15880 1.1E-02    25 F  T
 4          5.0301240915E-01 1.8E-01   15880 2.0E-01     9 F  T
 4          4.3518414509E-01 1.5E-01   15880 1.3E-01    14 F  T

 Iter Phase   Ninf     Objective     RGmax      NSB   Step  InItr MX OK
 4          3.4182763503E-01 1.3E-01   15880 3.1E-01     8 F  T
 4          2.6180229003E-01 1.8E-01   15880 6.0E-01     6 F  T
 4          2.2488788314E-01 1.0E-01   15880 3.5E-01     4 F  T
 4          8.2170876460E-02 1.2E-01   15880 1.0E+00    19 F  T
 4          7.8156727787E-02 7.2E-02   15880 2.4E-02    17 F  T
 4          2.9215108676E-02 6.5E-02   15880 1.0E+00    19 F  T
 4          1.6476352179E-02 2.5E-02   15880 1.0E+00    11 F  T
 4          1.5806302678E-02 8.3E-03   15880 1.3E-02    16 F  T
 4          1.5039449389E-02 8.3E-03   15880 6.7E-03    26 F  T
 4          1.1805048591E-02 7.0E-03   15880 7.0E-01     8 F  T

 Iter Phase   Ninf     Objective     RGmax      NSB   Step  InItr MX OK
 4          1.0709580500E-02 5.1E-03   15880 1.2E-01     8 F  T
 4          8.1725246609E-03 5.2E-03   15880 2.3E-01    21 F  T
 4          8.0942821370E-03 4.4E-03   15880 5.2E-04    20 F  T
 4          6.3004269825E-03 4.4E-03   15880 3.8E-01    10 F  T
 4          4.1778350811E-03 4.0E-03   15880 2.5E-01    16 F  T
 4          2.5334705070E-03 3.4E-03   15880 1.0E+00    10 F  T
 4          2.3896586290E-03 1.3E-03   15880 3.2E-03    14 F  T
 4          2.3497057462E-03 1.3E-03   15880 1.0E-03    20 F  T
 4          2.0465050059E-03 1.3E-03   15880 6.0E-02    11 F  T
 4          9.0459121006E-04 1.3E-03   15880 1.0E+00    19 F  T

 Iter Phase   Ninf     Objective     RGmax      NSB   Step  InItr MX OK
 4          7.6421461815E-04 5.0E-04   15880 2.7E-01    21 F  T
 4          5.5603019205E-04 8.6E-04   15880 1.0E+00     3 F  T
 4          4.9514237667E-04 3.1E-04   15880 1.2E-01     1 F  T
 4          3.9356203057E-04 2.0E-04   15880 1.7E-01    24 F  T
 4          3.0287011909E-04 2.0E-04   15880 2.3E-01    19 F  T
 4          2.9704041784E-04 2.6E-04   15880 3.7E-03    34 F  T
 4          1.6991249844E-04 2.5E-04   15880 1.0E+00    15 F  T
 4          1.4060212557E-04 9.5E-05   15880 1.9E-01    20 F  T
 4          9.9018352270E-05 8.3E-05   15880 1.0E+00     6 F  T
 4          6.7376892616E-05 3.4E-05   15880 2.7E-01    21 F  T

 Iter Phase   Ninf     Objective     RGmax      NSB   Step  InItr MX OK
 4          6.1554147788E-05 2.6E-05   15880 1.5E-02    22 F  T
 4          4.5489449921E-05 2.5E-05   15880 6.1E-01    18 F  T
 4          3.6545511325E-05 4.6E-05   15880 1.0E+00     4 F  T
 4          3.5892981060E-05 1.6E-05   15880 2.7E-03     1 F  T
 4          2.4897946851E-05 1.7E-05   15880 1.0E+00     8 F  T
 4          2.1156176892E-05 1.6E-05   15880 1.0E+00     3 F  T
 4          1.9675296802E-05 6.5E-06   15880 3.8E-03     1 F  T
 4          1.6978461996E-05 3.8E-06   15880 9.9E-02    19 F  T
 4          1.0952386528E-05 7.4E-06   15880 4.2E-01    18 F  T
 4          1.0167424637E-05 6.1E-06   15880 3.5E-02    14 F  T

 Iter Phase   Ninf     Objective     RGmax      NSB   Step  InItr MX OK
 4          9.9907088504E-06 6.2E-06   15880 1.8E-03    31 F  T
 4          5.7163009544E-06 6.1E-06   15880 1.0E+00    17 F  T
 4          5.4233103235E-06 2.2E-06   15880 1.9E-02    43 F  T
 4          2.7513649421E-06 2.2E-06   15880 1.0E+00    19 F  T
 4          9.7835780721E-07 8.2E-07   15880 1.0E+00    18 F  T
 4          9.5771803288E-07 3.2E-07   15880 2.7E-03    21 F  T
 4          5.8858897134E-07 3.3E-07   15880 1.0E+00     8 F  T
 4          5.5434200164E-07 1.5E-07   15880 1.0E+00     1 F  T
 4          5.5434200164E-07 8.3E-08   15880

We can visualize the loss per iteration:

Let’s calculate the accuracy on the batch we trained:

output = np.tanh(init_data @ w1.toDense()) @ w2.toDense()

pred = output.argmax(axis=1)
acc = 100 * (sum([1 if pl == rl.item() else 0 for pl, rl in zip(pred, target)]) / len(pred))
print("Training batch accuracy: ({:.0f}%)".format(acc))

Training batch accuracy: (100%)

As you can see, we achieved 100% accuracy on the batch we trained. Overfitting on a small batch is typically done to verify that training is proceeding as expected. In this case, we did it because starting training on another batch and stopping the solver before overfitting occurs would be too advanced for an introductory example. Now, let’s test our network on the test set:

def test(w1_data, w2_data, test_loader):
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        data, target = data, target
        data = data.reshape(data.shape[0], -1)
        output = (np.tanh(data @ w1_data)) @ w2_data
        pred = output.argmax(dim=1, keepdim=True)  # get the index of the max logit
        correct += pred.eq(target.view_as(pred)).sum().item()

    print('\nTest set accuracy: {}/{} ({:.0f}%)\n'.format(
        correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


test(w1.toDense(), w2.toDense(), test_loader)

Test set accuracy: 8122/10000 (81%)

We reached 81% accuracy on the test set and demonstrated the flexibility of GAMSPy by training a simple neural network. For research purposes and curious users, it is interesting to see how black-box solvers can handle neural network training.

Convolutional Neural Network#

We can have a similar example that uses convolutional neural networks. Let’s start by copying the common code from the previous example:

import sys

import gamspy as gp
import gamspy.math
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from gamspy.math.matrix import dim
from torchvision import datasets, transforms

torch.manual_seed(7)
np.random.seed(7)

train_kwargs = {'batch_size': 500} # should be multiple of 10
test_kwargs = {'batch_size': 128}

mean = (0.1307,)
std = (0.3081,)

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean, std)
    ])


train_dataset = datasets.MNIST('../data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('../data', train=False, transform=transform)

train_loader = torch.utils.data.DataLoader(train_dataset, **train_kwargs)
test_loader = torch.utils.data.DataLoader(test_dataset, **test_kwargs)

batch = train_kwargs["batch_size"]
def get_balanced_batch(train_loader, sample_size):
    counts = [0] * 10
    good_data = []
    good_target = []
    finished = 0

    for data, target in train_loader:
        for row, row_label in zip(data, target):
            if counts[row_label.item()] < sample_size:
                good_data.append(row)
                good_target.append(row_label)
                counts[row_label.item()] += 1
                if [row_label.item()] == 10:
                    finished += 1

            if finished == 10:
                break

    return torch.stack(good_data), torch.stack(good_target)

We will use formulations to define formulations needed by our CNN. Just to have an idea about how it would look like in PyTorch, you can look at the following code:

class Flatten(torch.nn.Module):
    def __init__(self, dims_to_flatten: list[int]):
        super().__init__()
        self.dims_to_flatten = dims_to_flatten

    def forward(self, x):
        return torch.flatten(
            x,
            self.dims_to_flatten[0],
            self.dims_to_flatten[-1]
        )

torch_model = nn.Sequential(
    nn.Conv2d(1, 2, 5),
    nn.Tanh(),
    nn.AvgPool2d(2),
    nn.Conv2d(2, 3, 5),
    nn.Tanh(),
    nn.AvgPool2d(2),
    nn.Conv2d(3, 10, 4),
    Flatten([1, 2, 3])
)

We define the corresponding formulations:

m = gp.Container()

# create formulations
avg_pool = gp.formulations.AvgPool2d(m, 2)

conv1 = gp.formulations.Conv2d(m, 1, 2, 5)
conv1.make_variable(init_weights=True)

conv2 = gp.formulations.Conv2d(m, 2, 3, 5)
conv2.make_variable(init_weights=True)

conv3 = gp.formulations.Conv2d(m, 3, 10, 4)
conv3.make_variable(init_weights=True)

Since we use the formulations all we need to define is the input variable and the loss variable. The rest of the variables will be declared and defined by the formulations we use.

a1 = gp.Parameter(m, name="a1", domain=dim((batch, 1, 28, 28))) # input
z2, _ = conv1(a1)

a2, _ = gamspy.math.activation.tanh(z2)
a2.lo[...] = "-inf" # in these cases, we do not need implied bounds
a2.up[...] = "inf"

z3, _ = avg_pool(a2)
z4, _ = conv2(z3)
a4, _ = gamspy.math.activation.tanh(z4)
a4.lo[...] = "-inf"
a4.up[...] = "inf"

z5, _ = avg_pool(a4)
z6, _ = conv3(z5)
output, _ = gp.formulations.flatten_dims(z6, [1, 2, 3])


target_set = gp.Set(
    m,
    name="targets",
    domain=dim([batch, 10]),
    uels_on_axes=True
)

loss = gp.Variable(m, name="loss")

Formulations return output variable and the list of equations, hence the repeating var_name, _ = convn(…). We ignore the returned list of equations because in this example, we will use all the generated equations and we do not need to pick certain ones.

Since formulations also take care of the most of the eqations, we need to just define loss function.

log_probs, _ = gamspy.math.activation.log_softmax(output)

set_loss = gp.Equation(m, name="calc_loss")
set_loss[...] = loss == gp.Sum(target_set[log_probs.domain[0], log_probs.domain[1]], -log_probs)

train_nn = gp.Model(
     m,
     name="train",
     equations=m.getEquations(),
     problem="NLP",
     sense="min",
     objective=loss,
)

Let’s load our model with data and start training:

data, target = get_balanced_batch(train_loader, batch // 10)
init_data = data.detach().numpy()

# reshape the target, labels, so that we can provide them to GAMSPy
target_df = pd.DataFrame(target)
target_df["val"] = 1
target_df = target_df.pivot(columns=[0], values="val").fillna(0).astype(bool)

a1.setRecords(init_data)
target_set.setRecords(target_df, uels_on_axes=True)

train_nn.solve(
    solver='conopt',
    output=sys.stdout,
)

We will not wait until the global optimum is reached but break the training in between:

    CONOPT 4         50.4.0 c55df396 Aug 12, 2025          LEG x86 64bit/Linux

    C O N O P T   version 4.37
    Copyright (C) GAMS Software GmbH
                  GAMS Development Corporation

    Will use up to 4 threads.


    The user model has 1527001 constraints and 1527696 variables
    with 28083501 Jacobian elements, 10802000 of which are nonlinear.
    The Hessian of the Lagrangian has 677000 elements on the diagonal,
    5062500 elements below the diagonal, and 845630 nonlinear variables.

    Iter Phase   Ninf   Infeasibility   RGmax      NSB   Step  InItr MX OK
       0   0          2.5091871555E+05 (Input point)

    The post-triangular part of the model has 1527001 constraints and variables.

    Preprocessed model has no constraints and 695 variables.

    Iter Phase   Ninf   Infeasibility   RGmax      NSB   Step  InItr MX OK
                      0.0000000000E+00 (Full preprocessed model)
                      0.0000000000E+00 (After scaling)

 ** Feasible solution. Value of objective =    1149.05219597

    Iter Phase   Ninf     Objective     RGmax      NSB   Step  InItr MX OK
       1   4          1.0630657623E+03 1.8E+01     695 4.8E-02    25 F  T
       2   4          6.7418753792E+02 1.8E+02     695 6.6E-01    19 F  T
       3   4          4.0736035285E+02 7.3E+01     695 1.0E+00    20 F  T
       4   4          2.2923161526E+02 8.8E+01     695 1.0E+00    17 F  T
       5   4          1.2657426225E+02 4.1E+01     695 1.0E+00    22 F  T
       6   4          5.2481273660E+01 5.2E+01     695 1.0E+00    16 F  T
       7   4          4.7678172951E+01 3.1E+01     695 3.6E-04    24 F  T
       8   4          4.4340661314E+01 3.6E+01     695 1.7E-02    14 F  T
       9   4          4.2112698991E+01 4.5E+01     695 1.9E-02     5 F  T

 ** User Interrupt.

--- Reading solution for model train
--- _so0O_AQST0e5md_9iKZOnQ.gms(815) 1827 Mb  34 secs
--- Executing after solve: elapsed 0:00:44.966
--- _so0O_AQST0e5md_9iKZOnQ.gms(816) 1827 Mb
--- _so0O_AQST0e5md_9iKZOnQ.gms(874) 1827 Mb
--- GDX File /tmp/tmp0jbosfgf/_so0O_AQST0e5md_9iKZOnQout.gdx
*** Status: Normal completion
--- Job _so0O_AQST0e5md_9iKZOnQ.gms Stop 08/18/25 16:30:52 elapsed 0:00:45.332
[MODEL - WARNING] The solve was interrupted! Solve status: UserInterrupt. For further information, see https://gamspy.readthedocs.io/en/latest/reference/gamspy._model.html#gamspy.SolveStatus.

Let’s test the accuracy by utilizing the Sequential module:

my_weights = {
    '0.weight': torch.tensor(conv1.weight.toDense()),
    '0.bias': torch.tensor(conv1.bias.toDense()),
    '3.weight': torch.tensor(conv2.weight.toDense()),
    '3.bias': torch.tensor(conv2.bias.toDense()),
    '6.weight': torch.tensor(conv3.weight.toDense()),
    '6.bias': torch.tensor(conv3.bias.toDense()),
}

torch_model.load_state_dict(my_weights)

def test(test_loader):
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data, target
            output = torch_model(data)
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max logit
            correct += pred.eq(target.view_as(pred)).sum().item()

    print('\nTest set accuracy: {}/{} ({:.0f}%)\n'.format(
        correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


test(test_loader)

# Outputs: Test set accuracy: 8414/10000 (84%)

Depending on when you cancel the training, you might get a different test set accuracy. The most methodical way would be doing a test/dev/train split and use development set while training but for the case of this example we just use early termination.

Conclusion#

We have showcased two examples where we trained a feed-forward neural network and a convolutional neural network. In the second example, we utilized formulations since writing down convolution operation is more involved. Although

Here are some points that can help with your research:

Avoid initializing your weights to zero; instead, use a common initialization function. This is crucial for gradient descent and equally important for your nonlinear solver.
Whenever possible, specify only the bounds that are not already implied, since redundant bounds can (and will in this case) limit solver optimizations.
Provide initial values for your variables to prevent the solver from wasting time on forward propagation alone.
Unlike other optimization problems, you don’t need very strict tolerances for feasibility and optimality.

What would be the next steps:

Using Convolutional Neural Networks (CNNs) can reduce the number of weights, allowing for larger batch sizes.
You can train on multiple batches, but ensure that the solver does not overfit to the batch you are training. Also, remember to update the initial values for layers (e.g., z2, a2, etc.) when you change the input.