Matrix Multiplication-Why is it a big deal?

12 min readMar 22, 2021

When I took my first school course on matrices, the teacher wasn’t big on explanation. He showed us how to multiply matrices but didn’t say why.

My first reaction was, “You’re kidding us, right?”

The rule seemed so bizarre and arbitrary that it must have come to some theoretical mathematician in a drug-induced nightmare. Surely there had to be a more rational approach.

But guess what? There isn’t. The rule makes perfect sense, when you see where it came from. My goal here is to give you an understandable rationale for why we do it the way we do. It’ll still take you a little while to get used to the idea, and even longer to be comfortable with it. But I hope that, after my explanation, you’ll at least see that it’s not an arbitrary convention. To explain the rule, let me begin with the set of linear equations

Matrix multiplication is a symbolic way of substituting one linear change of variables into another one. If x′=ax+byx′=ax+by and y′=cx+dyy′=cx+dy, and x′′=a′x′+b′y′x″=a′x′+b′y′ and y′′=c′x′+d′y′y″=c′x′+d′y′ then we can plug the first pair of formulas into the second to express x′′x″ and y′′y″ in terms of xx and yy:

x′′=a′x′+b′y′=a′(ax+by)+b′(cx+dy)=(a′a+b′c)x+(a′b+b′d)yx″=a′x′+b′y′=a′(ax+by)+b′(cx+dy)=(a′a+b′c)x+(a′b+b′d)y

and

y′′=c′x′+d′y′=c′(ax+by)+d′(cx+dy)=(c′a+d′c)x+(c′b+d′d)y.y″=c′x′+d′y′=c′(ax+by)+d′(cx+dy)=(c′a+d′c)x+(c′b+d′d)y.

It can be tedious to keep writing the variables, so we use arrays to track the coefficients, with the formulas for x′x′ and x′′x″ on the first row and for y′y′ and y′′y″ on the second row. The above two linear substitutions coincide with the matrix product

(a′c′b′d′)(acbd)=(a′a+b′cc′a+d′ca′b+b′dc′b+d′d).(a′b′c′d′)(abcd)=(a′a+b′ca′b+b′dc′a+d′cc′b+d′d).

So matrix multiplication is just a bookkeeping device for systems of linear substitutions plugged into one another (order matters). The formulas are not intuitive, but it’s nothing other than the simple idea of combining two linear changes of variables in succession.

Matrix multiplication was first defined explicitly in print by Cayley in 1858, in order to reflect the effect of composition of linear transformations.

However, the idea of tracking what happens to coefficients when one linear change of variables is substituted into another (which we view as matrix multiplication) goes back further. For instance, the work of number theorists in the early 19th century on binary quadratic forms ax2+bxy+cy2ax2+bxy+cy2 was full of linear changes of variables plugged into each other (especially linear changes of variable that we would recognize as coming from SL2(Z)SL2(Z)). For more on the background, see the paper by Thomas Hawkins on matrix theory in the 1974 ICM. Google “ICM 1974 Thomas Hawkins” and you’ll find his paper among the top 3 hits.

Matrix Multiplication in Computer Science

Among the most common tools in electrical engineering and computer science are rectangular grids of numbers known as matrices. The numbers in a matrix can represent data, and they can also represent mathematical equations. In many time-sensitive engineering applications, multiplying matrices can give quick but good approximations of much more complicated calculations.

Matrices arose originally as a way to describe systems of linear equations, a type of problem familiar to anyone who took grade-school algebra. “Linear” just means that the variables in the equations don’t have any exponents, so their graphs will always be straight lines.

The equation x — 2y = 0, for instance, has an infinite number of solutions for both x and y, which can be depicted as a straight line that passes through the points (0,0), (2,1), (4,2), and so on. But if you combine it with the equation x — y = 1, then there’s only one solution: x = 2 and y = 1. The point (2,1) is also where the graphs of the two equations intersect.

The matrix that depicts those two equations would be a two-by-two grid of numbers: The top row would be [1 -2], and the bottom row would be [1 -1], to correspond to the coefficients of the variables in the two equations.

In a range of applications from image processing to genetic analysis, computers are often called upon to solve systems of linear equations — usually with many more than two variables. Even more frequently, they’re called upon to multiply matrices.

Matrix multiplication can be thought of as solving linear equations for particular variables. Suppose, for instance, that the expressions t + 2p + 3h; 4t + 5p + 6h; and 7t + 8p + 9h describe three different mathematical operations involving temperature, pressure, and humidity measurements. They could be represented as a matrix with three rows: [1 2 3], [4 5 6], and [7 8 9].

Now suppose that, at two different times, you take temperature, pressure, and humidity readings outside your home. Those readings could be represented as a matrix as well, with the first set of readings in one column and the second in the other. Multiplying these matrices together means matching up rows from the first matrix — the one describing the equations — and columns from the second — the one representing the measurements — multiplying the corresponding terms, adding them all up, and entering the results in a new matrix. The numbers in the final matrix might, for instance, predict the trajectory of a low-pressure system.

Of course, reducing the complex dynamics of weather-system models to a system of linear equations is itself a difficult task. But that points to one of the reasons that matrices are so common in computer science: They allow computers to, in effect, do a lot of the computational heavy lifting in advance. Creating a matrix that yields useful computational results may be difficult, but performing matrix multiplication generally isn’t.

One of the areas of computer science in which matrix multiplication is particularly useful is graphics, since a digital image is basically a matrix to begin with: The rows and columns of the matrix correspond to rows and columns of pixels, and the numerical entries correspond to the pixels’ color values. Decoding digital video, for instance, requires matrix multiplication; earlier this year, MIT researchers were able to build one of the first chips to implement the new high-efficiency video-coding standard for ultrahigh-definition TVs, in part because of patterns they discerned in the matrices it employs.

In the same way that matrix multiplication can help process digital video, it can help process digital sound. A digital audio signal is basically a sequence of numbers, representing the variation over time of the air pressure of an acoustic audio signal. Many techniques for filtering or compressing digital audio signals, such as the Fourier transform, rely on matrix multiplication.

Another reason that matrices are so useful in computer science is that graphs are. In this context, a graph is a mathematical construct consisting of nodes, usually depicted as circles, and edges, usually depicted as lines between them. Network diagrams and family trees are familiar examples of graphs, but in computer science they’re used to represent everything from operations performed during the execution of a computer program to the relationships characteristic of logistics problems.

Every graph can be represented as a matrix, however, where each column and each row represents a node, and the value at their intersection represents the strength of the connection between them (which might frequently be zero). Often, the most efficient way to analyze graphs is to convert them to matrices first, and the solutions to problems involving graphs are frequently solutions to systems of linear equations.

The Problem and its history

In 1968, Winograd made the discovery that, by using a different method of calculating the inner product, one could find the product of two n × n matrices, which, while using a similar number of overall operations, shifted the emphasis more on addition than on multiplication.

This was important as addition was computationally less demanding than multiplication. The same year, Strassen provided an explicit algorithm which could multiply two 2^n × 2^n matrices in less than 6.7^n operations, where using Winograd or the trivial algorithm, we would have had approximately 8^n operations. Using this, it is shown that ω ≤ log2 (7) < 2.81.

In 1978, Pan found explicit algorithms to further reduce ω by means of the technique of trilinear aggregation. This technique uses the fact that computing the trace of the product of three n×n matrices is equivalent to the problem of multiplying two n × n matrices (in terms of the total number of multiplications). By defining a function on the indices of the entries in the matrices A, B and C to be multiplied, we may define an aggregate to do all the required multiplications, plus some extra terms. We unite terms to remove these extra terms in as few calculations as possible. Using this technique, Pan shows that we can multiply two 70 × 70 matrix multiplications in 143640 operations. This gives ω ≤ log70 143640 < 2.79512, and further, we can perform a 46 × 46 matrix multiplication in 41952 operations, giving ω ≤ 2.78017.

In 1980, Bini et al. showed that the number of operations required to perform a matrix multiplication could be reduced by considering approximate algorithms. If we change our underlying field k to be the field of polynomials of λ, a variable which, if k is R can be assumed to be just a small number (allowing negative powers of λ) with coefficients in k, we may obtain, using fewer operations, an approximation of the required matrix multiplication (in the sense that each entry will be “out” by a power of λ). Using this method (which is similar to trilinear aggregation), they obtain ω ≤ 2.7799.

In 1981, Schönhage showed that an algorithm which could approximately compute multiple independent matrix multiplications could be used to further reduce ω. This is the result of his asymptotic sum inequality- using it, he shows that ω ≤ 2.5479. Using similar techniques, Coppersmith and Winograd showed that one can take an algorithm (of a certain type) that can perform multiple disjoint matrix multiplications and square it. The resulting algorithm will be capable of multiplying larger matrices than expected. This method gives ω ≤ 2.4955480.

In 1986 Strassen showed that one could start with an algorithm that was not a matrix product: we have a series of blocks, where the blocks can themselves be seen as elements of a matrix multiplication, and the blocks themselves are matrix multiplications. Raising this original algorithm to a large power, we may set some blocks to zero to obtain a large number of independent matrix products: we then use Schönhage to find a value for ω. This method yields ω ≤ 2.4785.

In 1987, Coppersmith and Winograd used this method to great effect to provide the current record for ω. They start with an algorithm, raise it to the Nth power and show that setting certain variables as being zero will lead to the algorithm calculating a large number of independent matrix products of a certain size: using Schönhage, we get that ω ≤ 2.376.

In 2005, Cohn and Umans placed the matrix multiplication in a group theoretic context: while they were unable to find new bounds for ω, the group-theoretic context provides new conjectures, which, if proved, will show that ω = 2. A related problem is determining the rank of Matrix Multiplication. The rank is the total number of non-scalar multiplications required to evaluate a Matrix product (including scalar multiplications this becomes the Multiplicative Complexity).

The Main Problem

Matrices have long been the subject of much study by many Mathematicians. However, the rise of computers in the late 20th century has led to new problems, the main one being the problem of Matrix Multiplication. Computers are required to do many Matrix Multiplications at a time, and hence it is desirable to find algorithms to reduce the number of steps required to multiply two matrices together. Until 1968, we had only the Trivial Algorithm to multiply matrices together. This is as follows:

Algorithm 1. If we have two n × n matrices, n ∈ N, A and B, with entries in a field k, such that

where multiplication is defined as in the field k.

We see that this algorithm requires 2n^3 − n^2 operations in k to multiply two n × n matrices, of which n 3 are multiplications and n^3 − n^2 are additions. However in 1968, Winograd showed that one could take the inner product of two vectors using fewer multiplications, but with more additions. We consider finding the inner product of two vectors (x1, .., x n) and (y1, .., y n). Set

Since matrix multiplication can be regarded as taking multiple inner products, we get that the total number of multiplications required to multiply an n×n matrix by another matrix of the same size is about (n^3)/2 and the number of additions required is about (3/2)n^3 , improving slightly on the trivial algorithm. Denote by Mk(n) the total number of operations required by a bilinear algorithm to multiply two n × n matrices over k

We see from the trivial algorithm that ω(k) has an upper limit of 3. Since there must be an output of n^2 entries, there cannot be any fewer operations than this in total matrix multiplication, so hence ω(k) ∈ [2, 3]. We see that the total number of operations in Winograd’s algorithm implies that the least upper bound for ω is 3. However, in the same year, Strassen managed to find an improvement for ω.

Strassen’s Algorithm

For a long time it was assumed that (N3) was required for matrix multiplication. However, in the late sixties, Strassen showed how to break the (N^3) barrier. The basic idea of Strassen’s algorithm is to divide each matrix into four quadrants. This strategy is called Divide and Conquer. We then use these results to compute C’s submatrices.

we see that T(N) = O(N3), so we do not have an improvement. As we saw with integer multiplication, we must reduce the number of subproblems below 8. Strassen used a strategy similar to the integer multiplication divide-and-conquer algorithm and showed how to use only seven recursive calls by carefully arranging the computations. The seven multiplications are

For detailed explanation refer to Introduction to Algorithms by Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein.

There’s a catch

As usual, there are details to consider, such as the case when N is not a power of 2, but these are basically minor nuisances. Strassen’s algorithm is worse than the straight forward algorithm until N is fairly large. It does not generalize for the case where the matrices are sparse (contain many zero entries), and it does not easily parallelize. When run with floating-point entries, it is less stable numerically than the classic algorithm. Thus, until recently it had only limited applicability. Nevertheless, it has always represented an important theoretical milestone and certainly shows that in computer science, as in many other fields, even though a problem seems to have an intrinsic complexity, nothing is certain until proven.

Strassen’s method is “asymptotically” faster than the straightforward SQUAREMATRIX-MULTIPLY procedure.

Divide-and-conquer as a technique for designing algorithms dates back to at least 1962 in an article by Karatsuba and Ofman . It might have been used well before then, however; according to Heideman, Johnson, and Burrus , C. F. Gauss devised the first fast Fourier transform algorithm in 1805, and Gauss’s formulation breaks the problem into smaller subproblems whose solutions are combined.

Strassen’s algorithm caused much excitement when it was published in 1969. Before then, few imagined the possibility of an algorithm asymptotically faster than the basic SQUARE-MATRIX-MULTIPLY procedure. The asymptotic upper bound for matrix multiplication has been improved since then. The most asymptotically efficient algorithm for multiplying n n matrices to date, due to Coppersmith and Winograd , has a running time of O(n^(2.376)). The best lower bound known is just the obvious O(n²) bound (obvious because we must fill in n^2 elements of the product matrix).

From a practical point of view, Strassen’s algorithm is often not the method of choice for matrix multiplication, for four reasons:

The constant factor hidden in the ⊝(n^log 7) running time of Strassen’s algorithm is larger than the constant factor in the ⊝(n³) -time SQUARE-MATRIXMULTIPLY procedure.

So if you take constant term as three and try to equate 3 n^log 7 = n³ the value of n comes to be about 1.483489 x 10⁷(check here),that's insane, yes I heard you saying HARDWARE ACCELARATION.

When the matrices are sparse, methods tailored for sparse matrices are faster.
Strassen’s algorithm is not quite as numerically stable as SQUARE-MATRIXMULTIPLY. In other words, because of the limited precision of computer arithmetic on non-integer values, larger errors accumulate in Strassen’s algorithm than in SQUARE-MATRIX-MULTIPLY.
The submatrices formed at the levels of recursion consume space.

So if you can find a better matrix multiplication algorithm it will be better for HUMANITY. well unless you use hardware acceleration