Echo Chess: The Quest for Solvability

| Published
Echo Chess: The Quest for Solvability

This story has stirred some great conversations on Hacker News. Many have reached out with great ideas for a v2. If you still want to chime in, just drop me a note.

Prologue

Let’s make Chess more fun in Single-Player. How hard could it be? ☠️

This is the story of venturing too deep, head-first, into the unknown. It all started with a doodle on a piece of paper. This doodle to be exact.

Some pics are only worth log(1k) words.

I first conceived of this game on a whim as part of many strategy and puzzle games I’d been designing for fun. I was musing with the idea of a chess-inspired Turn-Based Strategy (TBS) game that no one would recognize as chess-based. My first attempts purposefully kept steering the theme away from chess. Why go vanilla when there are so many wilder thematic flavors and bolder mechanics to explore?

Asymmetric rewards, alternating movement rules, stochastic obstacles, stealth, morphing, etc. A wise friend and fellow strategy game nerd then said to me:

The game’s dope. But what’s wrong with people associating it with Chess? No need to innovate that kernel away - their mental load would be taken up by re-learning how to move. Chess pieces are a universal language. They help them overcome the activation energy and figure out what’s going on.

So I re-designed the rules, mechanics, and first few levels with a sharpie and a whiteboard Figma and a LLaMA sandwich paper and a pencil. Then I started testing it with friends by setting up this old wooden board I found in storage. Soon “the game” looked more like this.

Fallen pieces = obstacles. Inverted ones = me running out of pieces (4 black rooks above).
Ballpens are for board resizing. No, there’s no “game engine”. Yes, I personally update the game state for you every time you move. No, Hollywood hasn’t texted me yet.

Something surprising started happening quickly. Anyone who tried the game got hooked. People would keep coming back trying to beat levels they couldn’t solve. They’d ask me to manually reset the board to specific checkpoints in a puzzle so they could try again.

Chess masters and n00bs alike would describe feeling a rush of excitement every time they’d reach the ‘Aha’ moment of a maze. Then people wanted to play “the game” at home and show it to their friends. I’d send them photos of my scribbled level designs so they could reproduce them on their own with a physical chess board or something like lichess.

One day I just figured this was getting comically unsustainable. So I decided to build “the game” into a proper thing online and call it Echo Chess.

And thus began a deep, deep rabbit hole…🐇


Echo Chess “Classic”

Gameplay Design

Three simple rules of Echo Chess — still relevant from v0 to this day:

  1. You are playing White, there is no opponent. You must capture all pieces to win.

  2. You become the "echo" of any piece you capture. Captured a bishop? Become that bishop.

  3. You can’t pass through red obstacles. Find the best move order to clear the board.

And so it was that Echo Chess graduated from a sandwich paper to a Mechanical Turk model to a minimalist web app, where you just hit a link and solve the maze. No downloads, no installs, no load screens, no tutorials, no handholding. Just tap the link, figure out the puzzle. An extreme experiment in ‘show, don’t tell’ design.

A new meaning for ‘eat what you kill’.

What’s the catch? Because you’re transformed with every capture, the maze is effectively changing as you play.

Level 6 of “Classic” v1. The higher ones get significantly more challenging.

In later levels, you unlock squad-like gameplay where you get to move and coordinate multiple pieces — or even cannibalize one of your own to clear the way, if it comes down to it.

The enemy of my enemy is another person blocking the way.

Game Mechanics Implications

I think one of the reasons Echo Chess is intriguing to so many diverse player personas is that it’s (laughably) easy to learn, yet (frustratingly) hard to master.

To give you but a glimpse as to why that’s the case:

  • The moment you enjoy any capture whatsoever, you lose all your existing abilities. If you were a knight galavanting around obstacles, you just lost all that by capturing a rook.

  • Landed yourself a powerful queen? Great, you can only use it once before it’s gone. Think you should’ve saved capturing it for later when you’re in a pinch? Who says the option would still be there if you take another path? Capturing order matters. A lot.

  • Need to clear the passage to cross to the other side of the board? Don’t forget you’ll become whatever blocker you’re clearing. Pawns = fewer degrees of freedom. Keep them till the end, or better get them over with quickly?

Oh you think it’d be much easier if you only had more white pieces on your side? Why don’t you try solving, say, Level 13 below? (jk pls don’t actually try it cold turkey. It’s pretty much impossible.)

Level 13, or the “puzzle from hell” as one of the early playtesters called it.

Very quickly you’ll realize Echo Chess is a puzzle game that’s deceivingly immune to brute force. Click-spamming won’t get you far here.

I actually had to add a ‘Give Up’ button as an early termination state for those who get stumped but want to save their high score (in some ways, this makes puzzle games different from the arcade genre because you can never just… die).


Balancing, Levels and Game Design

Traditional game design tells us we should aim to reproduce, both as strictly and as loosely as possible, this kind of sentiment when setting the difficulty of a game:

The ‘Difficulty Saw’ curve most designers swear their game has. It doesn’t.

A good game feels like it’s getting easier as you start mastering its mechanics, until a new mechanic/challenge/twist is introduced, and the difficulty shoots up again. You try applying the old strategies you’ve perfected but it’s no use. Then you get it: you need to snap out of the comfort zone and learn some new SKilLz it’s pushing you to learn. So you actually #gitgud and the game feels easier again, up until you hit the next bump. Rinse and repeat.

In theory, this is the recipe for a truly fun game where players enter the coveted ‘Flow’ state of gaming.

In practice, this kinda works a tiny bit, until it doesn’t. Players come with all sorts of prior skills (having played many games of the same genre, or just being wired a certain way), a wide spectrum of (im)patience, and all kinds of expectations as to what qualifies as a difficulty spike, or where they would’ve drawn the line between offensively easy and unfairly brutal. You somehow end up both scaring the n00bs and boring the veterans.

More people will rate your game as top-left-red than bottom-right-red.
Since the dumbing down of modern society, the Mindless zone has never been so fertile.

What’s the answer, then? Well, if you’re thinking of making a game that’s ‘easy to learn, hard to master’, I think it goes one of two ways:

  • If you’re reading this after the year 2020 2010 2000, don’t make this game. Seriously, don’t. No one will play it. Find some other thing instead. You’ve been warned.

  • If you’re someone who’s 100% immune to good advice and you still insist on making this game, then the gods be with you. At the very least, though, please do these simple things:

    • (1) let people skip to any level they want, whenever they want. Let them gauge their own spice tolerance.

    • (2) if you really need to sprinkle new mechanics here and there, add them in some micro-saw-tooth way.

    • (3) make your macro difficulty ramp a slow-building exponential curve instead, closer to this next one. Newcomers will Dunning–Kruger their way to the early fun part, and ambitious players will power through to reach the really fun stuff.

      • (Optional) offer enough replayability and variability for each difficulty tier so that newbies can thrive in cuteness land forever, while GMs and PhDs play a math meta of their own.

A more desirable curve for easy-to-learn-hard-to-master games.

Cool. Let’s bring it back to Echo Chess. Friends who know me well will attest that, when it comes to any form of creative expression, I ascribe to the (more-controversial-than-it-has-to-be) No Spoilers school of thought.

As a service to fellow nospoilerists out there, I’ll only break down one (fictional) example in detail here. Then I’ll let you enjoy the actual levels yourself.

If your levels have nothing to say, your game’s language is too primitive.

Consider the 4x4 board above. You’re playing King, and need to capture two pawns, a knight, and a rook. There are 2 impassable obstacles. Which piece do you capture first? And is this a good tutorial level?

If you’re new to Echo Chess, it means you fall in one of two camps:

  • (1) You see a cute small level. Only 4 pieces. You brute-force your way into capturing every piece to see what happens. All hail the #yolo king.

  • (2) You’ve binge-watched The Queen's Gambit and have been slogging through Chess.com, because Covid. Or maybe you’re an actual Chess player. So you know better. You plan your 1st, 2nd, Nth move in your head and assign values to certain squares.

The problem if you’re a type-2 person is that in Chess, you already have some built-in intuition as to what generally makes a move good vs bad. You’ve spent so many years practicing advanced tactics and openings that using your intuition as a heuristic in your ‘forward pass’ exploration makes it less brute-forcy than for type-1s. But in Echo Chess, your heuristics can be easily deceived. Is r > n or r < n?

Here’s how an Echo Chess player would typically solve the board above using a ‘backward pass’ instead (obviously there’s more than one “right” way to do these, but FWIW):

Step 1. Identify the top pawn as an ‘end state’ piece that has 0 valid moves.
Step 2. Identify the bottom pawn as only ever able to capture one piece: the top pawn.
Conclusion: LAST 2 captures MUST be bottom p, then top p.
Step 3. Identify the bottom pawn as non-capturable by a knight (no L’s can reach it).
Step 4. Bottom p is non-capturable by p or n. But it’s our penultimate capture, so K can’t get it.
Conclusion: the rook MUST capture bottom p. LAST 3 captures MUST be: r, p, p.
Step 5: r is non-capturable by the p’s. But it’s our 3rd-to-last capture, so K can’t get it.
Step 6: n must capture r. It needs a few L-jumps but it can do it.
Conclusion: LAST 4 captures MUST be: n, r, bottom p, then top p.
Step 7: All that’s left is for K to capture n. Easy.
Step 8: Recap of the ‘forward pass’ capturing order to clear the board.
Conclusion: backprop does the smoothest moonwalk.

In reality, an ‘Echo Chess player’ (wtv that is) would instinctively have spotted the 2 pawns as finishers from the first glance, and likely computed Steps 3-5 pretty quickly. I’d say the L-jumps of Step 6 are the only cryptic bit in this maze, assuming the right intuition has already been developed. Is it a good tutorial level? Nope.

The onus of nurturing this type of game-mechanic-specific intuition is on the game designer, not the player. It should be (and is) experienced in a prior maze to this one. There are at least a dozen similar heuristics to what we just saw that you’re bound to pick up while playing Echo Chess. I’ve designed each level to hopefully convey the importance of a new one. If you guess some of them, please drop me a note.

If you’re curious about the more challenging ‘Aha’s, try levels 11-15 on echochess.com. You can switch to any of them anytime by clicking the Level dropdown at the top.


Scoring System

Let’s talk about scores. I’ve always believed that a well-designed scoring system should be optimized to properly incentivize better gameplay, not cheesing for high scores or APM. So naturally I baked that into the design of Echo Chess.

  • You get points for each capture (higher per piece type, higher overall for harder levels).

  • Gameplay-wise, there’s (almost) no difference between a Q and a K. Why? Because single-player. If it’s always your turn, space-time gets bent. K * moves = Q.

1function updateScoreForCapture(pieceType) {
2 var pieceScores = {
3 'p': 10,
4 'n': 30,
5 'b': 30,
6 'r': 50,
7 'q': 80,
8 'k': 100
9 }
10 runningScore += currentLevel * pieceScores[pieceType]
11}
  • You get a “level bonus” every time you solve a maze. This increases quadratically as levels go up.

  • The level bonus is penalized (in a compounding way) for every move you make. So move efficiency matters a ton —especially in higher levels where the compound penalty hurts a bigger base. You’d still always get some positive bonus, nbd.

1function updateScoreForLevelBonus(numMoves) {
2 var levelBase = 200 * Math.pow(currentLevel, 2)
3 // penalize by compounding -2% for every move used in this level
4 var levelBonus = Math.floor(levelBase * Math.pow(0.98, numMoves))
5 levelBonus = Math.round(levelBonus / 5) * 5
6 runningScore += levelBonus
7}
  • Jumping levels using the dropdown is allowed at any time, but the score resets automatically to avoid cheating the live leaderboard.

  • You get a “time bonus” when finishing the game based on how long it took you to beat the whole thing. The most you could ever score (even at hyperspace speed) is capped at a multiple of base, then it starts dropping as a reciprocal of time spent.

1function updateScoreForGameCompletion() {
2 // time bonus drops reciprocally but is always in [0, +30%]
3 var cappedMultiplier = 0.3 / (1 + 0.005 * secondsElapsed)
4 timeBonus = Math.floor(cappedMultiplier * runningScore)
5 runningScore += timeBonus
6}

Basically the scoring function rewards maze-solving first, move efficiency second, and speed of completion last.

This incentivizes players to embrace the joy of adventuring and actually play the maze until they truly ‘get it’ without premature optimization. Then they can go back and search for more creative solution paths (which brings a joy of its own).

Players always behave in ways the game reinforces them to, whether they realize it or not. And a meta will emerge from any scoring system (LeetCode, anyone?). This follows directly from evolutionary psychology —or from RL if you prefer robo-speak. Oh and speedrunners gonna speedrun no matter what you do.

“Make sure the winning strategy is also the funnest one to play.” -Mark Rosewater, strategy game designer and puzzlemaker.


Game Full Stack Development

So how did Echo Chess go from wooden boards to pixels?

The frontend is, believe it or not, good old Javascript, jQuery, HTML, CSS. The backend is a Flask server hooked into a ReplitDB. Async calls are made with good old AJAX. Echo Chess is retro through and through.

Two open source libraries helped quite a bit for the earliest prototype’s client-side plumbing: chess.js and chessboard.js. Of course, an enormous amount of work went in to go from a barebone chess moves validator to a fully fledged chess variant puzzle game.

In fact, a ton of code in these libraries had to be overwritten and extended to incorporate chess variants mechanics. Here are but a few examples:

  • King movements, checks, checkmate

  • Relative pins, absolute pins, immobilized pieces

  • Restrictions on kings per player (< or > 1)

  • Piece promotions

  • Castling rights

  • Player turns, or lack thereof

  • Half-moves, full moves, draws

  • Pawn double-step disabling

  • En-passant dynamics

  • Squares having obstacles or boundaries

  • Movement pathing with obstacles

  • Capture-based transformation

  • Preventing transformation reversion

  • Self-sacrifice dynamics

  • (you get the point)

The client side keeps track of the game state, level state, moves efficiency, scoring, timing and consistency across level jumps, retries, restarts, switching game modes, etc.

The server handles saving and retrieving scores, live leaderboards, level data validation and prediction (more on that later on). The latest saved high scores get cached on the client and only get updated by the server when relevant.

Let’s take a look at a simple implementation example from Echo Chess to illustrate how a traditional chess engine can be used as a building block for chess variants.

Here we’d like to highlight in green all the squares a piece can move to, taking into account the concept of obstacles we’ve defined for this game. We start by formalizing how an obstacle can block stuff.

Left: all 3 on same line, AND obstacle between them.
Right: all 3 on same line, but obstacle NOT between them.

Let’s write some simple code to capture this.

1function obstacleOnPath(direction, source, target, obstacle) {
2 let [i1, j1] = getRankAndFile(source)
3 let [i2, j2] = getRankAndFile(target)
4 let [ik, jk] = getRankAndFile(obstacle)
5 if (direction == 'h') {
6 // horizontal block
7 if (i1 == i2 && i1 == ik && isBetween(jk, j1, j2)) {
8 return true
9 }
10 }
11 else if (direction == 'v') {
12 // vertical block
13 if (j1 == j2 && j1 == jk && isBetween(ik, i1, i2)) {
14 return true
15 }
16 }
17 else if (direction == 'd') {
18 // diagonal block
19 if (Math.abs(i2 - i1) == Math.abs(j2 - j1) &&
20 Math.abs(ik - i1) == Math.abs(jk - j1) &&
21 Math.abs(i2 - ik) == Math.abs(j2 - jk) &&
22 isBetween(ik, i1, i2) && isBetween(jk, j1, j2)) {
23 return true
24 }
25 }
26 return false
27}

Now we can combine these concepts with the usage of the chess engine to write a simple highlighting function for possible moves.

1import { Chess } from 'chess.js'
2const chess = new Chess()
3
4function blockedTarget(source, target) {
5 var obstacles = levels[currentLevel].obstacles
6 if (obstacles.includes(target)){
7 return true
8 }
9 var pieceType = chess.get(source).type
10 if (pieceType == 'b' || pieceType == 'q') {
11 for (let obstacle of obstacles) {
12 if (obstacleOnPath('d', source, target, obstacle)) {
13 return true;
14 }
15 }
16 }
17 if (pieceType == 'r' || pieceType == 'q') {
18 for (let obstacle of obstacles) {
19 if (obstacleOnPath('h', source, target, obstacle) ||
20 obstacleOnPath('v', source, target, obstacle)) {
21 return true;
22 }
23 }
24 }
25 return false;
26}
27
28function colorPossibleMoves(source) {
29 var moves = chess.moves({
30 square: source,
31 verbose: true
32 })
33 if (moves.length == 0) return
34 greenSquare(source)
35 for (var i = 0; i < moves.length; i++) {
36 if (!blockedTarget(source, moves[i].to)) {
37 greenSquare(moves[i].to)
38 }
39 }
40}

From then on, we can call the colorPossibleMoves function every time we pick up a white piece from a given square, and it will color all the corresponding squares in green.

1function onDragStart(source, piece) {
2 // player can only pick up white pieces
3 if (piece.search(/^b/) != -1) {
4 $("#wrong-piece")[0].play();
5 return false
6 }
7 colorPossibleMoves(source)
8 ...
9}
10
11var board = new Chessboard('board', {
12 position: levels[currentLevel].fen,
13 draggable: true,
14 onDragStart: onDragStart
15 ...
16})
Sometimes it’s easy to forget how constrained a cornered queen can be.

Fast forward several thousand lines of code, and soon enough, Echo Chess ‘Classic’ was live! Friends were able to enjoy it on the go.

Since I had already meticulously crafted 15 puzzle levels, there was enough in there to keep everyone busy. For a while, the hardest puzzles remained unsolved. Dozens would try, day in, day out, but very few would be able to reach the very end. Weeks passed. And then it happened. People started finishing the game.

That’s legit street cred right here.

Like a rush of emotions after binge-watching the last season of a favorite show, they celebrated, sighed, stared into the abyss, then cracked their knuckles, and came back asking for more. MOAR LEVELS. ASAP.


Echo Chess “ENDLESS”

Once again, I was faced with the bittersweet realization that the users’ appetite for Echo Chess had exceeded my ability to manually keep up with “DM-ing” it. I knew I needed to automate level creation to make it, once and for all, entirely independent from my involvement as a mammal.

Procedural Generation

And so was born the new ‘Endless’ mode for Echo Chess! A true “Tetris”-like puzzle game, completely separate from ‘Classic Mode’, togglable in url params.

Endless Mode (right-side) even has its own brand, compliments of Angela

Infinite levels, all randomly generated, all in real time. Anytime you finish a level, you carry over your winning piece to the next, and the board of the upcoming puzzle pivots around your current piece’s position.

Levels automatically swivel around you.

So what do we really mean by ‘Endless’?

We randomly generate each level from parametrized distributions of pieces, obstacles and boundaries (1). This gives us full control over adjusting the variety and difficulty of generations (2), and lets us tailor the beginning of each level to coincide with the ending of the prior one (3).

That brings us to the time mechanic. How do we converge an infinite level progression? We make it a countdown race:

100 seconds. Time's running out! (4) Wanna stay alive? Keep solving new levels! You get more time back for solving bigger ones. (5)

Coincidentally, with this new fast-paced playing style comes a shift from a pure strategy/puzzle solving to a full-on arcade mode. The scoring function gets adjusted accordingly with bonuses scaling up or down with level sizes and difficulty —and we get a new leaderboard that’s kept fully separate.


Encoding, Decoding

Now that we’re dealing with so many levels, we’ll need an efficient and lossless way to conveniently read, write, encode, decode, send, receive, store, retrieve, transform, analyze, augment, and render any level configuration.

Unforunately my wooden system falls a bit short of Shannon entropy.

Chess players reading this are probably already thinking of multiple such viable systems, like PGN, SAN, or UCI. These are all great for chess moves encoding and decoding, and could possibly be adjusted for our needs. But there’s actually a better system we could reuse for the purpose of game states in particular: the elegant Forsyth–Edwards Notation (FEN).

One row of a chess board converted to a string using FEN code.
Source: chessgames.com

The easiest way to understand a FEN string is as follows:

  • Read it left to right, one row at a time, starting from the top row

  • Lowercase letters = black chess pieces

  • Uppercase letters = white chess pieces

  • Digits = number of consecutive empty squares (ignore their color)

  • King♚, Queen♛, Rook♜, Bishop♝, Pawn♟, kNight♞

    • The King already called dibs on ‘K’ (royal greed has no limits) so use ‘N’ for the peasants knights instead🐴🌝

All rows of the board converted and combined to generate the board’s FEN string.

First order of business: repurpose and expand the FEN encoding system for Echo Chess. In addition to everything else FEN does, I needed a version that takes into account things like the locations of obstacles, the size and shape of a level, where the boundary squares are, and so on.

Developing a New Formal Notation

So I came up with a new encoding that expands on FEN, which I’m offering up here for anyone who might find it valuable for their work. I call it the “compoundFEN”.

One row of an Echo Chess board converted to a string using compoundFEN code.

Let’s consider the following example from Echo Chess to see how the compoundFEN would be derived. To better visualize this, we’ll overlay a grid over the level to make the canonical 8x8 chess board pop.

Level 7 of the original Classic mode. Looks like all in all, it was all just bricks in the wall.

We go through each row, top to bottom, starting from the 0th to the 7th row — sure, we could instead count from 1 to 8 like sapiens commoners, but what would the robots say?

  • (0) Nothing but a board boundary in this row showing the confines of this particular level size. Can’t really think of these squares as ‘empty’ because no movement is allowed on them. Let’s call each of these boundary squares an ‘X’. The royal K hasn’t expropriated the X letter yet so we should be good to go.

    • We’ll encode this row in compoundFEN as ‘XXXXXXXX’.

  • (1) Okay so we start with an ‘X’ for a boundary. Then we have an actual empty square, so that’s a 1. Now a bunch of obstacles. They’re technically similar to boundaries in that they block the player’s motion, but they have a slightly different use. Let’s call these obstacle squares lowercase ‘x’ just in case. Okay so we have 5 of those ‘x’, followed by another ‘X’.

    • Great, we can encode this as ‘X1xxxxxX’.

  • (2) This should be easier. We have ‘X’ followed by an ‘x’, followed by a bunch of black pieces (so all lowercase: q, p, b, k, and r), then a final ‘X’, with 0 proper ‘empty’ squares to be noted anywhere.

    • Boom. ‘XxqpbkrX’.

  • (3) Same thing here: we start with an ‘X’ followed by an ‘x’, then we get a white piece (so it’s uppercase) and it’s an ‘N’, not a ‘K’, because knights can’t even keep their initials in an absolute monarchy. Then we continue as before with the alternating black pawns ‘p’ and obstacles ‘x’ until hitting the last boundary ‘X’.

    • Easy. ‘XxNpxpxX’.

  • (4) Straightforward row: ‘X’ then ‘x’, then a black king ‘k’, then 4 ‘x’, and a final ‘X’.

    • Super chill. ‘XxkxxxxX’.

  • (5) Interesting, a slightly different one. ‘X’ and 2 ‘x’, sure. Then 3 actual empty squares. Okay so we’ll represent these as ‘3’ according to the base FEN convention. Then a black knight ‘n’ and a final ‘X’.

    • Cool. ‘Xxx3nX’.

  • (6) Back to familiar ones, we’ve got this. ‘X1xbnrxX’.

  • (7) We already know this one, nothing but boundaries. ‘XXXXXXXX

Now we put all these rows together, separated by ‘/’, to get the full ‘piece placement’ portion of the compoundFEN:

Let’s test this to see if our compoundFEN for Level 7 looks right. Here’s a quick decoding function you can use to convert from compoundFEN to a 2D board.

1def fen_to_board(compound_fen):
2 board = np.empty((8, 8), dtype=str)
3 board.fill(' ')
4 # grab 'piece placement' rows, ignore the rest
5 fen_parts = compound_fen.split(' ')
6 ranks = fen_parts[0].split('/')
7 rank_index = 0
8 file_index = 0
9 for i in range(len(ranks)):
10 rank = ranks[i]
11 for j in range(len(rank)):
12 char = rank[j]
13 if char.isdigit():
14 # consecutive empty squares
15 file_index += int(char)
16 else:
17 board[rank_index][file_index] = char
18 file_index += 1
19 rank_index += 1
20 file_index = 0
21 return board
1compound_fen = 'XXXXXXXX/X1xxxxxX/XxqpbkrX/XxNpxpxX/XxkxxxxX/Xxx3nX/X1xbnrxX/XXXXXXXX w - - 0 1'
2board = fen_to_board(compound_fen)
3print(board)
1[['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']
2 ['X' ' ' 'x' 'x' 'x' 'x' 'x' 'X']
3 ['X' 'x' 'q' 'p' 'b' 'k' 'r' 'X']
4 ['X' 'x' 'N' 'p' 'x' 'p' 'x' 'X']
5 ['X' 'x' 'k' 'x' 'x' 'x' 'x' 'X']
6 ['X' 'x' 'x' ' ' ' ' ' ' 'n' 'X']
7 ['X' ' ' 'x' 'b' 'n' 'r' 'x' 'X']
8 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']]

Looks great. Go ahead and compare every cell of this board decoded from our compoundFEN to the Level 7 screenshot above. Remember: lowercase ‘x’ refers to obstacles, and uppercase ‘X’ refers to boundary squares.

And now here’s an encoding function to automate the conversion of boards to compoundFENs moving forward.

1def board_to_fen(board):
2 compound_fen = ''
3 for i in range(len(board)):
4 row = board[i]
5 empty_count = 0
6 for j in range(len(row)):
7 piece = row[j]
8 # count consecutive empty squares
9 if piece == ' ':
10 empty_count += 1
11 else:
12 # add preceding empty squares
13 if empty_count > 0:
14 compound_fen += str(empty_count)
15 empty_count = 0
16 # add this square's content
17 compound_fen += piece
18 # add row's trailing empty squares
19 if empty_count > 0:
20 compound_fen += str(empty_count)
21 if i < len(board) - 1:
22 compound_fen += '/'
23 compound_fen += ' w - - 0 1'
24 return compound_fen
1print(board_to_fen(board))
2print(compound_fen == board_to_fen(board))
1XXXXXXXX/X1xxxxxX/XxqpbkrX/XxNpxpxX/XxkxxxxX/Xxx3nX/X1xbnrxX/XXXXXXXX w - - 0 1
2True

If instead you wanted to check the basic ‘standard’ FEN encoding for this same Level 7 example, you could simply ignore all ‘x’ or ‘X’ codes in the compoundFEN and assume they’re part of the consecutive empty spaces. That’s because standard FENs consider anything that’s not a chess piece to be empty. This is what you’d get: 8/8/2qpbkr1/2Np1p2/2k5/6n1/3bnr2/8 w - - 0 1, which you can conveniently verify with this online chess board (just keep in mind that it won’t look as recognizable without the Echo Chess parts).


The Problem of Solvability

So now we have an efficient, reliable way to procedurally generate random levels, manipulate them, and serve them to our players. But one thing’s for sure: we don’t want to be serving unsolvable mazes. Tough puzzles can be thrilling, broken puzzles are just trash.

How do we check that for a given generated maze? Is it even theoretically possible to force randomly generated Echo Chess mazes to be solvable? To understand what we’re up against, let’s start by taking a look at why solving an Echo Chess maze is so tricky to start with.

Each of these trivial examples illustrates a simple configuration where the maze is completely unsolvable.

Top 2 configurations have no valid first moves due to obstacles or boundaries.
Bottom 2 have no valid first capture — effectively forming disconnected subgraphs.

Now let’s look at some less-trivial examples of unsolvable configurations. See if you can spot the issue before looking at the answers.

Top-left: the rook is impossible to reach. The capturable bishop acts as a blocker.
Top-right: no piece exists that could reach the cornered queen.
Bottom-left: if q is captured, can’t get b; but if b is captured, can’t get q.
Bottom-right: Two bishops have no valid moves; capture one and you’re stuck.

At this stage, you may be starting to develop some intuition for what made these mazes unsolvable. And yet, here’s an example of a level that feels unsolvable the first dozen times you try it, but that does actually have known solutions.

If you can’t see the solution, head to echochess.com and try it out yourself.
It’s Level 8 in Classic mode, from back when I was still designing every puzzle manually.

You might be asking yourself: okay, so solving a maze on the fly is not that straightforward, but given enough prep time, what’s the big deal? Instead of generating completely random levels in real time, why doesn’t this guy just serve a random level from a pre-generated list of solvable levels that he’s already curated beforehand?

Good question —that was my initial attempt. The reason why that turns out to be not such a good idea is that we have no control over which carry-over piece and on which carry-over square the player would be starting any given level, seeing that it depends entirely on their prior solution path. Still, you might argue, you could either:

  • (1) make sure each level has only one possible end point
    => Nope, that means we’d be back to manual designs, beating the whole point of the Endless mode

  • (2) pre-generate and curate levels for every combination of piece type and square the player might end up in, just in case

    => Really? You want me to manually test and curate 8 x 8 x 6 = 384 combinations for every level?

  • or (3) abandon the whole concept of a carry-over piece and simply serve levels from a curated pregen list regardless of how the prior level ended

    => That would actually ruin the core vibe of Endless that all users currently love: it really fees like one continuous, coherent, infinite level being played. To be fair, manually editing a pregen list is still much faster than manually designing from scratch (MidJourney, anyone?) so this could still come in handy for extra help on future Classic mode levels design, but it won’t give us what we need for Endless.

In any case, this still does not solve the primary matter at hand, which is finding an automated method for generating mazes that have solutions.

One way to approach this would be to try using Pathfinding Algorithms like Depth-First Search (DFS) with Backtracking, or variants of A* and heuristic-based approaches.

An example of a maze generator using DFS with recursive backtracking.
Source: something I built for fun as I was iterating on Echo Chess.

But before we hit another iceberg of delusion, let’s pause and think a bit about what we’re really attempting here:

  1. To solve an Echo Chess level, we have to clear the full board.

  2. Clearing the board means capturing every piece.

  3. To capture a piece, we have to move to its location.

  4. Once a piece is captured, we can’t capture it again.

  5. Therefore, to solve an Echo Chess level, we have to ‘visit’ each piece exactly once.

Does this ring a bell? Bad news, you guys. It looks like the problem of Echo Chess Solvability (ECS) could be mappable to the Hamiltonian Path Problem (HPP) of visiting each city exactly once in a graph.

If only solving a Hamiltonian Path Problem was as easy as tracing a known solution.
Source: Combinatorica

In case you remember your Thoeretical Computer Science class from back in the day, the HPP is an NP-complete problem.

And if you also want to account for Echo Chess’s move efficiency mechanic, like finding the shortest Hamiltonian path visiting all cities, then you’re even risking NP-hard territory.

If you don’t recall the HPP, don’t worry about it - you may be more familiar with its close cousin TSP since the Traveling Salesman Problem is such a notorious one. Any way you look at it, you should probably be un-pumped that we’re dealing with these beasts.

The Travelly Boi. My money’s on the Travelly Boi.
Source: reddit, obviously.

Granted, with enough compute, we could likely attempt some approaches like Monte Carlo Tree Search, alpha-beta pruning, or others, but there are other aspects of the Echo Chess solvability problem that can be non-trivial to deal with — even if we developed a reproducible mapping from a level’s starting compoundFEN+movement rules to a graph theory representation.

Things like disappearing ‘cities’ (due to captures), ‘roads’ changing upon every city visit (due to the echoing mechanism), a morphing topology as the graph is being explored: we would need to account for each of these meticulously.

Yes, some heuristic-based backtracking algorithm with clever pruning may put up a decent fight with all of this. But at the end of the day, let’s not lose track of what we actually care about.

Ultimately, the main goal we’re after here is answering this simple question: is this level we’re serving to the player likely to be “solvable” or “not solvable”?

Remember: we’re not really looking for any actual solutions to a maze, that would be our players’ job —they can keep all the fun to themselves. We just need a simple boolean to tell us YES or NO for solvability. So at the very core of it, we are dealing with a Classification problem!

And that’s something we should be able to, in theory, deal much better with, assuming we set things up correctly. Why don’t we give that a shot? 🌚


Data mining

The thing that comes to mind right away when we think of any supervised learning approach is data. Do we have any, can we get any, what shape does it come in, will we get it labeled, will it be clean, will we have enough, you know the drill.

Thankfully, the initial Classic mode of Echo Chess has already achieved a very humble level of traction and engagement, entirely thanks to word-of-mouth from players enjoying the game around the world.

Okay so we have some real users and we can serve them some fresh levels data, that’s great. Let’s assume for a second that we can keep getting similar traction and engagement for Endless mode as we got so far from Classic. How will we get labeled data out of this?

Well if someone gets to the next level, that means they must’ve found a correct solution. Corollary: that level they were playing must be solvable. Just by winning at the game, they’re labeling it for us! Now we just need to get our Genki Dama donors to keep playing and solving while having fun.

Wait —isn’t that a catch-22? How are we going to:

  • (a) convince enough people to play that game mode so we can

  • (b) crowdsource labeled data, which we’ll use to

  • (c) train a model for predicting solvability, which will allow us to

  • (d) guess which generated levels are solvable, so that we can

  • (e) make something that’s fun to play, which is key in order to

  • (f) convince enough people to play that game mode?

I wonder where Aquinas stands on scissors packaging.

The subtle trick is in (e). If we can make something that’s just enough fun to play to kickstart the first cycle, we’ll have our prime mover. We’ll need it to be playable without the user reaching a dead end anytime an unsolvable level shows up. That would kill the whole vibe.

Say hello to the SHUFFLE button: your infinite Get-Out-of-Jail-Free card to shuffle those pesky dead ends away!

Okay, I guess we could acknowledge upfront there are imperfections and dead ends and hope someone will still give the janky prototype a try? Nah, we can do insanely better than that. We make it a feature, not a bug! Remember Minesweeper from the 90s?

A pothole is a tragic accident. A mine is a strategic trap.

Introducing ENDLESS Mode. Feeling stuck at any level? Try again, or simply 'SHUFFLE' it away and move on to the next one 💁🏻‍♀️

Watch out! Some levels are purposefully sprinkled in there as unsolvable traps to be SHUFFLED away 👻

Hone the skill of directly spotting dead-ends or you'll get stuck in a treacherous maze 🧨

Don't be shy with SHUFFLING! You have unlimited shuffles - just keep an eye on the countdown or you'll be racing to Game Over⏱️😉

Unsolvables are a feature, not a bug. Temporarily, until the ML’s ready.

To offer that same adrenaline rush old-school arcade games are famous for, we also cap the extra time that’s winnable from successfully solving levels back at the countdown’s starting 100sec. This lets our players keep their guard up as they progress — and keeps our levels being diligently solved labeled.

And the best part, no data whatsoever about any user is needed. Echo Chess simply tracks its own generated level configs, and tags whether they were solved or not. All data collection is 100% anonymous, not even 'anonymized'. Proud to be supporting Plausible.io, a true privacy-obsessed, cookie-less, GDPR-native, open source analytics platform.

Next, I needed to GTM with this crowdsourcing Trojan Horse. A savvy marketing friend recommended I post a promo video on Instagram (or, God forbid, Tiktok) to reach newcomers from the casual gaming scene.

Not too proud of what happened next but, hey, data ain’t cheap ngl.

I won’t say I had to use the robot voice, but I won’t deny it either.
No clue who the face is at the end btw - is that a Gen Z thing?

Believe it or not, that lame promo + some trending song really hit it off with the social gods. Soon enough, Echo Chess started attracting real fans entirely through word of mouth. I started getting tagged in Twitter conversations from strangers in other languages.

Spannend back at you, brothers. Thank Godheid for Google Translate.

So that was great — it meant I now had real users (happily) playtesting the randomly generated mazes for free, trying their best to beat each one, and automatically tagging a new maze as ‘solvable’ in the background every time they leveled up. Win-win-win. Boom.

Obviously though, the challenge remained with identifying all the ‘unsolvable’ levels. Trying to crowdsource the labeling of these ones is extremely likely to lead to tons of false negative labels that dirty the data, so it’s a bad idea to delegate it. Just because someone out there couldn’t solve a level doesn’t mean it’s unsolvable. Only the opposite is true.

What’s the best way to guarantee your data is clean? Collect it yourself.

To get over the hump of initial bootstrapping, I went ahead and manually tested+tagged 500 different level generations. Yep, 500. Manually.

Was it time-consuming? Agree to disagree. Did my head hurt? Ice cream helped. Did I enjoy it? You bet. I mean I’m literally solving a puzzle game, it's not like I'm labeling dry paint.

Pro tip: gamify your ML data collection. Games are a good drug.

And I’ll let you in on a little secret, but please don’t try this in prod. I was too lazy concerned about getting carpal tunnel from clicking around 100s of levels on my local dev env with a desktop mouse. So I pushed an easter-egg div to the live mobile version of Echo Chess and I used it to tag all ‘unsolvables’ by tapping an invisible pixel on the screen while on the go —the ‘solvables’ continued being tracked automatically upon leveling up.

Which brings us to today. Here’s the data mining pipeline as it stands: client app -> Flask server -> ReplitDB -> export to csv -> Jupyter nb for analysis + model training.

From here on out, once we reach some models we feel ‘good enough’ about, we can export them, import them into our Flask server, and handle the Inference stage from there through regular client-server communication.

But let’s not get ahead of ourselves yet. We continue our post-data-mining journey on the Jupyter side.


Finally playing with some data

We start by pulling in the labeled levels data into a Jupyter notebook. I’m using Kaggle here, but we can easily switch it up with Colab, our own box, or literally anything.

1import pandas as pd
2df = pd.read_csv('levels-dataset.csv')
3df
5548 rows × 7 columns

That’s 5,548 labeled, unique, procedurally generated levels. Every single one of them manually played and certified by a human player as a solvable or unsolvable maze. Now we’re talking🤘

We seem to have ~40 levels (0.7%) that got mined with an undefined compoundFEN. Let’s just get rid of those to keep things clean. No tears shed.

1df = df[~(df['compoundFen'] == 'undefined')]
2df.reset_index(drop=True, inplace=True)
3print(len(df))
15508

Done. 5.5k. That’s a nice round number.

Train-Test Splits

Before we do anything else, we should randomly split the data into three separate sets: Training, Validation and Test.

Remember taking the SAT? Our friend Bill here does and he’ll take us down memory lane.

Sorry Bill, we’re about to dissect your past trauma.

It only took Bill a few tries to nail those practice sheets that come with answers in the back. Sure, his first two attempts were abysmal. But that’s because the test format is weird. Once Bill saw the correct answers he directly got what the fuss is about. Soon he was acing every single question. Bill’s mom was proud. Her Billy was always such a fast learner.

Then Bill tried a ‘realistic’ test. Downloaded a fresh test bank, sight-unseen —even timed himself and all. But Bill didn’t get so lucky this time around. No sweat, he’d been there before. Better flunk these than the actual exam. At least Bill was now used to the ‘real’ conditions and the importance of guessing when getting unfamiliar questions.

By his tenth test, Bill had really perfected his guessing game. He taught his friends tips and tricks, preaching about multiple-choice patterns and Bayesian probabilities of B’s and C’s in sections following alternating A’s and D’s. He showed them charts he’d been plotting of test bank Q&As. Bill had finally conquered the SAT.

Real life can be a slap in the face. Tough lessons can really cook you through.

When Bill finally sat down for the real thing on D-Day, he couldn’t shake that weird feeling that the questions he was getting seemed a bit… different. Still, Bill had an unfair advantage: his genius ‘probablistic guessing’ technique. He could’ve sworn it’s infallible —even thought about patenting it. Yet Bill never bombed so hard as he did that day.

Could it be that Bill was studying for the wrong version all along?

You, my friend, were majorly overfitting. You got good at gaming ‘a’ system, just not the one that mattered.

Instead of focusing his learning on universal concepts that generalize widely, Bill just wasted his time coming up with two-bit parlor tricks to impress a handful of practice tests. And the saddest part: if he hadn’t gone through the three sobering phases, Bill would’ve never even known he sucked.

Let’s not repeat Bill’s mistake. We want our models to learn actually useful stuff that’s applicable to data they’ve never seen before. The whole point is to test how well we can predict solvability for new unkown levels. So let’s be sure to set some aside that we’ll keep 100% untouched.

1# Split 10% into Test, preserving the `solvability` distribution
2df, tst_df = train_test_split(df, test_size=0.1, random_state=42, stratify=df['solvability'])
3
4# Split the non-test portion (90%) into 70% Training, 20% Validation
5trn_df, val_df = train_test_split(df, test_size=0.2/0.9, random_state=42, stratify=df['solvability'])
1print(len(trn_df), len(val_df), len(tst_df))
13855 1102 551

Great. We’ve now randomly split our levels into:

  • 70% Training Set in trn_df (3,855/5,508)

  • 20% Validation Set in val_df (1,102/5,508)

  • 10% Test Set in tst_df (551/5,508)

Data Preprocessing and Feature Engineering

Given the knowledge of EchoChess mechanics, we already know there are many features of a level that we could analyze and which could provide some solvability predictive power: things like walls, starting pieces, number of queens, and so on. All this information can be extracted from a level's compoundFen, which means new features can be easily engineered and populated for every row.

Without boring you with the deets, I wrote a preprocessing function that takes in a pandas df, and returns it with all the extra features that may be of interest. This function is then called like this.

1proc_data(trn_df)
2proc_data(val_df)
3proc_data(tst_df)

Here are some of the new columns added by proc_data:

  • levelSize

  • levelGeoShape

  • levelOrientation

  • numObstacleSquares

  • numConnectedWalls

  • lengthOfLongestWall

  • startingWhitePiece

  • isStartingPieceBlocked

  • numEndStates

  • numBlackKnights

  • numBlackBishops

  • numBlackRooks

  • numBlackQueens

  • numBlackKings

  • numEmptySpaces

  • emptySquaresProportion

  • blackPiecesProportion

  • obstacleSquaresProportion

  • +to encode a certain understanding of the relative spatial positioning of every board state, we generate the following 64 features (for each of the squares in the 8x8 board):

    • [ content of cell square_xy for every xy position on the board ]

This is a sample of what our Training set trn_df looks like right now.

3855 rows × 67 columns

I’ll only call out the implementation of a few interesting features we added that I think showcase well the power and flexibility of the compoundFen format.

1def get_starting_white_piece(fen):
2 return next((c for c in fen if c.isupper() and c != 'X'), '')
1def count_black_pieces(fen):
2 excluded_chars = ['x', 'w']
3 return sum(1 for c in fen if c.islower() and c not in excluded_chars)
1def count_empty_spaces(fen):
2 piece_placement = fen.split(' ')[0]
3 digits = [int(char) for char in piece_placement if char.isdigit()]
4 return sum(digits)
1def hasValidMove(rank, file, board):
2 piece = board[rank][file].lower()
3 if piece not in ['r', 'b', 'n', 'k', 'q']:
4 return 0
5 def outOfBounds(i, j):
6 return (i < 0) or (i >= 8) or (j < 0) or (j >= 8)
7 def blockedMove(dx, dy):
8 if outOfBounds(rank+dx, file+dy):
9 return 1
10 # check if it hits an obstacle or boundary
11 return (board[rank+dx][file+dy].lower() == 'x')
12 blockedCross = all(blockedMove(dx, dy) for dx, dy in [(1, 0), (-1, 0), (0, 1), (0, -1)])
13 blockedDiagonal = all(blockedMove(dx, dy) for dx, dy in [(1, 1), (-1, 1), (1, -1), (-1, -1)])
14 l_jumps = [(2, 1), (2, -1), (-2, 1), (-2, -1), (1, 2), (1, -2), (-1, 2), (-1, -2)]
15 blockedLJump = all(blockedMove(dx, dy) for dx, dy in l_jumps)
16 if piece == 'r' and blockedCross:
17 return 0
18 elif piece == 'b' and blockedDiagonal:
19 return 0
20 elif piece == 'n' and blockedLJump:
21 return 0
22 elif (piece == 'k' or piece == 'q') and blockedCross and blockedDiagonal:
23 return 0
24 return 1
1def blocked_starting_piece(fen):
2 board = fen_to_board(fen)
3 white_pieces = ['R', 'B', 'N', 'K', 'Q']
4 for rank in range(8):
5 for file in range(8):
6 if board[rank][file] in white_pieces and not hasValidMove(rank, file, board):
7 return 1
8 return 0
1def numEndStates(fen):
2 board = fen_to_board(fen)
3 black_pieces = ['r', 'b', 'n', 'k', 'q']
4 blocked_count = 0
5 for rank in range(8):
6 for file in range(8):
7 if board[rank][file] in black_pieces and not hasValidMove(rank, file, board):
8 blocked_count += 1
9 return blocked_count

Each one of the numerous functions like these is then easily applied to the individual rows of the dataframe to create new features. For instance:

1df['numEndStates'] = df['compoundFen'].apply(numEndStates)
2df['blackPieces'] = df['compoundFen'].apply(getBlackPieces)
3df['numRooks'] = df['blackPieces'].apply(lambda x: countSpecificPiece(x, 'r'))

And so on and so forth.

In order to simplify our analysis down the line, let’s also make a distinction, within each of the three 90/20/10 datasetstrn_df, val_df, and tst_df, between the independent variables (the X’s) and the dependent variable (the Y). Y is the column we’re trying to predict, i.e. solvability. X’s are all the columns we’re using to predict Y.

Note that this is a column split, not a row split like train-val-test. So the amount of data in each set won’t change.

1def xs_y(df):
2 ...
3 # `cats`, `conts` are names of categorical & continuous independent variables
4 xs = df[cats+conts].copy()
5 return xs,df['solvability'] if 'solvability' in df else None
1trn_xs, trn_y = xs_y(trn_df)
2val_xs, val_y = xs_y(val_df)
3tst_xs, tst_y = xs_y(tst_df)

Cool, that was easy. From here, anytime we want to analyze, model or predict the relationship between the X’s and the Y for any of the three train/val/test sets, we can simply use the corresponding pair.

Exploratory Data Analysis

We can now perform some preliminary EDA on the training set. As you already guessed, we’ll be using trn_xs and trn_y for these plots.

Give a bishop a huge open field, and they’ll still miss out on (exactly) half its glory.
If life gives you bishops, trade them for anything else.
If there are enough queens to morph into out there somewhere, you’ll probably be fine.
The smaller the level, the more constrained the exploration paths become.
More pieces to capture -> more degrees of freedom -> more paths to victory.
GMs should really try sprinkling some obstacles in their Chess matches.
6x6 ‘core’ level size for Endless mode (excluding boundaries on files a & h or ranks 1 & 8).
Can you spot the 4 ‘quad’ centers where knights have 0 moves? (c3, f3, c6, f6)

Machine Learning

Based on these preliminary EDA observations, let's set up a dummy classifier to get a super quick baseline. We’re just eye-balling the above trn_df plots for now.

Remember, we train a classifier on trn_df —or more precisely in this case, we ‘eye-ball’ a dummy classifier on trn_df, then we validate our classifier’s predictive power on val_df. The test set tst_dfis purposefully never touched! We’ll only come back to it after we’ve selected the winning model using val_df and we’re done with absolutely everything.

1def naive_predict(df):
2 predictions = []
3 for _, row in df.iterrows():
4 # having queens on the board helps
5 if row['numQueens'] > 0:
6 predictions.append(True)
7 # having fewer obstacles helps
8 elif row['obstaclesPortion'] < 0.25:
9 predictions.append(True)
10 # assume everything else unsolvable
11 else:
12 predictions.append(False)
13 return predictions
1naive_preds = naive_predict(val_xs)
2print(mean_absolute_error(val_y, naive_preds))
0.14065335753176045

Here’s how to interpret this information:

  • ‘Mean Absolute Error’ (MAE) = how far off are we, on average, from correctly predicting the solvability (val_y) of each of the 1,102 Echo Chess levels in val_xs. We got MAE=0.14 here.

  • ‘Confusion Matrix’ = fancy term to mean this 2x2 table above which gives a sense of how ‘confused’ our classifier was in its predictions. For every group of solvability actuals (vertical axis), it shows us what our predictions were (horizontal axis).

    • We correctly predicted ‘unsolvable’ for 27 true unsolvables (True Negatives, TN). Nice.

    • We incorrectly predicted ‘unsolvable’ for 60 levels that were actually solvable (False Negatives, FN). Not so nice.

    • We correctly predicted ‘solvable’ for 920 true solvables (True Positives, TP). Good.

    • We incorrectly predicted ‘solvable’ for 95 levels that were actually unsolvable (False Positives, FP). Not good.

Our goal now is to hopefully improve on this naive predictor which will be used as a mininmum baseline for performance.

Let’s first compare these results to a proper, but simple, Decision Tree to check whether our EDA-eye-balling predictor is that far off. Instead of us splitting hairs on where the nice-looking groups should be separated on visual plots, a tree model will recursively partition our levels based on the features that best separate them into distinct groups —maximizing both heterogeneity across the groups and homogeneity within each.

1from sklearn.tree import DecisionTreeClassifier, export_graphviz
2import graphviz
3
4m = DecisionTreeClassifier(max_leaf_nodes=4).fit(trn_xs, trn_y);
5
6def draw_tree(t, df, size=10, ratio=0.6, precision=2, **kwargs):
7 s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True, rounded=True,
8 special_characters=True, rotate=False, precision=precision, **kwargs)
9 return graphviz.Source(re.sub('Tree {', f'Tree {{ size={size}; ratio={ratio}', s))
10
11draw_tree(m, trn_xs, size=10)

Okay, so it’s first picking up on the proportion of obstacles to board size, just like we did. It has a more precise cut-off —sure, we kinda winged that one. Then it’s looking at the number of capturable pieces on the board, and finally the type of the starting piece. That makes a lot of sense, the EDA plots for these were pretty interesting. Not too bad.

1print(mean_absolute_error(val_y, m.predict(val_xs)))
10.11070780399274047

MAE at 0.11. Some improvement already. Maybe we'll see even better results with a bigger tree that has more leaves. Feels seasonal enough.

1m = DecisionTreeClassifier(max_leaf_nodes=16)
2m.fit(trn_xs, trn_y)
3draw_tree(m, trn_xs, size=16)
1print(mean_absolute_error(val_y, m.predict(val_xs)))
10.1161524500907441

Nope, our average error actually increased a tiny bit. Let’s see if we get improvements with a full-on Random Forest instead.

Think of an RF as a team of individual Decision Trees working together to make better decisions. Each one takes a random chunk of our levels (rows) and a random subset of their features (columns) and tries to come up with some clever prediction. Then they get together and combine their efforts. It’s like traveling in big groups, only actually enjoyable, not entirely useless, and where decisions do get made.

1from sklearn.ensemble import RandomForestClassifier
2
3rf = RandomForestClassifier(100, min_samples_leaf=5, random_state=42)
4rf.fit(trn_xs, trn_y)
5
6rf_preds = rf.predict(val_xs)
7print(mean_absolute_error(val_y, rf_preds))
10.10435571687840291
1confusion_matrix(val_y, rf_preds)

Interesting. MAE looks better at 0.10, but we now seem to almost always be predicting the 'solvable' class, leading to a ridiculous amount of False Positives: 111 unsolvables incorrectly classified as solvable 🤦🏻‍♂️.

Remember when we data mined all those labeled levels, and how crowdsourcing through gaming was only letting us confidently gather solvables?

Well this was bound to happen, and we saw it coming, but now we’ve ended up with an imbalanced level dataset where unsolvables are much less common (~1:8 ratio).

Might as well mindlessly predict ‘solvable’ across the board.

He’s not wrong, you know.

Maybe we can try selecting a threshold higher than 50% for the model to predict True. That’d be, loosely speaking, like asking it to increase its confidence in a level's solvability before saying yes.

Hmm. What do we choose as a threshold: 60%, 75%, 90%… even higher? They say ML is an iterative science. Let’s throw things at the wall start with an educated guess to get a feel for the impact.

1threshold = 0.75
2raw_rf_preds = rf.predict_proba(val_xs)[:, 1]
3rf_preds = (raw_rf_preds >= threshold).astype(int)
4
5print(mean_absolute_error(val_y, rf_preds))
10.14791288566243194
1confusion_matrix(val_y, rf_preds)

Look at us ping-ponging. We now have a slightly worse MAE, but the amount of False Positives has significantly improved (almost halved) because we are now demanding a higher confidence threshold.

I feel like there’s a missing part.


Eval Metrics

Before we go any further with ML model iterations to see what works best, we need a crystal clear definition of what it means to be ‘best’ for our specific use case, or we might end up optimizing the wrong thing right passed it.

“If you aim at nothing, you’ll hit it every time.” —someone who aimed at sounding deep.

It’s clear there are trade-offs here for improving some of the metrics at the expense of others. But is there a smarter and more systematic way to tune this threshold hyperparameter? Sure thing, we’ll need to choose a clever point on the Precision-Recall and AUC-ROC curves, as we’ll see below. Defining that point is trickier than it feels.

Precision vs Recall

Given that the ultimate goal of all this is to minimize unsolvable levels being served to the user, and given that the procedural generation of new levels is fairly efficient and 'cheap' in some sense, one argument could be to maximize ‘Precision’ (i.e. the proportion of levels classified as 'solvable' when they are, indeed, solvable in reality).

No, I won’t include LaTeX formulas for Precision and Recall.
They ++affectation and --intuition.

You can think of aiming for high precision as requiring a high confidence in the levels we are judging to be 'solvable', even if we happen to also be overconservatively discarding other levels that were borderline solvable but that we preferred classifying as 'unsolvable' just in case.

In other words, we’d be willing to optimize for high precision even at the expense of a potential drop in ‘Recall’ (i.e. even if many potentially valid levels get missed out on through misclassification in our overzealous quest for purity of 'solvables').

Recall and True Positive Rate (TPR) are literally the same thing. Gandalf-level renaming.

FPR vs F1

Similarly, we also certainly want to minimize the "False Positives Rate" or FPR ratio (i.e. proportion of 'unsolvable' levels that mistakenly get classified as 'solvable', thus ruining our purity of 'solvables'). We can either plot FPR directly for every threshold of our binary classifier, or use the Receiver Operating Characteristic (ROC) curve to analyze the FPR vs TPR (True Positive Rate) trade-off. We’ll do that shortly below.

However, given the class imbalance and that the majority of our dataset comes with positive labels ('solvable' levels), we can likely expect Precision in this dataset to be slightly less sensitive than FPR to an 'unsolvable' level being misclassified as 'solvable'. This is because precision (and recall) would reflect mostly the ability of prediction of the positive class (solvables), not the negative class (unsolvables) which will naturally be harder to detect due to the smaller number of samples.

In this case, we can expect FPR to be more sensitive than Precision to the model's actual performance given that the negative class is the minority one in the dataset.

Therefore, we will look for the optimal threshold range that minimizes FPR, and maximize the F1 Score within that range.

Why F1, for that secondary goal? Trying to maximize Precision even further within an already minimized FPR range could lead to undesirable extremes like incredibly low Recall and/or TPR values. Conversely, it could be tempting to search within that range for a 'reasonable-recall' point along the Precision-Recall curve, so that fewer procedurally generated solvables get unnecessarily tossed out. But that's pretty much what F1 is doing for us under the hood either way, so F1 gets us two birds, one stone.

The trick is to keep the grip loose enough, or you’ll end up letting go of every last level.

We then define a max % FPR we are willing to accept for our use case, say, 5%. Remember, this would represent how many of the generated levels that we predict to be solvable - all else equal - end up being, unfortunately, unsolvable. We’ll still have some leeway in how to make use of these generated levels and their predictions in prod, but for now this seems like a reasonable starting point.

To recap our approach:

  • For evaluating models:

    • focus on FPR (rightside ratio in our confusion matrix)

  • To select prediction thresholds:

    • sweep along the thresholds that are within 5pp of min FPR

    • select the one that maximizes F1 among these

    • set that min_fpr_max_f1_threshold as a hyperparameter

Here’s the code we’ll need to implement it —minus minutiae for plots and the like. We’ll be plotting a bunch of cool curves for different models but it’s all based on the kernel below.

1from sklearn.metrics import roc_curve, precision_recall_curve
2
3def threshold_tune_binary_classifier(val_xs, val_y, model, max_fpr_tolerance=0.05):
4
5 raw_preds = model.predict_proba(val_xs)[:, 1]
6 fpr, tpr, thresholds_roc = roc_curve(val_y, raw_preds, drop_intermediate=False)
7 precision, recall, thresholds_pr = precision_recall_curve(val_y, raw_preds)
8 f1_scores = 2 * (precision * recall) / (precision + recall)
9 min_fpr_thresholds_range = thresholds_roc[fpr <= (min(fpr) + max_fpr_tolerance)]
10
11 # Find thresholds, precisions, recalls within min-FPR range
12 sorted_idx = np.argsort(thresholds_pr)
13 idx_range = np.where((thresholds_pr[sorted_idx] >= min(min_fpr_thresholds_range)) & (thresholds_pr[sorted_idx] <= max(min_fpr_thresholds_range)))
14 pr_idx_in_range = sorted(sorted_idx[idx_range])
15 pr_thresholds_in_range = thresholds_pr[pr_idx_in_range]
16 precision_in_range = precision[pr_idx_in_range]
17 recall_in_range = recall[pr_idx_in_range]
18
19 # Maximize F1 within min-FPR range
20 f1_in_range = 2 * (precision_in_range * recall_in_range) / (precision_in_range + recall_in_range)
21 min_fpr_max_f1_threshold = pr_thresholds_in_range[np.argmax(f1_in_range)]
22 min_fpr_max_f1_index = int(np.where(thresholds_pr == min_fpr_max_f1_threshold)[0])
23 roc_target_index = int(np.where(thresholds_roc == min_fpr_max_f1_threshold)[0])
24
25 return {
26 'threshold': min_fpr_max_f1_threshold,
27 'precision': precision[min_fpr_max_f1_index],
28 'recall': recall[min_fpr_max_f1_index],
29 'f1': f1_scores[min_fpr_max_f1_index],
30 'fpr': fpr[roc_target_index],
31 'tpr': tpr[roc_target_index]
32 }

Model Selection and Tuning

Great. So now we have data, features, metrics, and a proper way to select suitable models. Let’s spin up some proper tabular models and see what’s up.

Tabular models

Random Forest

Remember our rf? Let’s retune this model to improve the metrics we actually care about.

1rf_targets = threshold_tune_binary_classifier(val_xs, val_y, rf)
1Threshold: 0.9505685425685425
2Precision: 98.63%
3Recall: 43.98%
4F1 Score: 60.83%
5True Positive Rate: 43.98%
6False Positive Rate: 4.92%

And I’ve got some snazzy plots for this rf for us to look at. Don’t worry if things look a bit confusing, we’ll go through them together in a sec.

Left: FPR for every prediction threshold we could set. We’re almost hitting our 5% max tolerance.
Right: F1 for every possible threshold. That’s the max we could get given our FPR constraint.
(X-axis the same for both)
Left: Precision for every possible threshold. As expected, FPR gets us a good value here ‘for free’.
Right: Corresponding Recall at each Precision value. Again, F1 lands us on a solid point ‘for free’.
(Y-axis the same for both)
Bonus: Corresponding TPR (i.e. Recall) for every FPR value. This one specifically illustrates the trade-off of making overconservative predictions. But it’s a little redundant for what we need.

What does this all mean?

It means that for this model and dataset, if we set an overly conservative prediction threshold (i.e. has to be >0.95 to be considered solvable, everything else is assumed not), we can drop the False Positive rate down to 4.9% and get a pretty high Precision of 98.6%. In other words, we’d be pretty confident that the solvable levels we choose are actually solvable.

Evidently this comes at a cost. We’d only be getting a True Positive rate of 44%, so we’d be throwing out 1 of every ~2.2 actually solvable levels, say half of them. But that’s the best that this model can give us for our data given the metrics we care to optimize. Not too shabby tbh.

Left: orginal RF before tuning a threshold. Right: after tuning for min-FPR, max-F1.
We completely reversed almost all the FPs at the cost of only ~half the TPs.
This TP loss would look scary if we hadn’t purposefully defined our goals.

All in all, this looks promising. We’ll probably end up using multiple models that can pick up on different things and ensemble them together. Let’s add this model to our models_predictions ensembling dict.

1raw_preds = rf.predict_proba(val_xs)[:, 1
2rf_threshold_target = rf_targets['threshold']
3predictions = (raw_preds >= rf_threshold_target).astype(int)
4models_predictions['Random-Forest-tuned'] = {
5    'tuned_threshold': rf_threshold_target,
6    'raw_predictions': raw_preds,
7    'class_predictions': predictions
8}

Balanced Random Forest

Another approach is to use a Balanced Random Forest (BRF) which could also help reduce the rate of false positives caused by our imbalanced dataset.

1from imblearn.ensemble import BalancedRandomForestClassifier
2
3brf = BalancedRandomForestClassifier(n_estimators=100, min_samples_leaf=5, random_state=42)
4brf.fit(trn_xs, trn_y)

This has much less FPs than the PRE-tuning unbalanced Random Forest, but it does not beat the threshold-tuned version of our unbalanced rf, given our metrics goals. Let's see if we can combine both approaches by threshold-tuning the BRF.

1brf_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, brf, max_fpr_tolerance)['threshold']
1Threshold: 0.6988891533303299
2Precision: 98.67%
3Recall: 45.51%
4F1 Score: 62.29%
5True Positive Rate: 45.51%
6False Positive Rate: 4.92%

This tuned-BRF has very comparable FPR and Precision to the unbalanced tuned-RF, while also having slightly TPR and F1. We’ll add it to the ensemble as well —same code with brf instead of rf.

XGBoost (eXtreme Gradient Boosting)

I’m thinking it’s worth trying to train an XGBoost model instead —XGBoost tends to handle unbalanced datasets slightly better, so it could be a good fit in this case.

Remember the team analogy we touched on for random forests? XGBoost is like that smart kid who added some structure around the decision-making team. You’re still creating a bunch of trees, but not entirely randomly —instead you try to fix the mistakes of the previous trees as you go, paying extra attention to the errors and adjusting each new tree accordingly. Hence the boosting name.

1import xgboost as xgb
2
3xgb_model = xgb.XGBClassifier(n_estimators=100, min_child_weight=5, random_state=42, scale_pos_weight=sum(trn_y==0)/sum(trn_y==1))
4xgb_model.fit(trn_xs, trn_y)

This first attempt of XGBoost seems to have a fairly high FPR. Let's try threshold-tuning it see if it can fit our needs better.

1xgb_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, xgb_model, max_fpr_tolerance)['threshold']
1Threshold: 0.9594321846961975
2Precision: 98.92%
3Recall: 46.94%
4F1 Score: 63.67%
5True Positive Rate: 46.94%
6False Positive Rate: 4.10%

Look at that gorgeous baby. Better vitals across the board: lower FPR, higher Precision, higher Recall/TPR, higher F1. Sold to the ensemble in the back.

1raw_preds = xgb_model.predict_proba(val_xs)[:, 1]
2predictions = (raw_preds >= xgb_threshold_target).astype(int)
3models_predictions['XGBoost-tuned'] = {
4 'tuned_threshold': xgb_threshold_target,
5 'raw_predictions': raw_preds,
6 'class_predictions': predictions
7}

Feature Importance

Another interesting approach we can try is to only select the most important features according to the random forest, and then train our model only on these features, disregarding all others. The reason this works in many cases is because it can help reduce noise, multicollinearity with less relevant features, and make the model less prone to overfitting.

Random Forests are usually good at identifying feature importance in a dataset. Let’s take a look using our rf model.

1def get_top_features_importance(model, trn_xs, min_imp=0, num_top_features=40):
2 plt.figure(figsize=(10, 8))
3 feature_importances = model.feature_importances_
4 df = pd.DataFrame(dict(cols=trn_xs.columns, imp=feature_importances))
5 df = df[df['imp'] > min_imp].sort_values(by='imp').tail(num_top_features)
6 ax = df.plot('cols', 'imp', kind='barh', legend=False, xlabel='', ylabel='')
7 plt.show()
8 return df.sort_values(by='imp', ascending=False)
9
10rf_top_features = get_top_features_importance(rf, trn_xs, min_imp=0.02)

Now let’s retrain our random forest only on these top features. We’ll make sure to threshold-tune it as well.

1imp_trn_xs = trn_xs[rf_top_features.cols]
2imp_val_xs = val_xs[rf_top_features.cols]
3
4rf_top_feats = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, random_state=42)
5rf_top_feats.fit(imp_trn_xs, trn_y)
6
7top_feats_rf_threshold_target = threshold_tune_binary_classifier(imp_val_xs, val_y, rf_top_feats, max_fpr_tolerance)['threshold']
1Threshold: 0.9614293206793206
2Precision: 98.72%
3Recall: 47.14%
4F1 Score: 63.81%
5True Positive Rate: 47.14%
6False Positive Rate: 4.92%

Anoter model giving great results. Throw it in the ensemble bag. You know the drill.

Alternatively, we could check if there’s a better feature selection subset according to the feature importance interpretation of an XGBoost model as opposed to the Random Forest. In general, these should be close enough but it doesn’t hurt to check for our data.

1def xgb_top_features_importance(xgb_model, importance_type, min_imp=0, num_top_features=40):
2 xgb.plot_importance(xgb_model, importance_type=importance_type, max_num_features=num_top_features,
3 title=f"Feature {importance_type} importance", xlabel=f"{importance_type}s")
4 xgb_imps_dict = xgb_model.get_booster().get_score(importance_type=importance_type)
5 xgb_imps_dict = dict(sorted(xgb_imps_dict.items(), key=lambda item: item[1], reverse=True))
6 df = pd.DataFrame(list(xgb_imps_dict.items()), columns=['features', f"{importance_type}_importance"]).head(num_top_features)
7 df = df[ df[f"{importance_type}_importance"] >= min_imp ]
8 return df
9
10xgb_new = xgb.XGBClassifier(n_estimators=100, min_child_weight=5, random_state=42, scale_pos_weight=sum(trn_y==0)/sum(trn_y==1))
11xgb_new.fit(trn_xs, trn_y)
12xgb_top_features = xgb_top_features_importance(xgb_new, importance_type="gain", num_top_features=15)
13x_imp_trn_xs = trn_xs[xgb_top_features.features]
14x_imp_val_xs = val_xs[xgb_top_features.features]
XGB assigns a somewhat different ranking here and there, but the overall collection is similar.

Interestingly, when we retrain the XGB model on xgb_top_features instead of rf_top_features, we get slightly worse results. Even varying the importance_type XGB uses to select the top features (across gain, cover or weight) doesn’t improve so much over the good old rf_top_features approach. We’ll just keep it simple and stick to the original appraoch we had.

I’d say we have a decent start so far with these tabular models. There’s something much more intriguing we might be missing out on, though. I’m sure some of you reading this might have been waiting for it from the moment we uttered the word ‘classification’.

CNNs and image classification

CNN classifier using chess board arrays

Taking inspiration from image recognition approaches, since 2D chess boards are very structured and always come in the same form with a cell containing one of a predefined set of classes, we can train a Convolutional Neural Network (CNN) on the 2D chess board states extracted from every level's FEN.

  • (1) We’d need 13 categories for the content of a cell given that pawns are excluded here (K, Q, B, N, R, k, q, b, n, r, _, x, X).

  • (2) We then one-hot-encode each of these 13 to avoid any implicit relative ranking between them.

  • (3) Every one-hot-encoding of the 13 can now be considered a separate channel for the CNN.

  • (4) So our input tensor becomes (numSamples, 13, 8, 8) for every 8x8 chess board. Note that since the biggest levels in Endless mode are of size 6x6, we could also preconvert all 2D boards to 6x6 and make the input tensor (numSamples, 13, 6, 6) if necessary.

A chess board is literally this. But instead of RGB, it’s KQRBN. Same stuff, more channels.
It’s all tensors all the way down regardless.

This approach might even capture the essence of the spatial data and relative positioning of pieces in a level in a better way than img classification of the rendered level from the Echo Chess app, because only the 'substance' of a level's config is being trained on, without all the noise and variability linked to unnecessary visual elements.

On the flip side, though, this may also make it challenging to simply fine-tune a pretrained SOTA CNN or img classifier from the web since the input format could become substantially different from what is expected.

In any case, we can see how well a basic neural net performs in this approach, and then take it from there. We’ll need to prepare the input data accordingly.

1trn_X = np.array([fen_to_board(x) for x in trn_df['compoundFen'].values])
2one_hot_trn_X = np.array([one_hot_encode_2D_array(x, char_to_index) for x in trn_X])
3trn_X_tensors = torch.tensor(one_hot_trn_X)

We’ll also do the same for val_X, trn_y, and val_y.

To choose a CNN architecture, let’s first start with a simple baseline and we can iterate from there. In general, most CNNs can be thought of as pretty much some conv layers, normalization, and activation functions, rinse and repeat. We’ll use the following simple arch as a start.

1class ChessCNN(nn.Module):
2 def __init__(self):
3 super(ChessCNN, self).__init__()
4 # Convolutional Layer 1
5 self.conv1 = nn.Conv2d(in_channels=13, out_channels=16, kernel_size=3, stride=1)
6 # Max Pooling Layer 1
7 self.maxpool1 = nn.MaxPool2d(kernel_size=2, stride=2)
8 # Convolutional Layer 2
9 self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1)
10 # Fully Connected Layer 1
11 self.fc1 = nn.Linear(in_features=32, out_features=64)
12 # Fully Connected Layer 2 (Output Layer)
13 self.fc2 = nn.Linear(in_features=64, out_features=1)
14
15 def forward(self, x):
16 x = F.relu(self.conv1(x))
17 x = self.maxpool1(x)
18 x = F.relu(self.conv2(x))
19 x = torch.flatten(x, 1)
20 x = F.relu(self.fc1(x))
21 x = torch.sigmoid(self.fc2(x))
22 return x

This is what our basic cnn looks like:

1ChessCNN(
2 (conv1): Conv2d(13, 16, kernel_size=(3, 3), stride=(1, 1))
3 (maxpool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
4 (conv2): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
5 (fc1): Linear(in_features=32, out_features=64, bias=True)
6 (fc2): Linear(in_features=64, out_features=1, bias=True)
7)

We’ll train this cnn model for 8 epochs with the Binary Cross-Entropy (BCE) as the Loss Function, Adam as the optimizer, and a Learning Rate of 0.002. Why these choices? That’s what worked best in my experiments on this data.

1cnn = ChessCNN()
2criterion = nn.BCELoss()
3optimizer = optim.Adam(cnn.parameters(), lr=0.002)
4num_epochs = 8
5batch_size = 8
6for epoch in range(num_epochs):
7 cnn.train()
8 for i in range(0, len(trn_X_tensors), batch_size):
9 inputs = trn_X_tensors[i:i+batch_size].float()
10 labels = torch.tensor(trn_y[i:i+batch_size], dtype=torch.float32).view(-1, 1)
11 optimizer.zero_grad()
12 outputs = cnn(inputs)
13 loss = criterion(outputs, labels)
14 loss.backward()
15 optimizer.step()

Now that we have a neural net classifier spun up, let’s tune its prediction threshold like we did for the other classifer models.

1def get_cnn_raw_preds(cnn, X_train, X_val):
2 cnn.eval()
3 with torch.no_grad():
4 trn_raw_predictions = cnn(X_train.float())
5 trn_raw_predictions = trn_raw_predictions.squeeze().numpy()
6 val_raw_predictions = cnn(X_val.float())
7 val_raw_predictions = val_raw_predictions.squeeze().numpy()
8 raw_preds_dict = {
9 "trn_raw_preds": trn_raw_predictions,
10 "val_raw_preds": val_raw_predictions
11 }
12 return raw_preds_dict
13
14trn_raw_predictions = get_cnn_raw_preds(cnn, trn_X_tensors, val_X_tensors)['trn_raw_preds']
15val_raw_predictions = get_cnn_raw_preds(cnn, trn_X_tensors, val_X_tensors)['val_raw_preds']
1cnn_threshold_target = threshold_tune_binary_classifier(val_X_tensors, val_y, cnn, max_fpr_tolerance, val_raw_preds=val_raw_predictions)['threshold']

Okay, now we’re talking. But before we rush and add this model to our ensemble, it’s always good to check our curves to see if anything looks off.

Hmm. This model's curves for FPR, Precision and (especially) F1 are incredibly steep in the vicinity of the threshold target. I don’t feel so great about this news.

It means that the model could be extremely sensitive to noise or slight variations in the data. And it’s especially the case for the FPR value that could fluctuate easily between 0 and 40% if the test data has enough variation compared to our training+validation sets. It will be important to keep this in mind when deciding whether/how to incorporate this model's predictions in the ensemble.

With all this experimentation we’ve been doing to deal with our imbalanced data, we still haven’t tried adding the secret sauce. Let’s rectify this next.


Resampling to balance class distribution

Given that only a small proportion of levels in the data are of the ‘unsolvable’ class, what if we just tried to… oversample from this minority class to get a more balanced dataset?

If we believe the unsolvable levels we currently have fairly represent the underlying distribution of unsolvables, let’s sample more of them! This has the potential to improve model performance by tackling the imbalanced data issue at the root.

We proceed to trying various resampling methods like:

  • RandomOverSampler

  • SMOTE (Synthetic Minority Oversampling Technique)

  • BorderlineSMOTE

  • ADASYN

So now we go back to the very beginning, oversample, retrain, retune, and re-evaluate. For each resampling method and each model. Fun fun fun 🙃 Don’t worry I’ll fast-forward you to the interesting stuff.

From all the experimentation I did, the resampling method that led to the best performance was the regular random oversampling without interpolation using the RandomOverSampler. SMOTE and its variant resampling methods were disappointingly not helpful in improving results on this dataset. In fact, performance actually suffered when trying out fancier resampling methods on this data.

1trn_xs_rnd_resampled, trn_y_rnd_resampled = resample_to_balance('RandomOverSampler', trn_xs,trn_y)
Nice. Our training set is now at 6,856 levels with 50-50 solvables.
It’s crazy to think that each of these is an actual Echo Chess level you can play!

Oversampling with tuned-RF

Now that we’ve updated our training set with oversampling, we can retrain (and re-tune) our promising random forest model. Notice we’re using the new versions here: trn_xs_rnd_resampled and trn_y_rnd_resampled.

1rf_rnd_rs = RandomForestClassifier(100, min_samples_leaf=5, random_state=42)
2rf_rnd_rs.fit(trn_xs_rnd_resampled, trn_y_rnd_resampled)
3rf_rnd_rs_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, rf_rnd_rs, max_fpr_tolerance)['threshold']
1Threshold: 0.8903113063407183
2Precision: 98.60%
3Recall: 43.06%
4F1 Score: 59.94%
5True Positive Rate: 43.06%
6False Positive Rate: 4.92%

Sure, not bad, I’ll take it. Dump it in our ensemble.

1models_predictions['RandomOversampling-Random-Forest-tuned'] = {
2 'tuned_threshold': rf_rnd_rs_threshold_target,
3 'raw_predictions': rf_rnd_rs.predict_proba(val_xs)[:, 1] ,
4 'class_predictions': (rf_rnd_rs.predict_proba(val_xs)[:, 1] >= threshold).astype(int)
5}

Oversampling with tuned-BalancedRF

1brf_rnd_rs = BalancedRandomForestClassifier(n_estimators=100, min_samples_leaf=5, random_state=42)
2brf_rnd_rs.fit(trn_xs_rnd_resampled, trn_y_rnd_resampled)
3brf_rnd_rs_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, brf_rnd_rs, max_fpr_tolerance)['threshold']
1Threshold: 0.8999404166095343
2Precision: 98.50%
3Recall: 40.10%
4F1 Score: 57.00%
5True Positive Rate: 40.10%
6False Positive Rate: 4.92%

Okay, maybe I’m getting a little too picky given the stellar options we have, but this one might be pushing it a bit.

I mean we’re already oversampling, and using a balanced RF to handle imbalanced data, and tuning a threshold to be super conservative (all at the risk of generalizability issues). At least we should expect some crazy performance. I think we pass on this one.

Oversampling with tuned-XGBoost

I’ll spare you the same lines of code. This one also got better with oversampling. Guess where it went? Into the ensemble.

I think this was actually our best model so far —hard to keep up with all these nice goodies. Real-talk, though, next time I’m going down such a deep rabbit hole, I’m obviously using an MLOps platform like Weights & Biases to keep track of all these model iterations. IYKYK. You live and you learn.


Ensembling all the promising models

So far we’ve been adding each promising model to a models_predictions dict of dicts. Let’s convert it to a dataframe to work with it easily.

1models_predictions_df = pd.DataFrame.from_dict(models_predictions, orient='index')
2print(models_predictions_df.index.values)
1['Random-Forest-tuned' 'Balanced-Random-Forest-tuned' 'XGBoost-tuned' 'Top-features-RF-tuned' 'CNN-tuned' 'RandomOversampling-Random-Forest-tuned' 'RandomOversampling-XGBoost-tuned']

We can now experiment with different approaches of ensembling using the Validation set to pick the winning one.

For instance, we could get the mean raw predictions across the ensemble and then use our existing threshold_tune_binary_classifier function to select a good ensemble_raw_threshold_target prediction threshold for the avg raw preds that minimizes FPR and maximizes F1 scores within our 5% tolerance.

1avg_models_raw_preds = models_predictions_df['raw_predictions'].mean()
2
3ensemble_raw_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, None, max_fpr_tolerance=max_fpr_tolerance, val_raw_preds=avg_models_raw_preds)['threshold']
4
5avg_raw_to_tuned_preds = (avg_models_raw_preds >= ensemble_raw_threshold_target).astype(int)

Similarly, we could ensemble by averaging the class prediction themselves (instead of raw preds), then selecting a good ensemble_classes_threshold_target for the avg of the class preds themselves.

1avg_models_class_preds = models_predictions_df['class_predictions'].mean()
2
3ensemble_classes_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, None, max_fpr_tolerance=max_fpr_tolerance, val_raw_preds=avg_models_class_preds)['threshold']
4
5avg_class_to_tuned_preds = (avg_models_class_preds >= ensemble_classes_threshold_target).astype(int)

Or we could require that both these approaches agree (relative to each one’s tuned threshold respectively), if we wanted to be even more overconservative on solvability.

1comb_preds = (avg_raw_to_tuned_preds * avg_class_to_tuned_preds > 0).astype(int)

But in general, each of these approaches risks getting us into dangerous overfitting territory as we squeeze greedily more of the tuning juice out of the validation set. Interestingly, the simpler, non-tuned approach, ends up being as performant too: ensembling through majority voting of the class predictions.

1majority_needed = math.ceil(num_models_in_ensemble / 2)
2majority_class_preds = (num_positive_class_preds >= majority_needed).astype(int)

Here are the final results, side-by-side, for each of these 4 ensembling techniques, post-threshold-tuning (if applicable) on the validation set:

When choosing among gemstones, pick the color you like.

Looking at the TPs and FPs of each, no matter which ensembling method is chosen, when judging a generated level as solvable, we’d still guess right ~99% of the time. For this and the reasons above, we’ll go with the simple majority approach as our ensembling technique.

More specifically, our goal when productizing this will be to pick the “most-promisingly solvable” level from a candidate list of randomly generated levels. So we’ll sort first by class majority vote predictions, then by avg raw preds within each class —this way even if no solvables are found, at least we’d be serving the most promising level among the unsolvable ones.

1# Example usage for inference
2ensemble_preds = predict_solvability(level_df, 'models')
3most_promising_levels = ensemble_preds.sort_values(by=['ensemble_preds', 'avg_raw_preds'], ascending=[False, False])

Inference on the External Test

Remember that 10% of data we completely stashed away in tst_df ages ago? We can now finally check how our ensemble is doing on it. If it sucks, we’re kind of back to square one like our friend Bill. Moment of truth.

FPR of 1.2% on the untouched test set. I’m not crying, you’re crying.

Boom! And that's it! Extremely promising results on both the validation and test sets. When we’re ready to push this to prod, we can load these best-performing models into the server and cache them, grab their pre-tuned thresholds, generate individual predictions at the inference stage, and ensemble them together through majority voting.

Great stuff. 6 classical ML + 1 DL model. All straightforward architectures. Preprocessing, feature engineering, hyperparameter tuning and some usage of oversampling here and there to handle the imbalanced dataset. A few other minor things but all in all just a vintage ML modeling routine.

Genies were the first AGIs.

But wait! We still have two final tricks up our sleeve to improve this further. Chess players will like these.


Trick #1: Data Augmentation

Taking inspiration from other domain areas like image processing, we can quickly notice an opportunity to significantly augment our dataset by creating new valid, labeled samples from our existing labeled ones:

  • Every rectangle level can be mirrored either horizontally or vertically on the opposing side (depending on its shape) to create one new, effectively identical, level for classification.

  • Similarly, every quad level can be mirrored across the 3 opposing corners (either horizontally, vertically, or both horizontally and vertically) to create 3 new, effectively identical, levels.

Top-left: original level. Top-right: augmentation through vertical mirroring.
Bottom-left: aug. through horizontal mirroring. Bottom-right: aug. through v+h mirroring.

Each of these 4 levels above are pretty much the same and should command the same solvability label, yet they each have a different compoundFEN input. Importantly, they also accurately represent the type of levels that get generated by our ground truth distribution —which we can be sure of, given that it’s a parametrized distribution we invented ourselves for the random level generator.

Good data augmentation opportunity + easy low hanging fruits. Let’s implement them quickly.

1def mirror(compoundFen, direction):
2 table = fenToTable(compoundFen)
3 if direction == 'h':
4 return tableToFen(table[[0, 4, 5, 6, 1, 2, 3, 7]])
5 elif direction == 'v':
6 return tableToFen(table[:, [0, 4, 5, 6, 1, 2, 3, 7]])
7
8def augment(rows, mirroring_direction):
9 augmented_rows = rows.copy()
10 augmented_rows['compoundFen'] = augmented_rows['compoundFen'].apply(mirror, args=(mirroring_direction,))
11 augmented_rows['levelOrientation'] = augmented_rows['levelOrientation'].apply(switch_orientation, args=(mirroring_direction,))
12 return augmented_rows
13
14def mirror_augmentation(df):
15 rectangle_rows = df[df['levelGeoShape'] == 'rectangle']
16 quad_rows = df[df['levelGeoShape'] == 'quad']
17 h_rectangle_rows = rectangle_rows[rectangle_rows['levelOrientation'].isin(['h-top', 'h-bottom'])]
18 v_rectangle_rows = rectangle_rows[rectangle_rows['levelOrientation'].isin(['v-left', 'v-right'])]
19 augmented_rows = pd.concat([
20 augment(h_rectangle_rows, 'h'),
21 augment(v_rectangle_rows, 'v'),
22 augment(quad_rows, 'h'),
23 augment(quad_rows, 'v'),
24 # mirror quad diagonally across
25 augment(augment(quad_rows, 'v'), 'h')
26 ], ignore_index=True)
27 augmented_df = pd.concat([df, augmented_rows], ignore_index=True)
28 return augmented_df

And here is a simple example in action.

1fen_test = "XXXXXXXX/XqbxBb1X/Xx1xnxrX/X1n1kq1X/XXXXXXXX/XXXXXXXX/XXXXXXXX/XXXXXXXX w - - 0 1"
2print(fen_to_board(fen_test))
1[['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']
2 ['X' 'q' 'b' 'x' 'B' 'b' ' ' 'X']
3 ['X' 'x' ' ' 'x' 'n' 'x' 'r' 'X']
4 ['X' ' ' 'n' ' ' 'k' 'q' ' ' 'X']
5 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']
6 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']
7 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']
8 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']]
1mirrored_fen = mirror(fen_test,'h')
2print(fen_to_board(mirrored_fen))
1[['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']
2 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']
3 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']
4 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']
5 ['X' 'q' 'b' 'x' 'B' 'b' ' ' 'X']
6 ['X' 'x' ' ' 'x' 'n' 'x' 'r' 'X']
7 ['X' ' ' 'n' ' ' 'k' 'q' ' ' 'X']
8 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']]

It may look silly, but this stuff actually allows us to 2X our 'rectangle' row samples and 4X our 'quad's, significantly increasing the size of our dataset.

1augmented_trn_df = mirror_augmentation(trn_df)
2print(len(trn_df), len(augmented_trn_df))
13855, 7877

That’s more than a 2X increase in the size of our data! 7,877 Echo Chess levels. I wonder how long it’d take for someone to play them all.

It’s worth mentioning too that we could similarly implement further data augmentations through simple 90-degree rotations, as well as internal reflections, of any level to generate new FENs out of every quad, rectangle, or even full level. It’s not shown in the current version but I’ll just leave you with this for inspiration:

A board is a board is a board.

Trick #2: Pre-predicting ‘guaranteed unsolvables’

We can again use our knowledge of the Echo Chess game mechanics to directly filter out any ‘guaranteed’ unsolvable levels before even passing the baton to the ensemble model - for example any level with:

  • No valid first move

  • No black pieces on board

  • More ‘end-state-pieces’, or black pieces with no valid moves, than starting white pieces

  • ‘Untethered’ starting bishops —i.e. starting with a bishop on a white square when all black pieces are on a black square, or vice versa

  • And so on…

We already saw how to check for many of these in the Feature Engineering section, so I won’t bore you with the implementation again. The key thing here is that when we add all the little tweaks like these, as well as the data augmentation part, the oversampling, feature engineering and feature selection, and retrain our ensemble model on the augmented set, the improvements quickly add up even further.

We’ll see next how this is all faring today in prod on the ⭐️LIVE⭐️ app on echochess.com.


The FINAL Outcome on the LIVE APP

I can’t believe I’m saying this… But we are FINALLY at the stage where we can put all this in prod. It’s here. It’s live. People are using it. With ML.

I know you’re dying to find out how this all ends, so I’ll spare you the details. This is the gist of it.

The live Echo Chess Endless mode in prod generates a solvable maze 99%+ of the time. You can test it yourself 🫳🎤

How? Every single time we need to serve a new level to the player, 50 (yes, FIFTY) different candidate levels get procedurally generated using parametrized random distributions.

On the server, we generate predictions for each of these 50 random gens using each of the tuned ML/DL models we’ve selected. We then ensemble them through majority voting, plus keep track of each level’s average raw predictions from every model.

We then sort the 50 candidate gens by majority class first, and avg raw preds second. The “most-promisingly solvable” level from these is then sent to the client, rendered and served to be played. The whole thing happens in a split-second while the level transition sound effect is playing.

IT WORKS, YOU GUYS! We made a fun Single-Player Chess 🥲

Future work, what didn’t make the cut

I would be remiss if I didn’t mention some of the crazy ideas I wanted to try with Echo Chess before I realized this rabbit hole was getting a tad bit too deep.

If any of these resonates with you (or especially if you have better ideas than these), drop me a note and maybe I’ll try it out in the upcoming version.

RE:Enhancing Predictions

Transformer-based NLP classification of compoundFENs

  • The encoded compoundFEN is just a string. NLP it. Fine-tune a pre-trained Transformer model from HuggingFace like UCL’s ChessGPT (a 2.8B language model pretrained on Chess).

LLM k-shot in-context learning

  • Instead of using a task-specific classifier, feed in all the labeled FENs from the training set as few-shot examples in a prompt to an LLM, and provide it with the unlabeled validation+test sets to make solvability predictions on (can do the inference one FEN at a time or in bulk through the LLM provider’s API or a wrapper like LangChain).

  • Given the size of the training sets, we’d certainly need to use a large-context-window LLM like Anthropic's Claude2 100k – and possibly even some creative version of Retrieval Augmented Generation (RAG) and vector dbs.

JUMBO data augmentation

After EVERY player move (which leads to a new board state), save each interim compoundFEN as its own unique ‘new level’.

  • When the actual level is solved (or is tagged as unsolvable), assign the same solvability for all these unique interim FENs

  • That’s because if a path exists from start to end, then definitionally all the interim states of that path can reach the end, if they follow that same path.

Synthetic generation of Unsolvables

  • For every level without a valid first move, generate N similar levels with the same startingWhitePiece on that same starting square, but with ALL other NON-obstacle squares replaced with random combinations of {black piece, obstacle square, empty square}. They all won’t have valid first moves either.

  • Do the same for every level with more “end-state” pieces than white pieces. Even shuffle the white pieces themselves in type and location, but keep their same count. All these levels will be unsolvable too.

Actual Image classification of levels

  • Generate jpg screenshots of every level using its compoundFEN, then train an image classifier on the labeled images

  • Can fine-tune a pretrained SOTA classifier on this img dataset to get better performance than training from scratch. And given that chess boards probably don’t look as similar to what ImageNet deals with usually as something like ‘cats’ vs ‘dogs’, we’d likely pick one of the architectures from the PyTorch Image Models (timm) that’s more suitable to datasets that are not very ‘ImageNet-like’.

  • fast.ai has done some great analysis on such models and datasets.

Letting AI actually play using Reinforcement Learning

  • Instead of thinking of the problem as a classification problem, approaching it instead from an RL lens à la AlphaZero. Tons of fun could be had exploring this direction.

I wonder how they’ll drink those beers.

From a ‘business’ use case and user-centered design standpoint, we’re solid. We’ve also pretty much saturated the returns from improvements in solvability prediction tbh. So let’s turn our eyes instead to overall Product and Game Design.

RE:Enhancing the Product experience

  • Adjust difficulty curve as player solves more procedurally-generated levels (via varying distribution parametrization, classification thresholds, selected percentile from the sorted preds, etc.)

  • Let players decide what gens look like using an LLM interface: ‘I want more knights, 4-square walls, pawns with promotion, but keep it easy’

  • Player can unlock power-ups like:

    • Call reinforcements to add extra random white pieces on the board (start with 0 available, win extra call every X levels)

    • Upgrade your white piece to Queen at any time (start with 2 instant promotions available, win extra one every Y levels)

    • Trigger regional explosions by landing on a target highlighted square that shows up every X levels 

    • Portal squares to teleport the player’s current piece to another part of the board

  • A community-driven level maker for Echo Chess. Anyone could design their own crazy level and upload it to the community to try out.

I’m sure there are tons of other interesting ideas as well. Curious what others will find.


Epilogue

The game is live, it’s being played by 1000s, both Classic and Endless modes are a hit in their unique ways. ML really saved the day for Endless. And I’ll always enjoy manually crafting hardcore puzzles for Classic.

So just go ahead and try it out already 🙂

For all you chess experts looking for a challenge, try starting with levels 8+ in Classic. And for especially ambitious chess veterans, see how many you can solve starting 11+ (heads-up: it gets really difficult, really quickly!)

Hopefully this little rabbit hole of a story inspires other builders to create more with chess, games, ML, or all of the above. If you come across anything like that that piques your interest, please drop a link below. Would love to try it out.

And if you made it to the end and actually enjoyed any of this (or even if you didn’t), please let me know by sharing your thoughts or angry rants. Feedback is always welcome anytime.

Happy puzzling 🕹️

-Sami

Comments