How Aelu Was Built: Lessons in Language Acquisition

Aelu was born from a simple frustration: the tools available for learning Chinese did not work the way an adult learner's brain needs them to. What started as a personal project became a full adaptive learning system. Along the way, the research on memory and language acquisition reshaped every design decision — and plenty of mistakes were made.

How this started

About two years ago, the goal was to learn Mandarin. Not "learn a few phrases for a trip" — actually learn it. Read a newspaper. Have a conversation. Understand what was being said in Chinese dramas playing in the background.

Duolingo came first, because it always does. It was fine for about a week. The gamification kept the app opening, but after a few days it became clear the system was optimizing for streaks, not learning. Matching pictures to words, tapping the right answer in a multiple-choice lineup, and watching a green progress bar fill up felt productive — but when recall was tested without the app's scaffolding, everything was gone. The game mechanics were working exactly as designed. They just were not designed for retention.

Anki came next. Anki is powerful — genuinely powerful — but it asks you to build your own curriculum. You import decks, or make your own cards, and the spaced repetition engine handles the scheduling. The problem is that good Anki decks require good content design, and content design is a separate skill from language learning. More time went into formatting cards than studying them. Every decision — should there be pinyin? Audio? Example sentences? Both directions or just one? — was a rabbit hole.

HelloChinese and a few others came after that. They were better. The content was sequenced. The exercises made sense. But they still felt like textbooks with touchscreens. And none of them solved the problem that kept coming up: studying vocabulary in one app, then trying to read something real — a WeChat message, a news headline, a restaurant sign — and the gap between "studying Chinese" and "using Chinese" felt enormous.

Nothing connected what a learner was reading and hearing in the real world to what they were drilling in an app. That gap is where Aelu started.

What the research actually says

Before a line of code was written, weeks went into reading cognitive science papers on memory and learning. Not from an academic perspective — from a builder's perspective. If a tool was going to be built, it should be grounded in an understanding of why existing tools worked or did not.

A few findings shaped everything:

Spaced repetition works, but the intervals matter. This goes back to Hermann Ebbinghaus in the 1880s, who first mapped the forgetting curve — how quickly we forget new information without review. Pimsleur refined this into specific intervals for language learning. Modern algorithms like FSRS (Free Spaced Repetition Scheduler) take it further, adjusting intervals based on your personal forgetting patterns. The core insight is simple: review something just as you're about to forget it, and the memory strengthens. Review too early and you're wasting time. Review too late and you're re-learning from scratch.

Interleaving beats blocked practice. This one was surprising. If you are learning vocabulary for food, clothing, and transportation, the instinct is to study all the food words, then all the clothing words, then transportation. That feels organized. It also does not work as well. Research by Rohrer and Taylor (2007) and others shows that mixing topics and drill types in a single session — even though it feels harder and messier — produces significantly better long-term retention. The difficulty is the point.

Desirable difficulty is a real thing. Robert Bjork's work on this concept changed how the team thinks about practice design. Making something harder — within reason — makes the learning stick better. If a drill is easy, you are not growing. If it is impossible, you are just frustrated. The sweet spot is that slightly-too-hard zone where you have to work for it.

The testing effect: retrieval beats review. Actively trying to recall something is more effective than passively re-reading it. This is why flashcards work better than highlighting a textbook. And it's why varied, active drills work better than flashcards alone.

Why Aelu has 44 drill types

When people hear the app has 27 different drill types, they sometimes laugh. It sounds like over-engineering. But each one exists for a reason grounded in how memory works.

Not all practice is equal. Recognizing a character when you see it uses different neural pathways than producing it from memory. Hearing a tone and identifying it is different from producing the correct tone yourself. Reading a sentence with a missing word (cloze deletion) forces you to understand the grammar and context, not just the vocabulary.

Here is a sample of what that means:

Tone pair drills force you to distinguish between similar sounds. The difference between 买 (mǎi, to buy, third tone) and 卖 (mài, to sell, fourth tone) is a single tone — and if you get it wrong, you've said the opposite of what you meant. You don't build that discrimination by reading; you build it by listening and choosing, over and over, with pairs that are specifically designed to be confusing.

Cloze deletion gives you a sentence with a blank: 我想___一杯咖啡 (I want to ___ a cup of coffee). You have to produce 喝 (hē, drink) from context. This is harder than recognizing 喝 on a flashcard, and that's exactly why it works better.

Audio-to-hanzi matching forces you to connect what you hear to what you read — bridging the listening-reading gap that trips up so many learners.

Register-aware drills teach you the difference between formal and casual Chinese. Textbook Chinese and street Chinese diverge significantly, and most apps only teach you one.

The science of varied practice says that switching between these drill types in a single session — even though it feels disorienting — strengthens memory traces by forcing your brain to re-contextualize the same information in different ways. It's like training a muscle from multiple angles instead of doing the same exercise on repeat.

The cleanup loop

The feature the team is most proud of is not flashy. It is called the cleanup loop, and it works like this:

You read something in Chinese — a graded reader, a news snippet, whatever's at your level. When you hit a word you don't know, you tap it and get an inline gloss: pinyin, meaning, example. That unknown word automatically becomes a drill item. The next time you practice, it shows up in your spaced repetition queue, mixed in with your other items, across multiple drill types.

Read real Chinese. Look up what you don't know. Drill those specific words. Repeat.

This sounds simple, and it is. But it bridges the gap that frustrated learners with every other tool available. It connects the experience of reading Chinese to the discipline of drilling vocabulary, without requiring anyone to manually create cards or maintain a word list.

It is basically how a good tutor works. They notice what you struggle with, make a mental note, and circle back to it later. Except the app never forgets and never gets tired.

What went wrong

It would be nice to say there was a clear vision executed perfectly. There was not. Mistakes were made that cost months.

An audio recording and tone grading system was built before anyone needed it. It was technically interesting — record yourself speaking, compare your tone contours to reference audio, get a score. But it was built early, when the core drilling and content were not mature enough. A cool engineering problem was being solved instead of the most important user problem. The lesson: build what is needed next, not what is interesting to build.

The scheduling algorithm was over-engineered before there was data. Weeks went into tweaking spaced repetition parameters — optimal intervals, difficulty weights, interleaving ratios — before there was enough usage data to know if the tweaks were improvements. Optimization was happening in the dark. Eventually the right call was made: pick sensible defaults from the research literature and commit to tuning later with real data. That decision should have come sooner.

Too long was spent on features and not enough on content quality. This is the trap every developer falls into when building an educational product. The features are in your wheelhouse. The content is the hard part. But users do not care how elegant the scheduling algorithm is if the example sentences are awkward or the difficulty progression is wrong. The content is the product. The features are just delivery infrastructure.

Curriculum design was underestimated. Deciding which 300 words to teach first, which grammar points to introduce at each level, which example sentences best illustrate usage — these decisions have more impact on learning outcomes than any technical feature. More time should have been spent on curriculum and less on code in the first six months.

What actually moves the needle

After building this system and using it daily for over a year, and after watching how it tracks real learner progress, here is what actually matters:

Consistency beats intensity. Fifteen minutes every day produces better results than two hours once a week. This isn't motivational advice — it's a direct consequence of how the forgetting curve works. Spaced repetition requires regular contact. A two-hour session can't compensate for six days of silence because by day three, you've already forgotten most of what you reviewed.

Active recall beats passive review. Drilling — actively trying to produce or identify something — is more effective than re-reading notes or passively listening. This is the testing effect in action. It feels harder because it is harder, and that's why it works.

Context beats isolation. Learning a character inside a sentence is more effective than learning it on a flashcard by itself. Sentences give you grammar, usage patterns, collocations, and register cues that isolated vocabulary can't. This is why cloze deletion and sentence-level drills are worth the extra complexity.

Honest diagnostics beat encouragement. This is the one that feels counterintuitive. Most language apps lean heavily on positive reinforcement — confetti animations, streak celebrations, "Great job!" messages. That feels good in the moment, but it doesn't help you improve. Knowing that your listening comprehension is two levels behind your reading ability is uncomfortable but actionable. You can fix a specific weakness. You can't fix "keep up the great work!"

The app shows you exactly where you are: which tones you mix up, which grammar patterns you get wrong, where your listening lags your reading. The numbers are real and sometimes unflattering. That's the point.

Why deterministic?

Every drill, every score, every recommendation in the app is deterministic — computed from your data using explicit algorithms. This was not an ideological choice. It was a practical one.

Reliability. Language drills need to be instant. The app needed to work every time, immediately, with no dependencies on external services.

Correctness. Auto-generated example sentences sometimes contain errors — wrong tones, unnatural phrasing, hallucinated words. In a language learning context, an error in the training material is worse than no material at all, because you'll memorize the error. Every sentence, every audio clip, every drill in the app has been verified.

Privacy. Your learning data — what you get wrong, how often, which patterns you struggle with — is sensitive in a low-stakes but personal way. It stays on the system. It doesn't get sent to third parties for processing.

Offline capability. The core drilling works without an internet connection. On a train, on a plane, in a part of China with spotty wifi — it works.

Where this is going

The app is used daily by its creators and a growing community of learners. The graded reading library is growing. The drill types are being refined based on what the data shows is working. The curriculum is getting tighter. Every week, something that could be better is found and fixed.

If you are learning Chinese, we would genuinely like you to try it. Not for a growth metric — but because the tool gets better when more people use it and share what is missing.

HSK 1-2 content is free, no time limit. Full access to all levels, all drill types, all diagnostics is $14.99/month. No annual upsell, no "premium tier," no in-app purchases.

If you are curious: aeluapp.com

And if you just read this whole thing and have thoughts — about the approach, the research, the mistakes — reach out at hello@aeluapp.com. Every email gets read.