Character-Level Mistake Marking: Precision Error Tracking for Arabic Text
بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ
Tracking recitation mistakes at the word level isn't enough. A student might pronounce the letter correctly but miss the diacritical mark above it. In this post, I'll walk through how we built character-level mistake tracking for Arabic Quranic text — from Unicode decomposition to the tap-based selection UI.
Why Character-Level Precision?
Consider the Arabic word بِسْمِ (bismi). It contains three letters (ب، س، م) and three diacritical marks (kasra, sukun, kasra). A student might pronounce the س correctly but miss the sukun above it, turning a stopped consonant into an open syllable. Word-level tracking would flag the entire word. Character-level tracking pinpoints the exact harakat.
This distinction matters for teachers. When reviewing a student's recitation, they need to see patterns: does this student consistently drop sukun marks? Do they confuse fatha and damma? Character-level data makes these patterns visible.
Understanding Arabic Character Structure
Arabic text is composed of base letters with combining diacritical marks (tashkeel) that modify pronunciation. These marks appear above or below the letter:
Diacritic Positions
Above the letter:
- Fatha (◌َ) U+064E — short "a" vowel
- Damma (◌ُ) U+064F — short "u" vowel
- Sukun (◌ْ) U+0652 — no vowel (consonant stop)
- Shadda (◌ّ) U+0651 — doubled consonant
- Fathatan (◌ً) U+064B — nunation with "a"
- Dammatan (◌ٌ) U+064C — nunation with "u"
- Superscript Alef (◌ٰ) U+0670 — long "a" (dagger alef)
Below the letter:
- Kasra (◌ِ) U+0650 — short "i" vowel
- Kasratan (◌ٍ) U+064D — nunation with "i"
- Subscript Alef (◌ٖ) U+0656
A single Arabic character column can therefore have three layers: the base letter, marks above, and marks below. Multiple marks can stack in the same position (e.g., shadda + fatha: بَّ).
The Character Column Model
We represent each visual "column" of an Arabic word as a structured object with three slots:
# app/helpers/arabic_text_helper.rb
ARABIC_DIACRITICS = /[\u064B-\u065F\u0670\u06D6-\u06ED]/
BELOW_DIACRITICS = /[\u0650\u064D\u0656]/
def arabic_character_columns(word_text)
columns = []
current_index = 1 # 1-based indexing
word_text.each_char do |char|
if char.match?(ARABIC_DIACRITICS)
# Diacritical mark — attach to current column
position = char.match?(BELOW_DIACRITICS) ? :below : :above
if columns.last[position]
# Stack multiple marks: shadda + fatha
columns.last[position][:text] += char
else
columns.last[position] = { text: char, index: current_index }
end
current_index += 1
else
# Base letter — start new column
columns << {
letter: { text: char, index: current_index },
above: nil,
below: nil
}
current_index += 1
end
end
columns
end
For the word بِسْمِ, this produces:
[
{ "letter": {"text": "ب", "index": 1}, "above": null, "below": {"text": "ِ", "index": 2} },
{ "letter": {"text": "س", "index": 3}, "above": {"text": "ْ", "index": 4}, "below": null },
{ "letter": {"text": "م", "index": 5}, "above": null, "below": {"text": "ِ", "index": 6} }
]
Every character — letter or diacritical mark — gets a unique 1-based index. This flat index space is what we store in the database.
Unicode Range Detection
The regex [\u064B-\u065F\u0670\u06D6-\u06ED] covers two Unicode blocks:
U+064B - U+065F Standard Arabic tashkeel (fathatan through hamza below)
U+0670 Superscript Alef (dagger alef)
U+06D6 - U+06ED Small high/low marks used in Quranic orthography
The below-diacritics subset [\u0650\u064D\u0656] identifies the three marks that render below the baseline: kasra, kasratan, and subscript alef. Everything else in the diacritics range renders above.
Data Model
The Mistakes Table
CREATE TABLE mistakes (
id bigint PRIMARY KEY,
account_id bigint NOT NULL REFERENCES accounts(id),
word_id bigint NOT NULL REFERENCES words(id),
personal_mushaf_id bigint NOT NULL REFERENCES personal_mushafs(id),
mistake_type_id bigint NOT NULL REFERENCES mistake_types(id),
assignment_id bigint REFERENCES assignments(id),
teacher_id bigint REFERENCES users(id),
start_index integer, -- 1-based, NULL = whole word
end_index integer, -- 1-based, NULL = whole word
notes text,
created_at timestamp NOT NULL,
updated_at timestamp NOT NULL
);
The key insight: start_index and end_index are both NULL for whole-word mistakes, and both populated for character-level mistakes. The range is inclusive and must be contiguous.
Selection Types
def selection_type
if start_index.nil? && end_index.nil?
:whole_word
elsif harakat? && !letter?
:harakat_only # Only diacritical marks selected
elsif letter?
:letter # At least one base letter selected
else
:unknown
end
end
Harakat-only selections display with a dotted circle prefix (◌َ) to show the mark in isolation.
Mistake Type Hierarchy
We organize mistake types into five categories with ascending severity:
Generic (Severity 10)
General pronunciation errors that don't fit a specific category.
Fluency (Severity 20-21)
- Hesitation — pause or stumble during recitation
- Skipped — omitted words or sections
Harakat (Severity 30-35)
Vowel and diacritical mark errors:
- Fatha, Damma, Kasra, Sukun, Shadda, and a generic Harakat type
Tajweed (Severity 40-55)
Recitation rule violations covering 14 distinct rules:
- Ghunnah (غنّة) — nasal resonance
- Madd (مدّ) — letter elongation
- Qalqalah (قلقلة) — bouncing echo on certain letters
- Ikhfa (إخفاء) — hidden pronunciation of noon
- Idgham (إدغام) — letter merging
- Iqlab (إقلاب) — noon transforms to meem before ba
- Izhar (إظهار) — clear pronunciation
- Waqf (وقف) — stopping rules
- Ibtida (ابتداء) — starting rules
- Heavy/light letter articulation, throat letters, mushaddad rules, silent letters
Huroof (Severity 70-98)
All 28 Arabic letters as individual mistake types — for students who consistently mispronounce specific letters (e.g., confusing ص and س).
Color Coding
const INDICATOR_COLORS = {
// Generic & Fluency
"mistake": "239 68 68", // red-500
"hesitation": "234 179 8", // yellow-500
"skipped": "234 179 8", // yellow-500
// Harakat (all rose)
"fatha": "244 63 94", // rose-500
"dhamma": "244 63 94",
"kasrah": "244 63 94",
"sukoon": "244 63 94",
"shadda": "244 63 94",
// Tajweed
"ghunnah": "168 85 247", // purple-500
"madd": "59 130 246", // blue-500
"qalqalah": "20 184 166", // teal-500
// Other tajweed: "249 115 22" (orange-500)
// Huroof: "16 185 129" (emerald-500)
};
Each category has a distinct color so teachers can scan a mushaf page and immediately see the distribution of error types.
The Character Selection UI
When a teacher taps a word on the mushaf, a dialog opens with each character rendered as a selectable element:
<div class="character-selector" dir="rtl">
<!-- Row 1: Above harakat -->
<span class="char-element char-harakat" data-char-index="4">ْ</span>
<!-- Row 2: Base letters -->
<span class="char-element char-letter" data-char-index="1">ب</span>
<span class="char-element char-letter" data-char-index="3">س</span>
<span class="char-element char-letter" data-char-index="5">م</span>
<!-- Row 3: Below harakat -->
<span class="char-element char-harakat" data-char-index="2">ِ</span>
<span class="char-element char-harakat" data-char-index="6">ِ</span>
</div>
Tap Behavior
// First tap: select all characters (whole word)
// Enters toggle mode for subsequent taps
handleCharacterTap(event) {
const element = event.currentTarget;
const type = element.dataset.charType;
const index = parseInt(element.dataset.charIndex);
if (!this.toggleMode) {
// First tap: select everything, enter toggle mode
this.selectAll();
this.toggleMode = true;
return;
}
if (type === "letter") {
// Tapping a letter toggles the letter AND its harakat
this.toggleLetterWithHarakat(index);
} else {
// Tapping a harakat toggles just that mark
this.toggleSingleChar(index);
}
this.updateHiddenInputs();
}
This two-phase interaction gives teachers a fast path (one tap for whole word) and a precise path (tap again to narrow down to specific characters).
Visual Feedback
/* Unselected: grayed out */
.char-letter:not(.selected) {
background-color: rgb(243 244 246); /* gray-100 */
color: rgb(156 163 175); /* gray-400 */
}
/* Selected: blue highlight */
.char-letter.selected {
background-color: rgb(219 234 254); /* blue-100 */
color: rgb(37 99 235); /* blue-700 */
box-shadow: 0 0 0 2px rgba(59, 130, 246, 0.3);
}
/* Touch feedback */
.char-element:active {
transform: scale(0.95);
}
Building the Character Mistakes Map
When rendering a word with mistakes, we need to map each character index to its mistake (if any). The first mistake per index wins:
def self.character_mistakes_map(mistakes, word)
char_columns = ArabicTextHelper.character_columns(word.splittable_text)
all_indices = char_columns.flat_map { |col| char_indices_from_column(col) }
map = {}
mistakes.each do |mistake|
if mistake.whole_word?
all_indices.each { |idx| map[idx] ||= mistake }
else
(mistake.start_index..mistake.end_index).each do |idx|
map[idx] ||= mistake
end
end
end
map
end
This map drives the rendering: each character column checks its letter, above, and below indices against the map to determine the visual treatment.
Rendering Mistake Indicators
We use two visual treatments depending on whether the mistake covers the whole word or specific characters:
Whole-Word Mistakes: Colored Background
When all mistakes on a word are whole-word selections, the entire word gets a colored background:
def mistake_class_for(personal_word)
mistake = personal_word.primary_mistake
case mistake.mistake_type.category
when "generic" then "bg-red-100 dark:bg-red-900/30 text-red-900 rounded"
when "fluency" then "bg-yellow-100 dark:bg-yellow-900/30 text-yellow-900 rounded"
when "tajweed" then "bg-purple-100 dark:bg-purple-900/30 text-purple-900 rounded"
when "harakat" then "bg-rose-100 dark:bg-rose-900/30 text-rose-900 rounded"
when "huroof" then "bg-emerald-100 dark:bg-emerald-900/30 text-emerald-900 rounded"
end
end
Character-Level Mistakes: Colored Underline
When a mistake targets specific characters (not the whole word), the word gets a colored bottom border instead. This is simpler and more reliable than trying to position overlays on individual combining characters:
.partial-mistake--bottom-border {
background: transparent !important;
box-shadow: inset 0 -3px 0 0 var(--border-color);
border-radius: 0 !important;
}
/* Color variants */
.partial-mistake--red { --border-color: rgb(239 68 68); } /* generic */
.partial-mistake--yellow { --border-color: rgb(234 179 8); } /* hesitation */
.partial-mistake--rose { --border-color: rgb(244 63 94); } /* harakat */
.partial-mistake--purple { --border-color: rgb(168 85 247); } /* ghunnah */
.partial-mistake--blue { --border-color: rgb(59 130 246); } /* madd */
.partial-mistake--teal { --border-color: rgb(20 184 166); } /* qalqalah */
.partial-mistake--orange { --border-color: rgb(249 115 22); } /* other tajweed */
The underline approach avoids the complexity of measuring combining characters with zero advance width. The color tells the teacher the mistake category at a glance, and tapping the word reveals the full details in the mistake dialog.
Handling Edge Cases
Multiple Mistakes on One Word
A word can have several mistakes (e.g., a tajweed error on one letter and a harakat error on another). The character_mistakes_map assigns the first mistake per index. Multiple mistakes never overlap visually — each character is colored by at most one mistake.
Safari and Arabic Shaping
Wrapping individual Arabic letters in <span> tags can break contextual shaping in some browsers. When a shadda is present and the browser is Safari, we fall back to whole-word highlighting:
def use_whole_word?(safari: false, lacks_arabic_span_shaping: false)
all_whole_word? || lacks_arabic_span_shaping || (safari && has_shadda?)
end
QPC Font Limitations
QPC V1 and V2 mushaf fonts use ligature glyphs that can't be split into individual characters. For character-level operations, we use the QPC Nastaleeq text representation which stores individual letters with separate diacritics:
def splittable_text
qpc_nastaleeq # Individual letters + diacritics, not ligatures
end
The Complete Flow
- Teacher taps a word on the mushaf — Stimulus controller opens the mistake dialog
- Dialog loads character selector via Turbo Frame — backend splits the word into character columns
- Teacher selects characters — first tap selects all, subsequent taps toggle individual characters
- Teacher picks mistake type from the categorized list and submits
- Backend validates and saves — converts selected indices to start_index/end_index range, validates contiguity
- Word re-renders via Turbo Stream — the word on the mushaf page updates with the new mistake indicator
# Controller creates the mistake
@mistake = @personal_mushaf.mistakes.build(
word_id: params[:word_id],
mistake_type_id: params[:mistake_type_id],
selected_indices: params[:selected_indices], # e.g., [3, 4]
teacher: current_user,
assignment: @assignment,
account: Current.account
)
# Model derives start/end from selection
def derive_indices_from_selection
indices = selected_indices.map(&:to_i).sort
self.start_index = indices.first # 3
self.end_index = indices.last # 4
end
What This Enables
With character-level mistake data, teachers can:
- See exactly which diacritical marks a student struggles with
- Track tajweed rule violations at the letter level
- Identify patterns across assignments (e.g., consistently missing sukun)
- Provide precise feedback referencing specific characters, not just words
The system handles 28 Arabic letters, 15+ diacritical marks, 14 tajweed rules, and all 28 huroof categories — all indexed at character granularity across 83,668 words in the Quran.
Questions? Email [email protected].