Loading…

Character-Level Mistake Marking: Precision Error Tracking for Arabic Text

Jibran Kalia 12 min read
Written

بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ

Tracking recitation mistakes at the word level isn't enough. A student might pronounce the letter correctly but miss the diacritical mark above it. In this post, I'll walk through how we built character-level mistake tracking for Arabic Quranic text — from Unicode decomposition to the tap-based selection UI.

Why Character-Level Precision?

Consider the Arabic word بِسْمِ (bismi). It contains three letters (ب، س، م) and three diacritical marks (kasra, sukun, kasra). A student might pronounce the س correctly but miss the sukun above it, turning a stopped consonant into an open syllable. Word-level tracking would flag the entire word. Character-level tracking pinpoints the exact harakat.

This distinction matters for teachers. When reviewing a student's recitation, they need to see patterns: does this student consistently drop sukun marks? Do they confuse fatha and damma? Character-level data makes these patterns visible.

Understanding Arabic Character Structure

Arabic text is composed of base letters with combining diacritical marks (tashkeel) that modify pronunciation. These marks appear above or below the letter:

Diacritic Positions

Above the letter:

  • Fatha (◌َ) U+064E — short "a" vowel
  • Damma (◌ُ) U+064F — short "u" vowel
  • Sukun (◌ْ) U+0652 — no vowel (consonant stop)
  • Shadda (◌ّ) U+0651 — doubled consonant
  • Fathatan (◌ً) U+064B — nunation with "a"
  • Dammatan (◌ٌ) U+064C — nunation with "u"
  • Superscript Alef (◌ٰ) U+0670 — long "a" (dagger alef)

Below the letter:

  • Kasra (◌ِ) U+0650 — short "i" vowel
  • Kasratan (◌ٍ) U+064D — nunation with "i"
  • Subscript Alef (◌ٖ) U+0656

A single Arabic character column can therefore have three layers: the base letter, marks above, and marks below. Multiple marks can stack in the same position (e.g., shadda + fatha: بَّ).

The Character Column Model

We represent each visual "column" of an Arabic word as a structured object with three slots:

# app/helpers/arabic_text_helper.rb

ARABIC_DIACRITICS = /[\u064B-\u065F\u0670\u06D6-\u06ED]/
BELOW_DIACRITICS  = /[\u0650\u064D\u0656]/

def arabic_character_columns(word_text)
  columns = []
  current_index = 1  # 1-based indexing

  word_text.each_char do |char|
    if char.match?(ARABIC_DIACRITICS)
      # Diacritical mark — attach to current column
      position = char.match?(BELOW_DIACRITICS) ? :below : :above

      if columns.last[position]
        # Stack multiple marks: shadda + fatha
        columns.last[position][:text] += char
      else
        columns.last[position] = { text: char, index: current_index }
      end
      current_index += 1
    else
      # Base letter — start new column
      columns << {
        letter: { text: char, index: current_index },
        above: nil,
        below: nil
      }
      current_index += 1
    end
  end

  columns
end

For the word بِسْمِ, this produces:

[
  { "letter": {"text": "ب", "index": 1}, "above": null,                    "below": {"text": "ِ", "index": 2} },
  { "letter": {"text": "س", "index": 3}, "above": {"text": "ْ", "index": 4}, "below": null },
  { "letter": {"text": "م", "index": 5}, "above": null,                    "below": {"text": "ِ", "index": 6} }
]

Every character — letter or diacritical mark — gets a unique 1-based index. This flat index space is what we store in the database.

Unicode Range Detection

The regex [\u064B-\u065F\u0670\u06D6-\u06ED] covers two Unicode blocks:

U+064B - U+065F  Standard Arabic tashkeel (fathatan through hamza below)
U+0670           Superscript Alef (dagger alef)
U+06D6 - U+06ED  Small high/low marks used in Quranic orthography

The below-diacritics subset [\u0650\u064D\u0656] identifies the three marks that render below the baseline: kasra, kasratan, and subscript alef. Everything else in the diacritics range renders above.

Data Model

The Mistakes Table

CREATE TABLE mistakes (
  id            bigint PRIMARY KEY,
  account_id    bigint NOT NULL REFERENCES accounts(id),
  word_id       bigint NOT NULL REFERENCES words(id),
  personal_mushaf_id bigint NOT NULL REFERENCES personal_mushafs(id),
  mistake_type_id    bigint NOT NULL REFERENCES mistake_types(id),
  assignment_id bigint REFERENCES assignments(id),
  teacher_id    bigint REFERENCES users(id),
  start_index   integer,  -- 1-based, NULL = whole word
  end_index     integer,  -- 1-based, NULL = whole word
  notes         text,
  created_at    timestamp NOT NULL,
  updated_at    timestamp NOT NULL
);

The key insight: start_index and end_index are both NULL for whole-word mistakes, and both populated for character-level mistakes. The range is inclusive and must be contiguous.

Selection Types

def selection_type
  if start_index.nil? && end_index.nil?
    :whole_word
  elsif harakat? && !letter?
    :harakat_only   # Only diacritical marks selected
  elsif letter?
    :letter         # At least one base letter selected
  else
    :unknown
  end
end

Harakat-only selections display with a dotted circle prefix (◌َ) to show the mark in isolation.

Mistake Type Hierarchy

We organize mistake types into five categories with ascending severity:

Generic (Severity 10)

General pronunciation errors that don't fit a specific category.

Fluency (Severity 20-21)

  • Hesitation — pause or stumble during recitation
  • Skipped — omitted words or sections

Harakat (Severity 30-35)

Vowel and diacritical mark errors:

  • Fatha, Damma, Kasra, Sukun, Shadda, and a generic Harakat type

Tajweed (Severity 40-55)

Recitation rule violations covering 14 distinct rules:

  • Ghunnah (غنّة) — nasal resonance
  • Madd (مدّ) — letter elongation
  • Qalqalah (قلقلة) — bouncing echo on certain letters
  • Ikhfa (إخفاء) — hidden pronunciation of noon
  • Idgham (إدغام) — letter merging
  • Iqlab (إقلاب) — noon transforms to meem before ba
  • Izhar (إظهار) — clear pronunciation
  • Waqf (وقف) — stopping rules
  • Ibtida (ابتداء) — starting rules
  • Heavy/light letter articulation, throat letters, mushaddad rules, silent letters

Huroof (Severity 70-98)

All 28 Arabic letters as individual mistake types — for students who consistently mispronounce specific letters (e.g., confusing ص and س).

Color Coding

const INDICATOR_COLORS = {
  // Generic & Fluency
  "mistake":    "239 68 68",   // red-500
  "hesitation": "234 179 8",   // yellow-500
  "skipped":    "234 179 8",   // yellow-500

  // Harakat (all rose)
  "fatha":   "244 63 94",     // rose-500
  "dhamma":  "244 63 94",
  "kasrah":  "244 63 94",
  "sukoon":  "244 63 94",
  "shadda":  "244 63 94",

  // Tajweed
  "ghunnah":  "168 85 247",   // purple-500
  "madd":     "59 130 246",   // blue-500
  "qalqalah": "20 184 166",   // teal-500
  // Other tajweed: "249 115 22" (orange-500)

  // Huroof: "16 185 129" (emerald-500)
};

Each category has a distinct color so teachers can scan a mushaf page and immediately see the distribution of error types.

The Character Selection UI

When a teacher taps a word on the mushaf, a dialog opens with each character rendered as a selectable element:

<div class="character-selector" dir="rtl">
  <!-- Row 1: Above harakat -->
  <span class="char-element char-harakat" data-char-index="4">ْ</span>

  <!-- Row 2: Base letters -->
  <span class="char-element char-letter" data-char-index="1">ب</span>
  <span class="char-element char-letter" data-char-index="3">س</span>
  <span class="char-element char-letter" data-char-index="5">م</span>

  <!-- Row 3: Below harakat -->
  <span class="char-element char-harakat" data-char-index="2">ِ</span>
  <span class="char-element char-harakat" data-char-index="6">ِ</span>
</div>

Tap Behavior

// First tap: select all characters (whole word)
// Enters toggle mode for subsequent taps

handleCharacterTap(event) {
  const element = event.currentTarget;
  const type = element.dataset.charType;
  const index = parseInt(element.dataset.charIndex);

  if (!this.toggleMode) {
    // First tap: select everything, enter toggle mode
    this.selectAll();
    this.toggleMode = true;
    return;
  }

  if (type === "letter") {
    // Tapping a letter toggles the letter AND its harakat
    this.toggleLetterWithHarakat(index);
  } else {
    // Tapping a harakat toggles just that mark
    this.toggleSingleChar(index);
  }

  this.updateHiddenInputs();
}

This two-phase interaction gives teachers a fast path (one tap for whole word) and a precise path (tap again to narrow down to specific characters).

Visual Feedback

/* Unselected: grayed out */
.char-letter:not(.selected) {
  background-color: rgb(243 244 246);  /* gray-100 */
  color: rgb(156 163 175);             /* gray-400 */
}

/* Selected: blue highlight */
.char-letter.selected {
  background-color: rgb(219 234 254);  /* blue-100 */
  color: rgb(37 99 235);               /* blue-700 */
  box-shadow: 0 0 0 2px rgba(59, 130, 246, 0.3);
}

/* Touch feedback */
.char-element:active {
  transform: scale(0.95);
}

Building the Character Mistakes Map

When rendering a word with mistakes, we need to map each character index to its mistake (if any). The first mistake per index wins:

def self.character_mistakes_map(mistakes, word)
  char_columns = ArabicTextHelper.character_columns(word.splittable_text)
  all_indices = char_columns.flat_map { |col| char_indices_from_column(col) }
  map = {}

  mistakes.each do |mistake|
    if mistake.whole_word?
      all_indices.each { |idx| map[idx] ||= mistake }
    else
      (mistake.start_index..mistake.end_index).each do |idx|
        map[idx] ||= mistake
      end
    end
  end

  map
end

This map drives the rendering: each character column checks its letter, above, and below indices against the map to determine the visual treatment.

Rendering Mistake Indicators

We use two visual treatments depending on whether the mistake covers the whole word or specific characters:

Whole-Word Mistakes: Colored Background

When all mistakes on a word are whole-word selections, the entire word gets a colored background:

def mistake_class_for(personal_word)
  mistake = personal_word.primary_mistake
  case mistake.mistake_type.category
  when "generic"  then "bg-red-100 dark:bg-red-900/30 text-red-900 rounded"
  when "fluency"  then "bg-yellow-100 dark:bg-yellow-900/30 text-yellow-900 rounded"
  when "tajweed"  then "bg-purple-100 dark:bg-purple-900/30 text-purple-900 rounded"
  when "harakat"  then "bg-rose-100 dark:bg-rose-900/30 text-rose-900 rounded"
  when "huroof"   then "bg-emerald-100 dark:bg-emerald-900/30 text-emerald-900 rounded"
  end
end

Character-Level Mistakes: Colored Underline

When a mistake targets specific characters (not the whole word), the word gets a colored bottom border instead. This is simpler and more reliable than trying to position overlays on individual combining characters:

.partial-mistake--bottom-border {
  background: transparent !important;
  box-shadow: inset 0 -3px 0 0 var(--border-color);
  border-radius: 0 !important;
}

/* Color variants */
.partial-mistake--red    { --border-color: rgb(239 68 68); }   /* generic */
.partial-mistake--yellow { --border-color: rgb(234 179 8); }   /* hesitation */
.partial-mistake--rose   { --border-color: rgb(244 63 94); }   /* harakat */
.partial-mistake--purple { --border-color: rgb(168 85 247); }  /* ghunnah */
.partial-mistake--blue   { --border-color: rgb(59 130 246); }  /* madd */
.partial-mistake--teal   { --border-color: rgb(20 184 166); }  /* qalqalah */
.partial-mistake--orange { --border-color: rgb(249 115 22); }  /* other tajweed */

The underline approach avoids the complexity of measuring combining characters with zero advance width. The color tells the teacher the mistake category at a glance, and tapping the word reveals the full details in the mistake dialog.

Handling Edge Cases

Multiple Mistakes on One Word

A word can have several mistakes (e.g., a tajweed error on one letter and a harakat error on another). The character_mistakes_map assigns the first mistake per index. Multiple mistakes never overlap visually — each character is colored by at most one mistake.

Safari and Arabic Shaping

Wrapping individual Arabic letters in <span> tags can break contextual shaping in some browsers. When a shadda is present and the browser is Safari, we fall back to whole-word highlighting:

def use_whole_word?(safari: false, lacks_arabic_span_shaping: false)
  all_whole_word? || lacks_arabic_span_shaping || (safari && has_shadda?)
end

QPC Font Limitations

QPC V1 and V2 mushaf fonts use ligature glyphs that can't be split into individual characters. For character-level operations, we use the QPC Nastaleeq text representation which stores individual letters with separate diacritics:

def splittable_text
  qpc_nastaleeq  # Individual letters + diacritics, not ligatures
end

The Complete Flow

  1. Teacher taps a word on the mushaf — Stimulus controller opens the mistake dialog
  2. Dialog loads character selector via Turbo Frame — backend splits the word into character columns
  3. Teacher selects characters — first tap selects all, subsequent taps toggle individual characters
  4. Teacher picks mistake type from the categorized list and submits
  5. Backend validates and saves — converts selected indices to start_index/end_index range, validates contiguity
  6. Word re-renders via Turbo Stream — the word on the mushaf page updates with the new mistake indicator
# Controller creates the mistake
@mistake = @personal_mushaf.mistakes.build(
  word_id: params[:word_id],
  mistake_type_id: params[:mistake_type_id],
  selected_indices: params[:selected_indices],  # e.g., [3, 4]
  teacher: current_user,
  assignment: @assignment,
  account: Current.account
)

# Model derives start/end from selection
def derive_indices_from_selection
  indices = selected_indices.map(&:to_i).sort
  self.start_index = indices.first  # 3
  self.end_index = indices.last     # 4
end

What This Enables

With character-level mistake data, teachers can:

  • See exactly which diacritical marks a student struggles with
  • Track tajweed rule violations at the letter level
  • Identify patterns across assignments (e.g., consistently missing sukun)
  • Provide precise feedback referencing specific characters, not just words

The system handles 28 Arabic letters, 15+ diacritical marks, 14 tajweed rules, and all 28 huroof categories — all indexed at character granularity across 83,668 words in the Quran.

Questions? Email [email protected].