Home About Projects
Contact AVAILABLE — Summer 2026 Usually replies in <24h • UTC+3
Home About Projects Contact
/

Projects Showcase

Featured / Research

Turkish Morphological Tokenizer

A context-aware analyzer using Finite State Transducers and Viterbi-based POS disambiguation. It algorithmically models Turkish phonetic rules and resolves polysemy in 65k+ roots.

View Repository Also look data pipeline
Finite State Transducer

Project Overview

This research project addresses the complexity of agglutinative morphology in Turkish. By combining Finite State Transducers (FST) with probabilistic disambiguation, we achieve high-accuracy segmentation without the massive compute requirements of LLMs.

  • Core Tech: FST for phonology, Hidden Markov Models for disambiguation.
  • Dataset: Custom 65k+ root lexicon derived from Kaikki, Zemberek, and TDK.
  • Impact: Provides a lightweight, interpretable alternative to neural tokenizers.

Why is this hard?

Turkish is an agglutinative language where a single word can correspond to an entire English sentence (e.g., "çekoslovakyalılaştıramadıklarımızdanmışsınızcasına").

  • Ambiguity: A single surface form can have dozens of valid parses.
  • Phonetic Harmony: Vowel harmony and consonant changes create a massive search space.
  • OOV Issue: Dictionary-based lookups fail on productive derivations.

Approach

We built a two-stage pipeline:

  • 1. FST Generation: Using OpenFst principles, we model phonological rules (vowel harmony, drops) as state transitions. This generates all possible parses for a word.
  • 2. Disambiguation: A Viterbi decoder scores path probabilities based on bigram POS statistics trained on the BOUN corpus.
  • 3. Rule Engine: A final rule-based layer handles edge cases like proper nouns and date formats.

Key Metrics

  • Status: Research in progress.
  • Focus: Phonological rule coverage and disambiguation logic.

Ablation Studies

To validate components, we disabled parts of the pipeline:

  • Without Viterbi: Accuracy drops to ~70% (random selection of valid parses).
  • Without Phonological Rules: Coverage drops significantly on complex derivations.

Lessons Learned

  • Data Quality > Model Complexity: Cleaning the lexicon gave higher gains than tweaking the HMM.
  • Hybrid is Robust: Combining FST (exact) with Probabilistic (guess) handles both seen and unseen words best.
  • Interpretability matters: Unlike BERT tokenizers, we can debug exactly why a word was split a certain way.
Data Collection
Kaikki & Zemberek Parsing
FST Design
Phonological Rule Modeling
Disambiguation
HMM & Viterbi Implementation
Evaluation
Benchmarks & Demo
İpekGPT Dashboard

İpekGPT

LLM-based chatbot for İpek Yolu Entrepreneur Incubation Center. Features RAG architecture for accurate, context-aware responses.

<200ms Latency
90% Cost Reduction
Hybrid Retrieval
Try Live Demo Source Code

Projects

FitTurkAI
Teknofest / AI Assistant

FitTurkAI

Personalized nutrition assistant combining RAG and fine-tuned CosmosGemma for task-focused dietetic advice.

TECH STACK
PYTHON COSMOSGEMMA 2B RAG FLASK
KEY FEATURES
  • Fine-tuned CosmosGemma 2B model for dietetic expertise.
  • RAG pipeline processing 40+ nutrition PDFs.
  • Personalized meal planning based on biometrics.
RAG ARCHITECTURE
PDFs Chunks Vectors LLM
SOURCE CODE
Personal OS
Productivity / System

Personal OS

LLM-based chatbot for Ipek Yolu Entrepreneur Incubation Center. Features RAG architecture for accurate, context-aware responses.

TECH STACK
OPENAI API QDRANT FASTAPI REACT
KEY FEATURES
  • Hybrid retrieval (Vector + Keyword) for high accuracy.
  • Sub-200ms latency on production deployment.
  • Context-aware conversation history management.
HYBRID SEARCH FLOW
User Keyword Vector Reranker LLM
Source Code
KEGOMODORO
Productivity / Open Source

KEGOMODORO

Customizable Pomodoro timer with Pixela integration for productivity tracking. Features custom themes, stopwatch mode, and full personalization.

TECH STACK
ELECTRON JAVASCRIPT PIXELA API CSS3
KEY FEATURES
  • Real-time Pixela graph integration for habit tracking.
  • Customizable themes and focus/break intervals.
  • Electron-based cross-platform architecture.
LIVE ACTIVITY
KEGOMODORO Live Activity

Click to see how it works

SOURCE CODE
SCROLL

İpekGPT Case Study

Problem

Entrepreneurs needed instant answers to incubation regulations, but the 500+ page docs were unsearchable. Traditional keyword search failed on semantic queries.

Approach

Implemented a RAG pipeline using Qdrant for vector storage and OpenAI for generation. Used a hybrid search (sparse + dense) to capture both exact terminology and semantic meaning.

Metrics

Reduced average query time to < 200ms. Answer accuracy improved by 40% compared to baseline keyword search.

Links

GitHub Repository

FitTurkAI Case Study

Problem

Generic LLMs give generic diet advice. Users needed personalized plans based on specific health data (BMI, allergies) and verified nutritional guidelines.

Approach

Fine-tuned CosmosGemma 2B on a curated dataset of dietetic Q&A. Integrated RAG to fetch context from approved nutrition textbooks before generation.

Architecture

[Data Ingestion] -> [Chunking] -> [Vector Store] -> [Fine-tuned LLM] -> [Personalized Plan]

KEGOMODORO Case Study

Problem

Most Pomodoro timers are rigid. I needed a tool that could track my actual focus hours to Pixela graphs automatically.

Approach

Built an Electron app for cross-platform support. Integrated Pixela API for heatmaps. Added a 'Stopwatch Mode' for flexible work sessions.

Results

Used personally for 500+ hours. Open sourced with 13 stars.

Wanna see more?
Check my GitHub Account

© 2026 Kağan Arıbaş. All rights reserved.