36 Hours of Algorithms and Sleep Deprivation: Third Place at the AlphaBit AI Datathon
36 Hours of Algorithms and Sleep Deprivation: Third Place at the AlphaBit AI Datathon
AlphaBit AI Datathon 2025 - ESI SBA
Links
Connect with me: LinkedIn - Aymen Guerrouf
The AlphaBit Club at ESI SBA organized a 36-hour AI Datathon with 9 distinct challenges. Not tutorials, not guided notebooks — real problems with minimal documentation, conflicting data, and evaluation metrics designed to punish lazy solutions.
We formed a team of 5, entered with the intention of learning, and left with third place overall.
Here's what actually happened.
The Environment: 36 Hours of Controlled Chaos
The format was straightforward: 9 challenges, all running simultaneously, submit as many solutions as you want before the deadline. Final ranking based on aggregate performance across all tasks.
Most teams specialized. Pick 2-3 challenges, go deep, ignore the rest.
We decided to attack everything.
In hindsight, this was the right call. The scoring rewarded breadth. A mediocre submission on a neglected challenge often outperformed missing it entirely.
Citation Network: Graph Neural Networks from Scratch
This was one of my primary challenges. Each node represented a research paper, edges represented citations, and the goal was to predict paper topics using both node features and graph structure.
The dataset:
edges.csv— citation links between papersfeatures.csv— node feature vectorslabels_train.csv— topic labels for training nodessplits.csv— explicit train/val/test masks
The task was clear: build a GNN (GCN, GraphSAGE, GAT) using PyTorch Geometric or DGL, train on masked nodes, predict on the hidden test set.
My approach:
- Graph construction — Built the adjacency structure using PyG's Data object
- Architecture — Tested GCN and GraphSAGE variants with different hidden dimensions
- Training discipline — Strict adherence to the provided masks, CrossEntropyLoss, validation-based hyperparameter tuning
The challenge was beginner-friendly by design, but execution still mattered. Proper message passing, avoiding overfitting on the small labeled set, and respecting the split masks.
Final position: 9th place.
Solid fundamentals, nothing fancy.
AITSP: When Python Isn't Fast Enough
The scheduling challenge was where things got interesting.
PFSP (Permutation Flow Shop Scheduling Problem) is NP-hard. The search space explodes exponentially. Naive approaches hit local optima around 2100 when competitive solutions needed 1600.
The problem demanded algorithmic engineering, not just model selection.
The Technical Stack
Phase 1: Speed
Python's default performance was unacceptable. I rewrote the makespan computation using Numba with Taillard's acceleration technique.
Result: 1,000x speedup. From 5,000 evaluations per second to 5 million.
Phase 2: Search Architecture
Small instances (N ≤ 20) went to OR-Tools CP-SAT for exact solutions. Large instances got an Iterated Greedy algorithm with NEH-D initialization and path relinking to escape cycles.
Phase 3: Parallelization
I deployed multiple search agents simultaneously:
- Forward solver
- Backward solver (reversed job order)
- Aggressive destroy/rebuild cycles
- Conservative local refinement
Four CPUs, four independent search trajectories, unified best-solution tracking.
The final engine could explore the solution space with surgical precision while competitors were still debugging their greedy implementations.
Final position: 7th place.
The algorithm punched above its weight.
RAG Challenge: Teaching Models to Detect Contradictions
This challenge split into two sub-tasks:
Legal Clerk Agent — Parse zoning laws that intentionally contain contradictions. The evaluation specifically tested whether your model could identify conflicting rules instead of hallucinating a resolution.
Fact-Checking Agent — Classify claims as True, False, or Partially True based on a knowledge base. Negation handling was critical.
Most LLM-based systems fail at contradiction detection. They're trained to be helpful, which means they try to reconcile conflicting information instead of flagging it.
My approach:
- Structured retrieval — Retrieved relevant law sections with explicit conflict markers
- Prompt engineering — Forced the model to enumerate contradictions before attempting resolution
- Confidence calibration — Tuned the system to output "conflict detected" rather than fabricating agreement
The fact-checking component required similar discipline. I built retrieval logic that surfaced negations explicitly and structured prompts that separated evidence extraction from classification.
Final position: 3rd place.
The system worked because I designed it to be correct, not confident.
Find the Water: Late Discovery, Strong Finish
We noticed this segmentation challenge embarrassingly late. Hours of potential optimization time, gone.
The pragmatic response: ship something functional, fast.
I assembled a minimal pipeline:
- Lightweight U-Net architecture
- Aggressive data augmentation
- Clean RLE encoding for submission
- Short but efficient training loop
No hyperparameter tuning. No ensemble methods. Just solid fundamentals executed quickly.
Final position: 2nd place.
Sometimes good engineering under pressure beats elaborate solutions with more time.
Masked X-Ray Challenge: Pneumonia Detection with Minimal Labels
This challenge tested model resilience under extreme data constraints. The task: classify masked chest X-rays for pneumonia detection, evaluated on AUC.
The dataset structure was intentionally restrictive:
train/— Unlabeled images only. No annotations whatsoever.val/— A small labeled subset for local evaluationtest/— Hidden labels, used for final scoring
The catch: you had to figure out how to leverage unlabeled training data with only a tiny validation set for supervision. Classic semi-supervised learning territory.
My approach:
- Pseudo-labeling — Used the validation-trained model to generate soft labels for unlabeled training data
- Progressive training — Started with high-confidence pseudo-labels, gradually included harder samples
- Augmentation strategy — Heavy augmentations to prevent overfitting on the small labeled set
- AUC optimization — Focused on ranking quality rather than raw accuracy
The evaluation metric (AUC) rewarded models that could separate classes reliably, not just predict correctly. This meant calibration mattered as much as accuracy.
Final position: 7th place.
Limited data, limited labels, still competitive.
Sentence Meaning Similarity: Understanding Beyond Words
The AISM challenge asked a deceptively simple question: do these two sentences mean the same thing?
The constraint that made it interesting: no pretrained LLMs allowed. Notebook submission mandatory.
This forced us back to fundamentals. No BERT, no sentence transformers, no fine-tuning massive models. You had to build something from scratch that could capture semantic similarity.
My approach:
- Feature engineering — Word overlap metrics, TF-IDF similarity, syntactic pattern matching
- Embedding methods — Trained lightweight embeddings on the provided corpus
- Ensemble logic — Combined multiple similarity signals into a unified classifier
- F1 optimization — Balanced precision and recall for the binary classification
The 40/60 public/private split meant leaderboard positions could shift dramatically at final evaluation. Overfitting to the public test set was a trap.
Final position: 2nd place.
No pretrained models, pure engineering.
The Molecular Classification Mess
Mass spectrometry data is notoriously ugly. This dataset was worse.
Fragment m/z values, intensity spectra, SMILES strings, precursor metadata — all with inconsistent formatting, missing values, and severe class imbalance.
Data cleaning consumed hours that should have gone to modeling.
The final pipeline:
- Binned spectral features
- Molecular fingerprint generation
- Multi-class classifier with probability calibration
Final position: 4th place.
A solid finish on a notoriously messy domain.
Histopathology Classification: Medical Imaging at Scale
The histopathology challenge was a 7-class tissue classification problem. Each image patch came from microscopy slides, and the task was to distinguish between normal tissue, benign lesions, atypical hyperplasia, carcinoma in situ, and invasive carcinoma.
The classes:
- 0: Normal (N)
- 1: Papillary Benign (PB)
- 2: Usual Ductal Hyperplasia (UDH)
- 3: Flat Epithelial Atypia (FEA)
- 4: Atypical Ductal Hyperplasia (ADH)
- 5: Ductal Carcinoma In Situ (DCIS)
- 6: Invasive Carcinoma (IC)
The evaluation metric was Quadratic Weighted Cohen's Kappa — designed for ordinal classification where the distance between classes matters. Predicting IC when the true class is Normal carries a much heavier penalty than being off by one category.
This made the problem clinically realistic. In real diagnostics, confusing normal tissue with invasive carcinoma is catastrophic. Confusing ADH with DCIS is a much smaller error.
The approach:
- Transfer learning — Started with pretrained CNN backbones, fine-tuned on the training set
- Data augmentation — Heavy geometric and color augmentations to handle the limited dataset
- Ordinal awareness — Loss function modifications to respect the class ordering
- Ensemble methods — Combined multiple model predictions for stability
Final position: 2nd place.
Medical imaging with proper ordinal scoring.
The Reality of Hour 30
By the final stretch:
- Someone on the team started talking to notebook cells that weren't there
- Jupyter crashed and took 2 hours of unsaved work with it
- One teammate fell asleep on the keyboard and executed random cells
- I debugged a failing submission for 30 minutes before realizing the bug was a missing comma
The scoreboard kept updating. We kept climbing.
Resilience matters more than people admit. Most competitors build solutions when conditions are optimal. The teams that win build solutions when everything is falling apart.
Final Results
| Challenge | Position | My Contribution |
|---|---|---|
| Citation Network | 9th | Lead |
| AITSP Optimization | 7th | Lead |
| RAG Legal/Fact-Check | 3rd | Lead |
| Find the Water | 2nd | Team |
| Masked X-Ray | 7th | Team |
| Sentence Similarity (AISM) | 2nd | Team |
| Histopathology | 2nd | Team |
| Molecular Classification | 4th | Team |
Overall: 3rd Place
What I Learned
Speed is a feature. The scheduling challenge proved that algorithmic performance isn't academic. A 1000x speedup translated directly to better solutions.
Correctness beats confidence. The RAG challenge rewarded systems that admitted uncertainty over systems that hallucinated answers.
Breadth matters in multi-task competitions. Submitting something on every challenge outperformed specializing in a few. We attacked all 9 problems.
Exhaustion reveals fundamentals. At hour 30, you don't have the cognitive resources for clever tricks. You fall back on solid engineering habits. Build those habits before you need them.
Competition clarifies priorities. When you have 36 hours and 9 problems, you learn very quickly what actually matters versus what feels important.
Looking Forward
This datathon reinforced something I already suspected: I enjoy problems that require both theoretical understanding and practical engineering. The scheduling challenge wasn't about knowing optimization theory — it was about implementing it efficiently. The RAG challenge wasn't about understanding LLMs — it was about constraining them to behave correctly.
Third place with a team that attacked every challenge, survived the chaos, and learned more in 36 hours than most workshops teach in a semester.
I'll be back for the next one.
Let's Connect
If you're considering entering an AI competition: Do it. The time pressure forces you to prioritize ruthlessly, the variety of challenges exposes gaps in your knowledge, and the competition reveals where your skills actually stand.
What's your approach to competition strategy? Specialize deeply or attack broadly?
Connect with me: LinkedIn
36 hours of algorithms, caffeine, and determination. Third place at AlphaBit AI Datathon 2025, ESI SBA. The optimization never stops.
Enjoyed this post?
Found this helpful? Share the link with others!