Projects

Check out our projects!

A community research initiative working towards democratizing AI research for Bengali by crowdsourcing datasets and launching research competitions.

bornil: An open-source, sign language data crowdsourcing platform for AI enabled dialect-agnostic communication and domain study
bornil: An open-source, sign language data crowdsourcing platform for AI enabled dialect-agnostic communication and domain study
ongoingAugust 29, 2023

The first opensourced and publicly available sign language data collection tool. The commonvoice for sign languages.

BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset
BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset
completedMarch 09, 2023

The first multidomain large Bengali Document Layout Analysis Dataset: BaDLAD. This dataset contains 33,695 human annotated document samples from six domains - i) books and magazines ii) public domain govt. documents iii) liberation war documents iv) new newspapers v) historical newspapers and vi) property deeds. 700K polygon annotations from image captured documents in the wild.

The first large scale Multi-Domain Bengali Handwritten Digit Recognition Dataset
The first large scale Multi-Domain Bengali Handwritten Digit Recognition Dataset
completedJune 06, 2018

The first large scale Bengali Handwritten Digit Recognition Dataset.

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents
bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents
ongoingAugust 21, 2023

The first Opensourced complete OCR pipeline for Bengali. We provide 2 synthetic datasets for word recognition and one hand annotated complete document deconstruction dataset BCD3 with 228 domain diversified samples. We benchmark our models on the BCD3 dataset and opensource the datasets and the models+system for further research. We will keep updating the dataset.

Bengali Grammatical Error Correction Project
Bengali Grammatical Error Correction Project
ongoingMarch 05, 2023

Large Bengali grammatical error detection and correction project. Involves a novel linguist validated dataset with 100k+ sentence samples with word-level anntoatation. Hosted a Kaggle competition in 2023 to crowdsource solutions.

OOD-Speech: Bengali.AI Massively Crowdsourced Bengali Speech Recognition Project
OOD-Speech: Bengali.AI Massively Crowdsourced Bengali Speech Recognition Project
completedMay 15, 2023

Jointly largest open-sourced Bengali ASR dataset as well as first Bengali Out-of-Distribution Speech Recognition benchmarking dataset. 25,000+ people contributed in the development of this dataset.

A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes
A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes
completedOctober 01, 2020

A benchmark datset for multi-target classification of handwritten Bengali Graphemes, with novel implications for all alpha-syllabary languages, e.g., Hindi, Gujrati, and Thai.