Bengali.AI

Projects

Check out our projects!

A community research initiative working towards democratizing AI research for Bengali by crowdsourcing datasets and launching research competitions.

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents

ongoing•August 21, 2023

The first Opensourced complete OCR pipeline for Bengali. We provide 2 synthetic datasets for word recognition and one hand annotated complete document deconstruction dataset BCD3 with 228 domain diversified samples. We benchmark our models on the BCD3 dataset and opensource the datasets and the models+system for further research. We will keep updating the dataset.

Bengali Grammatical Error Correction Project

ongoing•March 05, 2023

Large Bengali grammatical error detection and correction project. Involves a novel linguist validated dataset with 100k+ sentence samples with word-level anntoatation. Hosted a Kaggle competition in 2023 to crowdsource solutions.

OOD-Speech: Bengali.AI Massively Crowdsourced Bengali Speech Recognition Project

completed•May 15, 2023

Jointly largest open-sourced Bengali ASR dataset as well as first Bengali Out-of-Distribution Speech Recognition benchmarking dataset. 25,000+ people contributed in the development of this dataset.

A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes

completed•October 01, 2020

A benchmark datset for multi-target classification of handwritten Bengali Graphemes, with novel implications for all alpha-syllabary languages, e.g., Hindi, Gujrati, and Thai.

BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset

completed•March 09, 2023

The first multidomain large Bengali Document Layout Analysis Dataset: BaDLAD. This dataset contains 33,695 human annotated document samples from six domains - i) books and magazines ii) public domain govt. documents iii) liberation war documents iv) new newspapers v) historical newspapers and vi) property deeds. 700K polygon annotations from image captured documents in the wild.

The first large scale Multi-Domain Bengali Handwritten Digit Recognition Dataset

completed•June 06, 2018

The first large scale Bengali Handwritten Digit Recognition Dataset.