Projects
Check out our projects!
A community research initiative working towards democratizing AI research for Bengali by crowdsourcing datasets and launching research competitions.

The first opensourced and publicly available sign language data collection tool. The commonvoice for sign languages.

The first multidomain large Bengali Document Layout Analysis Dataset: BaDLAD. This dataset contains 33,695 human annotated document samples from six domains - i) books and magazines ii) public domain govt. documents iii) liberation war documents iv) new newspapers v) historical newspapers and vi) property deeds. 700K polygon annotations from image captured documents in the wild.

The first Opensourced complete OCR pipeline for Bengali. We provide 2 synthetic datasets for word recognition and one hand annotated complete document deconstruction dataset BCD3 with 228 domain diversified samples. We benchmark our models on the BCD3 dataset and opensource the datasets and the models+system for further research. We will keep updating the dataset.

Jointly largest open-sourced Bengali ASR dataset as well as first Bengali Out-of-Distribution Speech Recognition benchmarking dataset. 25,000+ people contributed in the development of this dataset.