Datasets
Below are the datasets, I have created and published for research purposes. All datasets are freely available for academic use.
ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition
Published in 2025
ANCHOLIK-NER, a Bangla Regional Named Entity Recognition dataset focusing on the Sylhet, Chittagong, Barishal, Mymensingh, and Noakhali dialects. It comprises 17,405 sentences, evenly distributed across the five regions, with entities categorized into 10 types. The raw sentences were collected from two publicly available datasets and through web scraping from various online newspapers, articles.
Size: 579 KB; 17,405 sentences; 90,755 samples