Indian Diaspora is rich in languages. According to a 2018 survey, 9 out of 10 new internet users will be Indian language users. Moreover, close to 70% of people consume content in local languages. With the rise in demand, enthusiasts, freelancers and organisations started focusing on developing NLP tools for the Indian languages.
This post aims in sharing the existing set of NLP tools and their usage
(i) iNLTK
NLTK is a pretty popular package available for ENG language analysis. iNLTK is its Indian cousin for analysing regional languages.
Installation
pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install inltk
Setup
from inltk.inltk import setup
setup('<code-of-language>') // for telugu -> setup('te')
The setup is for the first time to use the language that downloads the necessary models for analysis
Supported Languages
Hindi, Punjabi, Gujarati, Kannada, Malayalam, Oriya, Marathi, Bengali, Tamil, Urdu, Nepali, Sanskrit, English, and Telugu
It also supports few code-mixed varieties viz., Hinglish (Hindi + Eng), Tanglish (Tamil + Eng), and Manglish (Malayalam + Eng)
Supported Features
Tokenize
Get embedding vectors
Predict next 'n' words
Identify language
Remove foreign languages
Get Sentence Encoding
Get Sentence Similarity
Get Similar sentences
For more documentation, please visit here
(i) IndicNLP
IndicNLP stemmed from the efforts of the open source community: AI4Bharat. Check out this [page] (https://ai4bharat.org/) for knowing more about the initiative and the projects they are working on
Installation
pip install indic-nlp-library
# download the repo
git clone https://github.com/anoopkunchukuttan/indic_nlp_library.git
#download the resources
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
Setup
import sys
from indicnlp import common
# The path to the local git repo for Indic NLP library
INDIC_NLP_LIB_HOME=r"indic_nlp_library"
# The path to the local git repo for Indic NLP Resources
INDIC_NLP_RESOURCES=r"indic_nlp_resources"
# Add library to Python path
sys.path.append(r'{}\src'.format(INDIC_NLP_LIB_HOME))
# Set environment variable for resources folder
common.set_resources_path(INDIC_NLP_RESOURCES)
#Loads the library
indicnlp.loader.load()
The setup is for the first time that sets the necessary resources for analysis
Supported Languages
Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Oriya, Punjabi, Tamil and Telugu
It also supports few code-mixed varieties viz., Hinglish (Hindi + Eng), Tanglish (Tamil + Eng), and Manglish (Malayalam + Eng)
Supported Features
Text Normalisation
Script Information
Word Tokenization and Detokenization
Sentence Splitting
Word Segmentation
Script Conversion
Transliteration
Translation
Syllabification
Indicization
For more documentation, please visit here
(iii) StanfordNLP
Installation
pip install stanfordnlp
Setup
import stanfordnlp
stanfordnlp.download('te') #Downloading Telugu lang model
Supported Languages
Hindi, Tamil, Telugu and Urdu are supported
For more documentation, please visit here