Easily Set Up and Use MeCab With Docker and NodeJS
MeCab is a powerful morphological analyzer for the Japanese language. If your application processes Japanese text, more likely than not you will need a way to tokenize sentences to find part of speech, key terms, etc. MeCab takes in text as an input and efficiently tokenizes it for you, providing useful information about each token such as part of speech, pronunciation, and reading.
So how do we get started using MeCab? The documentation is in Japanese but includes installation instructions at least. You need to install MeCab itself and a dictionary for it to work it’s magic. Using Docker simplifies the installation and setup process to a couple lines in a Dockerfile. Running MeCab in a Docker container means you do not need to install and configure MeCab on your local machine.
Assuming Docker is already installed and we are using a Linux image base:
RUN apt-get update && apt-get install -y mecab libmecab-dev mecab-ipadic-utf8
Adding this line will download and install the packages needed to run MeCab. Build the image and start a container to get a working MeCab installation on your machine!
Several programming languages have available packages that allow you to use MeCab and capture tokenization results in your code. For example, Python has the mecab-python3 module. For this article, we’ll use mecab-async, a MeCab wrapper for NodeJS.
If you are using NPM, install mecab-async with:
npm install mecab-async
Now in your JavaScript file:
Now you can use MeCab to analyze strings of text and process the results in your NodeJS application! You can even implement an Express server that allows client to call MeCab over a REST API so they do not have to go through the trouble of installing it themselves!