Have you ever heard of the Tanaka Corpus? If you’ve been using Jim Breen’s fantastic free online Japanese-English dictionary JDIC, you may have unwittingly encountered it. The Tanaka Corpus is a collection of over 150,000 sentence pairs ideal for students learning Japanese. Perfect for repetitions in your favourite SRS software!
Here’s a brief introduction. If you just want to get started searching for example sentences, click the link below!
Search Tatoeba.org using a variety of different languages!
The corpus was compiled by Professor Yasuhito Tanaka at Hyogo University and his students, as described in his Pacling2001 paper. At Pacling2001 Professor Tanaka released copies of the corpus, and stated that it is in the public domain. According to Professor Christian Boitet, Professor Tanaka did not think the collection was of a very good standard. (Sadly, Prof. Tanaka died in early 2003.)
At the 2002 Papillon workshop in Tokyo, Professor Boitet included a copy of the corpus in a CD, distributed to participants, and suggested that it may serve as examples in a dictionary. Jim Breen realised it had the potential to be a source of example sentences in the WWWJDIC server. He edited, reformatted and indexed the corpus and linked it at the word level to the dictionary function in the server.
The inclusion of the Corpus in the WWWJDIC server exposed it to a wide audience, and a number of other systems incorporated the corpus into their operation. It also began to be used in some research projects in natural language processing.
In 2006 the Corpus was incorporated into the Tatoeba Project being developed by Trang Ho to provide a sentence-based multi-lingual resource. That project is now the “home” of the Corpus.
There are however several caveats that the avid student of Japanese must be aware of in order to use this resource safely and effectively. Although the original Tanaka Corpus has come a long way and been cut from 212,000 sentences to 150,000 (mostly removing duplicates and correcting errors), there are still areas where it is unreliable.
・ There are many English sentences that do not sound natural. They may be the result of old-fashioned English or contrived sentences originally taken from textbooks.
・ There are also sentences that sound unnatural in Japanese, such as direct, hard translations of English to Japanese, or sentences which sound odd out of context.
Most people reading this will no doubt be native English speakers, so dodgy English sentences shouldn’t worry us much as we can work around them. Unnatural Japanese sentences, however, are a problem. This means the resource is best suited to intermediate students who already have a firm grasp of the language and have a good chance of being able to distinguish between a natural and unnatural Japanese sentence.
Beginners can benefit too, but be sure to check as much as you can with native or fluent speakers to make sure you are not memorising an odd expression. Also be aware of gender differences in language so that you don’t end up sounding like the opposite sex when you speak!
There are many other websites and applications that use this public domain database as their primary data source but do not tell you about the problems with it, so double check your favourite application to see where they get their sentence examples from. It takes a lot of work to compile original Japanese-English sentence pairs, so there is a good chance that they took the easy route and used the Tanaka Corpus.
You can download the latest version on Jim Breen’s site here, but you can browse the corpus on ManyThings or search for example sentences containing a particular word on Tatoeba using a variety of different languages! If all else fails, here is a mirror on Gakuu (but please be aware that this file will quickly become outdated). Mirrored file 21st November 2011.
The corpus is available free to all under a Creative Commons CC BY licence.