Corpora

Textual corpora that document language use are invaluable for research in various areas of linguistics, as well as for collecting statistical information that facilitates the construction of a variety of natural language processing applications. MILA has collected or acquired a number of Hebrew corpora from various domains. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too.

All corpora follow the standards developed by MILA.

Corpus statistics can be found here:
Corpus Description
HaAretz News and articles from the HaAretz news website, 1990-1991.
Arutz 7 News and articles from the Arutz 7 news website, 2001-2006.
TheMarker Articles from the TheMarker financial newspaper, May - October 2002.
HaKnesset Session protocols of the Knesset (Israeli Parliament) during January 2004 - November 2005.
Wikipedia 2013 Articles from the Hebrew Wikipedia online encyclopedia, 2013.
Doctors Articles from the Doctors medical website.
Infomed Question and answer discussions from the Infomed website's medical forum, January 2006 - September 2007.
Nature of Healing Articles and forum discussions from the Nature of Healing neuropathy medical website.
To Be Healthy Articles and forum discussions from the To Be Healthy (L'Hiyot Bari, 2b-bari) medical website.
Tapuz People Forum Forum discussions from the Tapuz People website, on a variety of subjects.
Hebrew CHILDES Spoken Hebrew conversations between children and between children and adults.
Spoken Israeli Hebrew Spoken Hebrew conversations and parts of the Corpus of Spoken Israeli Hebrew (CoSIH).
Hebrew Dotted Text Articles from beginner-Hebrew newspapers Shaar LaMatchil and Yanshuf.
Text includes dots (niqqud/vocalization).
Dependency parsed corpora A dependency parsed corpus.
The corpus is part of the Hebrew Wikipedia corpus and the dependencies were created by Yoav Goldberg’s automatic dependency parser.
Walla Food Corpus Articles from Walla Food website, 2014-2015.
Foodpage Corpus Articles from Foodpage.co.il website, 2014-2015.
Walla Sport Corpus Articles from Walla Sport website, 2014-2015.
Sport5 Corpus Articles from Sport5 website, 2014-2015.
Learning Man Articles from the "Learning man in the technological era" conference.