You are here
jWeb1T is a Java tool for efficiently searching the Web 1T 5-gram corpus. It is based on a binary search algorithm that finds the n-grams and returns their frequency counts in logarithmic time. As the corpus is stored in many files a simple index is used to retrive the files containing the n-grams. The corpus must be installed and uncopressed on a hard drive (approx. 60 GB).
jWeb1T is released as free software with full source code, provided under the terms of the Apache License, Version 2.0.
jWeb1T has been used in the FBK-irst system for the English Lexical Substitution Task at SemEval 2007.
Claudio Giuliano, Alfio Gliozzo and Carlo Strapparava. FBK-irst: Lexical Substitution Task Exploiting Domain and Syntagmatic Coherence . In Proceedings of the 4th Interational Workshop on Semantic Evaluations (SemEval-2007), Prague, 23-24 June 2007.
jWeb1T developement has been funded by X-Media Project.