HLT Course

Human Language Technology: Applications to Information Access

This course introduces recent applications of human language technology (HLT), presenting the basic knowledge required to implement them, with an overview of possible alternatives, evaluation methods, and challenges or limits of the state of the art. The technologies focus on the problem of accessing text-based information across three main types of barriers: the quantity barrier (accessing information in very large repositories), the crosslingual barrier (accessing information across languages through machine translation), and the subjective barrier (accessing information that is enclosed in complex human interactions). The following applications will be studied for each barrier to information access, with emphasis on the transition from statistical learning to neural networks.

  1. The quantity barrier: document classification and retrieval.
  2. The crosslingual barrier: machine translation with statistical vs. neural methods, language models, MT evaluation.
  3. The subjective barrier: sentiment analysis, human-human and human-computer dialogue.

The course includes weekly lectures in English (2h) followed by practical work (2h) using freely-available software and language resources (on each student’s laptop) to perform some of the tasks introduced in the course. Labs can serve as starting points for individual projects, on a topic to be chosen in agreement with the lecturer. The projects will be graded based on a report and an oral defense at the end of January 2019. Once in the semester students will present a scientific article, and one lab work will be graded. Students should have followed at least one prior course in statistics, machine learning, computational linguistics, or artificial intelligence, and should be proficient in a language such as Python or Java.

Date Morning Afternoon
#1 19.09 Introduction
Objectives of the course.
Plan, references, course page, organization, grading, final projects.
Basics of data-driven HLT: machine learning, classifiers, features, training / testing data, evaluation.
Basics of language analysis methods.
See below.
#1′ Text classification
Document classification using lexical features and a Naive Bayes model.
Getting the work protocol into place. Run a classifier (Weka) using lexical features on newswire data (Reuters). Vary the features and observe influence on scores.
#2 26.9 Low-dimension representations of words and documents and the use of vector spaces for information retrieval Introduction to neural networks for NLP: learning word representations
#3 03.10 Lab: training and using word2vec for word similarity Lab: application of word2vec to document similarity and word sense disambiguation
#4 10.10 No course in the morning Beyond information retrieval
Relevance feedback, pseudo-RF, query expansion. Learning-to-rank.
Recommender systems: content-based vs. collaborative filtering.
#5 17.10 Introduction to machine translation
History of MT, typology of systems, introduction to statistical and neural systems, MT evaluation.
Translation models: from parallel data.
Sentence and word alignment.
#6 24.10 N-gram language models Lab: using a LM for word prediction
#7 31.10 The Moses decoder Tuning an MT system Lab: train a phrase-based statistical MT system
#8 07.11 Lab: PBSMT with Moses Lab: PBSMT with Moses
#9 14.11 Sequence modeling with neural networks Neural language models and neural MT
#10 21.11 Recent developments in neural MT Evaluation issues in MT
#11 28.11 Sentiment analysis: lexical and neural models Lab: polarity detection in movie reviews
#12 05.12 Human-human and human-computer interactions Dialogue systems and chatbots
#13 12.12 Question answering Advising on individual projects
#14 19.12 Conclusion: synthesis on HLT research Advising on individual projects