Department of Computer Science and Engineering

Aalborg University Esbjerg

 


F8S-2/CIS-2 Information Retrieval

Spring 2006

Daniel Ortiz-Arroyo

 

Overview

This is an introductory graduate level course on Information Retrieval. In this course we will study the main components, models and architecture of a modern Information Retrieval System. Information organization, text operations, and the metrics used to compare the performance of diverse systems will be described. We will also study the main models that have been proposed in Information Retrieval Systems to represent documents and queries and find the similarity between documents and queries. Finally, advanced techniques recently employed in IR will be presented.

Exam and Questions:

The 30 minutes exam will consists of 3 parts:
1) A 10 minute presentation of your assigned topic. Projector and a Laptop will be provided. Bring your presentation in a memory stick
2) Questions related to your topic
3) Questions that will be randomly selected from the 11 exercises posted in the table schedule shown below ( i.e. Exercise 1 up to Exercise 11).  


 Time/Room/Schedule

We'll start at 9AM each class. Our assigned classroom is B202. Please check the schedule regularly for new/updated information. The class notes posted below will be final the previous day before a class. The tentative schedule for the course is:

Week Day

Class #

Subject

Exercises/Assignments

Class Notes*/Extra Reading Assignments

Week6 Tue

1

Introduction I: Historical perspective, information structure, information vs. data retrieval systems, digital libraries and IR systems organization. The WWW. Overview of the course.


Read Chapter 1 from Textbook
Exercise 1
To be solved in class


Notes
Week6
Thur

2

Introduction II: Text operations: tokenization, stemming, stop words, lematization, compression. Words, terms and concepts, Thesaurus. Markup languages and the semantic web.

Read Chapter 7 and Appendix from Textbook
Exercise 2
To be solved in class

Notes
Read description of  Porter's stemmer
Week7 Tue

3

Query types. Modeling in IR: Boolean and vector space models. Similarity measures.

Read Chapter 2.1-2.5.3 from Textbook
Exercise 3
To be solved in class
Notes
Week7 Thur CANCELLED Look at Exercise 3 and try to solve it
Week 8
Tue

4

Review of probability concepts. Probabilistic model in IR.

Read Chapter 2.5.4, 2.6.2 from Textbook
Exercise 4
To be solved in class
Notes
Read this paper (at least up to Section 2.6)
Week8 Thur

5

Review of concepts in fuzzy logic. Fuzzy logic-based model in IR. Extended boolean model. 

Bayesian Networks in IR: the inference network model.

Read Chapter 2.6, 2.6.1 and 2.8.1-2.8.5 from Textbook
Exercise 5
To be solved in class
Notes
Read this paper
Week 9 Tue

6

Retrieval evaluation: Recall and Precision, alternate measures. Reference collections. Query Languages. Query Operations: pseudo-feedback local and global analysis.

Read Chapter 3,4,5 from Textbook
Exercise 6
To be solved in class
Notes
Week 9 Thur CANCELLED Finish reading chapters 3,4,5
Look at exercise 6 and try to solve it
Week10 Tue

7

Ranking algorithms: HITS, PageRank. 
Indexing searching and storage mechanisms 1st part: flat, bitmap and signature files, PAT trees.

Read Sections 13.4.4 and 8.3  from Textbook
Exercise 7
Notes
Read this paper and
this one

Week10 Thur

8

Indexing, searching and storage mechanisms 2nd part:  Inverted files. Dictionaries: Tries and B-trees. Anatomy of search engines and crawlers. Libraries and Toolkits for IR: Lucene.

Read Sections 8.1, 8.2, and 13.1-13.4.3  from Textbook
Exercise 8
Notes
Read this article
and this one
Week11 Tue
14th March

9

Efficient IR part I: Review of parallel processing. Flynn's classification. Speedup, efficiency, Amdhal's law and Amdhal's effect.

Parallel and distributed mechanisms in search engines: data parallelism for logical and physical documents, data parallelism for terms.

Read Sections 9.1-9.2.2 and 9.3 from Textbook
Exercise 9
Notes

Week11 Thur
16th March

10

Efficient IR part II: A parallel crawler. Static and dynamic partitioning of search graphs. Searching models.

Multimedia IR:  data modeling, queries and features. Searching and indexing multimedia objects using features: R-trees, GEMINI.

Read Chapter 11, sections 11.1-11.2.1,11.3.11.3.1. Chapter 12, sections 12.1-12.2 from Textbook
Exercise 10
Notes
Read this paper
Week12 Tue
21rd March
CANCELLED


Week12 Thur
23th March

11

Introduction to Artificial Intelligence and its application in IR systems.  QA systems. Course overview. Final Exercise session
Notes
Week13 Tue
28th March

12

Jens Rúni Poulsen
Topic: Multiagent systems in IR
Discussion

Notes
Week13 Thur 30th March

13

CANCELLED

Week14 Tue
4th April

14

Ole Buus
Topic: Advanced techniques in IR: genetic algorithms, simulated annealing etc.
Daniel Jacob Poulsen
Topic: Text classification(categorization)

Discussion


Thur
6th April
Week15 Tue
11th April
Thur
13th April
   
Week16 Tue
18th
April
Thur
20th April
15

Kim Beck
Topic: Natural Language Processing in IR

Jia Ma
Topic: Machine Learning methods in IR: unsupervised/semi-supervised learning.
Bing Pen
Topic: Machine Learning methods in IR: supervised learning
Discussion

Presentation schedule and recommendations Notes on NLP in IR systems by Kim Beck
Notes on unsupervised  ML in IR by Jia Ma
Notes on supervised ML in IR by Bing Pen
 

*Disclaimer: We will use slides/notes from the textbook, my own notes and from other sources on the web.

References

Textbook

Modern Information Retrieval by Ricardo Baeza-Yates, Berthier Ribeiro-Neto 
Publisher: Addison Wesley; 1st edition (May 15, 1999)
ISBN: 020139829X
 

Reference books

Mining the Web: Analysis of Hypertext and Semi Structured Data by Soumen Chakrabarti
Publisher: Morgan Kaufmann; 1st edition (August 15, 2002)
 ISBN: 1558607544
 

Resources

Porter's stemmer algorithm

How Google works?