graviti
Products
Resources
About us
CMU QA
Text
NLP
|...
License: CC BY-SA 3.0

Overview

We describe a competitive question generation and answering project used in our undergraduate
natural language processing courses. This semester-long project challenges teams of three
or four students to use available NLP components (or develop their own) to construct systems
that ask and answer questions about an arbitrary Wikipedia article. We describe how the project
and competition were structured, the outcomes, and lessons learned. The Question/Answer dataset
generated by students who took undergraduate natural language processing courses taught by
Noah Smith at Carnegie Mellon and Rebecca Hwa at the University of Pittsburgh during Spring
2008, Spring 2009, and Spring 2010.

Data Collection

The project proceeded in 4 phases of a 15-week semester: data preparation (weeks 1–4),
during which the first few course lectures introduced the most important concepts for getting
started in NLP and motivating applications; system development (weeks 5–12), during which
teams worked on their systems as they learned more about problems and solutions in NLP; evaluation/competition
(weeks 13–14); and live demonstrations (hosted by the local Google office) at the end.
The first and third phases are most relevant.

Data Annotation

There are three directories, one for each year of students: S08, S09, and S10.

The file "question_answer_pairs.txt" contains the questions and answers. The first line of
the file contains column names for the tab-separated data fields in the file. This first line
follows:

ArticleTitle;Question;Answer;DifficultyFromQuestioner;DifficultyFromAnswerer;ArticleFile

Field 1 is the name of the Wikipedia article from which questions and answers initially came.
Field 2 is the question.
Field 3 is the answer.
Field 4 is the prescribed difficulty rating for the question as given to the question-writer.
Field 5 is a difficulty rating assigned by the individual who evaluated and answered
the question, which may differ from the difficulty in field 4.
Field 6 is the relative path
to the prefix of the article files. html files (.htm) and cleaned text (.txt) files are provided.

Questions that were judged to be poor were discarded from this data set.
There are frequently multiple lines with the same question, which appear if those questions
were answered by multiple individualsThis particular release was prepared by Kevin Gimpel,
but the data collection process was performed by Noah Smith, Mike Heilman, Rebecca Hwa, Shay
Cohen, and many CMU students and Pitt students.

Instruction

The project requirements provided considerable flexibility. Students could develop their systems
in any programming language, and they were allowed to use existing NLP components available
on the Web. The command-line interface for the question generation program was

./ask art.txt N

where art.txt
is a file containing the text of a Wikipedia article, and N is a positive integer telling how
many questions to generate. The program is expected to print to standard output a sequence
of newline-separated N questions about the article that a human could answer, given the article.
Students were instructed to aim for questions that are "fluent and reasonable.” The answering
program has a similar interface:

./answer art.txt q.txt

where q.txt lists questions in the same format as ask’s output.
Answers are to be written to standard output, one per line. Students were instructed to aim
for answers that are fluent, correct, and intelligent. Note that there is no document retrieval
component to this project; questions and answers always pertain to a specific, known document.

Citation

@inproceedings{
author={Noah A. Smith, Michael Heilman, and Rebecca Hwa},
title={Question Generation as a Competitive Undergraduate Course Project},
booktitle={In Proceedings of the NSF Workshop on the Question Generation Shared Task and Evaluation
Challenge, Arlington, VA, September 2008.},
url={Available at: http://www.cs.cmu.edu/~nasmith/papers/smith+heilman+hwa.nsf08.pdf}
}

License

CC BY-SA 3.0

Data Summary
Type
Text,
Amount
--
Size
7.87MB
Provided by
Carnegie Mellon University
Carnegie Mellon University (CMU) is a private research university based in Pittsburgh, Pennsylvania.
| Amount -- | Size 7.87MB
CMU QA
Text
NLP
License: CC BY-SA 3.0

Overview

We describe a competitive question generation and answering project used in our undergraduate
natural language processing courses. This semester-long project challenges teams of three
or four students to use available NLP components (or develop their own) to construct systems
that ask and answer questions about an arbitrary Wikipedia article. We describe how the project
and competition were structured, the outcomes, and lessons learned. The Question/Answer dataset
generated by students who took undergraduate natural language processing courses taught by
Noah Smith at Carnegie Mellon and Rebecca Hwa at the University of Pittsburgh during Spring
2008, Spring 2009, and Spring 2010.

Data Collection

The project proceeded in 4 phases of a 15-week semester: data preparation (weeks 1–4),
during which the first few course lectures introduced the most important concepts for getting
started in NLP and motivating applications; system development (weeks 5–12), during which
teams worked on their systems as they learned more about problems and solutions in NLP; evaluation/competition
(weeks 13–14); and live demonstrations (hosted by the local Google office) at the end.
The first and third phases are most relevant.

Data Annotation

There are three directories, one for each year of students: S08, S09, and S10.

The file "question_answer_pairs.txt" contains the questions and answers. The first line of
the file contains column names for the tab-separated data fields in the file. This first line
follows:

ArticleTitle;Question;Answer;DifficultyFromQuestioner;DifficultyFromAnswerer;ArticleFile

Field 1 is the name of the Wikipedia article from which questions and answers initially came.
Field 2 is the question.
Field 3 is the answer.
Field 4 is the prescribed difficulty rating for the question as given to the question-writer.
Field 5 is a difficulty rating assigned by the individual who evaluated and answered
the question, which may differ from the difficulty in field 4.
Field 6 is the relative path
to the prefix of the article files. html files (.htm) and cleaned text (.txt) files are provided.

Questions that were judged to be poor were discarded from this data set.
There are frequently multiple lines with the same question, which appear if those questions
were answered by multiple individualsThis particular release was prepared by Kevin Gimpel,
but the data collection process was performed by Noah Smith, Mike Heilman, Rebecca Hwa, Shay
Cohen, and many CMU students and Pitt students.

Instruction

The project requirements provided considerable flexibility. Students could develop their systems
in any programming language, and they were allowed to use existing NLP components available
on the Web. The command-line interface for the question generation program was

./ask art.txt N

where art.txt
is a file containing the text of a Wikipedia article, and N is a positive integer telling how
many questions to generate. The program is expected to print to standard output a sequence
of newline-separated N questions about the article that a human could answer, given the article.
Students were instructed to aim for questions that are "fluent and reasonable.” The answering
program has a similar interface:

./answer art.txt q.txt

where q.txt lists questions in the same format as ask’s output.
Answers are to be written to standard output, one per line. Students were instructed to aim
for answers that are fluent, correct, and intelligent. Note that there is no document retrieval
component to this project; questions and answers always pertain to a specific, known document.

Citation

@inproceedings{
author={Noah A. Smith, Michael Heilman, and Rebecca Hwa},
title={Question Generation as a Competitive Undergraduate Course Project},
booktitle={In Proceedings of the NSF Workshop on the Question Generation Shared Task and Evaluation
Challenge, Arlington, VA, September 2008.},
url={Available at: http://www.cs.cmu.edu/~nasmith/papers/smith+heilman+hwa.nsf08.pdf}
}

License

CC BY-SA 3.0

0
Start building your AI now
graviti
wechat-QR
Long pressing the QR code to follow wechat official account

Copyright@Graviti