The goal of this project is to developthrough experienceyour ability to:
We all have too much to read: We are overwhelmed by material, both relevant and irrelevant. One version of the problem is the difficulty of even sorting articles into appropriate categories. We want you to develop software that will assist with this problem.
In its simplest form, you can imagine a training set of news articles, each of which is identified by its category. You are to devise a program that uses the training set to learn how to identify the category of subsequent articles.
International Conglomerates, Inc., is a huge multinational corporation that has virtual monopoly on a wide range of industries.
Presently, International Conglomerates has a central news center that receives news wires from all over the world. A human editor reviews each incoming article and classifies it, so that it can be routed to the appropriate division. Naturally, this is a very labor-intensive and error-prone task. CEO Ima $avage wonders if it is it possible to build a computer system to do the classification job automatically.
You have been hired as a high-powered consultant to look into the problem. You soon determine that there are four divisions within International Conglomerates, each of which is interested in a particular category of news:
|Division||Category of interest||Typical subjects|
|Conglomo-Organic Division||organic||livestock, sugar, wheat, grain, tea, vegetable oil|
|Conglomo-Petrochemical Division||petrochemicals||crude oil, oil, natural gas, plastics, etc.|
|Conglomo-Metals Division||metals||gold, silver, copper, tin, steel, etc.|
|Conglomo-Finance Division||financial||bond/interest rates, stock market, money supply, etc.|
In order to help with your task, the Ms. $avage has provided you with a corpus of old news articles and the divisions to which they were routed.
The corpus contains articles in the form suggested by the following samples. Evidently, the raw text has been prepared for digestion by a Scheme program:
(0000 (financial) (BANK OF JAPAN INTERVENES IN EARLY TOKYO AFTERNOON) (TOKYO *comma* April 1) (The Bank of Japan intervened in the market in the early afternoon *comma* buying dollars around 147 *period* 30 yen and continuing to buy them as high as 147 *period* 50 yen *comma* dealers said *period* The Bank intervened just after the dollar started rising on buying by securities houses at around 147 *period* 05 yen *comma* and hoped to accelerate the dollar *apostrophe* s advance *comma* they said *period* The dollar rose as high as 147 *period* 50 yen *period* REUTER))
(0001 (organic) (MALAYSIA CUTS GAZETTED RUBBER PRICE) (KUALA LUMPUR *comma* April 1) (Malaysian government said it cut the gazetted price of rubber to 202 *dash* 7/8 cents per kg from 213 *dash* 1/2 cents in March *comma* effective immediately *period* No export duty is applicable at this level *comma* against 3/8 cent per kg last month *comma* because the government raised the export duty threshold price to 210 cents per kg in early 1985 *period* The cess for rubber research and replanting remained unchanged at 3 *period* 85 and 9 *period* 92 cents per kg respectively *period* REUTER))
After discussion with Ms. $avage, you agree to write a program that learns to classify such articles on the basis of their content. That is, you are to write a program that uses the given corpus of human-categorized articles to learn to recognize articles that are not categorized.
In particular, given a test file containing articles with empty category
expressions, your program will write a file in which each articles is
represented as a two-element list containing an identification number and
one of four categories,
FINANCIAL. Thus, if the articles in the sample set were
provided as a test set, with empty category expressions, your program is to
write a file containing the following:
(0000 FINANCIAL) (0001 ORGANIC)
More precisely, you agree to:
Your approach and implementation should handle categories, training data, and test sets in general, not just those of interest to International Conglomerates. If we were to change the categories, training data, and test sets, your approach and implementation should still work. Thus, you are not to manually tune your program to the particular categories provided.
We provide you with a set of useful utility functions.