Material Detail

Using linguistic information as features for text categorization

This video was recorded at NATO Advanced Study Institute on Mining Massive Data Sets for Security. We report on some experiences using linguistic information as additional features in a classical Vector Space Model[10]. Extracted information of every word like the Part Of Speech and stem, lexical root have been combined in different ways for experimenting on a possible improvement in the classification performance and on several algorithms, like SVM [3], BBR [] and PLAUM [6]. Automatic Text Classification, or Automatic Text Categorization as is also known, tries to related documents to predefined set of classes. Extensive research has been carried out on this subject [11] and a wide range of techniques are appliable to solve this task: feature extraction [5], feature weighting, dimensionality reduction [4], machine learning algorithms and more. Besides, the classification task can be either binary (one out of two possible classes to select), multi-class (one out of set of possible classes) or multi-label (a set of classes from a larger set of potential candidates). In most cases, the latter two can be reduced to binary decisions [1], as the used algorithm does in our experiments [8]. In order to verify the contribution of the new features, we have combined them to be included into the vector space model by preprocessing the Reuters- 215781 collection, a well known set of data by the research community devoted to text categorization problems [2].

Keywords:: videolectures, ocwc, oec

Disciplines:

Science and Technology / Computer Science

Go to Material

Bookmark / Add to Course ePortfolio

Create a Learning Exercise

Add Accessibility Information

Rate

Add a Comment

Quality

User Rating
Comments
Learning Exercises
Bookmark Collections
Course ePortfolios
Accessibility Info

Report Broken Link
Report as Inappropriate

More about this material

Material Type:: Presentation
Date Added to MERLOT:: February 10, 2015
Date Modified in MERLOT:: February 10, 2015
Author:: Arturo Montejo Ráez, University of Jaén
Submitter:: The Open Education Consortium
Primary Audience:: College General Ed, College Lower Division, College Upper Division
Technical Format:: Video

Mobile Compatibility:: Not specified at this time
Language:: English
Cost Involved:: No
Source Code Available:: No
Creative Commons:: This work is licensed under a Attribution-NonCommercial-NoDerivs 3.0 United States