1,059 Pages • 652,515 Words • PDF • 38.1 MB

Uploaded at 2021-09-24 07:49

This document was submitted by our user and they confirm that they have the consent to share it. Assuming that you are writer or own the copyright of this document, report to us by using this DMCA report button.

Encyclopedia of Machine Learning

Claude Sammut, Geoﬀrey I. Webb (Eds.)

Encyclopedia of Machine Learning With Figures and Tables

123

Editors Claude Sammut School of Computer Science and Engineering University of New South Wales Sydney Australia [email protected] Geoffrey I. Webb Faculty of Information Technology Clayton School of Information Technology Monash University P.O. Box Victoria Australia [email protected]

ISBN ---- e-ISBN ---- Print and electronic bundle ISBN ---- DOI ./---- Springer New York Library of Congress Control Number: © Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, Spring Street, New York, NY , USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface The term “Machine Learning” came into wide-spread use following the first workshop by that name, held at Carnegie-Mellon University in . The papers from that workshop were published as Machine Learning: An Artificial Intelligence Approach, edited by Ryszard Michalski, Jaime Carbonell and Tom Mitchell. Machine Learning came to be identified as a research field in its own right as the workshops evolved into international conferences and journals of machine learning appeared. Although the field coalesced in the s, research on what we now call machine learning has a long history. In his paper on “Computing Machinery and Intelligence”, Alan Turing introduced his imitation game as a means of determining if a machine could be considered intelligent. In the same paper he speculates that programming the computer to have adult level intelligence would be too difficult. “Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child’s? If this were then subjected to an appropriate course of education one would obtain the adult brain”. Investigations into induction, a fundamental operation in learning, go back much further to Francis Bacon and David Hume in the th and th centuries. Early approaches followed the classical AI tradition of symbolic representation and logical inference. As machine learning began to be used in a wide variety of areas, the range of techniques expanded to incorporate ideas from psychology, information theory, statistics, neuroscience, genetics, operations research and more. Because of this diversity, it is not always easy for a new researcher to find his or her way around the machine learning landscape. The purpose of this encyclopedia is to guide enquiries into the field as a whole and to serve as an entry point to specific topics, providing overviews and, most importantly, references to source material. All the entries have been written by experts in their field and have been refereed and revised by an international editorial board consisting of leading machine learning researchers. Putting together an encyclopedia for such a diverse field has been a major undertaking. We thank all the authors, without whom this would not have been possible. They have devoted their expertise and patience to the project because of their desire to contribute to this dynamic and still growing field. A project as large as this could only succeed with the help of the area editors whose specialised knowledge was essential in defining the range and structure of the entries. The encyclopedia was started by the enthusiasm of Springer editors Jennifer Evans and Oona Schmidt and continued with the support of Melissa Fearon. Special thanks to Andrew Spencer, who oversaw production and kept everyone, including the editors on track. Claude Sammut and Geoffrey I. Webb

Editors-in-Chief Claude Sammut School of Computer Science and Engineering University of New South Wales Sydney, Australia [email protected] Geoﬀrey I. Webb Faculty of Information Technology Clayton School of Information Technology Monash University P.O. Box Victoria, Australia Geoﬀ[email protected]

Area Editors Charu Aggarwal IBM T. J. Watson Research Center Skyline Drive Hawthorne NY USA [email protected] Wray Buntine NICTA Locked Bag Canberra ACT Australia [email protected] James Cussens Department of Biology (Area ) York Centre for Complex Systems Analysis University of York PO Box York YO YW UK [email protected] Luc De Raedt Dept. of Computer Science Katholieke Universiteit Leuven Celestijnenlaan A Heverlee Belgium [email protected] Peter A. Flach Department of Computer Science University of Bristol Woodland Road Bristol BS UB UK [email protected] Russ Greiner Department of Computing Science University of Alberta Athabasca Hall Edmonton

Alberta TG E Canada [email protected] Eamonn Keogh Computer Science & Engineering Department University of California Riverside California CA USA [email protected] Michael L. Littman Department of Computer Science Rutgers, the State University of New Jersey Frelinghuysen Road Piscataway New Jersey - USA [email protected] Sridhar Mahadevan Department of Computer Science University of Massachusetts Governor’s Drive Amherst MA USA [email protected] Stan Matwin School of Information Technology and Engineering University of Ottawa King Edward Ave., P.O. Box Stn A Ottawa Ontario KN N Canada [email protected] Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin University Station C

x

Area Editors

Austin Texas TX - USA [email protected] Dunja Mladenic Department for Intelligent Systems J. Stefan Institute Jamova Ljubljana Slovenia [email protected] C. David Page Department of Biostatistics and Medical Informatics University of Wisconsin Medical School University Avenue Wisconsin Madison WI USA [email protected] Bernhard Pfahringer Department of Computer Science University of Waikato Private Bag Hamilton New Zealand [email protected] Michail Prokopenko CSIRO Macquarie University Building EB, Campus Herring Road North Ryde NSW Australia

Frank Stephan Department of Mathematics National University of Singapore Science Drive S, Singapore Singapore [email protected] Peter Stone Department of Computer Sciences The University of Texas at Austin University Station C Austin Texas TX - USA [email protected] Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University Kelley Engineering Center Corvallis Oregon OR - USA [email protected] Takashi Washio The Institute of Scientific and Industrial Research Osaka University - Mihogaoka Osaka Ibaraki Japan [email protected]

List of Contributors Pieter Abbeel Department of Electrical Engineering and Computer Sciences University of California Sutardja Dai Hall # CA -, Berkeley California USA [email protected]

Charu C. Aggarwal IBM T. J. Watson Research Center Skyline Drive Hawthorne NY USA [email protected]

Biliana Alexandrova-Kabadjova General Directorate of Central Bank Operations Central Banking Operations Division Bank of Mexico Av. de Mayo No. Col. Centro, C.P. Mexico, D.F [email protected]

J. Andrew Bagnell Robotics Institute Carnegie Mellon University Forbes Avenue Pittsburgh, PA USA [email protected] Michael Bain University of New South Wales Sydney Australia [email protected] Arindam Banerjee Department of Computer Science and Engineering University of Minnesota Minneapolis, MN USA [email protected] Andrew G. Barto Department of Computer Science University of Massachusetts Amherst Computer Science Building Amherst, MA USA [email protected]

Periklis Andritsos Thoora Inc. Toronto, ON Canada [email protected]

Rohan A. Baxter Analytics, Intelligence and Risk Australian Taxation Office PO Box Civic Square, ACT Australia [email protected]

Peter Auer Institute of Computer Science University of Leoben Franz-Josef-Strasse Leoben Austria [email protected]

Bettina Berendt Katholieke Universiteit Leuven Department of Computer Science Celestijnenlaan A Heverlee Belgium [email protected]

xii

List of Contributors

Indrajit Bhattacharya IBM India Research Laboratory New Delhi India

Mustafa Bilgic University of Maryland AV Williams Bldg Rm College Park, MD USA

Mauro Birattari IRIDIA Université Libre de Bruxelles Brussels Belgium [email protected]

Hendrik Blockeel Department of Computer Science Katholieke Universiteit Leuven Celestijnenlaan A Heverlee Belgium [email protected]

Shawn Bohn Pacific Northwest National Laboratory

Antal van den Bosch Tilburg centre for Creative Computing Tilburg University P.O. Box LE, Tilburg The Netherlands [email protected]

Janez Brank Department for Intelligent Systems Jožef Stefan Institute Jamova Ljubljana Slovenia [email protected]

Jürgen Branke Institut für Angewandte Informatik und Formale Beschreibungsverfahren Universität Karlsruhe (TH) Karlsruhe Germany [email protected] Pavel Brazdil LIAAD-INESC Porto L.A./Faculdade de Economia Laboratory of Artificial Intelligence and Computer Science University of Porto Rua de Ceuta n. .piso Porto - Portugal [email protected] Gavin Brown The University of Manchester School of Computer Science Kilburn Building Oxford Road Manchester, M PL UK [email protected] Ivan Bruha Department of Computing & Software McMaster University Hamilton, ON Canada [email protected] M.D. Buhmann Numerische Mathematik Justus-Liebig University Mathematisches Institut Heinrich-Buff-Ring Giessen Germany [email protected] Wray L. Buntine NICTA Locked Bag Canberra ACT Australia [email protected]

List of Contributors

Tibério Caetano Research School of Information Sciences and Engineering Australian National University Canberra ACT Australia tibé[email protected] Nicola Cancedda Xerox Research Centre Europe , chemin de Maupertuis Meylan France [email protected] Gail A. Carpenter Department of Cognitive and Neural Systems Center for Adaptive Systems Boston University Boston, MA USA John Case Department of Computer and Information Sciences University of Delaware Newark DE - USA [email protected] Tonatiuh Peña Centeno Economic Research Division Bank of Mexico Av. de Mayo # Col. Centro, C.P. Mexico, D.F. Deepayan Chakrabarti Yahoo! Research st Avenue Sunnyvale, CA USA [email protected] Philip K. Chan Department of Computer Sciences Florida Institute of Technology Melbourne, FL USA [email protected]

Massimiliano Ciaramita Yahoo! Research Barcelona Ocata Barcelona Spain [email protected] Adam Coates Department of Computer Science Stanford University Stanford, CA USA David Cohn Google, Inc. Amphitheatre Parkway Mountain View, CA USA [email protected] David Corne Heriot-Watt University Earl Mountbatten Building Edinburgh EH AS UK [email protected] Susan Craw IDEAS Research Institute School of Computing The Robert Gordon University St. Andrew Street Aberdeen AB HG Scotland UK [email protected] Artur Czumaj Department of Computer Science University of Warwick Coventry CV AL UK [email protected] Walter Daelemans Department of Linguistics CLIPS University of Antwerp Prinsstraat Antwerpen Belgium [email protected]

xiii

xiv

List of Contributors

Sanjoy Dasgupta Department of Computer Science and Engineering University of California San Diego Gilman Drive Mail Code La Jolla, California - USA [email protected] Gerald DeJong Department of Computer Science University of Illinois at Urbana Urbana, IL USA [email protected] Marco Dorigo IRIDIA Université Libre de Bruxelles Avenue Franklin Roosevelt Brussels Belgium [email protected] Kurt Driessens Departement Computerwetenschappen Katholieke Universiteit Leuven Celestijnenlaan A Heverlee Belgium [email protected] Christopher Drummond Integrated Reasoning National Research Council Institute for Information Technology Montreal Road Building M-, Room Ottawa, ON KA R Canada [email protected] Yaakov Engel AICML, Department of Computing Science University of Alberta - Athabasca Hall Edmonton Alberta TG E Canada [email protected]

Scott E. Fahlman Language Technologies Institute Carnegie Mellon University GHC Forbes Avenue Pittsburgh, PA USA [email protected] Alan Fern School of Electrical Engineering and Computer Science Oregon State University Kelley Engineering Center Corvallis, OR - USA [email protected] Peter A. Flach Department of Computer Science University of Bristol Woodland Road Bristol, BS UB UK [email protected] Pierre Flener Department of Information Technology Uppsala University Box SE- Uppsala Sweden [email protected] Johannes Fürnkranz TU Darmstadt Fachbereich Informatik Hochschulstraße Darmstadt Germany [email protected] Thomas Gärtner Knowledge Discovery Fraunhofer Institute for Intelligent Analysis and Information Systems Schloss Birlinghoven Sankt Augustin Germany [email protected]

List of Contributors

João Gama Laboratory of Artificial Intelligence and Decision Support University of Porto Porto Portugal [email protected]

Alma Lilia García-Almanza General Directorate of Information Technology Bank of Mexico Av. de Mayo No. Col. Centro, C.P. Mexico, D.F. [email protected]

Gemma C. Garriga Laboratoire d’Informatique de Paris Universite Pierre et Marie Curie place Jussieu Paris France [email protected]

Wulfram Gerstner Laboratory of Computational Neuroscience Brain Mind Institute Ecole Polytechnique Fédérale de Lausanne Station Lausanne EPFL Switzerland [email protected]

Lise Getoor Department of Computer Science University of Maryland AV Williams Bldg, Rm College Park, MD USA [email protected]

Christophe Giraud-Carrier Department of Computer Science Brigham Young University TMCB Provo UT USA

Marko Grobelnik Department for Intelligent Systems Jožef Stefan Institute Jamova , Ljubljana Slovenia [email protected]

Stephen Grossberg Department of Cognitive Boston University Beacon Street Boston, MA USA [email protected]

Jiawei Han Department of Computer Science University of Illinois at Urbana Champaign N. Goodwin Avenue Urbana, IL USA [email protected]

Julia Handl Faculty of Life Sciences in Manchester University of Manchester UK [email protected]

Michael Harries Technology Strategy Division Advanced Products Group, Citrix Labs North Ryde NSW Australia

Jun He Department of Computer Science Aberystwyth University Aberystwyth SY DB Wales UK [email protected]

xv

xvi

List of Contributors

Bernhard Hengst School of Computer Science & Engineering University of New South Wales Sydney NSW Australia [email protected]

Phil Husbands Department of Informatics University of Sussex Brighton BNQH UK [email protected]

Tom Heskes Radboud University Nijmegen Toernooiveld ED Nijmegen The Netherlands [email protected]

Marcus Hutter Australian National University RSIS Room B Building Corner of North and Daley Road ACT Canberra Australia [email protected]

Geoffrey Hinton Department of Computer Science Office PT G University of Toronto King’s College Road MS G, Toronto Ontario Canada [email protected] Lawrence Holder School of Electrical Engineering and Computer Science Box Washington State University Pullman, WA USA [email protected] Tamás Horváth Department of Computer Science III University of Bonn and Fraunhofer IAIS Fraunhofer Institute for Intelligent Analysis and Information Systems Schloss Birlinghoven Sankt Augustin Germany [email protected] Eyke Hüllermeier Knowledge Engineering & Bioinformatics Head of the KEBI Lab Department of Mathematics and Computer Science Philipps-Universität Marburg Mehrzweckgebäude Hans-Meerwein-Straße Marburg Germany [email protected]

Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum Universitstr. Bochum Germany [email protected]

Sanjay Jain Department of Computer Science National University of Singapore Computing Drive Singapore Republic of Singapore [email protected]

Tommy R. Jensen Institut für Mathematik Alpen-Adria-Universität Klagenfurt Universitässtr. - Klagenfurt Austria [email protected]

Xin Jin University of Illinois at Urbana-Champaign Toernooiveld ED Urbana, IL USA

List of Contributors

Antonis C. Kakas Department of Computer Science University of Cyprus Kallipoleos Str., P.O. Box Nicosia Cyprus [email protected]

James Kennedy U.S. Bureau of Labor Statistics Postal Square Building Massachusetts Ave., NE Washington, DC - USA [email protected]

Subbarao Kambhampati Department of Computer Science and Engineering Arizona State University Tempe, AZ USA [email protected]

Eamonn Keogh Computer Science & Engineering Department University of California Riverside, CA USA [email protected]

Anne Kao The Boeing Company P.O. Box MC L- Seattle, WA - USA [email protected]

Kristian Kersting Knowledge Discovery Fraunhofer IAIS Schloß Birlinghoven Sankt Augustin Germany [email protected]

George Karypis Department of Computer Science and Engineering Digital Technology Center and Army HPC Research Center University of Minnesota Minneapolis, MN USA [email protected]

Joshua Knowles University of Manchester

Samuel Kaski Laboratory of Computer and Information Science Helsinki University of Technology P.O. Box TKK Finland [email protected]

Kevin B. Korb School of Information Technology Monash University Room , Bldg , Clayton, Victoria Australia [email protected]

Carlos Kavka Istituto Nazionale di Fisica Nucleare University of Trieste Trieste Italy [email protected]

Aleksander Kołcz Microsoft One Microsoft Way Redmond, WA USA [email protected]

Stefan Kramer Institut für Informatik/I Technische Universität München Boltzmannstr. Garching b. München Germany [email protected]

xvii

xviii

List of Contributors

Krzysztof Krawiec Institute of Computing Science Poznan University of Technology Piotrowo - Poznan Poland [email protected]

Christina Leslie Computational Biology Program Sloan-Kettering Institute Memorial Sloan-Kettering Cancer Center York Ave Mail Box # New York, NY [email protected]

Nicolas Lachiche Image Sciences, Computer Sciences and Remote Sensing Laboratory , bld Brant llkirch-Graffenstaden France [email protected]

Shiau Hong Lim University of Illinois IL USA [email protected]

Michail G. Lagoudakis Department of Electronic and Computer Engineering Technical University of Crete Chania Crete Greece [email protected] John Langford Yahoo Research New York, NY USA [email protected] Pier Luca Lanzi Dipartimento di Elettronica e Informazione Politecnico di Milano Milano Italy [email protected] Nada Lavraˇc Department of Knowledge Technologies Jožef Stefan Institute Jamova Ljubljana Slovenia Faculty of Information Technology University of Nova Gorica Vipavska Nova Gorica Slovenia

Charles X. Ling The University of Western Ontario Canada [email protected] Huan Liu Computer Science and Engineering Ira Fulton School of Engineering Arizona State University Brickyard Suite South Mill Avenue Tempe, AZ - USA [email protected] Bin Liu Faculty of Information Technology Monash University Melbourne Australia [email protected] John Lloyd College of Engineering and Computer Science The Australian National University , Canberra ACT Australia [email protected] Shie Mannor Department of Electrical Engineering Israel Institute of Technology Technion Technion City Haifa Israel [email protected]

List of Contributors

Eric Martin Department of Artificial Intelligence School of Computer Science and Engineering University of New South Wales NSW Sydney Australia [email protected] Serafín Martínez-Jaramillo General Directorate of Financial System Analysis Financial System Analysis Division Bank of Mexico Av. de Mayo No. Col. Centro, C.P. Mexico, D.F [email protected] Stan Matwin School of Information Technology and Engineering University of Ottawa Ottawa, ON Canada [email protected] Julian McAuley Statistical Machine Learning Program Department of Engineering and Computer Science National University of Australia NICTA, Locked Bag Canberra ACT Australia [email protected] Prem Melville Machine Learning IBM T. J. Watson Research Center Route /P.O. Box Kitchawan Rd Yorktown Heights, NY USA [email protected] Pietro Michelucci Strategic Analysis, Inc. Wilson Blvd Suite Arlington, VA USA [email protected]

Rada Mihalcea Department of Computer Science and Engineering University of North Texas Denton, TX - USA [email protected] Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin University Station C Austin, TX - USA [email protected] Dunja Mladeni´c Department of Knowledge Technologies Jožef Stefan Insitute Jamova , Ljubljana Slovenia [email protected] Katharina Morik Department of Computer Science Technische Universität Dortmund Dortmund Germany [email protected] Jun Morimoto Advanced Telecommunication Research Institute International ATR Kyoto Japan Abdullah Mueen Department of Computer Science and Engineering University California-Riverside Riverside, CA USA Paul Munro School of Information Sciences University of Pittsburgh Pittsburgh, PA USA [email protected]

xix

xx

List of Contributors

Ion Muslea Language Weaver, Inc. Admiralty Way, Suite Marina del Rey, CA USA [email protected] Galileo Namata Department of Computer Science University of Maryland College Park, MD USA Sriraam Natarajan Department of Computer Sciences University of Wisconsin Medical School University Avenue Madison, WI USA [email protected] Andrew Y. Ng Stanford AI Laboratory Stanford University Serra Mall, Gates Building A Stanford, CA - USA [email protected] Siegfried Nijssen Institut für Informatik Albert-Ludwigs-Universität Freiburg Georges-Köhler-Allee, Gebäude Freiburg i. Br. Germany [email protected] William Stafford Noble Department of Genome Sciences University of Washington Seattle, WA USA [email protected] Petra Kralj Novak Department of Knowledge Technologies Jožef Stefan Institute Jamova Ljubljana Slovenia [email protected]

Daniel Oblinger DARPA/IPTO Fairfax Drive Arlington, VA USA [email protected]

Peter Orbanz Department of Engineering Cambridge University Trumpington Street Cambridge, CB PZ UK

Miles Osborne Institute for Communicating and Collaborative Systems University of Edinburgh Buccleuch Place Edinburgh EH LW Scotland UK [email protected]

C. David page Department of Biostatistics and Medical Informatics University of Wisconsin Medical School University Avenue Madison, WI USA [email protected]

Jonathan Patrick Telfer School of Management University of Ottawa Laurier avenue Ottawa, ON KN N Canada [email protected]

Claudia Perlich Data Analytics Research Group IBM T.J. Watson Research Center P.O. Box Yorktown Heights, NY USA [email protected]

List of Contributors

Jan Peters Department of Empirical Inference and Machine Learning Max Planck Institute for Biological Cybernetics Spemannstr. Tuebingen Germany [email protected]

Bernhard Pfahringer Department of Computer Science University of Waikato Private Bag Hamilton New Zealand [email protected]

Steve Poteet Boeing Phantom Works P.O. Box MC L- Seattle, WA USA

Pascal Poupart School of Computer Science University of Waterloo University Avenue West Waterloo ON NL G Canada [email protected]

Rob Powers Computer Science Department Stanford University Serra Mall Stanford, CA USA [email protected]

Cecilia M. Procopiuc AT&T Labs Florham Park, NJ USA [email protected]

Martin L. Puterman Centre for Health Care Management Sauder School of Business University of British Columbia Main Mall Vancouver, BC VT Z Canada [email protected] Lesley Quach Boeing Phantom Works P.O. Box MC L- Seattle, WA USA Novi Quadrianto Department of Engineering and Computer Science Australian National University NICTA London Circuit Canberra ACT Australia [email protected] Luc De Raedt Department of Computer Science Katholieke Universiteit Leuven Celestijnenlaan A BE - Heverlee Belgium [email protected] Dev Rajnarayan NASA Ames Research Center Mail Stop - Moffett Field, CA USA Adwait Ratnaparkhi Yahoo! Labs Santa Clara California USA [email protected] Soumya Ray School of EECS Oregon State University Kelley Engineering Center Corvallis, OR USA [email protected]

xxi

xxii

List of Contributors

Mark Reid Research School of Information Sciences and Engineering The Australian National University Canberra, ACT Australia [email protected] Jean-Michel Renders Xerox Research Centre Europe , chemin de Maupertuis Meylan France John Risch Pacific Northwest National Laboratory Jorma Rissanen Complex Systems Computation Group Department of Computer Science Helsinki Institute of Information Technology Helsinki Finland [email protected] Nicholas Roy Massachusetts Institute of Technology Cambridge, MA USA Lorenza Saitta Università del Piemonte Orientale Alessandria Italy [email protected] Yasubumi Sakakibara Department of Biosciences and Informatics Keio University [email protected] Hiyoshi Kohoku-ku Japan Claude Sammut School of Computer Science and Engineering The University of New South Wales Sydney NSW Australia [email protected]

Joerg Sander Department of Computing Science University of Alberta Edmonton, AB Canada [email protected] Scott Sanner Statistical Machine Learning Group NICTA, London Circuit, Tower A ACT Canberra Australia [email protected] Stefan Schaal Department of Computer Science University of Southern California ATR Computational Neuroscience Labs Watt Way Los Angeles, CA - USA [email protected] Ute Schmid Department of Information Systems and Applied Computer Science University of Bamberg Feldkirchenstr. Bamberg Germany [email protected] Stephen Scott University of Nebraska Lincoln, NE USA Michele Sebag Laboratoire de Recherche en Informatique Université Paris-Sud Bât Orsay France [email protected] Prithviraj Sen University of Maryland AV Williams Bldg, Rm College Park, MD USA

List of Contributors

Hanhuai Shan Department of Computer Science and Engineering University of Minnesota Minneapolis, MN USA [email protected]

Hossam Sharara Department of Computer Science University of Maryland College Park, MD Maryland USA

Victor S. Sheng The University of Western Ontario Canada

Jelber Sayyad Shirabad School of Information Technology and Engineering University of Ottawa King Edward P.O. Box Stn A, KN N Ottawa, Ontario Canada [email protected]

Yoav Shoham Computer Science Department Stanford University Serra Mall Stanford, CA USA [email protected]

Thomas R. Shultz Department of Psychology and School of Computer Science McGill University Dr. Penfield Avenue Montréal QC HA B Canada [email protected]

Ricardo Silva Gatsby Computational Neuroscience Unit University College London Alexandra House Queen Square London WCN AR UK [email protected] Vikas Sindhwani IBM T. J. Watson Research Center Route /P.O. Box Kitchawan Rd Yorktown Heights, NY USA Moshe Sipper Department of Computer Science Ben-Gurion University P.O. Box Beer-Sheva Israel [email protected] William D. Smart Associate Professor Department of Computer Science and Engineering Washington University in St. Louis Campus Box One Brookings Drive St. Louis, MO USA [email protected] Carlos Soares LIAAD-INESC Porto L.A./Faculdade de Economia Laboratory of Artificial Intelligence and Computer Science University of Porto Rua de Ceuta n. .piso, - Porto Portugal Christian Sohler Heinz Nixdorf Institute & Computer Science Department University of Paderborn Fuerstenallee Paderborn Germany [email protected]

xxiii

xxiv

List of Contributors

Frank Stephan Department of Computer Science and Department of Mathematics National University of Singapore Singapore Republic of Singapore [email protected]

Jon Timmis Department of Computer Science and Department of Electronics University of York Heslington York DD UK [email protected]

Peter Stone Department of Computer Sciences The University of Texas at Austin Austin, TX USA [email protected]

Jo-Anne Ting University of Edinburgh

Alexander L. Strehl Department of Computer Science Rutgers University Frelinghuysen Road Piscataway, NJ USA [email protected]

Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University Kelley Engineering Center Corvallis, OR - USA [email protected]

Russ Tedrake Department of Computer Science Massachusetts Institute of Technology Vassar Street Cambridge, MA USA [email protected]

Yee Whye Teh Gatsby Computational Neuroscience Unit University College London Queen Square London WCN AR UK [email protected]

Kai Ming Ting Gippsland School of Information Technology Monash University Gippsland Campus Churchill , Victoria Australia [email protected] Ljupˇco Todorovski Faculty of Administration University of Ljubljana Gosarjeva Ljubljana Slovenia [email protected] Hannu Toivonen Department of Computer Science University of Helsinki P.O. Box (Gustaf Hällströmin katu b) Helsinki Finland [email protected] Luís Torgo Department of Computer Science Faculty of Sciences University of Porto Rua Campo Alegre /, – Porto Portugal [email protected] Panayiotis Tsaparas Microsoft Research Microsoft Mountain View, CA USA [email protected]

List of Contributors

Paul E. Utgoff Department of Computer Science University of Massachusetts Governor’s Drive Amherst, MA – USA William Uther NICTA and the University of New South Wales [email protected] Sethu Vijayakumar University of Edinburgh University of Southern California

Eric Wiewiora University of California San Diego [email protected] Anthony Wirth Department of Computer Science and Software Engineering The University of Melbourne Victoria Australia [email protected]

Ricardo Vilalta Department of Computer Science University of Houston Calhoun Rd Houston, TX - USA

Michael Witbrock Cycorp, Inc. Executive Center Drive Austin, TX USA [email protected]

Michail Vlachos IBM Zürich Research Laboratory Säumerstrasse Rüschlikon Switzerland [email protected]

David Wolpert NASA Ames Research Center Moffett Field, CA USA [email protected]

Kiri L. Wagstaff Machine Learning Systems Jet Propulsion Laboratory California Institute of Technology Pasadena, CA USA [email protected] Geoffrey I. Webb Faculty of Information Technology Clayton School of Information Technology Monash University P.O. Box Victoria Australia [email protected] R. Paul Wiegand Institute for Simulation and Training University of Central Florida Orlando, FL USA [email protected] [email protected]

Stefan Wrobel Department of Computer Science University of Bonn, and Fraunhofer IAIS (Institute for Intelligent Analysis and Information Systems) Fraunhofer IAIS Schloss Birlinghoven Sankt Augustin Germany Jason Wu Boeing Phantom Works P.O. Box MC L- Seattle, WA USA Zhao Xu Knowledge Discovery Fraunhofer IAIS Schloß Birlinghoven Sankt Augustin Germany

xxv

xxvi

List of Contributors

Ying Yang Australian Taxation Office White Horse Road Box Hill VIC Australia [email protected]

Ying Zhao Department of Computer Science and Technology Tsinghua University Beijing China

Sungwook Yoon PARC Labs Coyote Hill Road Palo Alto, CA USA

Fei Zheng Faculty of Information Technology Monash University Clayton School of I.T. Room , Bldg Wellington Road Clayton Melbourne Victoria Australia [email protected]

Thomas Zeugmann Division of Computer Science Graduate School of Information Science and Technology Hokkaido University Sapparo Japan [email protected] Xinhua Zhang School of Computer Science Australian National University NICTA London Circuit Canberra Australia [email protected]

Xiaojin Zhu Department of Computer Sciences University of Wisconsin-Madison West Dayton Street, Madison, WI USA [email protected]

- -Norm Distance 7Manhattan Distance

Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----, © Springer Science+Business Media LLC

A Abduction Antonis C. Kakas University of Cyprus, Nicosia, Cyprus

Definition Abduction is a form of reasoning, sometimes described as “deduction in reverse,” whereby given a rule that “A follows from B” and the observed result of “A” we infer the condition “B” of the rule. More generally, given a theory, T, modeling a domain of interest and an observation, “A,” we infer a hypothesis “B” such that the observation follows deductively from T augmented with “B.” We think of “B” as a possible explanation for the observation according to the given theory that contains our rule. This new information and its consequences (or ramifications) according to the given theory can be considered as the result of a (or part of a) learning process based on the given theory and driven by the observations that are explained by abduction. Abduction can be combined with 7induction in different ways to enhance this learning process.

Motivation and Background Abduction is, along with induction, a synthetic form of reasoning whereby it generates, in its explanations, new information not hitherto contained in the current theory with which the reasoning is performed. As such, it has a natural relation to learning, and in particular to knowledge intensive learning, where the new information generated aims to complete, at least partially, the current knowledge (or model) of the problem domain as described in the given theory.

Early uses of abduction in the context of machine learning concentrated on how abduction can be used as a theory revision operator for identifying where the current theory could be revised in order to accommodate the new learning data. This includes the work of Michalski (), Ourston and Mooney (), and Ade, Malfait, and Raedt (). Another early link of abduction to learning was given by the 7explanation based learning method (DeJong & Mooney, ), where the abductive explanations of the learning data (training examples) are generalized to all cases. Following this, it was realized (Flach & Kakas, ) that the role of abduction in learning could be strengthened by linking it to induction, culminating in a hybrid integrated approach to learning where abduction and induction are tightly integrated to provide powerful learning frameworks such as the ones of Progol . (Muggleton & Bryant, ) and HAIL (Ray, Broda, & Russo, ). On the other hand, from the point of view of abduction as “inference to the best explanation” (Josephson & Josephson, ) the link with induction provides a way to distinguish between different explanations and to select those explanations that give a better inductive generalization result. A recent application of abduction, on its own or in combination with induction, is in Systems Biology where we try to model biological processes and pathways at different levels. This challenging domain provides an important development test-bed for these methods of knowledge intensive learning (see e.g., King et al., ; Papatheodorou, Kakas, & Sergot, ; Ray, Antoniades, Kakas, & Demetriades, ; TamaddoniNezhad, Kakas, Muggleton, & Pazos, ; Zupan et al., ).

Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----, © Springer Science+Business Media LLC

A

Abduction

Structure of the Learning Task Abduction contributes to the learning task by first explaining, and thus rationalizing, the training data according to a given and current model of the domain to be learned. These abductive explanations either form on their own the result of learning or they feed into a subsequent phase to generate the final result of learning. Abduction in Artificial Intelligence

Abduction as studied in the area of Artificial Intelligence and the perspective of learning is mainly defined in a logic-based approach (Other approaches to abduction include the set covering approach See, e.g., Reggia () or case-based explanation, e.g., Leake ().) as follows. Given a set of sentences T (a theory or model), and a sentence O (observation), the abductive task is the problem of finding a set of sentences H (abductive explanation for O) such that: . T ∪ H ⊧ O, . T ∪ H is consistent, where ⊧ denotes the deductive entailment relation of the formal logic used in the representation of our theory and consistency refers also to the corresponding notion in this logic. The particular choice of this underlying formal framework of logic is in general a matter that depends on the problem or phenomena that we are trying to model. In many cases, this is based on 7first order predicate calculus, as, for example, in the approach of theory completion in Muggleton and Bryant (). But other logics can be used, e.g., the nonmonotonic logics of default logic or logic programming with negation as failure when the modeling of our problem requires this level of expressivity. This basic formalization as it stands, does not fully capture the explanatory nature of the abductive explanation H in the sense that it necessarily conveys some reason why the observations hold. It would, for example, allow an observation O to be explained by itself or in terms of some other observations rather than in terms of some “deeper” reason for which the observation must hold according to the theory T. Also, as the above specification stands, the observation can be abductively explained by generating in H some new (general) theory

completely unrelated to the given theory T. In this case, H does not account for the observations O according to the given theory T and in this sense it may not be considered as an explanation for O relative to T. For these reasons, in order to specify a “level” at which the explanations are required and to understand these relative to the given general theory about the domain of interest, the members of an explanation are normally restricted to belong to a special preassigned, domain-specific class of sentences called abducible. Hence abduction, is typically applied on a model, T, in which we can separate two disjoint sets of predicates: the observable predicates and the abducible (or open) predicates. The basic assumption then is that our model T has reached a sufficient level of comprehension of the domain such that all the incompleteness of the model can be isolated (under some working hypotheses) in its abducible predicates. The observable predicates are assumed to be completely defined (in T) in terms of the abducible predicates and other background auxiliary predicates; any incompleteness in their representation comes from the incompleteness in the abducible predicates. In practice, the empirical observations that drive the learning task are described using the observable predicates. Observations are represented by formulae that refer only to the observable predicates (and possibly some background auxiliary predicates) typically by ground atomic facts on these observable predicates. The abducible predicates describe underlying (theoretical) relations in our model that are not observable directly but can, through the model T, bring about observable information. The assumptions on the abducible predicates used for building up the explanations may be subject to restrictions that are expressed through integrity constraints. These represent additional knowledge that we have on our domain expressing general properties of the domain that remain valid no matter how the theory is to be extended in the process of abduction and associated learning. Therefore, in general, an abductive theory is a triple, denoted by ⟨T, A, IC⟩, where T is the background theory, A is a set of abducible predicates, and IC is a set of integrity constraints. Then, in the definition of an abductive explanation given above, one more requirement is added: . T ∪ H satisfies IC.

Abduction

The satisfaction of integrity constraints can be formally understood in several ways (see Kakas, Kowalski, & Toni, and references therein). Note that the integrity constraints reduce the number of explanations for a set of observations filtering out those explanations that do not satisfy them. Based on this notion of abductive explanation a credulous form of abductive entailment is defined. Given an abductive theory, T = ⟨T, A, IC⟩, and an observation O then, O is abductively entailed by T, denoted by T ⊧A O, if there exists an abductive explanation of O in T. This notion of abductive entailment can then form the basis of a coverage relation for learning in the face of incomplete information.

Abductive Concept Learning

Abduction allows us to reason in the face of incomplete information. As such when we have learning problems where the background data on the training examples is incomplete the use of abduction can enhance the learning capabilities. Abductive concept learning (ACL) (Kakas & Riguzzi, ) is a learning framework that allows us to learn from incomplete information and to later be able to classify new cases that again could be incompletely specified. Under ACL, we learn abductive theories, ⟨T, A, IC⟩ with abduction playing a central role in the covering relation of the learning problem. The abductive theories learned in ACL contain both rules, in T, for the concept(s) to be learned as well as general clauses acting as integrity constraints in IC. Practical problems that can be addressed with ACL: () concept learning from incomplete background data where some of the background predicates are incompletely specified and () concept learning from incomplete background data together with given integrity constraints that provide some information on the incompleteness of the data. The treatment of incompleteness through abduction is integrated within the learning process. This allows the possibility of learning more compact theories that can alleviate the problem of over fitting due to the incompleteness in the data. A specific subcase of these two problems and important third application problem of ACL is that of () multiple predicate learning, where each predicate is required to be learned from the incomplete data for the other

A

predicates. Here the abductive reasoning can be used to suitably connect and integrate the learning of the different predicates. This can help to overcome some of the nonlocality difficulties of multiple predicate learning, such as order-dependence and global consistency of the learned theory. ACL is defined as an extension of 7Inductive Logic Programming (ILP) where both the background knowledge and the learned theory are abductive theories. The central formal definition of ACL is given as follows where examples are atomic ground facts on the target predicate(s) to be learned. Definition (Abductive Concept Learning) Given A set of positive examples E+ ● A set of negative examples E− ● An abductive theory T = ⟨P, A, I⟩ as background theory ● An hypothesis space T = ⟨P, I⟩ consisting of a space of possible programs P and a space of possible constraints I ●

Find A set of rules P′ ∈ P and a set of constraints I ′ ∈ I such that the new abductive theory T ′ = ⟨P ∪ P′ , A, I ∪ I ′ ⟩ satisfies the following conditions T ′ ⊧ A E+ ● ∀e− ∈ E− , T ′ ⊭A e− ●

where E+ stands for the conjunction of all positive examples. An individual example e is said to be covered by a theory T ′ if T ′ ⊧A e. In effect, this definition replaces the deductive entailment as the example coverage relation in the ILP problem with abductive entailment to define the ACL learning problem. The fact that the conjunction of positive examples must be covered means that, for every positive example, there must exist an abductive explanation and the explanations for all the positive examples must be consistent with each other. For negative examples, it is required that no abductive explanation exists for any of them. ACL can be illustrated as follows.

A

A

Abduction

Example Suppose we want to learn the concept father. Let the background theory be T = ⟨P, A, ∅⟩ where: P = {parent(john, mary), male(john), parent(david, steve), parent(kathy, ellen), female(kathy)}, A = {male, female}. Let the training examples be: E+ = {father(john, mary), father(david, steve)}, E− = {father(kathy, ellen), father(john, steve)}. In this case, a possible hypotheses T ′ = ⟨P ∪ P′ , A, I ′ ⟩ learned by ACL would consist of P′ = {father(X, Y) ← parent(X, Y), male(X)}, I ′ = { ← male(X), female(X)}. This hypothesis satisfies the definition of ACL because: . T ′ ⊧A father(john, mary), father(david, steve) with ∆ = {male(david)}. . T ′ ⊭A father(kathy, ellen), as the only possible explanation for this goal, namely {male(kathy)} is made inconsistent by the learned integrity constraint in I ′ . . T ′ ⊭A father(john, steve), as this has no possible abductive explanations. Hence, despite the fact that the background theory is incomplete (in its abducible predicates), ACL can find an appropriate solution to the learning problem by suitably extending the background theory with abducible assumptions. Note that the learned theory without the integrity constraint would not satisfy the definition of ACL, because there would exist an abductive explanation for the negative example father(kathy, ellen), namely ∆− = {male(kathy)}. This explanation is prohibited in the complete theory by the learned constraint together with the fact female(kathy). The algorithm and learning system for ACL is based on a decomposition of this problem into two subproblems: () learning the rules in P′ together with appropriate explanations for the training examples and () learning integrity constraints driven by the explanations generated in the first part. This decomposition allows ACL to be developed by combining the two IPL settings of explanatory (predictive) learning and confirmatory (descriptive) learning. In fact, the first subproblem can be seen as a problem of learning from

entailment, while the second subproblem as a problem of learning from interpretations. Abduction and Induction

The utility of abduction in learning can be enhanced significantly when this is integrated with induction. Several approaches for synthesizing abduction and induction in learning have been developed, e.g., Ade and Denecker (), Muggleton and Bryant (), Yamamoto (), and Flach and Kakas (). These approaches aim to develop techniques for knowledge intensive learning with complex background theories. One problem to be faced by purely inductive techniques, is that the training data on which the inductive process operates, often contain gaps and inconsistencies. The general idea is that abductive reasoning can feed information into the inductive process by using the background theory for inserting new hypotheses and removing inconsistent data. Stated differently, abductive inference is used to complete the training data with hypotheses about missing or inconsistent data that explain the example or training data, using the background theory. This process gives alternative possibilities for assimilating and generalizing this data. Induction is a form of synthetic reasoning that typically generates knowledge in the form of new general rules that can provide, either directly, or indirectly through the current theory T that they extend, new interrelationships between the predicates of our theory that can include, unlike abduction, the observable predicates and even in some cases new predicates. The inductive hypothesis thus introduces new, hitherto unknown, links between the relations that we are studying thus allowing new predictions on the observable predicates that would not have been possible before from the original theory under any abductive explanation. An inductive hypothesis, H, extends, like in abduction, the existing theory T to a new theory T ′ =T ∪ H, but now H provides new links between observables and nonobservables that was missing or incomplete in the original theory T. This is particularly evident from the fact that induction can be performed even with an empty given theory T, using just the set of observations. The observations specify incomplete (usually extensional) knowledge about the observable

Abduction

predicates, which we try to generalize into new knowledge. In contrast, the generalizing effect of abduction, if at all present, is much more limited. With the given current theory T, that abduction always needs to refer to, we implicitly restrict the generalizing power of abduction as we require that the basic model of our domain remains that of T. Induction has a stronger and genuinely new generalizing effect on the observable predicates than abduction. While the purpose of abduction is to extend the theory with an explanation and then reason with it, thus enabling the generalizing potential of the given theory T, in induction the purpose is to extend the given theory to a new theory, which can provide new possible observable consequences. This complementarity of abduction and induction – abduction providing explanations from the theory while induction generalizes to form new parts of the theory – suggests a basis for their integration within the context of theory formation and theory development. A cycle of integration of abduction and induction (Flach & Kakas, ) emerges that is suitable for the task of incremental modeling (Fig. ). Abduction is used to transform (and in some sense normalize) the observations to information on the abducible predicates. Then, induction takes this as input and tries to generalize this information to general rules for the abducible predicates now treating these as observable predicates for its own purposes. The cycle can then be repeated by adding the learned information on the abducibles back in the model as new partial information T′

O

T∪H

Induction

T

O

A

on the incomplete abducible predicates. This will affect the abductive explanations of new observations to be used again in a subsequent phase of induction. Hence, through this cycle of integration the abductive explanations of the observations are added to the theory, not in the (simple) form in which they have been generated, but in a generalized form given by a process of induction on these. A simple example, adapted from Ray et al. (), that illustrates this cycle of integration of abduction and induction is as follows. Suppose that our current model, T, contains the following rule and background facts: sad(X) ← tired(X), poor(X), tired(oli), tired(ale), tired(kr), academic(oli), academic(ale), academic(kr), student(oli), lecturer(ale), lecturer(kr), where the only observable predicate is sad/. Given the observations O = {sad(ale), sad(kr), not sad(oli)} can we improve our model? The incompleteness of our model resides in the predicate poor. This is the only abducible predicate in our model. Using abduction we can explain the observations O via the explanation: E = {poor(ale), poor(kr), not poor(oli)}. Subsequently, treating this explanation as training data for inductive generalization we can generalize this to get the rule: poor(X) ← lecturer(X)

Abduction

O′

Abduction. Figure . The cycle of abductive and inductive knowledge development. The cycle is governed by the “equation” T ∪ H ⊧ O, where T is the current theory, O the observations triggering theory development, and H the new knowledge generated. On the left-hand side we have induction, its output feeding into the theory T for later use by abduction on the right; the abductive output in turn feeds into the observational data O′ for later use by induction, and so on

thus (partially) defining the abducible predicate poor when we extend our theory with this rule. This combination of abduction and induction has recently been studied and deployed in several ways within the context of ILP. In particular, inverse entailment (Muggleton and Bryant, ) can be seen as a particular case of integration of abductive inference for constructing a “bottom” clause and inductive inference to generalize it. This is realized in Progol . and applied to several problems including the discovery of the function of genes in a network of metabolic pathways (King et al., ), and more recently to the study of

A

A

Abduction

inhibition in metabolic networks (Tamaddoni-Nezhad, Chaleil, Kakas, & Muggleton, ; Tamaddoni-Nezhad et al., ). In Moyle (), an ILP system called ALECTO, integrates a phase of extraction-case abduction to transform each case of a training example to an abductive hypothesis with a phase of induction that generalizes these abductive hypotheses. It has been used to learn robot navigation control programs by completing the specific domain knowledge required, within a general theory of planning that the robot uses for its navigation (Moyle, ). The development of these initial frameworks that realize the cycle of integration of abduction and induction prompted the study of the problem of completeness for finding any hypotheses H that satisfies the basic task of finding a consistent hypothesis H such that T ∪ H ⊧ O for a given theory T, and observations O. Progol was found to be incomplete (Yamamoto, ) and several new frameworks of integration of abduction and induction have been proposed such as SOLDR (Ito & Yamamoto, ), CF-induction (Inoue, ), and HAIL (Ray et al., ). In particular, HAIL has demonstrated that one of the main reasons for the incompleteness of Progol is that in its cycle of integration of abduction and induction, it uses a very restricted form of abduction. Lifting some of these restrictions, through the employment of methods from abductive logic programming (Kakas et al., ), has allowed HAIL to solve a wider class of problems. HAIL has been extended to a framework, called XHAIL (Ray, ), for learning nonmonotonic ILP, allowing it to be applied to learn Event Calculus theories for action description (Alrajeh, Ray, Russo, & Uchitel, ) and complex scientific theories for systems biology (Ray & Bryant, ). Applications of this integration of abduction and induction and the cycle of knowledge development can be found in the recent proceedings of the Abduction and Induction in Artificial Intelligence workshops in (Flach & Kakas, ) and (Ray, Flach, & Kakas, ).

Abduction in Systems Biology

Abduction has found a rich field of application in the domain of systems biology and the declarative modeling of computational biology. In a project called, Robot scientist (King et al., ), Progol . was used to

generate abductive hypotheses about the function of genes. Similarly, learning the function of genes using abduction has been studied in GenePath (Zupan et al., ) where experimental genetic data is explained in order to facilitate the analysis of genetic networks. Also in Papatheodorou et al. () abduction is used to learn gene interactions and genetic pathways from microarray experimental data. Abduction and its integration with induction has been used in the study of inhibitory effect of toxins in metabolic networks (Tamaddoni-Nezhad et al., , ) taking into account also the temporal variation that the inhibitory effect can have. Another bioinformatics application of abduction (Ray et al., ) concerns the modeling of human immunodeficiency virus (HIV) drug resistance and using this in order to assist medical practitioners in the selection of antiretroviral drugs for patients infected with HIV. Also, the recently developed frameworks of XHAIL and CF-induction have been applied to several problems in systems biology, see e.g., Ray (), Ray and Bryant (), and Doncescu, Inoue, and Yamamoto (), respectively.

Cross References 7Explanation-Based Learning 7Inductive Logic Programming

Recommended Reading Ade, H., & Denecker, M. (). AILP: Abductive inductive logic programming. In C. S. Mellish (Ed.), IJCAI (pp. –). San Francisco: Morgan Kaufmann. Ade, H., Malfait, B., & Raedt, L. D. (). Ruth: An ILP theory revision system. In ISMIS. Berlin: Springer. Alrajeh, D., Ray, O., Russo, A., & Uchitel, S. (). Using abduction and induction for operational requirements elaboration. Journal of Applied Logic, (), –. DeJong, G., & Mooney, R. (). Explanation-based learning: An alternate view. Machine Learning, , –. Doncescu, A., Inoue, K., & Yamamoto, Y. (). Knowledge based discovery in systems biology using cf-induction. In H. G. Okuno & M. Ali (Eds.), IEA/AIE (pp. –). Heidelberg: Springer. Flach, P., & Kakas, A. (). Abductive and inductive reasoning: Background and issues. In P. A. Flach & A. C. Kakas (Eds.), Abductive and inductive reasoning. Pure and applied logic. Dordrecht: Kluwer. Flach, P. A., & Kakas, A. C. (Eds.). (). Abduction and induction in artificial intelligence [Special issue]. Journal of Applied Logic, (). Inoue, K. (). Inverse entailment for full clausal theories. In LICS workshop on logic and learning.

Accuracy

Ito, K., & Yamamoto, A. (). Finding hypotheses from examples by computing the least generlisation of bottom clauses. In Proceedings of discovery science ’ (pp. –). Berlin: Springer. Josephson, J., & Josephson, S. (Eds.). (). Abductive inference: Computation, philosophy, technology. New York: Cambridge University Press. Kakas, A., Kowalski, R., & Toni, F. (). Abductive logic programming. Journal of Logic and Computation, (), –. Kakas, A., & Riguzzi, F. (). Abductive concept learning. New Generation Computing, , –. King, R., Whelan, K., Jones, F., Reiser, P., Bryant, C., Muggleton, S., et al. (). Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, , –. Leake, D. (). Abduction, experience and goals: A model for everyday abductive explanation. The Journal of Experimental and Theoretical Artificial Intelligence, , –. Michalski, R. S. (). Inferential theory of learning as a conceptual basis for multistrategy learning. Machine Learning, , –. Moyle, S. (). Using theory completion to learn a robot navigation control program. In Proceedings of the th international conference on inductive logic programming (pp. –). Berlin: Springer. Moyle, S. A. (). An investigation into theory completion techniques in inductive logic programming. PhD thesis, Oxford University Computing Laboratory, University of Oxford. Muggleton, S. (). Inverse entailment and Progol. New Generation Computing, , –. Muggleton, S., & Bryant, C. (). Theory completion using inverse entailment. In Proceedings of the tenth international workshop on inductive logic programming (ILP-) (pp. –). Berlin: Springer. Ourston, D., & Mooney, R. J. (). Theory refinement combining analytical and empirical methods. Artificial Intelligence, , –. Papatheodorou, I., Kakas, A., & Sergot, M. (). Inference of gene relations from microarray data by abduction. In Proceedings of the eighth international conference on logic programming and non-monotonic reasoning (LPNMR’) (Vol. , pp. –). Berlin: Springer. Ray, O. (). Nonmonotonic abductive inductive learning. Journal of Applied Logic, (), –. Ray, O., Antoniades, A., Kakas, A., & Demetriades, I. (). Abductive logic programming in the clinical management of HIV/AIDS. In G. Brewka, S. Coradeschi, A. Perini, & P. Traverso (Eds.), Proceedings of the th European conference on artificial intelligence. Frontiers in artificial intelligence and applications (Vol. , pp. –). Amsterdam: IOS Press. Ray, O., Broda, K., & Russo, A. (). Hybrid abductive inductive learning: A generalisation of Progol. In Proceedings of the th international conference on inductive logic programming. Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer. Ray, O., & Bryant, C. (). Inferring the function of genes from synthetic lethal mutations. In Proceedings of the second international conference on complex, intelligent and software

A

intensive systems (pp. –). Washington, DC: IEEE Computer Society. Ray, O., Flach, P. A., & Kakas, A. C. (Eds.). (). Abduction and induction in artificial intelligence. Proceedings of IJCAI workshop. Reggia, J. (). Diagnostic experts systems based on a set-covering model. International Journal of Man-Machine Studies, (), –. Tamaddoni-Nezhad, A., Chaleil, R., Kakas, A., & Muggleton, S. (). Application of abductive ILP to learning metabolic network inhibition from temporal data. Machine Learning, (–), –. Tamaddoni-Nezhad, A., Kakas, A., Muggleton, S., & Pazos, F. (). Modelling inhibition in metabolic pathways through abduction and induction. In Proceedings of the th international conference on inductive logic programming (pp. –). Berlin: Springer. Yamamoto, A. (). Which hypotheses can be found with inverse entailment? In Proceedings of the seventh international workshop on inductive logic programming. Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer. Zupan, B., Bratko, I., Demsar, J., Juvan, P., Halter, J., Kuspa, A., et al. (). Genepath: A system for automated construction of genetic networks from mutant data. Bioinformatics, (), –.

Absolute Error Loss 7Mean Absolute Error

Accuracy Definition Accuracy refers to a measure of the degree to which the predictions of a 7model match the reality being modeled. The term accuracy is often applied in the context of 7classification models. In this context, accuracy = P(λ(X) = Y), where XY is a 7joint distribution and the classification model λ is a function X → Y. Sometimes, this quantity is expressed as a percentage rather than a value between . and .. The accuracy of a model is often assessed or estimated by applying it to test data for which the 7labels (Y values) are known. The accuracy of a classifier on test data may be calculated as number of correctly classified objects/total number of objects. Alternatively, a smoothing function may be applied, such as a 7Laplace estimate or an 7m-estimate.

A

A

ACO

Accuracy is directly related to 7error rate, such that accuracy = . − error rate (or when expressed as a percentage, accuracy = − error rate).

Cross References 7Confusion Matrix 7Resubstitution Accuracy

ACO 7Ant Colony Optimization

Actions In a 7Markov decision process, actions are the available choices for the decision-maker at any given decision epoch, in any given state.

Active Learning David Cohn Mountain View, CA, USA

Definition The term Active Learning is generally used to refer to a learning problem or system where the learner has some role in determining on what data it will be trained. This is in contrast to Passive Learning, where the learner is simply presented with a 7training set over which it has no control. Active learning is often used in settings where obtaining 7labeled data is expensive or time-consuming; by sequentially identifying which examples are most likely to be useful, an active learner can sometimes achieve good performance, using far less 7training data than would otherwise be required.

Structure of Learning System In many machine learning problems, the training data are treated as a fixed and given part of the problem definition. In practice, however, the training data

are often not fixed beforehand. Rather, the learner has an opportunity to play a role in deciding what data will be acquired for training. This process is usually referred to as “active learning,” recognizing that the learner is an active participant in the training process. The typical goal in active learning is to select training examples that best enable the learner to minimize its loss on future test cases. There are many theoretical and practical results demonstrating that, when applied properly, active learning can greatly reduce the number of training examples, and even the computational effort required for a learner to achieve good generalization. A toy example that is often used to illustrate the utility of active learning is that of learning a threshold function over a one-dimensional interval. Given +/− labels for N points drawn uniformly over the interval, the expected error between the true value of the threshold and any learner’s best guess is bounded by O(/N). Given the opportunity to sequentially select the position of points to be labeled, however, a learner can pursue a binary search strategy, obtaining a best guess that is within O(/N ) of the true threshold value. This toy example illustrates the sequential nature of example selection that is a component of most (but not all) active learning strategies: the learner makes use of initial information to discard parts of the solution space, and to focus future data acquisition on distinguishing parts that are still viable.

Related Problems The term “active learning” is usually applied in supervised learning settings, though there are many related problems in other branches of machine learning and beyond. The “exploration” component of the “exploration/exploitation” strategy in reinforcement learning is one such example. The learner must take actions to gain information, and must decide what actions will give him/her the information that will best minimize future loss. A number of fields of Operations Research predate and parallel machine learning work on active learning, including Decision Theory (North, ), Value of Information Computation, Bandit problems (Robbins, ), and Optimal Experiment Design (Fedorov, ; Box & Draper, ).

Active Learning

Active Learning Scenarios When active learning is used for classification or regression, there are three common settings: constructive active learning, pool-based active learning, and streambased active learning (also called selective sampling). Constructive Active Learning

In constructive active learning, the learner is allowed to propose arbitrary points in the input space as examples to be labeled. While this in theory gives the learner the most power to explore, it is often not practical. One obstacle is the observation that most learning systems train on only a reduced representation of the instances they are presented with: text classifiers on bags of words (rather than fully-structured text) and speech recognizers on formants (rather than raw audio). While a learning system may be able to identify what pattern of formants would be most informative to label, there is no reliable way to generate audio that a human could recognize (and label) from the desired formants alone. Pool-Based Active Learning

Pool-based active learning (McCallum & Nigam, ) is popular in domains such as text classification and speech recognition where unlabeled data are plentiful and cheap, but labels are expensive and slow to acquire. In pool-based active learning, the learner may not propose arbitrary points to label, but instead has access to a set of unlabeled examples, and is allowed to select which of them to request labels for. A special case of pool-based learning is transductive active learning, where the test distribution is exactly the set of unlabeled examples. The goal then is to sequentially select and label a small number of examples that will best allow predicting the labels of those points that remain unlabeled. A theme that is common to both constructive and pool-based active learning is the principle of sequential experimentation. Information gained from early queries allows the learner to focus its search on portions of the domain that are most likely to give it additional information on subsequent queries. Stream-Based Active Learning

Stream-based active learning resembles pool-based learning in many ways, except that the learner only has

A

access to the unlabeled instances as a stream; when an instance arrives, the learner must decide whether to ask for its label or let it go. Other Forms of Active Learning

By virtue of the broad definition of active learning, there is no real limit on the possible settings for framing it. Angluin’s early work on learning regular sets (Angluin, ) employed a “counterexample” oracle: the learner would propose a solution, and the oracle would either declare it correct, or divulge a counterexample – an instance on which the proposed and true solutions disagreed. Jin and Si () describe a Bayesian method for selecting informative items to recommend when learning a collaborative filtering model, and Steck and Jaakkola () describe a method best described as unsupervised active learning to build Bayesian networks in large domains. While most active learning work assumes that the cost of obtaining a label is independent of the instance to be labeled, there are many scenarios where this is not the case. A mobile robot taking surface measurements must first travel to the point it wishes to sample, making distant points more expensive than nearby ones. In some cases, the cost of a query (e.g., the difficulty of traveling to a remote point to sample it) may not even be known until it is made, requiring the learner to learn a model of that as well. In these situations, the sequential nature of active learning is greatly accentuated, and a learner faces the additional challenges of planning under uncertainty (see “Greedy vs. Batch Active Learning,” below).

Common Active Learning Strategies . Version space partitioning. The earliest practical active learning work (Ruff & Dietterich, ; Mitchell, ) explicitly relied on 7version space partitioning. These approaches tried to select examples on which there was maximal disagreement between hypotheses in the current version space. When such examples were labeled, they would invalidate as large a portion of the version space as possible. A limitation of explicit version space approaches is that, in underconstrained domains, a learner may waste its effort differentiating portions of the version space that have little

A

A

Active Learning

effect on the classifier’s predictions, and thus on its error. . Query by Committee (Seung, Opper, & Sompolinsky ). In query by committee, the experimenter trains an ensemble of models, either by selecting randomized starting points (e.g., in the case of a neural network) or by bootstrapping the training set. Candidate examples are scored based on disagreement among the ensemble models – examples with high disagreement indicate areas in the sample space that are underdetermined by the training data, and therefore potentially valuable to label. Models in the ensemble may be looked at as samples from the version space; picking examples where these models disagree is a way of splitting the version space. . Uncertainty sampling (Lewis & Gail, ). Uncertainty sampling is a heuristic form of statistical active learning. Rather than sampling different points in the version space by training multiple learners, the learner itself maintains an explicit model of uncertainty over its input space. It then selects for labeling those examples on which it is least confident. In classification and regression problems, uncertainty contributes directly to expected loss (as the variance component of the “error = bias + variance” decomposition), so that gathering examples where the learner has greatest uncertainty is often an effective loss-minimization heuristic. This approach has also been found effective for non-probabilistic models, by simply selecting examples that lie near the current decision boundary. For some learners, such as support vector machines, this heuristic can be shown to be an approximate partitioning of the learner’s version space (Tong & Koller, ). . Loss minimization (Cohn, Ghahramani, & Jordan, ). Uncertainty sampling can stumble when parts of the learner’s domain are inherently noisy. It may be that, regardless of the number of samples labeled in some neighborhood, it will remain impossible to accurately predict these. In these cases, it would be desirable to not only model the learner’s uncertainty over arbitrary parts of its domain, but also to model what effect labeling any future example is expected

to have on that uncertainty. For some learning algorithms it is feasible to explicitly compute such estimates (e.g., for locally-weighted regression and mixture models, these estimates may be computed in closed form). It is, therefore, practical to select examples that directly minimize the expected loss to the learner, as discussed below under “Statistical Active Learning.”

Statistical Active Learning Uncertainty sampling and direct loss minimization are two examples of statistical active learning. Both rely on the learner’s ability to statistically model its own uncertainty. When learning with a statistical model, such as a linear regressor or a mixture of Gaussians (Dasgupta, ), the objective is usually to find model parameters that minimize some form of expected loss. When active learning is applied to such models, it is natural to also select training data so as to minimize that same objective. As statistical models usually give us estimates on the probability of (as yet) unknown values, it is often straightforward to turn this machinery upon itself to assist in the active learning process (Cohn et al., ). The process is usually as follows: . Begin by requesting labels for a small random subsample of the examples x , x , K, xn x and fit our model to the labeled data. . For any x in our domain, a statistical model lets us estimate both the conditional expectation yˆ(x) and σyˆ(x) , the variance of that expectation. We estimate our current loss by drawing a new random sample of unlabeled data, and computing the averaged σyˆ(x) . . We now consider a candidate point x˜ , and ask what reduction in loss we would obtain if we had labeled it y˜. If we knew its label with certainty, we could simply add the point to the training set, retrain, and compute the new expected loss. While we do not know the true y˜, we could, in theory, compute the new expected loss for every possible y˜ and average those losses, weighting them by our model’s estimate of p(˜y∣˜x). In practice, this is normally unfeasible; however, for some statistical models, such as locally-weighted regression and mixtures of Gaussians, we can compute the distribution of p(˜y∣˜x) and its effect on σyˆ(x) in closed form (Cohn et al., ).

Active Learning

. Given the ability to estimate the expected effect of obtaining label y˜ for candidate x˜ , we repeat this computation for a sample of Mcandidates, and then request a label for the candidate with the largest expected decrease in loss. We add the newly-labeled example to our training set, retrain, and begin looking at candidate points to add on the next iteration.

Given n labeled pairs, and a prediction to make for input x, LOESS computes the following covariance statistics around x: ∑ ki (xi − µ x ) ∑i ki xi , σx = i , n n ∑i ki (xi − µ x ) (yi − µ y ) σxy = n ∑i ki (yi − µ y ) ∑ ki yi µy = i , σy = , n n σxy σy∣x = σy − σx

A Detailed Example: Statistical Active Learning with LOESS LOESS (Cleveland, Devlin, & Gross, ) is a simple form of locally-weighted regression using a kernel function. When asked to predict the unknown output y corresponding to a given input x, LOESS computes a 7linear regression over known (x, y) pairs, in which it gives pair (xi , yi ) weight according to the proximity of xi to x. We will write this weighting as a kernel function, K(xi , x), or simplify it to ki when there is no chance of confusion. In the active learning setting, we will assume that we have a large supply of unlabeled examples drawn from the test distribution, along with labels for a small number of them. We wish to label a small number more so as to minimize the mean squared error (MSE) of our model. MSE can be decomposed into two terms: squared 7bias and variance. If we make the (inaccurate but simplifying) assumption that LOESS is approximately unbiased for the problem at hand, minimizing MSE reduces to minimizing the variance of our estimates.

µx =

The Need for Reference Distributions Step () above illustrates a complication that is unique to active learning approaches. Traditional “passive” learning usually relies on the assumption that the distribution over which the learner will be tested is the same as the one from which the training data were drawn. When the learner is allowed to select its own training data, it still needs some form of access to the distribution of data on which it will be tested. A pool-based or stream-based learner can use the pool or stream as a proxy for that distribution, but if the learner is allowed (or required) to construct its own examples, it risks wasting all its effort on resolving portions of the solution space that are of no interest to the problem at hand.

A

We can combine these to express the conditional expectation of y (our estimate) and its variance as: yˆ = µ y + σyˆ =

σy∣x

n

σxy (x − µ x ), σx (∑ ki + i

(x − µ x ) (xi − µ x ) k ). ∑ i σx σx i

Our proxy for model error is the variance of our prediction, integrated over the test distribution ⟨σyˆ ⟩. As we have assumed a pool-based setting in which we have a large number of unlabeled examples from that distribution, we can simply compute the above variance over a sample from the pool, and use the resulting average as our estimate. To perform statistical active learning, we want to compute how our estimated variance will change if we add an (as yet unknown) label y˜ for an arbitrary x˜ . We will write this new expected variance as ⟨σ˜yˆ ⟩. While we do not know what value y˜ will take, our model gives us an estimated mean yˆ(˜x) and variance σyˆ(˜x) for the value, as above. We can add this “distributed” y value to LOESS just as though it were a discrete one, and compute the resulting expectation ⟨σ˜yˆ ⟩ in closed form. Defining k˜ as K(˜x, x), we write:

⟨σ˜yˆ ⟩ =

⟨σ˜y∣x ⟩

˜ (n + k)

(∑ ki + k˜ + i

(x − µ˜ x ) σ˜x

(xi − µ˜ x ) ˜ (˜x − µ˜ x ) × (∑ ki +k )) , σ˜x σ˜x i

A

A

Active Learning Theory

where the component expectations are computed as follows: ⟨σ˜y∣x ⟩ = ⟨σ˜y ⟩ −

⟨σ˜xy ⟩

, σ˜x ˜ + (ˆy(˜x) − µ y ) ) nk(σ nσy y∣˜x ˜ ⟨ σy ⟩ = + , ˜ ˜ n+k (n + k) ˜x nµ x + k˜ µ˜ x = , n + k˜ ˜ x − µ x )(ˆy(˜x) − µ y ) nσxy nk(˜ ⟨σ˜xy ⟩ = + , ˜ n + k˜ (n + k) ˜ x − µ x ) nσx nk(˜ + , ˜ n + k˜ (n + k) n k˜ σy∣˜ x − µ x ) x (˜ ⟨σ˜xy ⟩ = ⟨σ˜xy ⟩ + . ˜ (n + k) σ˜x =

Greedy Versus Batch Active Learning It is also worth pointing out that virtually all active learning work relies on greedy strategies – the learner estimates what single example best achieves its objective, requests that one, retrains, and repeats. In theory, it is possible to plan some number of queries ahead, asking what point is best to label now, given that N- more labeling opportunities remain. While such strategies have been explored in Operations Research for very small problem domains, their computational requirements make this approach unfeasible for problems of the size typically encountered in machine learning. There are cases where retraining the learner after every new label would be prohibitively expensive, or where access to labels is limited by the number of iterations as well as by the total number of labels (e.g., for a finite number of clinical trials). In this case, the learner may select a set of examples to be labeled on each iteration. This batch approach, however, is only useful if the learner is able to identify a set of examples whose expected contributions are non-redundant, which substantially complicates the process.

Cross References

Box, G. E. P., & Draper, N. (). Empirical model-building and response surfaces. New York: Wiley. Cleveland, W., Devlin, S., & Gross, E. (). Regression by local fitting. Journal of Econometrics, , –. Cohn, D., Atlas, L., & Ladner, R. (). Training connectionist networks with queries and selective sampling. In D. Touretzky (Ed.)., Advances in neural information processing systems. Morgan Kaufmann. Cohn, D., Ghahramani, Z., & Jordan, M. I. (). Active learning with statistical models. Journal of Artificial Intelligence Research, , –. http://citeseer.ist.psu.edu/ .html Dasgupta, S. (). Learning mixtures of Gaussians. Foundations of Computer Science, –. Fedorov, V. (). Theory of optimal experiments. New York: Academic Press. Kearns, M., Li, M., Pitt, L., & Valiant, L. (). On the learnability of Boolean formulae, Proceedings of the th annual ACM conference on theory of computing (pp. –). New York: ACM Press. Lewis, D. D., & Gail, W. A. (). A sequential algorithm for training text classifiers. Proceedings of the th annual international ACM SIGIR conference (pp. –). Dublin. McCallum, A., & Nigam, K. (). Employing EM and pool-based active learning for text classification. In Machine learning: Proceedings of the fifteenth international conference (ICML’) (pp. –). North, D. W. (). A tutorial introduction to decision theory. IEEE Transactions Systems Science and Cybernetics, (). Pitt, L., & Valiant, L. G. (). Computational limitations on learning from examples. Journal of the ACM (JACM), (), –. Robbins, H. (). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, , –. Ruff, R., & Dietterich, T. (). What good are experiments? Proceedings of the sixth international workshop on machine learning. Ithaca, NY. Seung, H. S., Opper, M., & Sompolinsky, H. (). Query by committee. In Proceedings of the fifth workshop on computational learning theory (pp. –). San Mateo, CA: Morgan Kaufmann. Steck, H., & Jaakkola, T. (). Unsupervised active learning in large domains. In Proceeding of the conference on uncertainty in AI. http://citeseer.ist.psu.edu/ steckunsupervised.html

Active Learning Theory

7Active Learning Theory

Sanjoy Dasgupta University of California, San Diego, La Jolla, CA, USA

Recommended Reading

Definition

Angluin, D. (). Learning regular sets from queries and counterexamples. Information and Computation, (), –. Angluin, D. (). Queries and concept learning. Machine Learning, , –.

The term active learning applies to a wide range of situations in which a learner is able to exert some control over its source of data. For instance, when fitting a

Active Learning Theory

regression function, the learner may itself supply a set of data points at which to measure response values, in the hope of reducing the variance of its estimate. Such problems have been studied for many decades under the rubric of experimental design (Chernoff, ; Fedorov, ). More recently, there has been substantial interest within the machine learning community in the specific task of actively learning binary classifiers. This task presents several fundamental statistical and algorithmic challenges, and an understanding of its mathematical underpinnings is only gradually emerging. This brief survey will describe some of the progress that has been made so far.

Learning from Labeled and Unlabeled Data In the machine learning literature, the task of learning a classifier has traditionally been studied in the framework of supervised learning. This paradigm assumes that there is a training set consisting of data points x (from some set X ) and their labels y (from some set Y), and the goal is to learn a function f : X → Y that will accurately predict the labels of data points arising in the future. Over the past years, tremendous progress has been made in resolving many of the basic questions surrounding this model, such as “how many training points are needed to learn an accurate classifier?” Although this framework is now fairly well understood, it is a poor fit for many modern learning tasks because of its assumption that all training points automatically come labeled. In practice, it is frequently the case that the raw, abundant, easily obtained form of data is unlabeled, whereas labels must be explicitly procured and are expensive. In such situations, the reality is that the learner starts with a large pool of unlabeled points and must then strategically decide which ones it wants labeled: how best to spend its limited budget. Example: Speech recognition. When building a speech recognizer, the unlabeled training data consists of raw speech samples, which are very easy to collect: just walk around with a microphone. For all practical purposes, an unlimited quantity of such samples can be obtained. On the other hand, the “label” for each speech sample is a segmentation into its constituent phonemes, and producing even one such label requires substantial human time and attention. Over the past decades, research labs and the government have expended an

A

enormous amount of money, time, and effort on creating labeled datasets of English speech. This investment has paid off, but our ambitions are inevitably moving past what these datasets can provide: we would now like, for instance, to create recognizers for other languages, or for English in specific contexts. Is there some way to avoid more painstaking years of data labeling, to somehow leverage the easy availability of raw speech so as to significantly reduce the number of labels needed? This is the hope of active learning.

Some early results on active learning were in the membership query model, where the data is assumed to be separable (that is, some hypothesis h perfectly classifies all points) and the learner is allowed to query the label of any point in the input space X (rather than being constrained to a prespecified unlabeled set), with the goal of eventually returning the perfect hypothesis h∗ . There is a significant body of beautiful theoretical work in this model (Angluin, ), but early experiments ran into some telling difficulties. One study (Baum & Lang, ) found that when training a neural network for handwritten digit recognition, the queries synthesized by the learner were such bizarre and unnatural images that they were impossible for a human to classify. In such contexts, the membership query model is of limited practical value; nonetheless, many of the insights obtained from this model carry over to other settings (Hanneke, a). We will fix as our standard model one in which the learner is given a source of unlabeled data, rather than being able to generate these points himself. Each point has an associated label, but the label is initially hidden, and there is a cost for revealing it. The hope is that an accurate classifier can be found by querying just a few labels, much fewer than would be required by regular supervised learning. How can the learner decide which labels to probe? One option is to select the query points at random, but it is not hard to show that this yields the same label complexity as supervised learning. A better idea is to choose the query points adaptively: for instance, start by querying some random data points to get a rough sense of where the decision boundary lies, and then gradually refine the estimate of the boundary by specifically querying points in its immediate vicinity. In other

A

A

Active Learning Theory

words, ask for the labels of data points whose particular positioning makes them especially informative. Such strategies certainly sound good, but can they be fleshed out into practical algorithms? And if so, do these algorithms work well in the sense of producing good classifiers with fewer labels than would be required by supervised learning? On account of the enormous practical importance of active learning, there are a wide range of algorithms and techniques already available, most of which resemble the aggressive, adaptive sampling strategy just outlined, and many of which show promise in experimental studies. However, a big problem with this kind of sampling is that very quickly the set of labeled points no longer reflects the underlying data distribution. This makes it hard to show that the classifiers learned have good statistical properties (for instance, that they converge to an optimal classifier in the limit of infinitely many labels). This survey will only discuss methods that have proofs of statistical well-foundedness, and whose label complexity can be explicitly analyzed.

Motivating Examples We will start by looking at a few examples that illustrate the enormous potential of active learning and that also make it clear why analyses of this new model require concepts and intuitions that are fundamentally different from those that have already been developed for supervised learning. Example: Thresholds on the Line

Suppose the data lie on the real line, and the available classifiers are simple thresholding functions, H = {hw : w ∈ R}: ⎧ ⎪+ if x ≥ w ⎪ hw (x) = ⎨ ⎪− if x < w ⎪ ⎩

(using VC theory) tells us that if the data are separable – that is, if they can be perfectly classified by some hypothesis in H – then we need approximately /є random labeled examples from P, and it is enough to return any classifier consistent with them. Now suppose we instead draw /є unlabeled samples from P. If we lay these points down on the line, their hidden labels are a sequence of −s followed by a sequence of +s, and the goal is to discover the point w at which the transition occurs. This can be accomplished with a simple binary search which asks for just log /є labels: first ask for the label of the median point; if it is +, move to the th percentile point, otherwise move to the th percentile point; and so on. Thus, for this hypothesis class, active learning gives an exponential improvement in the number of labels needed, from /є to just log /є. For instance, if supervised learning requires a million labels, active learning requires just log ,, ≈ , literally! It is a tantalizing possibility that even for more complicated hypothesis classes H, a sort of generalized binary search is possible. A natural next step is to consider linear separators in two dimensions. Example: Linear Separators in R

Let H be the hypothesis class of linear separators in R , and suppose the data is distributed according to some density supported on the perimeter of the unit circle. It turns out that the positive results of the onedimensional case do not generalize: there are some target hypotheses in H for which Ω(/є) labels are needed to find a classifier with error rate less than є, no matter what active learning scheme is used. To see this, consider the following possible target hypotheses (Fig. ): ● ●

To make things precise, let us denote the (unknown) underlying distribution on the data (X, Y) ∈ R × {+, −} by P, and let us suppose that we want a hypothesis h ∈ H whose error with respect to P, namely errP (h) = P(h(X) ≠ Y), is at most some є. How many labels do we need? In supervised learning, such issues are well understood. The standard machinery of sample complexity

h : all points are positive. hi ( ≤ i ≤ /є): all points are positive except for a small slice Bi of probability mass є.

The slices Bi are explicitly chosen to be disjoint, with the result that Ω(/є) labels are needed to distinguish between these hypotheses. For instance, suppose nature chooses a target hypothesis at random from among the hi , ≤ i ≤ /є. Then, to identify this target with probability at least /, it is necessary to query points in at least (about) half the Bi s.

Active Learning Theory

A

The Sample Complexity of Active Learning

Active Learning Theory. Figure . P is supported on the circumference of a circle. Each Bi is an arc of probability mass є

Thus for these particular target hypotheses, active learning offers little improvement in sample complexity over regular supervised learning. What about other target hypotheses in H, for instance those in which the positive and negative regions are more evenly balanced? It is quite easy (Dasgupta, ) to devise an active learning scheme which asks for O(min{/i(h), /є}) + O(log /є) labels, where i(h) = min{positive mass of h, negative mass of h}. Thus even within this simple hypothesis class, the label complexity can run anywhere from O(log /є) to Ω(/є), depending on the specific target hypothesis!

Example: An Overabundance of Unlabeled Data

In our two previous examples, the amount of unlabeled data needed was O(/є), exactly the usual sample complexity of supervised learning. But it is sometimes helpful to have significantly more unlabeled data than this. In Dasgupta (), a distribution P is described for which if the amount of unlabeled data is small (below any prespecified threshold), then the number of labels needed to learn the target linear separator is Ω(/є); whereas if the amount of unlabeled data is much larger, then only O(log /є) labels are needed. This is a situation where most of the data distribution is fairly uninformative while a miniscule fraction is highly informative. A lot of unlabeled data is needed in order to get even a few of the informative points.

We will think of the unlabeled points x , . . . , xn as being drawn i.i.d. from an underlying distribution PX on X (namely, the marginal of the distribution P on X × Y), either all at once (a pool) or one at a time (a stream). The learner is only allowed to query the labels of points in the pool/stream; that is, it is restricted to “naturally occurring” data points rather than synthetic ones (Fig. ). It returns a hypothesis h ∈ H whose quality is measured by its error rate, errP (h). In regular supervised learning, it is well known that if the VC dimension of H is d, then the number of labels that will with high probability ensure errP (h) ≤ є is roughly O(d/є) if the data is separable and O(d/є ) otherwise (Haussler, ); various logarithmic terms are omitted here. For active learning, it is clear from the examples above that the VC dimension alone does not adequately characterize label complexity. Is there a different combinatorial parameter that does? Generic Results for Separable Data

For separable data, it is possible to give upper and lower bounds on label complexity in terms of a special parameter known as the splitting index (Dasgupta, ). This is merely an existence result: the algorithm needed to realize the upper bound is intractable because it involves explicitly maintaining an є-cover (a coarse approximation) of the hypothesis class, and the size of this cover is in general exponential in the VC dimension. Nevertheless, it does give us an idea of the kinds of label complexity we can hope to achieve. Example. Suppose the hypothesis class consists of intervals on the real line: X = R and H = {ha,b : a, b ∈ R}, where ha,b (x) = (a ≤ x ≤ b). Using the splitting index, the label complexity of active learning is seen to ̃ be Θ(min{/P X ([a, b]), /є} + log /є) when the target ̃ notation hypothesis is ha,b (Dasgupta, ). Here the Θ is used to suppress logarithmic terms. Example. Suppose X = Rd and H consists of linear separators through the origin. If PX is the uniform distribution on the unit sphere, the number of labels needed ̃ log /є), to learn a hypothesis of error ≤ є is just Θ(d ̃ exponentially smaller than the O(d/є) label complexity of supervised learning. If PX is not the uniform distribution but is everywhere within a multiplicative

A

A

Active Learning Theory

Pool-based active learning

Stream-based active learning

Get a set of unlabeled points U ⊂ X Repeat until satisfied: Pick some x ∈ U to label Return a hypothesis h ∈ H

Repeat for t = , , , . . .: Choose a hypothesis ht ∈ H Receive an unlabeled point x ∈ X Decide whether to query its label

Active Learning Theory. Figure . Models of pool- and stream-based active learning. The data are draws from an underlying distribution PX , and hypotheses h are evaluated by errP (h). If we want to get this error below є, how many labels are needed, as a function of є?

factor λ > of it, then the label complexity becomes ̃ O((d log /є) log λ), provided the amount of unlabeled data is increased by a factor of λ (Dasgupta, ). These results are very encouraging, but the question of an efficient active learning algorithm remains open. We now consider two approaches.

results have shown how to remove this assumption (Balcan, Beygelzimer, & Langford, ; Dasgupta et al., ) and to accommodate classification loss functions other than − loss (Beygelzimer et al., ). Variants of the disagreement coefficient continue to characterize label complexity in the agnostic setting (Beygelzimer et al., ; Dasgupta et al., ).

Mildly Selective Sampling

The label complexity results mentioned above are based on querying maximally informative points. A less aggressive strategy is to be mildly selective, to query all points except those that are quite clearly uninformative. This is the idea behind one of the earliest generic active learning schemes (Cohn, Atlas, & Ladner, ). Data points x , x , . . . arrive in a stream, and for each one the learner makes a spot decision about whether or not to request a label. When xt arrives, the learner behaves as follows. Determine whether both possible labelings, (xt , +) and (xt , −), are consistent with the labeled examples seen so far. ● If so, ask for the label yt . Otherwise set yt to be the unique consistent label.

A Bayesian Model

The query by committee algorithm (Seung, Opper, & Sompolinsky, ) is based on a Bayesian view of active learning. The learner starts with a prior distribution on the hypothesis space, and is then exposed to a stream of unlabeled data. Upon receiving xt , the learner performs the following steps. Draw two hypotheses h, h′ at random from the posterior over H. ● If h(xt ) ≠ h′ (xt ) then ask for the label of xt and update the posterior accordingly.

●

●

Fortunately, the check required for the first step can be performed efficiently by making two calls to a supervised learner. Thus this is a very simple and elegant active learning scheme, although as one might expect, it is suboptimal in its label complexity (Balcan et al., ). Interestingly, there is a parameter called the disagreement coefficient that characterizes the label complexity of this scheme and also of some other mildly selective learners (Friedman, ; Hanneke, b). In practice, the biggest limitation of the algorithm above is that it assumes the data are separable. Recent

This algorithm queries points that substantially shrink the posterior, while at the same time taking account of the data distribution. Various theoretical guarantees have been shown for it (Freund, Seung, Shamir, & Tishby, ); in particular, in the case of linear separators with a uniform data distribution, it achieves a label complexity of O(d log /є), the best possible. Sampling from the posterior over the hypothesis class is, in general, computationally prohibitive. However, for linear separators with a uniform prior, it can be implemented efficiently using random walks on convex bodies (Gilad-Bachrach, Navot, & Tishby, ).

Adaboost

Other Work

In this survey, I have touched mostly on active learning results of the greatest generality, those that apply to arbitrary hypothesis classes. There is also a significant body of more specialized results. Efficient active learning algorithms for specific hypothesis classes. This includes an online learning algorithm for linear separators that only queries some of the points and yet achieves similar regret bounds to algorithms that query all the points (Cesa-Bianchi, Gentile, & Zaniboni, ). The label complexity of this method is yet to be characterized. ● Algorithms and label bounds for linear separators under the uniform data distribution. This particular setting has been amenable to mathematical analysis. For separable data, it turns out that a variant of the perceptron algorithm achieves the optimal O(d log /є) label complexity (Dasgupta, Kalai, & Monteleoni,).Asimplealgorithmisalsoavailable for the agnostic setting (Balcan et al., ).

●

Conclusion The theoretical frontier of active learning is mostly an unexplored wilderness. Except for a few specific cases, we do not have a clear sense of how much active learning can reduce label complexity: whether by just a constant factor, or polynomially, or exponentially. The fundamental statistical and algorithmic challenges involved, together with the huge practical importance of the field, make active learning a particularly rewarding terrain for investigation.

A

Beygelzimer, A., Dasgupta, S., & Langford, J. (). Importance weighted active learning. In International Conference on Machine Learning (pp. –). New York: ACM Press. Cesa-Bianchi, N., Gentile, C., & Zaniboni, L. (). Worst-case analysis of selective sampling for linear-threshold algorithms. Advances in Neural Information Processing Systems. Chernoff, H. (). Sequential analysis and optimal design. In CBMS-NSF Regional Conference Series in Applied Mathematics . SIAM. Cohn, D., Atlas, L., & Ladner, R. (). Improving generalization with active learning. Machine Learning, (),–. Dasgupta, S. (). Coarse sample complexity bounds for active learning. Advances in Neural Information Processing Systems. Dasgupta, S., Kalai, A., & Monteleoni, C. (). Analysis of perceptron-based active learning. In th Annual Conference on Learning Theory. pp. –. Dasgupta, S., Hsu, D. J., & Monteleoni, C. (). A general agnostic active learning algorithm. Advances in Neural Information Processing Systems. Fedorov, V. V. (). Theory of optimal experiments. (W. J. Studden & E. M. Klimko, Trans.). New York: Academic Press. Freund, Y., Seung, S., Shamir, E., & Tishby, N. (). Selective sampling using the query by committee algorithm. Machine Learning Journal, ,–. Friedman, E. (). Active learning for smooth problems. In Conference on Learning Theory. pp. –. Gilad-Bachrach, R., Navot, A., & Tishby, N. (). Query by committeee made real. Advances in Neural Information Processing Systems. Hanneke, S. (a). Teaching dimension and the complexity of active learning. In Conference on Learning Theory. pp. –. Hanneke, S. (b). A bound on the label complexity of agnostic active learning. In International Conference on Machine Learning. pp. –. Haussler, D. (). Decision-theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, (),–. Seung, H. S., Opper, M., & Sompolinsky, H. (). Query by committee. In Conference on Computational Learning Theory, pp. –.

Cross References 7Active Learning

Adaboost Recommended Reading Angluin, D. (). Queries revisited. In Proceedings of the th international conference on algorithmic learning theory (pp. –). Balcan, M.-F., Beygelzimer, A., & Langford, J. (). Agnostic active learning. In International Conference on Machine Learning (pp. –). New York: ACM Press. Balcan, M.-F., Broder, A., & Zhang, T. (). Margin based active learning. In Conference on Learning Theory. pp. –. Baum, E. B., & Lang, K. (). Query learning can work poorly when a human oracle is used. In International Joint Conference on Neural Networks.

Adaboost is an 7ensemble learning technique, and the most well-known of the 7Boosting family of algorithms. The algorithm trains models sequentially, with a new model trained at each round. At the end of each round, mis-classified examples are identified and have their emphasis increased in a new training set which is then fed back into the start of the next round, and a new model is trained. The idea is that subsequent models

A

A

Adaptive Control Processes

should be able to compensate for errors made by earlier models. See 7ensemble learning for full details.

Adaptive Control Processes 7Bayesian Reinforcement Learning

Adaptive Real-Time Dynamic Programming Andrew G. Barto University of Massachusetts, Amherst, MA, USA

Synonyms ARTDP

Definition Adaptive Real-Time Dynamic Programming (ARTDP) is an algorithm that allows an agent to improve its behavior while interacting over time with an incompletely known dynamic environment. It can also be viewed as a heuristic search algorithm for finding shortest paths in incompletely known stochastic domains. ARTDP is based on 7Dynamic Programming (DP), but unlike conventional DP, which consists of off-line algorithms, ARTDP is an on-line algorithm because it uses agent behavior to guide its computation. ARTDP is adaptive because it does not need a complete and accurate model of the environment but learns a model from data collected during agent-environment interaction. When a good model is available, 7RealTime Dynamic Programming (RTDP) is applicable, which is ARTDP without the model-learning component.

Motivation and Background RTDP combines strengths of heuristic search and DP. Like heuristic search – and unlike conventional DP – it does not have to evaluate the entire state space in order

to produce an optimal solution. Like DP – and unlike most heuristic search algorithms – it is applicable to nondeterministic problems. Additionally, RTDP’s performance as an 7anytime algorithm is better than conventional DP and heuristic search algorithms. ARTDP extends these strengths to problems for which a good model is not initially available. In artificial intelligence, control engineering, and operations research, many problems require finding a policy (or control rule) that determines how an agent (or controller) should generate actions in response to the states of its environment (the controlled system). When a “cost” or a “reward” is associated with each step of the agent’s behavior, policies can be compared according to how much cost or reward they are expected to accumulate over time. The usual formulation for problems like this in the discrete-time case is the 7Markov Decision Process (MDP). The objective is to find a policy that minimizes (maximizes) a measure of the total cost (reward) over time, assuming that the agent–environment interaction can begin in any of the possible states. In other cases, there is a designated set of “start states” that is much smaller than the entire state set (e.g., the initial board configuration in a board game). In these cases, any given policy only has to be defined for the set of states that can be reached from the starting states when the agent is using that policy. The rest of the states will never arise when that policy is being followed, so the policy does not need to specify what the agent should do in those states. ARTDP and RTDP exploit situations where the set of states reachable from the start states is a small subset of the entire state space. They can dramatically reduce the amount of computation needed to determine an optimal policy for the relevant states as compared with the amount of computation that a conventional DP algorithm would require to determine an optimal policy for all the states. These algorithms do this by focussing computation around simulated behavioral experiences (if there is a model available capable of simulating these experiences), or around real behavioral experiences (if no model is available). RTDP and ARTDP were introduced by Barto, Bradtke, and Singh (). The starting point was the novel observation by Bradtke that Korf ’s Learning Real-Time A* heuristic search algorithm (Korf, )

Adaptive Real-Time Dynamic Programming

is closely related to DP. RTDP generalizes Learning Real-Time A* to stochastic problems. ARTDP is also closely related to Sutton’s Dyna system (Sutton, ) and Jalali and Ferguson’s () Transient DP. Theoretical analysis relies on the theory of Asnychronous DP as described by Bertsekas and Tsitsiklis (). ARTDP and RTDP are 7model-based reinforcement learning algorithms, so called because they take advantage of an environment model, unlike 7model-free reinforcement learning algorithms such as 7Q-Learning and 7Sarsa.

A

applied to all states (and some other conditions are satisfied), the algorithm will converge. RTDP is an instance of asynchronous DP in which the states chosen for backups are determined by the agent’s behavior. The backup operation above is model-based because it uses known rewards and transition probabilities, and the values of all the states appear on the right-hand-side of the equation. In contrast, a sample backup uses the value of just one sample successor state. RTDP and ARTDP are like RL algorithms in that they rely on real or simulated behavioral experience, but unlike many (but not all) RL algorithms, they use full backups like DP.

Structure of Learning System Backup Operations

Off-Line Versus On-Line

A basic step of many DP and RL algorithms is a backup operation. This is an operation that updates a current estimate of the cost of an MDP’s state. (We use the cost formulation instead of reward to be consistent with the original presentation of the algorithms. In the case of rewards, this would be called the value of a state and we would maximize instead of minimize.) Suppose X is the set of MDP states. For each state x ∈ X, f (x), the cost of state x, gives a measure (which varies with different MDP formulations) of the total cost the agent is expected to incur over the future if it starts in x. If fk (x) and fk+ (x), respectively, denote the estimate of f (x) before and after a backup, a typical backup operation applied to x looks like this:

A conventional DP algorithm typically executes off-line. When applied to finding an optimal policy for an MDP, this means that the DP algorithm executes to completion before its result (an optimal policy) is used to control the agent’s behavior. The sweeps of DP sequentially “visit” the states of the MDP, performing a backup operation on each state. But it is important not to confuse these visits with the behaving agent’s visits to states: the agent is not yet behaving while the off-line DP computation is being done. Hence, the agent’s behavior has no influence on the DP computation. The same is true for off-line asynchronous DP. RTDP is an on-line, or “real-time,” algorithm. It is an asynchronous DP computation that executes concurrently with the agent’s behavior so that the agent’s behavior can influence the DP computation. Further, the concurrently executing DP computation can influence the agent’s behavior. The agent’s visits to states directs the “visits” to states made by the concurrent asynchronous DP computation. At the same time, the action performed by the agent is the action specified by the policy corresponding to the latest results of the DP computation: it is the “greedy” action with respect to the current estimate of the cost function.

fk+ (x) = mina∈A [cx (a) + ∑ pxy (a)fk (y)], y∈X

where A is the set of possible agent actions, cx (a) is the immediate cost the agent incurs for performing action a in state x, and pxy (a) is the probability that the environment makes a transition from state x to state y as a result of the agent’s action a. This backup operation is associated with the DP algorithm known as 7value iteration. It is also the backup operation used by RTDP and ARTDP. Conventional DP algorithms consist of successive “sweeps” of the state set. Each sweep consists of applying a backup operation to each state. Sweeps continue until the algorithm converges to a solution. Asynchronous DP, which underlies RTDP and ARTDP, does not use systematic sweeps. States can be chosen in any way whatsoever, and as long as backups continue to be

Specify actions Asynchronous Dynamic Programming Computation

Behaving Agent Specify states to backup

In the simplest version of RTDP, when a state is visited by the agent, the DP computation performs the

A

A

Adaptive Real-Time Dynamic Programming

model-based backup operation given above on that same state. In general, for each step of the agent’s behavior, RTDP can apply the backup operation to each of an arbitrary set of states, provided that the agent’s current state is included. For example, at each step of behavior, a limited-horizon look-ahead search can be conducted from the agent’s current state, with the backup operation applied to each of the states generated in the search. Essentially, RTDP is an asynchronous DP computation with the computational effort focused along simulated or actual behavioral trajectories. Learning A Model

ARTDP is the same as RTDP except that () an environment model is updated using any on-line model-learning, or system identification, method, () the current environment model is used in performing the RTDP backup operations, and () the agent has to perform exploratory actions occasionally instead of always greedy actions as in RTDP. This last step is essential to ensure that the environment model eventually converges to the correct model. If the state and action sets are finite, the simplest way to learn a model is to keep counts of the number of times each transition occurs for each action and convert these frequencies to probabilities, thus forming the maximum-likelihood model. Summary of Theoretical Results

When RTDP and ARTDP are applied to stochastic optimal path problems, one can prove that under certain conditions they converge to optimal policies without the need to apply backup operations to all the states. Indeed, is some problems, only a small fraction of the states need to be visited. A stochastic optimal path problem is an MDP with a nonempty set of start states and a nonempty set of goal states. Each transition until a goal state is reached has a nonnegative immediate cost, and once the agent reaches a goal state, it stays there and thereafter incurs zero cost. Each episode of agent experience begins with a start state. An optimal policy is one that minimizes the cost of every state, i.e., minimizes f (x) for all states x. Under some relatively mild conditions, every optimal policy is guaranteed to eventually reach a goal state. A state x is relevant if a start state s and an optimal policy exist such that x can be reached from s

when the agent uses that policy. If we could somehow know which states are relevant, we could restrict DP to just these states and obtain an optimal policy. But this is not possible because knowing which states are relevant requires knowledge of optimal policies, which is what one is seeking. However, under certain conditions, without requiring repeated visits to all the irrelevant states, RTDP produces a policy that is optimal for all the relevant states. The conditions are that () the initial cost of every goal state is zero, () there exists at least one policy that guarantees that a goal state will be reached with probability one from any start state, () all immediate costs for transitions from non-goal states are strictly positive, and () none of the initial costs are larger than the actual costs. This result is proved in Barto et al. () by combining aspects of Korf ’s () proof for LRTA* with results for asynchronous DP.

Special Cases and Extensions

A number of special cases and extensions of RTDP have been developed that improve performance over the basic version. Some examples are as follows. Bonnet and Geffner’s () Labeled RTDP labels states that have already been “solved,” allowing faster convergence than RTDP. Feng, Hansen, and Zilberstein () proposed Symbolic RTDP, which selects a set of states to update at each step using symbolic model-checking techniques. The RTDP convergence theorem still applies because this is a special case of RTDP. Smith and Simmons () developed Focused RTDP that maintains a priority value for each state to better direct search and produce faster convergence. Hansen and Zilberstein’s () LAO* uses some of the same ideas as RTDP to produce a heuristic search algorithm that can find solutions with loops to non-deterministic heuristic search problems. Many other variants are possible. Extending ARTDP instead of RTDP in all of these ways would produce analogous algorithms that could be used when a good model is not available.

Cross References 7Anytime Algorithm 7Approximate Dynamic Programming 7Reinforcement Learning 7System Identification

Adaptive Resonance Theory

Recommended Reading Barto, A., Bradtke, S., & Singh, S. (). Learning to act using realtime dynamic programming. Artificial Intelligence, (–), – . Bertsekas, D., & Tsitsiklis, J. (). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice-Hall. Bonet, B., & Geffner, H. (a). Labeled RTDP: Improving the convergence of real-time dynamic programming. In Proceedings of the th international conference on automated planning and scheduling (ICAPS-). Trento, Italy. Bonet, B., & Geffner, H. (b). Faster heuristic search algorithms for planning with uncertainty and full feedback. In Proceedings of the international joint conference on artificial intelligence (IJCAI-). Acapulco, Mexico. Feng, Z., Hansen, E., & Zilberstein, S. (). Symbolic generalization for on-line planning. In Proceedings of the th conference on uncertainty in artificial intelligence. Acapulco, Mexico. Hansen. E., & Zilberstein, S. (). LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, , –. Jalali, A., & Ferguson, M. (). Computationally efficient control algorithms for Markov chains. In Proceedings of the th conference on decision and control (pp.–), Tampa, FL. Korf, R. (). Real-time heuristic search. Artificial Intelligence, (–), –. Smith, T., & Simmons, R. (). Focused real-time dynamic programming for MDPs: Squeezing more out of a heuristic. In Proceedings of the national conference on artificial intelligence (AAAI). Boston, MA: AAAI Press. Sutton, R. (). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the th international conference on machine learning (pp.–). San Mateo, CA: Morgan Kaufmann.

A

complex changing environment is needed. ART clarifies the brain processes from which conscious experiences emerge. It predicts a functional link between processes of consciousness, learning, expectation, attention, resonance, and synchrony (CLEARS), including the prediction that “all conscious states are resonant states.” This connection clarifies how brain dynamics enable a behaving individual to autonomously adapt in real time to a rapidly changing world. ART predicts how top-down attention works and regulates fast stable learning of recognition categories. In particular, ART articulates a critical role for “resonant” states in driving fast stable learning; and thus the name adaptive resonance. These resonant states are bound together, using top-down attentive feedback in the form of learned expectations, into coherent representations of the world. ART hereby clarifies one important sense in which the brain carries out predictive computation. ART has explained and successfully predicted a wide range of behavioral and neurobiological data, including data about human cognition and the dynamics of spiking laminar cortical networks. ART algorithms have been used in large-scale applications such as medical database prediction, remote sensing, airplane design, and the control of autonomous adaptive robots.

Motivation and Background

Adaptive Resonance Theory Gail A. Carpenter, Stephen Grossberg Boston University, Boston, MA, USA

Synonyms ART

Definition Adaptive resonance theory, or ART, is both a cognitive and neural theory of how the brain quickly learns to categorize, recognize, and predict objects and events in a changing world, and a set of algorithms that computationally embody ART principles and that are used in large-scale engineering and technological applications wherein fast, stable, and incremental learning about

Many current learning algorithms do not emulate the way in which humans and other animals learn. The power of human and animal learning provides high motivation to discover computational principles whereby machines can learn with similar capabilities. Humans and animals experience the world on the fly, and carry out incremental learning of sequences of episodes in real time. Often such learning is unsupervised, with the world itself as the teacher. Learning can also proceed with an unpredictable mixture of unsupervised and supervised learning trials. Such learning goes on successfully in a world that is nonstationary; that is, the rules of which can change unpredictably through time. Moreover, humans and animals can learn quickly and stably through time. A single important experience can be remembered for a long time. ART proposes a solution of this stability–plasticity dilemma (Grossberg, ) by

A

A

Adaptive Resonance Theory

showing how brains learn quickly without forcing catastrophic forgetting of already learned, and still successful, memories. Thus, ART autonomously carries out fast, yet stable, incremental learning under both unsupervised and supervised learning conditions in response to a complex nonstationary world. In contrast, many current learning algorithms use batch learning in which all the information about the world to be learned is available at a single time. Other algorithms are not defined unless all learning trials are supervised. Yet other algorithms become unstable in a nonstationary world, or become unstable if learning is fast; that is, if an event can be fully learned on a single learning trial. ART overcomes these problems. Some machine learning algorithms are feed-forward clustering algorithms that undergo catastrophic forgetting in a nonstationary world. The ART solution of the stability–plasticity dilemma depends upon feedback, or top-down, expectations that are matched against bottom-up data and thereby focus attention upon critical feature patterns. A good enough match leads to resonance and fast learning. A big enough mismatch leads to hypothesis testing or memory search that discovers and learns a more predictive category. Thus, ART is a self-organizing expert system that avoids the brittleness of traditional expert systems. The world is filled with uncertainty, so probability concepts seem relevant to understanding how brains learn about uncertain data. This fact has led some machine learning practitioners to assume that brains obey Bayesian laws. However, the Bayes rule is so general that it can accommodate any system in nature. Additional computational principles and mechanisms must augment Bayes to distinguish a brain from, say, a hydrogen atom or storm. Moreover, probabilistic models often use nonlocal computations. ART shows how the brain embodies a novel kind of real-time probability theory, hypothesis testing, prediction, and decisionmaking, the local computations of which adapt to a nonstationary world. These ART principles and mechanisms go beyond Bayesian analysis, and are embodied parsimoniously in the laminar circuits of cerebral cortex. Indeed, the cortex embodies a new kind of laminar computing that reconciles the best properties of feedforward and feedback processing, digital and analog processing, and data-driven bottom-up processing

combined with hypothesis-driven top-down processing (Grossberg, ).

Structure of Learning System How CLEARS Mechanisms Interact

Humans are intentional beings who learn expectations about the world and make predictions about what is about to happen. Humans are also attentional beings who focus processing resources upon a restricted amount of incoming information at any time. Why are we both intentional and attentional beings, and are these two types of processes related? The stability– plasticity dilemma and its solution using resonant states provide a unifying framework for understanding these issues. To clarify the role of sensory or cognitive expectations, and of how a resonant state is activated, suppose you were asked to “find the yellow ball as quickly as possible, and you will win a $, prize.” Activating an expectation of a “yellow ball” enables its more rapid detection, and with a more energetic neural response. Sensory and cognitive top-down expectations hereby lead to excitatory matching with consistent bottom-up data. Mismatch between top-down expectations and bottom-up data can suppress the mismatched part of the bottom-up data, to focus attention upon the matched, or expected, part of the bottom-up data. Excitatory matching and attentional focusing on bottom-up data using top-down expectations generates resonant brain states: When there is a good enough match between bottom-up and top-down signal patterns between two or more levels of processing, their positive feedback signals amplify and prolong their mutual activation, leading to a resonant state. Amplification and prolongation of activity triggers learning in the more slowly varying adaptive weights that control the signal flow along pathways from cell to cell. Resonance hereby provides a global context-sensitive indicator that the system is processing data worthy of learning, hence the name adaptive resonance theory. In summary, ART predicts a link between the mechanisms which enable us to learn quickly and stably about a changing world, and the mechanisms that enable us to learn expectations about such a world, test hypotheses about it, and focus attention upon information that we find interesting. ART clarifies this link by asserting that to solve the stability–plasticity

Adaptive Resonance Theory

dilemma, only resonant states can drive rapid new learning. It is just a step from here to propose that those experiences which can attract our attention and guide our future lives by being learned are also among the ones that are conscious. Support for this additional assertion derives from the many modeling studies whose simulations of behavioral and brain data using resonant states map onto properties of conscious experiences in those experiments. The type of learning within the sensory and cognitive domain that ART mechanizes is match learning: Match learning occurs only if a good enough match occurs between bottom-up information and a learned top-down expectation that is read out by an active recognition category, or code. When such an approximate match occurs, previously learned knowledge can be refined. Match learning raises the concern about what happens if a match is not good enough? How does such a model escape perseveration on already learned representations? If novel information cannot form a good enough match with the expectations that are read-out by previously learned recognition categories, then a memory search or hypothesis testing is triggered, which leads to selection and learning of a new recognition category, rather than catastrophic forgetting of an old one. Figure illustrates how this happens in an ART model; it is discussed in great detail below. In contrast, learning within spatial and motor processes is proposed to be mismatch learning that continuously updates sensorymotor maps or the gains of sensory-motor commands. As a result, we can stably learn what is happening in a changing world, thereby solving the stability–plasticity dilemma, while adaptively updating our representations of where objects are and how to act upon them using bodies whose parameters change continuously through time. Brain systems that use inhibitory matching and mismatch learning cannot generate resonances; hence, their representations are not conscious.

Complementary Computing in the Brain: Resonance and Reset

It has been mathematically proved that match learning within an ART model leads to stable memories in response to arbitrary list of events to be learned (e.g.,

A

Carpenter & Grossberg, ). However, match learning also has a serious potential weakness: If you can only learn when there is a good match between bottom-up data and learned top-down expectations, then how do you ever learn anything that you do not already know? ART proposes that this problem is solved by the brain by using an interaction between complementary processes of resonance and reset, which are predicted to control properties of attention and memory search, respectively. These complementary processes help our brains to balance between the complementary demands of processing the familiar and the unfamiliar, the expected and the unexpected. Organization of the brain into complementary processes is predicted to be a general principle of brain design that is not just found in ART (Grossberg, ). A complementary process can individually compute some properties well, but cannot, by itself, process other complementary properties. In thinking intuitively about complementary properties, one can imagine puzzle pieces fitting together. Both pieces are needed to finish the puzzle. Complementary brain processes are more dynamic than any such analogy: Pairs of complementary processes interact to form emergent properties which overcome their complementary deficiencies to compute complete information with which to represent or control some aspect of intelligent behavior. The resonance process in the complementary pair of resonance and reset is predicted to take place in the What cortical stream, notably in the inferotemporal and prefrontal cortex. Here top-down expectations are matched against bottom-up inputs. When a topdown expectation achieves a good enough match with bottom-up data, this match process focuses attention upon those feature clusters in the bottom-up input that are expected. If the expectation is close enough to the input pattern, then a state of resonance develops as the attentional focus takes hold. Figure illustrates these ART ideas in a simple two-level example. Here, a bottom-up input pattern, or vector, I activates a pattern X of activity across the feature detectors of the first level F . For example, a visual scene may be represented by the features comprising its boundary and surface representations. This feature pattern represents the relative importance of different features in the inputs pattern I. In Fig. a, the

A

A

Adaptive Resonance Theory

F2

Y

U + T

+ T

+

S

S

F1

+

V +

F1

X*

–

–

ρ

–

X

+

a

b

+

T

Y*

F2

F2 +

S +

–

– +

T

S F1

X*

c

ρ +

+

Reset

+

F2

Y

F1 ρ

–

X +

+

ρ +

d

Adaptive Resonance Theory. Figure . Search for a recognition code within an ART learning circuit: (a) The input pattern I is instated across the feature detectors at level F as a short term memory (STM) activity pattern X. Input I also nonspecifically activates the orienting system with a gain that is called vigilance (ρ); that is, all the input pathways converge with gain ρ onto the orienting system and try to activate it. STM pattern X is represented by the hatched pattern across F . Pattern X both inhibits the orienting system and generates the output pattern S. Pattern S is multiplied by learned adaptive weights, also called long term memory (LTM) traces. These LTM-gated signals are added at F cells, or nodes, to form the input pattern T, which activates the STM pattern Y across the recognition categories coded at level F . (b) Pattern Y generates the top-down output pattern U which is multiplied by top-down LTM traces and added at F nodes to form a prototype pattern V that encodes the learned expectation of the active F nodes. Such a prototype represents the set of commonly shared features in all the input patterns capable of activating Y. If V mismatches I at F , then a new STM activity pattern X∗ is selected at F . X∗ is represented by the hatched pattern. It consists of the features of I that are confirmed by V. Mismatched features are inhibited. The inactivated nodes corresponding to unconfirmed features of X are unhatched. The reduction in total STM activity which occurs when X is transformed into X∗ causes a decrease in the total inhibitionfrom F to the orienting system. (c) If inhibition decreases sufficiently, the orienting system releases a nonspecific arousal wave to F ; that is, a wave of activation that equally activates all F nodes. This wave instantiates the intuition that “novel events are arousing.” This arousal wave resets the STM pattern Y at F by inhibiting Y. (d) After Y is inhibited, its top-down prototype signal is eliminated, and X can be reinstated at F . The prior reset event maintains inhibition of Y during the search cycle. As a result, X can activate a different STM pattern Y at F . If the top-down prototype due to this new Y pattern also mismatches I at F , then the search for an appropriate F code continues until a more appropriate F representation is selected. Such a search cycle represents a type of nonstationary hypothesis testing. When search ends, an attentive resonance develops and learning of the attended data is initiated (adapted with permission from Carpenter and Grossberg ()). The distributed ART architecture supports fast stable learning with arbitrarily distributed F codes (Carpenter, )

Adaptive Resonance Theory

A

pattern peaks represent more activated feature detector cells, and the troughs, less-activated feature detectors. This feature pattern sends signals S through an adaptive filter to the second level F at which a compressed representation Y (also called a recognition category, or a symbol) is activated in response to the distributed input T. Input T is computed by multiplying the signal vector S by a matrix of adaptive weights that can be altered through learning. The representation Y is compressed by competitive interactions across F that allow only a small subset of its most strongly activated cells to remain active in response to T. The pattern Y in the figure indicates that a small number of category cells may be activated to different degrees. These category cells, in turn, send top-down signals U to F . The vector U is converted into the top-down expectation V by being multiplied by another matrix of adaptive weights. When V is received by F , a matching process takes place between the input vector I and V which selects that subset X* of F features that were “expected” by the active F category Y. The set of these selected features is the emerging “attentional focus.”

of artificial intelligence have claimed that neural models can process distributed features, but not symbolic representations. This is not, of course, true in the brain. Nor is it true in ART. Resonance between these two types of information converts the pattern of attended features into a coherent context-sensitive state that is linked to its category through feedback. This coherent state, which binds together distributed features and symbolic categories, can enter consciousness while it binds together spatially distributed features into either a stable equilibrium or a synchronous oscillation. The original ART article (Grossberg, ) predicted the existence of such synchronous oscillations, which were there described in terms of their mathematical properties as “order-preserving limit cycles.” See Carpenter, Grossberg, Markuzon, Reynolds & Rosen () and Grossberg & Versace () for reviews of confirmed ART predictions, including predictions about synchronous oscillations.

Binding Distributed Feature Patterns and Symbols During Conscious Resonances

In ART, the resonant state, rather than bottom-up activation, is predicted to drive learning. This state persists long enough, and at a high enough activity level, to activate the slower learning processes in the adaptive weights that guide the flow of signals between bottomup and top-down pathways between levels F and F in Fig. . This viewpoint helps to explain how adaptive weights that were changed through previous learning can regulate the brain’s present information processing, without learning about the signals that they are currently processing unless they can initiate a resonant state. Through resonance as a mediating event, one can understand from a deeper mechanistic view why humans are intentional beings who are continually predicting what may next occur, and why we tend to learn about the events to which we pay attention. More recent versions of ART, notably the synchronous matching ART (SMART) model (Grossberg & Versace, ) show how a match may lead to fast gamma oscillations that facilitate spike-timing dependent plasticity (STDP), whereas mismatch can lead to slower beta oscillations that lower the probability that mismatched events can be learned by a STDP learning law.

If the top-down expectation is close enough to the bottom-up input pattern, then the pattern X∗ of attended features reactivates the category Y which, in turn, reactivates X∗ . The network hereby locks into a resonant state through a positive feedback loop that dynamically links, or binds, the attended features across X∗ with their category, or symbol, Y. Resonance itself embodies another type of complementary processing. Indeed, there seem to be complementary processes both within and between cortical processing streams (Grossberg, ). This particular complementary relation occurs between distributed feature patterns and the compressed categories, or symbols, that selectively code them: Individual features at F have no meaning on their own, just like the pixels in a picture are meaningless one-by-one. The category, or symbol, in F is sensitive to the global patterning of these features, and can selectively fire in response to this pattern. But it cannot represent the “contents” of the experience, including their conscious qualia, due to the very fact that a category is a compressed or “symbolic” representation. Practitioners

Resonance Links Intentional and Attentional Information Processing to Learning

A

A

Adaptive Resonance Theory

Complementary Attentional and Orienting Systems Control Resonance Versus Reset

A sufficiently bad mismatch between an active topdown expectation and a bottom-up input, say because the input represents an unfamiliar type of experience, can drive a memory search. Such a mismatch within the attentional system is proposed to activate a complementary orienting system, which is sensitive to unexpected and unfamiliar events. ART suggests that this orienting system includes the nonspecific thalamus and the hippocampal system. See Grossberg & Versace () for a summary of data supporting this prediction. Output signals from the orienting system rapidly reset the recognition category that has been reading out the poorly matching top-down expectation (Figs. b and c). The cause of the mismatch is hereby removed, thereby freeing the system to activate a different recognition category (Fig. d). The reset event hereby triggers memory search, or hypothesis testing, which automatically leads to the selection of a recognition category that can better match the input. If no such recognition category exists, say because the bottom-up input represents a truly novel experience, then the search process automatically activates an as yet uncommitted population of cells, with which to learn about the novel information. In order for a topdown expectation to match a newly discovered recognition category, its top-down adaptive weights initially have large values, which are pruned by the learning of a particular expectation. This learning process works well under both unsupervised and supervised conditions (Carpenter et al., ). Unsupervised learning means that the system can learn how to categorize novel input patterns without any external feedback. Supervised learning uses predictive errors to let the system know whether it has categorized the information correctly. Supervision can force a search for new categories that may be culturally determined, and are not based on feature similarity alone. For example, separating the letters E and F that are of similar features into separate recognition categories is culturally determined. Such error-based feedback enables variants of E and F to learn their own category and top-down expectation, or prototype. The complementary, but interacting, processes of attentive-learning and orienting-search together realize a type of error correction through hypothesis testing that can build an

ever-growing, self-refining internal model of a changing world. Controlling the Content of Conscious Experiences: Exemplars and Prototypes

What combinations of features or other information are bound together into conscious object or event representations? One view is that exemplars or individual experiences are learned because humans can have very specific memories. For example, we can all recognize the particular faces of our friends. On the other hand, storing every remembered experience as exemplars can lead to a combinatorial explosion of memory, as well as to unmanageable problems of memory retrieval. A possible way out is suggested by the fact that humans can learn prototypes which represent general properties of the environment (Posner & Keele, ). For example, we can recognize that everyone has a face. But then how do we learn specific episodic memories? ART provides an answer that overcomes the problems faced by earlier models. ART prototypes are not merely averages of the exemplars that are classified by a category, as is typically assumed in classical prototype models. Rather, they are the actively selected critical feature patterns upon which the top-down expectations of the category focus attention. In addition, the generality of the information that is codes by these critical feature patterns is controlled by a gain control process, called vigilance control, which can be influenced by environmental feedback or internal volition (Carpenter & Grossberg, ). Low vigilance permits the learning of general categories with abstract prototypes. High vigilance forces a memory search to occur for a new category when even small mismatches exist between an exemplar and the category that it activates. As a result, in the limit of high vigilance, the category prototype may encode an individual exemplar. Vigilance is computed within the orienting system of an ART model (Fig. b–d). It is here that bottom-up excitation from all the active features in an input pattern I is compared with inhibition from all the active features in a distributed feature representation across F . If the ratio of the total activity across the active features in F (i.e., the “matched” features) to the total activity of all the features in I is less than a vigilance parameter ρ (Fig. b), then a reset wave is activated (Fig. c), which

Adaptive Resonance Theory

can drive the search for another category to classify the exemplar. In other words, the vigilance parameter controls how bad a match can be tolerated before search for a new category is initiated. If the vigilance parameter is low, then many exemplars can influence the learning of a shared prototype, by chipping away at the features that are not shared with all the exemplars. If the vigilance parameter is high, then even a small difference between a new exemplar and a known prototype (e.g., F vs. E) can drive the search for a new category with which to represent F. One way to control vigilance is by a process of match tracking. Here a predictive error (e.g., D is predicted in response to F), the vigilance parameter increases until it is just higher than the ratio of active features in F to total features in I. In other words, vigilance “tracks” the degree of match between input exemplar and matched prototype. This is the minimal level of vigilance that can trigger a reset wave and thus a memory search for a new category. Match tracking realizes a minimax learning rule that conjointly maximizes category generality while it minimizes predictive error. In other words, match tracking uses the least memory resources that can prevent errors from being made. Because vigilance can vary across learning trials, recognition categories capable of encoding widely differing degrees of generalization or abstraction can be learned by a single ART system. Low vigilance leads to broad generalization and abstract prototypes. High vigilance leads to narrow generalization and to prototypes that represent fewer input exemplars, even a single exemplar. Thus a single ART system may be used, say, to learn abstract prototypes with which to recognize abstract categories of faces and dogs, as well as “exemplar prototypes” with which to recognize individual views of faces and dogs. ART models hereby try to learn the most general category that is consistent with the data. This tendency can, for example, lead to the type of overgeneralization that is seen in young children until further learning leads to category refinement.

Memory Consolidation and the Emergence of Rules: Direct Access to Globally Best Match

As sequences of inputs are practiced over learning trials, the search process eventually converges upon stable categories. It has been mathematically proved

A

(Carpenter & Grossberg, ) that familiar inputs directly access the category whose prototype provides the best match globally, while unfamiliar inputs engage the orienting subsystem to trigger memory searches for better categories until they become familiar. This process continues until the memory capacity, which can be chosen arbitrarily large, is fully utilized. The process whereby search is automatically disengaged is a form of memory consolidation that emerges from network interactions. Emergent consolidation does not preclude structural consolidation at individual cells, since the amplified and prolonged activities that subserve a resonance may be a trigger for learning-dependent cellular processes, such as protein synthesis and transmitter production. It has also been shown that the adaptive weights which are learned by some ART models can, at any stage of learning, be translated into fuzzy IF-THEN rules (Carpenter et al., ). Thus the ART model is a self-organizing rule-discovering production system as well as a neural network. These examples show that the claims of some cognitive scientists and AI practitioners that neural network models cannot learn rule-based behaviors are as incorrect as the claims that neural models cannot learn symbols. How the Laminar Circuits of Cerebral Cortex Embody ART Mechanisms

More recent versions of ART have shown how predicted ART mechanisms may be embodied within known laminar microcircuits of the cerebral cortex. These include the family of LAMINART models (Fig. ; see Raizada & Grossberg, ) and the synchronous matching ART, or SMART, model (Fig. ; see Grossberg & Versace, ). SMART, in particular, predicts how a top-down match may lead to fast gamma oscillations that facilitate spike-timing dependent plasticity (STDP), whereas a mismatch can lead to slower beta oscillations that prevent learning by a STDP learning law. At least three neurophysiological labs have recently reported data consistent with the SMART prediction. Review of ART and ARTMAP Algorithms From Winner-Take-All to Distributed Coding As noted

above, ART networks serve both as models of human cognitive information processing (Carpenter, ;

A

A

Adaptive Resonance Theory

4

6

6 LGN

a

LGN

d

2/3

V2 layer 6 1 V1

4

V2

4 5 6

b

6 2/3 V1

2/3

6

4

c

6

4

LGN

e

Adaptive Resonance Theory. Figure . LAMINART circuit clarifies how known cortical connections within and across cortical layers join the layer → and layer / circuits to form a laminar circuit model for the interblobs and pale stripe regions of cortical areas V and V. Inhibitory interneurons are shown filled-in black. (a) The LGN provides bottom-up activation to layer via two routes. First, it makes a strong connection directly into layer . Second, LGN axons send collaterals into layer , and thereby also activate layer via the → on-center off-surround path. The combined effect of the bottom-up LGN pathways is to stimulate layer via an on-center off-surround, which provides divisive contrast normalization (Grossberg, ) of layer cell responses. (b)Folded feedback carries attentional signals from higher cortex into layer of V, via the modulatory → path. Corticocortical feedback axons tend preferentially to originate in layer of the higher area and to terminate in layer of the lower cortex, where they can excite the apical dendrites of layer pyramidal cells whose axons send collaterals into layer . The triangle in the figure represents such a layer pyramidal cell. Several other routes through which feedback can pass into V layer exist. Having arrived in layer , the feedback is then “folded” back up into the feedforward stream by passing through the → on-center off-surround path (Bullier, Hup’e, James, & Girard, ). (c)Connecting the → on-centeroff-surround to the layer / grouping circuit: like-oriented layer simple cells with opposite contrast polarities compete (not shown) before generating half-wave rectified outputs that converge onto layer / complex cells in the column above them. Just like attentional signals from higher cortex, as shown in (b), groupings that form within layer / also send activation into the folded feedback path, to enhance their own positions in layer beneath them via the → on-center, and to suppress input to other groupings via the → off-surround. There exist direct layer / → connections in macaque V, as well as indirect routes via layer . (d) Top-down corticogeniculate feedback from V layer to LGN also has an oncenter off-surround anatomy, similar to the → path. The on-center feedback selectively enhances LGN cells that are consistent with the activation that they cause (Sillito, Jones, Gerstein, & West, ), and the off-surround contributes to length-sensitive (endstopped) responses that facilitate grouping perpendicular to line ends. (e) The entire V/V circuit: V repeats the laminar pattern of V circuitry, but at a larger spatial scale. In particular, the horizontal layer / connections have a longer range in V, allowing above-threshold perceptual groupings between more widely spaced

Adaptive Resonance Theory

A

Grossberg, , ) and as neural systems for technology transfer (Caudell, Smith, Escobedo, & Anderson, ; Parsons & Carpenter, ). Design principles derived from scientific analyses and design constraints imposed by targeted applications have jointly guided the development of many variants of the basic networks, including fuzzy ARTMAP (Carpenter et al., ), ART-EMAP, ARTMAP-IC, and Gaussian ARTMAP. Early ARTMAP systems, including fuzzy ARTMAP, employ winner-take-all (WTA) coding, whereby each input activates a single category node during both training and testing. When a node is first activated during training, it is mapped to its designated output class. Starting with ART-EMAP, subsequent systems have used distributed coding during testing, which typically improves predictive accuracy, while avoiding the computational problems inherent in the use of distributed code representations during training. In order to address these problems, distributed ARTMAP (Carpenter, ; Carpenter, Milenova, & Noeske, ) introduced a new network configuration, in addition to new learning laws. Comparative analysis of the performance of ARTMAP systems on a variety of benchmark problems has led to the identification of a default ARTMAP network, which features simplicity of design and robust performance in many application domains. Default ARTMAP employs winner-take-all coding during training and distributed coding during testing within a distributed ARTMAP network architecture. With winner-take-all coding during testing, default ARTMAP reduces to a version of fuzzy ARTMAP.

computational design known as opponent processing. Balancing an entity against its opponent, as in agonist– antagonist muscle pairs, allows a system to act upon relative quantities, even as absolute magnitudes may vary unpredictably. In ART systems, complement coding is analogous to retinal ON-cells and OFF-cells. When the learning system is presented with a set of input features a ≡ (a ...ai ...aM ), complement coding doubles the number of input components, presenting to the network both the original feature vector and its complement. Complement coding allows an ART system to encode within its critical feature patterns of memory features that are consistently absent on an equal basis with features that are consistently present. Features that are sometimes absent and sometimes present when a given category is learning are regarded as uninformative with respect to that category. Since its introduction, complement coding has been a standard element of ART and ARTMAP networks, where it plays multiple computational roles, including input normalization. However, this device is not particular to ART, and could, in principle, be used to preprocess the inputs to any type of system. To implement complement coding, component activities ai of a feature vector a are scaled; thus, ≤ ai ≤ . For each feature i, the ON activity ai determines the complementary OFF activity ( − ai ). Both ai and ( − ai ) are represented in the M-dimensional system input vector A = (a ∣ ac ) (Fig. ). Subsequent network computations then operate in this Mdimensional input space. In particular, learned weight vectors wJ are M-dimensional.

Complement Coding: Learning both Absent and Present Features ART and ARTMAP employ a preprocess-

ARTMAP Search and Match Tracking in Fuzzy ARTMAP

ing step called complement coding (Fig. ), which models the nervous system’s ubiquitous use of the

As illustrated by Fig. , the ART matching process triggers either learning or a parallel memory search. If search ends at an established code, the memory

inducing stimuli to form. V layer / projects up to V layers and , just as LGN projects to layers an of V. Higher cortical areas send feedback into V which ultimately reaches layer , just as V feedback acts on layer of V. Feedback paths from higher cortical areas straight into V (not shown) can complement and enhance feedback from V into V. Top-down attention can also modulate layer / pyramidal cells directly by activating both the pyramidal cells and inhibitory interneurons in that layer. The inhibition tends to balance the excitation, leading to a modulatory effect. These top-down attentional pathways tend to synapse in layer , as shown in Fig. b. Their synapses on apical dendrites in layer are not shown, for simplicity. (Reprinted with permission from Raizada & Grossberg ())

A

A

Adaptive Resonance Theory

Adaptive Resonance Theory. Figure . SMART model overview. A first-order and higher-order cortical area are linked by corticocortical and corticothalamocortical connections. The thalamus is subdivided into specific first-order, secondorder, nonspecific, and thalamic reticular nucleus (TRN). The thalamic matrix (one cell population shown as an open ring) provides priming to layer , where layer pyramidal cell apical dendrites terminate. The specific thalamus relays sensory information (first-order thalamus) or lower-order cortical information (second-order thalamus) to the respective cortical areas via plastic connections. The nonspecific thalamic nucleus receives convergent BU input and inhibition from the TRN, and projects to layer of the laminar cortical circuit, where it regulates reset and search in the cortical circuit (see text). Corticocortical feedback connections link layer II of the higher cortical area to layer of the lower cortical area, whereas thalamocortical feedback originates in layer II and terminates in the specific thalamus after synapsing on the TRN. Layer II corticothalamic feedback matches the BU input in the specific thalamus. V receives two parallel BU thalamocortical pathways. The LGN→V layer pathway and the modulatory LGN→V layer I → pathway provide divisive contrast normalization of layer cell responses. The intracortical loop V layer →/→→I → pathway (folded feedback) enhances the activity of winning layer / cells at their own positions via the I → on-center, and suppresses input to other layer / cells via the I → off-surround. V also activates the BU V→V corticocortical pathways (V layer /→V layers I and ) and the BU corticothalamocortical pathways (V layer →PULV→V layers I and ), where the layer I → pathway provides divisive contrast normalization to V layer cells analogously to V. Corticocortical feedback from V layer II →V layer →I → also uses the same modulatory I → pathway. TRN cells of the two thalamic sectors are linked via gap junctions, which provide synchronization of the two thalamocortical sectors when processing BU stimuli (reprinted with permission from Grossberg & Versace ())

representation may either remain the same or incorporate new information from matched portions of the current input. While this dynamic applies to arbitrarily distributed activation patterns, the F search and code

for fuzzy ARTMAP (Fig. ) describes a winner-take all system. Before ARTMAP makes a class prediction, the bottom-up input A is matched against the top-down

Adaptive Resonance Theory

A

complement coded input

A

A = (A1...AM ⏐ AM+1...A2M) = (a ⏐ ac ) OFF channel

ON channel

(a1...ai ...am ) = a

ac = ((1 – ai )...(1 – ai )...(1 – aM )) a

feature vector

Adaptive Resonance Theory. Figure . Complement coding transforms an M-dimensional feature vector a into a Mdimensional system input vector A. A complement-coded system input represents both the degree to which a feature i is present (ai ) and the degree to which that feature is absent ( − ai )

J = J1 y

J = J1 y F2

F2

X F1

A

–

–

A

r F0

F1

–

r

F0

a

a fuzzy ART J = J1 y

y r=1 reset

F2 X –

A

F1

F2 X

–

–

A

r F0

F1

–

r F0

a

a

Adaptive Resonance Theory. Figure . A fuzzy ART search cycle, with a distributed ART network configuration (Carpenter, ). The ART search cycle (Carpenter and Grossberg, ) is the same, but allows only binary inputs and did not originally feature complement coding. The match field F represents the matched activation pattern x = A ∧ wJ , where ∧ denotes the component-wise minimum, or fuzzy intersection, between the bottom-up input A and the top-down expectation wJ . If the matched pattern fails to meet the matching criterion, then the active code is reset at F , and the system searches for another code y that better represents the input. The match/mismatch decision takes place in the ART orienting system. Each active feature in the input pattern A excites the orienting system with gain equal to the vigilance parameter ρ. Hence, with complement M

coding, the total excitatory input is ρ ∣A∣ = ρ ∑ Ai =ρM. Active cells in the matched pattern x inhibit the orii=

M

enting system, leading to a total inhibitory input equal to − ∣x∣ = − ∑ xi . If ρ ∣A∣ − ∣x∣ ≤ , then the orienti=

ing system remains quiet, allowing resonance and learning to occur. If ρ ∣A∣ − ∣x∣ > , then the reset signal r = , initiating search for a better matching code

A

Adaptive Resonance Theory

learned expectation, or critical feature pattern, that is read out by the active node (Fig. b). The matching criterion is set by a vigilance parameter ρ. As noted above, low vigilance permits the learning of abstract, prototype-like patterns, while high vigilance requires the learning of specific, exemplar-like patterns. When ¯ a new input arrives, vigilance equals a baseline level ρ. Baseline vigilance is set equal to zero by default, in order to maximize generalization. Vigilance rises only after the system has made a predictive error. The internal control process that determines how far it must rise in order to correct the error is called match tracking. As vigilance rises, the network is required to pay more attention to how well top-down expectations match the current bottom-up input. Match tracking (Fig. ) forces an ARTMAP system not only to reset its mistakes, but to learn from them. With match tracking and fast learning, each ARTMAP network passes the next input test, which requires that,

match tracking dr = –(r – r– )+ΓRr c dt

J y

F2

wJ F1

x = A ∧ wJ

predictive error R=1

match r A – x ≤0 r c= 1 –x

ART Geometry Fuzzy ART long-term memories are

rc

F0 a

R

+r A

A

ac

A

if a training input were re-presented immediately after a learning trial, it would directly activate the correct output class, with no predictive errors or search. Match tracking thus simultaneously implements the design goals of maximizing generalization and minimizing predictive error, without requiring the choice of a fixed matching criterion. ARTMAP memories thereby include both broad and specific pattern classes, with the latter typically formed as exceptions to the more general “rules” defined by the former. ARTMAP learning typically produces a wide variety of such mixtures, whose exact composition depends upon the order of training exemplar presentation. Unless they have already activated all their coding nodes, ARTMAP systems contain a reserve of nodes that have never been activated, with weights at their initial values. These uncommitted nodes compete with the previously active committed nodes, and an uncommitted node is chosen over poorly matched committed nodes. An ARTMAP design constraint specifies that an active uncommitted node should not reset itself. Weights initially begin with wiJ = . Thus, when the active node J is uncommitted, x = A ∧ wJ = A at the match field. Then, ρ ∣A∣ − ∣x∣ = ρ ∣A∣ − ∣A∣ = (ρ − ) ∣A∣. Thus ρ ∣A∣ − ∣x∣ ≤ and an uncommitted node does not trigger a reset, provided ρ ≤ .

r

a

Adaptive Resonance Theory. Figure . ARTMAP match tracking. When an active node J meets the matching criterion (ρ ∣A∣ − ∣x∣ ≤ ), the reset signal r = and the node makes an prediction. If the predicted output is incorrect, the feedback signal R = . While R = rc = , ∣x∣ r increases rapidly. As soon as ρ > ∣A∣ , r switches to , which both halts the increase of r and resets the active F node. From one chosen node to the next, r decays to ∣x∣ slightly below ∣A∣ (MT–). On the time scale of learning r returns to ρ¯

visualized as hyper-rectangles, called category boxes. The weight vector wJ is interpreted geometrically as a box RJ whose ON-channel corner uJ and OFF-channel corner vJ are, in the format of the complement-coded input vector, defined by (uJ ∣ vJC ) ≡ wJ (Fig. ). For fuzzy ART with the choice-by-difference F → F signal function TJ , an input a activates the node J of the closest category box RJ , according to the L (city-block) metric. In case of a tie, as when a lies in more than one box, the node with the smallest RJ is chosen, where ∣RJ ∣ is M

defined as the sum of the edge lengths ∑ ∣viJ − uiJ ∣. The i=

chosen node J will reset if ∣RJ ⊕ a∣ > M ( − ρ), where RJ ⊕ a is the smallest box enclosing both RJ and a. Otherwise, RJ expands toward RJ ⊕ a during learning. With fast learning, Rnew = Rold ⊕ a. J J

Adaptive Resonance Theory

1

ART 3 search mechanism

vJ F2

a2

RJ

r=1 x = A ^ wj

0 a1

ρ|A| - |x| > 0

a

a 0

reset

Y

1

Adaptive Resonance Theory. Figure . Fuzzy ART geometry. The weight of a category node J is represented in complement-coding form as wJ = (uJ ∣ vJC ), and the M-dimensional vectors uJ and vJ define the corners of the category box RJ . When M = , the size of RJ equals its width plus its height. During learning, RJ expands toward RJ ⊕a, defined as the smallest box enclosing both RJ and a. Node J will reset before learning if ∣RJ ⊕ a∣ > M ( − ρ)

Biasing Against Previously Active Category Nodes and Previously Attended Features During Attentive Memory Search Activity x at the ART field F continuously com-

putes the match between the field’s bottom-up and topdown input patterns. A reset signal r shuts off the active F node J when x fails to meet the matching criterion determined by the value of the vigilance parameter ρ. Reset alone does not, however, trigger a search for a different F node: unless the prior activation has left an enduring trace within the F -to-F subsystem, the network will simply reactivate the same node as before. As modeled in ART , biasing the bottom-up input to the coding field F to favor the previously inactive nodes implements search by allowing the network to activate a new node in response to a reset signal. The ART search mechanism defines a medium-term memory (MTM) in the F -to-F adaptive filter which biases the system against re-choosing a node that had just produced a reset. A presynaptic interpretation of this bias is transmitter depletion, or habituation (Fig. ). Medium-term memory in all ART models allows the network to shift attention among learned categories at the coding field F during search. The new biased ART network (Carpenter & Gaddam, ) introduces a second medium-term memory that shifts attention among input features, as well as categories, during search. Self-Organizing Rule Discovery This foundation of com-

putational principles and mechanisms has enabled the

F1

A

F0 a

ac

A

J

RJ uJ

A

- |x| + ρ|A| |A|

ρ

a

Adaptive Resonance Theory. Figure . ART search implements a medium-term memory within the F -to-F pathways, which biases the system against choosing a category node that had just produced a reset

development of an ART information fusion system that is capable of incrementally learning a cognitive hierarchy of rules in response to probabilistic, incomplete, and even contradictory data that are collected by multiple observers (Carpenter, Martens, & Ogas, ).

Cross References 7Bayes Rule 7Bayesian Methods

Recommended Reading Bullier, J., Hupé, J. M., James, A., & Girard, P. (). Functional interactions between areas V and V in the monkey. Journal of Physiology Paris, (–), –. Carpenter, G. A. (). Distributed learning, recognition, and prediction by ART and ARTMAP neural networks. Neural Networks, , –. Carpenter, G. A. & Gaddam, S. C. (). Biased ART: A neural architecture that shifts attention towards previously disregarded features following an incorrect prediction. Neural Networks, . Carpenter, G. A., & Grossberg, S. (). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, , –. Carpenter, G. A. & Grossberg, S. (). Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions. Trends in Neurosciences, , –. Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H. & Rosen, D. B. (). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Networks, , –. Carpenter, G. A., Martens, S., & Ogas, O. J. (). Self-organizing information fusion and hierarchical knowledge discovery: A

A

Adaptive System

new framework using ARTMAP neural networks. Neural Networks, , –. Carpenter, G. A., Milenova, B. L., & Noeske, B. W. (). Distributed ARTMAP: A neural network for fast distributed supervised learning. Neural Networks, , –. Caudell, T. P., Smith, S. D. G., Escobedo, R., & Anderson, M. (). NIRS: Large scale ART neural architectures for engineering design retrieval. Neural Networks, , –. Grossberg, S. (). Adaptive pattern classification and universal recoding, II: Feedback, expectation, olfaction, and illusions. Biological Cybernetics, , –. Grossberg, S. (). How does a brain build a cognitive code? Psychological Review, , –. Grossberg, S. (). The link between brain, learning, attention, and consciousness. Consciousness and Cognition, , –. Grossberg, S. (). The complementary brain: Unifying brain dynamics and modularity. Trends in Cognitive Sciences, , –. Grossberg, S. (). How does the cerebral cortex work? Development, learning, attention, and D vision by laminar circuits of visual cortex. Behavioral and Cognitive Neuroscience Reviews, , –. Grossberg, S. (). Consciousness CLEARS the mind. Neural Networks, , –. Grossberg, S. & Versace, M. (). Spikes, synchrony, and attentive learning by laminar thalamocortical circuits. Brain Research, , –. Parsons, O., & Carpenter, G. A. (). ARTMAP neural networks for information fusion and data mining: Map production and target recognition methodologies. Neural Networks, (), –. Posner, M. I., & Keele, S. W. (). On the genesis of abstract ideas. Journal of Experimental Psychology, , –. Raizada, R., & Grossberg, S. (). Towards a theory of the laminar architecture of cerebral cortex: Computational clues from the visual system. Cerebral Cortex, , –. Sillito, A. M., Jones, H. E., Gerstein, G. L., & West, D. C. (). Feature-linked synchronization of thalamic relay cell firing induced by feedback from the visual cortex. Nature, , –.

definitions of agents. Most of them would agree on the following set of agent properties: Persistence: Code is not executed on demand but runs continuously and decides autonomously when it should perform some activity. ● Social ability: Agents are able to interact with other agents. ● Reactivity: Agents perceive the environment and are able to react. ● Proactivity: Agents exhibit goal-directed behavior and can take the initiative. ●

Agent-Based Computational Models 7Artificial Societies

Agent-Based Modeling and Simulation 7Artificial Societies

Agent-Based Simulation Models 7Artificial Societies

AIS 7Artificial Immune Systems

Adaptive System 7Complexity in Adaptive Systems

Agent In computer science, the term “agent” usually denotes a software abstraction of a real entity which is capable of acting with a certain degree of autonomy. For example, in artificial societies, agents are software abstractions of real people, interacting in an artifical, simulated environment. Various authors have proposed different

Algorithm Evaluation Geoffrey I. Webb Monash University, Victoria, Australia

Definition Algorithm evaluation is the process of assessing a property or properties of an algorithm.

Motivation and Background It is often valuable to assess the efficacy of an algorithm. In many cases, such assessment is relative, that is,

Ant Colony Optimization

evaluating which of several alternative algorithms is best suited to a specific application.

Processes and Techniques Many learning algorithms have been proposed. In order to understand the relative merits of these alternatives, it is necessary to evaluate them. The primary approaches to evaluation can be characterized as either theoretical or experimental. Theoretical evaluation uses formal methods to infer properties of the algorithm, such as its computational complexity (Papadimitriou, ), and also employs the tools of 7computational learning theory to assess learning theoretic properties. Experimental evaluation applies the algorithm to learning tasks to study its performance in practice. There are many different types of property that may be relevant to assess depending upon the intended application. These include algorithmic properties, such as time and space complexity. These algorithmic properties are often assessed separately with respect to performance when learning a 7model, that is, at 7training time, and performance when applying a learned model, that is, at 7test time. Other types of property that are often studied are the properties of the models that are learned (see 7model evaluation). Strictly speaking, such properties should be assessed with respect to a specific application or class of applications. However, much machine learning research includes experimental studies in which algorithms are compared using a set of data sets with little or no consideration given to what class of applications those data sets might represent. It is dangerous to draw general conclusions about relative performance on any application from relative performance on this sample of some unknown class of applications. Such experimental evaluation has become known disparagingly as a bake-off . An approach to experimental evaluation that may be less subject to the limitations of bake-offs is the use of experimental evaluation to assess a learning algorithm’s 7bias and variance profile. Bias and variance measure properties of an algorithm’s propensities in learning models rather than directly being properties of the models that are learned. Hence, they may provide more general insights into the relative characteristics of alternative algorithms than do assessments of the performance of learned models on a finite number of

A

applications. One example of such use of bias–variance analysis is found in Webb (). Techniques for experimental algorithm evaluation include 7bootstrap sampling, 7cross-validation, and 7holdout evaluation.

Cross References 7Computational Learning Theory 7Model Evaluation

Recommended Reading Hastie, T., Tibshirani, R., & Friedman, J. H. (). The elements of statistical learning. New York: Springer. Mitchell, T. M. (). Machine learning. New York: McGraw-Hill. Papadimitriou, C. H. (). Computational complexity. Reading, MA: Addison-Wesley. Webb, G. I. (). MultiBoosting: A technique for combining boosting and wagging. Machine Learning, (), –. Witten, I. H., & Frank, E. (). Data mining: Practical machine learning tools and techniques (nd ed.). San Francisco: Morgan Kaufmann.

Analogical Reasoning 7Instance-Based Learning

Analysis of Text 7Text Mining

Analytical Learning 7Deductive Learning 7Explanation-Based Learning

Ant Colony Optimization Marco Dorigo, Mauro Birattari Université Libre de Bruxelles, Brussels, Belgium

Synonyms ACO

Definition Ant colony optimization (ACO) is a population-based metaheuristic for the solution of difficult combinatorial

A

A

Ant Colony Optimization

optimization problems. In ACO, each individual of the population is an artificial agent that builds incrementally and stochastically a solution to the considered problem. Agents build solutions by moving on a graphbased representation of the problem. At each step their moves define which solution components are added to the solution under construction. A probabilistic model is associated with the graph and is used to bias the agents’ choices. The probabilistic model is updated online by the agents so as to increase the probability that future agents will build good solutions.

●

The Ant Colony Optimization Probabilistic Model

We assume that the combinatorial optimization problem (S, f ) is mapped on a problem that can be characterized by the following list of items: ● ●

Motivation and Background Ant colony optimization is so called because of its original inspiration: the foraging behavior of some ant species. In particular, in Beckers, Deneubourg, and Goss () it was demonstrated experimentally that ants are able to find the shortest path between their nest and a food source by collectively exploiting the pheromone they deposit on the ground while walking. Similar to real ants, ACO’s artificial agents, also called artificial ants, deposit artificial pheromone on the graph of the problem they are solving. The amount of pheromone each artificial ant deposits is proportional to the quality of the solution the artificial ant has built. These artificial pheromones are used to implement a probabilistic model that is exploited by the artificial ants to make decisions during their solution construction activity.

Structure of the Optimization System Let us consider a minimization problem (S, f ), where S is the set of feasible solutions, and f is the objective function, which assigns to each solution s ∈ S a cost value f (s). The goal is to find an optimal solution s∗ , that is, a feasible solution of minimum cost. The set of all optimal solutions is denoted by S ∗ . Ant colony optimization attempts to solve this minimization problem by repeating the following two steps: ●

Candidate solutions are constructed using a parameterized probabilistic model, that is, a parameterized probability distribution over the solution space.

The candidate solutions are used to modify the model in a way that is intended to bias future sampling toward low cost solutions.

● ● ●

A finite set C = {c , c , . . . , cNC } of components, where NC is the number of components. A finite set X of states of the problem, where a state is a sequence x = ⟨ci , cj , . . . , ck , . . . ⟩ over the elements of C. The length of a sequence x, that is, the number of components in the sequence, is expressed by ∣x∣. The maximum length of a sequence is bounded by a positive constant n < +∞. A set of (candidate) solutions S, which is a subset of X (i.e., S ⊆ X ). A set of feasible states X˜ , with X˜ ⊆ X , defined via a set of constraints Ω. A nonempty set S ∗ of optimal solutions, with S ∗ ⊆ X˜ and S ∗ ⊆ S.

Given the above formulation (Note that, because this formulation is always possible, ACO can in principle be applied to any combinatorial optimization problem.) artificial ants build candidate solutions by performing randomized walks on the completely connected, weighted graph G = (C, L, T ), where the vertices are the components C, the set L fully connects the components C, and T is a vector of so-called pheromone trails τ. Pheromone trails can be associated with components, connections, or both. Here we assume that the pheromone trails are associated with connections, so that τ(i, j) is the pheromone associated with the connection between components i and j. It is straightforward to extend the algorithm to the other cases. The graph G is called the construction graph. To construct candidate solutions, each artificial ant is first put on a randomly chosen vertex of the graph. It then performs a randomized walk by moving at each step from vertex to vertex on the graph in such a way that the next vertex is chosen stochastically according to the strength of the pheromone currently on the arcs.

Ant Colony Optimization

While moving from one node to another of the graph G, constraints Ω may be used to prevent ants from building infeasible solutions. Formally, the solution construction behavior of a generic ant can be described as follows: ant_solution_construction For each ant: – Select a start node c according to some problem dependent criterion. – Set k = and xk = ⟨c ⟩. ● While xk = ⟨c , c , . . . , ck ⟩ ∈ X˜ , xk ∉ S, and the set Jxk of components that can be appended to xk is not empty, select the next node (component) ck+ randomly according to:

A

The Ant Colony Optimization Pheromone Update

Many different schemes for pheromone update have been proposed within the ACO framework. For an extensive overview, see Dorigo and Stützle (). Most pheromone updates can be described using the following generic scheme:

●

PT (ck+ = c∣xk ) ⎧ F(ck ,c) (τ(ck , c)) ⎪ ⎪ if (ck , c)∈Jxk , ⎪ ⎪ ⎪ ∑(ck ,y)∈Jxk F(ck ,y) (τ(ck , y)) ⎪ ⎪ ⎪ ⎪ =⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ otherwise, ⎪ ⎩ () where a connection (ck , y) belongs to Jxk if and only if the sequence xk+ = ⟨c , c , . . . , ck , y⟩ satisfies the constraints Ω (that is, xk+ ∈ X˜ ) and F(i, j) (z) is some monotonic function – a common choice being z α η(i, j) β , where α, β > , and η(i, j)’s are heuristic values measuring the desirability of adding component j after i. If at some stage xk ∉ S and Jxk = ∅, that is, the construction process has reached a dead-end, the current state xk is discarded. However, this situation may be prevented by allowing artificial ants to build infeasible solutions as well. In such a case, an infeasibility penalty term is usually added to the cost function. Nevertheless, in most of the settings in which ACO has been applied, the dead-end situation does not occur. For certain problems, one may find it useful to use a more general scheme, where F depends on the pheromone values of several “related” connections rather than just a single one. Moreover, instead of the random-proportional rule above, different selection schemes, such as the pseudo-random-proportional rule (Dorigo & Gambardella, ), may be used.

Generic_ACO_Update ∀s ∈ Sˆ t , ∀(i, j) ∈ s : τ(i, j) ← τ(i, j)+Qf (s∣S , . . . , St ), ● ∀(i, j) : τ(i, j) ← ( − ρ) ⋅ τ(i, j),

●

where Si is the sample in the ith iteration, ρ, ≤ ρ < , is the evaporation rate, and Qf (s∣S , . . . , St ) is some “quality function,” which is typically required to be nonincreasing with respect to f and is defined over the “reference set” Sˆ t . Different ACO algorithms may use different quality functions and reference sets. For example, in the very first ACO algorithm – Ant System (Dorigo, Maniezzo, & Colorni, , ) – the quality function is simply /f (s) and the reference set Sˆ t = St . In a subsequently proposed scheme, called iteration best update (Dorigo & Gambardella, ), the reference set is a singleton containing the best solution within St (if there are several iteration-best solutions, one of them is chosen randomly). For the global-best update (Dorigo et al., ; Stützle & Hoos, ), the reference set contains the best among all the iteration-best solutions (and if there are more than one global-best solution, the earliest one is chosen). In Dorigo et al. () an elitist strategy was introduced, in which the update is a combination of the previous two. In case a good lower bound on the optimal solution cost is available, one may use the following quality function (Maniezzo, ): f¯ − f (s) f (s) − LB ) = τ ¯ , Qf (s∣S , . . . , St ) = τ ( − ¯ f − LB f − LB () where f¯ is the average of the costs of the last k solutions and LB is the lower bound on the optimal solution cost. With this quality function, the solutions are evaluated by comparing their cost to the average cost of the other recent solutions, rather than by using the absolute cost values. In addition, the quality function is automatically scaled based on the proximity of the average cost to the lower bound.

A

A

Anytime Algorithm

A pheromone update that slightly differs from the generic update described above was used in ant colony system (ACS) (Dorigo & Gambardella, ). There the pheromone is evaporated by the ants online during the solution construction, hence only the pheromone involved in the construction evaporates. Another modification of the generic update was introduced in MAX–MIN Ant System (Stützle & Hoos, , ), which uses maximum and minimum pheromone trail limits. With this modification, the probability of generating any particular solution is kept above some positive threshold. This helps to prevent search stagnation and premature convergence to suboptimal solutions.

Cross References 7Swarm Intelligence

is 7Adaptive Real-Time Dynamic Programming (ARTDP).

AODE 7Averaged One-Dependence Estimators

Apprenticeship Learning 7Behavioral Cloning

Approximate Dynamic Programming 7Value Function Approximation

Recommended Reading Beckers, R., Deneubourg, J. L., & Goss, S. (). Trails and U-turns in the selection of the shortest path by the ant Lasius Niger. Journal of Theoretical Biology, , –. Dorigo, M., & Gambardella, L. M. (). Ant colony system: A cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, (), –. Dorigo, M., Maniezzo, V., & Colorni, A. (). Positive feedback as a search strategy. Technical Report -, Dipartimento di Elettronica, Politecnico di Milano, Milan, Italy. Dorigo M., Maniezzo V., & Colorni A. (). Ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics – Part B, (), –. Dorigo, M., & Stützle, T. (). Ant colony optimization. Cambridge, MA: MIT Press. Maniezzo, V. (). Exact and approximate nondeterministic tree-search procedures for the quadratic assignment problem. INFORMS Journal on Computing, (), –. Stützle, T., & Hoos, H. H. (). The MAX–MIN ant system and local search for the traveling salesman problem. In Proceedings of the Congress on Evolutionary Computation – CEC’ (pp. –). Piscataway, NJ: IEEE Press. Stützle, T., & Hoos, H. H. (). MAX–MIN ant system. Future Generation Computer Systems, (), –, .

Anytime Algorithm An anytime algorithm is an algorithm whose output increases in quality gradually with increased running time. This is in contrast to algorithms that produce no output at all until they produce full-quality output after a sufficiently long execution time. An example of an algorithm with good anytime performance

Apriori Algorithm Hannu Toivonen University of Helsinki, Helsinki, Finland

Definition Apriori algorithm (Agrawal, Mannila, Srikant, Toivonen, & Verkamo, ) is a 7data mining method which outputs all 7frequent itemsets and 7association rules from given data. Input: set I of items, multiset D of subsets of I, frequency threshold min_ fr, and confidence threshold min_conf. Output: all frequent itemsets and all valid association rules in D. Method: : level := ; frequent_sets := ∅; : candidate_sets := {{i} ∣ i ∈ I}; : while candidate_sets ≠ ∅ .: scan data D to compute frequencies of all sets in candidate_sets; .: frequent_sets := frequent_sets ∪ {C ∈ candidate_sets ∣ frequency(C) ≥ min_ fr}; . level := level + ; .: candidate_sets := {A ⊂ I ∣ ∣A∣ = level and B ∈ frequent_sets for all B ⊂ A, ∣B∣ = level − };

Artificial Immune Systems

: output frequent_sets; : for each F ∈ frequent_sets .: for each E ⊂ F, E ≠ ∅, E ≠ F ..: if frequency(F)/frequency(E) ≥ min_conf then output association rule E → (F / E) The algorithm finds frequent itemsets (lines -) by a breadth-first, general-to-specific search. It generates and tests candidate itemsets in batches, to reduce the overhead of database access. The search starts with the most general itemset patterns, the singletons, as candidate patterns (line ). The algorithm then iteratively computes the frequencies of candidates (line .) and saves those that are frequent (line .). The crux of the algorithm is in the candidate generation (line .): on the next level, those itemsets are pruned that have an infrequent subset. Obviously, such itemsets cannot be frequent. This allows Apriori to find all frequent itemset without spending too much time on infrequent itemsets. See 7frequent pattern and 7constraint-based mining for more details and extensions. Finally, the algorithm tests all frequent association rules and outputs those that are also confident (lines -..).

Cross References 7Association Rule 7Basket Analysis 7Constraint-Based Mining 7Frequent Itemset 7Frequent Pattern

A

under an ROC curve. It evaluates the performance of a scoring classifier on a test set, but ignores the magnitude of the scores and only takes their rank order into account. AUC is expressed on a scale of to , where means that all negatives are ranked before all positives, and means that all positives are ranked before all negatives. See 7ROC Analysis.

AQ 7Rule Learning

ARL 7Average-Reward Reinforcement Learning

ART 7Adaptive Resonance Theory

ARTDP 7Adaptive Real-Time Dynamic Programming

Artificial Immune Systems Jon Timmis University of York, Heslington, North Yorkshire, UK

Recommended Reading Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A. I. (). Fast discovery of association rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. – s). Menlo Park: AAAI Press.

Synonyms AIS; Immune computing; Immune-inspired computing; Immunocomputing; Immunological computation

Definition

Area Under Curve Synonyms AUC

Definition The area under curve (AUC) statistic is an empirical measure of classification performance based on the area

Artificial immune systems (AIS) have emerged as a computational intelligence approach that shows great promise. Inspired by the complexity of the immune system, computer scientists and engineers have created systems that in some way mimic or capture certain computationally appealing properties of the immune system, with the aim of building more robust and adaptable solutions. AIS have been defined by de Castro and Timmis () as:

A

A

Artificial Immune Systems

▸ adaptive systems, inspired by theoretical immunology and observed immune functions, principle and models, which are applied to problem solving

AIS are not limited to machine learning systems, there are a wide variety of other areas in which AIS are developed such as optimization, scheduling, fault tolerance, and robotics (Hart & Timmis, ). Within the context of machine learning, both supervised and unsupervised approaches have been developed. Immune-inspired learning approaches typically develop a memory set of detectors that are capable of classifying unseen data items (in the case of supervised learning) or a memory set of detectors that represent clusters within the data (in the case of unsupervised learning). Both static and dynamic learning systems have been developed.

Motivation and Background The immune system is a complex system that undertakes a myriad of tasks. The abilities of the immune system have helped to inspire computer scientists to build systems that mimic, in some way, various properties of the immune system. This field of research, AIS, has seen the application of immune-inspired algorithms to a wide variety of areas. The origin of AIS has its roots in the early theoretical immunology work of Farmer, Perelson, and Varela (Farmer, Packard, & Perelson, ; Varela, Coutinho, Dupire, & Vaz, ). These works investigated a number of theoretical 7immune network models proposed to describe the maintenance of immune memory in the absence of antigen. While controversial from an immunological perspective, these models began to give rise to an interest from the computing community. The most influential people at crossing the divide between computing and immunology in the early days were Bersini and Forrest. It is fair to say that some of the early work by Bersini () was very well rooted in immunology, and this is also true of the early work by Forrest (). It was these works that formed the basis of a solid foundation for the area of AIS. In the case of Bersini, he concentrated on the immune network theory, examining how the immune system maintained its memory and how one might build models and algorithms mimicking that property. With regard to Forrest, her work was focused on computer security

(in particular, network intrusion detection) and formed the basis of a great deal of further research by the community on the application of immune-inspired techniques to computer security. At about the same time as Forrest was undertaking her work, other researchers began to investigate the nature of learning in the immune system and how that might by used to create machine learning algorithms (Cook & Hunt, ). They had the idea that it might be possible to exploit the mechanisms of the immune system (in particular, the immune network) in learning systems, so they set about doing a proof of concept (Cook & Hunt, ). Initial results were very encouraging, and they built on their success by applying the immune ideas to the classification of DNA sequences as either promoter or nonpromoter classes: this work was generalized in Timmis and Neal (). Similar work was carried out by de Castro and Von Zuben (), who developed algorithms for use in function optimization and data clustering. Work in dynamic unsupervised machine learning algorithms was also undertaken, meeting with success in works such as Neal (). In the supervised learning domain, very little happened until the work by Watkins () (later expanded in Watkins, ) developed an immune-based classifier known as AIRS, and in the dynamic supervised domain, with the work in Secker, Freitas, and Timmis () being one of a number of successes.

Structure of the Learning System In an attempt to create a common basis for AIS, the work in de Castro and Timmis () proposed the idea of a framework for engineering AIS. They argued that the case for such a framework as the existence of similar frameworks in other biologically inspired approaches, such as 7artificial neural networks (ANNs) and evolutionary algorithms (EAs), has helped considerably with the understanding and construction of such systems. For example, de Castro and Timmis () consider a set of artificial neurons, which can be arranged together to form an ANN. In order to acquire knowledge, these neural networks undergo an adaptive process, known as learning or training, which alters (some of) the parameters within the network. Therefore, they argued that in a simplified form, a framework to design an ANN is

Artificial Immune Systems

composed of a set of artificial neurons, a pattern of interconnection for these neurons, and a learning algorithm. Similarly, they argued that in evolutionary algorithms, there is a set of artificial chromosomes representing a population of individuals that iteratively suffer a process of reproduction, genetic variation, and selection. As a result of this process, a population of evolved artificial individuals arises. A framework, in this case, would correspond to the genetic representation of the individuals of the population, plus the procedures for reproduction, genetic variation, and selection. Therefore, they proposed that a framework to design a biologically inspired algorithm requires, at least, the following basic elements: A representation for the components of the system ● A set of mechanisms to evaluate the interaction of individuals with the environment and each other. The environment is usually stimulated by a set of input stimuli, one or more fitness function(s), or other means ● Procedures of adaptation that govern the dynamics of the system, i.e., how its behavior varies over time ●

This framework can be thought of as a layered approach such as the specific framework for engineering AIS of de Castro and Timmis () shown in Fig. . This framework follows the three basic elements for designing a biologically inspired algorithm just described, where the set of mechanisms for evaluation are the affinity measures and the procedures

A

of adaptation are the immune algorithms. In order to build a system such as an AIS, one typically requires an application domain or target function. From this basis, the way in which the components of the system will be represented is considered. For example, the representation of network traffic may well be different from the representation of a real-time embedded system. In AIS, the way in which something is represented is known as shape space. There are many kinds of shape space, such as Hamming, real valued, and so on, each of which carries it own bias and should be selected with care (Freitas & Timmis, ). Once the representation has been chosen, one or more affinity measures are used to quantify the interactions of the elements of the system. There are many possible affinity measures (which are partially dependent upon the representation adopted), such as Hamming and Euclidean distance metrics. Again, each of these has its own bias, and the affinity function must be selected with great care, as it can affect the overall performance (and ultimately the result) of the system (Freitas & Timmis, ).

Supervised Immune-Inspired Learning

The artificial immune recognition system (AIRS) algorithm was introduced as one of the first immuneinspired supervised learning algorithms and has subsequently gone through a period of study and refinement (Watkins, ). To use classifications from de Castro and Timmis (), for the procedures of adaptation, AIRS is a, 7clonal selection type of immune-inspired algorithm. The representation and affinity layers of the system are standard in

Artificial Immune Systems. Figure . AIS layered framework adapted from de Castro and Timmis ()

A

A

Artificial Immune Systems

that any number of representations such as binary, real values, etc., can be used with the appropriate affinity function. AIRS has its origin in two other immune-inspired algorithms: CLONALG (CLONAL Selection alGorithm) and Artificial Immune NEtwork (AINE) (de Castro and Timmis, ). AIRS resembles CLONALG in the sense that both the algorithms are concerned with developing a set of memory cells that give a representation of the learned environment. AIRS is concerned with the development of a set of memory cells that can encapsulate the training data. This is done in a two-stage process of first evolving a candidate memory cell and then determining if this candidate cell should be added to the overall pool of memory cells. The learning process can be outlined as follows:

. For each pattern to be recognized, do (a) Compare a training instance with all memory cells of the same class and find the memory cell with the best affinity for the training instance. This is referred to as a memory cell mcmatch . (b) Clone and mutate mcmatch in proportion to its affinity to create a pool of abstract B-cells. (c) Calculate the affinity of each B-cell with the training instance. (d) Allocate resources to each B-cell based on its affinity. (e) Remove the weakest B-cells until the number of resources returns to a preset limit. (f) If the average affinity of the surviving B-cells is above a certain level, continue to step (g). Else, clone and mutate these surviving B-cells based on their affinity and return to step (c). (g) Choose the best B-cell as a candidate memory cell (mccand ). (h) If the affinity of mccand for the training instance is better than the affinity of mcmatch , then add mccand to the memory cell pool. If, in addition to this, the affinity between mccand and mcmatch is within a certain threshold, then remove mcmatch from the memory cell pool. . Repeat from step (a) until all training instances have been presented.

Once this training routine is complete, AIRS classifies the instances using k-nearest neighbor with the developed set of memory cells. Unsupervised Immune-Inspired Learning

The artificial immune network (aiNET) algorithm was introduced as one of the first immune-inspired unsupervised learning algorithms and has subsequently gone through a period of study and refinement (de Castro & Von Zuben, ). To use classifications from de Castro and Timmis (), for the procedures of adaptation, aiNET is an immune network type of immune-inspired algorithm. The representation and affinity layers of the system are standard (the same as in AIRS). aiNET has its origin in another immuneinspired algorithms: CLONALG (the same forerunner to AIRS), and resembles CLONALG in the sense that both algorithms (again) are concerned with developing a set of memory cells that give a representation of the learnt environment. However, within aiNET there is no error feedback into the learning process. The learning process can be outlined as follows: . Randomly initialize a population P . For each pattern to be recognized, do (a) Calculate the affinity of each B-cell (b) in the network for an instance of the pattern being learnt (b) Select a number of elements from P into a clonal pool C (c) Mutate each element of C proportional to affinity to the pattern being learnt (the higher the affinity, the less mutation applied) (d) Select the highest affinity members of C to remain in the set C and remove the remaining elements (e) Calculate the affinity between all members of C and remove elements in C that have an affinity below a certain threshold (user defined) (f) Combine the elements of C with the set P (g) Introduce a random number of randomly created elements into P to maintain diversity . Repeat from (a) until stopping criteria is met Once this training routine is complete, the minimumspanning tree algorithm is applied to the network to extract the clusters from within the network.

Artificial Societies

Recommended Reading Bersini, H. (). Immune network and adaptive control. In Proceedings of the st European conference on artificial life (ECAL) (pp. –). Cambridge, MA: MIT Press. Cooke, D., & Hunt, J. (). Recognising promoter sequences using an artificial immune system. In Proceedings of intelligent systems in molecular biology (pp. –). California: AAAI Press. de Castro, L. N., & Timmis, J. (). Artificial immune systems: A new computational intelligence approach. New York: Springer. de Castro, L. N., & Von Zuben, F. J. (). aiNet: An artificial immune network for data analysis (pp. –). Hershey, PA: Idea Group Publishing. Farmer, J. D., Packard, N. H., & Perelson, A. S. (). The immune system, adaptation, and machine learning. Physica D, , –. Forrest, S., Perelson, A. S., Allen, L., Cherukuri, R. (). Self–nonself discrimination in a computer. In Proceedings of the IEEE symposium on research security and privacy (pp. –). Freitas, A., & Timmis, J. (). Revisiting the foundations of artificial immune systems: A problem oriented perspective, LNCS (Vol. ) (pp. –). New York: Springer. Hart, E., & Timmis, J. (). Application Areas of AIS: The Past, Present and the Future. Journal of Applied Soft Computing, (). pp. –. Neal, M. (). An artificial immune system for continuous analysis of time-varying data. In J. Timmis & P. Bentley (Eds.), Proceedings of the st international conference on artificial immune system (ICARIS) (pp. –). Canterbury, UK: University of Kent Printing Unit. Secker, A., Freitas, A., & Timmis, J. (). AISEC: An artificial immune system for email classification. In Proceedings of congress on evolutionary computation (CEC) (pp. –). Timmis, J., & Bentley (Eds.). (). Proceedings of the st international conference on artificial immune system (ICARIS). Canterbury, UK: University of Kent Printing Unit. Timmis, J., & Neal, M. (). A resource limited artificial immune system for data analysis. Knowledge Based Systems, (–), –. Varela, F., Coutinho, A., Dupire, B., & Vaz, N. (). Cognitive networks: Immune, neural and otherwise. Journal of Theoretical Immunology, , –. Watkins, A. (). AIRS: A resource limited artificial immune classifier. Master’s thesis, Mississippi State University. Watkins, A. (). Exploiting immunological metaphors in the development of serial, parallel and distributed learning algorithms. PhD thesis, University of Kent.

A

life include the origin of life, growth and development, evolutionary and ecological dynamics, adaptive autonomous robots, emergence and self-organization, social organization, and cultural evolution.

Artificial Neural Networks (ANNs) is a computational model based on biological neural networks. It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase.

Cross References 7Adaptive Resonance Theory 7Backpropagation 7Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity 7Boltzmann Machines 7Cascade Correlation 7Competitive Learning 7Deep Belief Networks 7Evolving Neural Networks 7Hypothesis Language 7Neural Network Topology 7Neuroevolution 7Radial Basis Function Networks 7Reservoir Computing 7Self-Organizing Maps 7Simple Recurrent Networks 7Weights

Artificial Societies Artificial Life Artificial Life is an interdisciplinary research area trying to reveal and understand the principles and organization of living systems. Its main goal is to artificially synthesize life-like behavior from scratch in computers or other artificial media. Important topics in artificial

Jürgen Branke University of Warwick, Coventry, UK

Synonyms Agent-based computational models; Agent-based modeling and simulation; Agent-based simulation models

A

A

Artificial Societies

Definition An artificial society is an agent-based, computerimplemented simulation model of a society or group of people, usually restricted to their interaction in a particular situation. Artificial societies are used in economics and social sciences to explain, understand, and analyze socioeconomic phenomena. They provide scientists with a fully controllable virtual laboratory to test hypotheses and observe complex system behavior emerging as result of the 7agents’ interaction. They allow formalizing and testing social theories by using computer code, and make it possible to use experimental methods with social phenomena, or at least with their computer representations, on a large scale. Because the designer is free to choose any desired 7agent behavior as long as it can be implemented, research based on artificial societies is not restricted by assumptions typical in classical economics, such as homogeneity and full rationality of agents. Overall, artificial societies have added an all new dimension to research in economics and social sciences and have resulted in a new research field called “agent-based computational economics.” Artificial societies should be distinguished from virtual worlds and 7artificial life. The term virtual world is usually used for virtual environments to interact with, as, e.g., in computer games. In artificial life, the goal is more to learn about biological principles, understand how life could emerge, and create life within a computer.

Motivation and Background Classical economics can be roughly divided into analytical and empirical approaches. The former uses deduction to derive theorems from assumptions. Thereby, analytical models usually include a number of simplifying assumptions in order to keep the model tractable, the most typical being full rationality and homogeneity of agents. Also, analytical economics is often limited to equilibrium calculations. Classical empirical economics collects data from the real world, and derives patterns and regularities inductively. In recent years, the tremendous increase in available computational power gave rise to a new branch of economics and sociology which uses simulation of artificial societies as a tool to generate new insights.

Artificial societies are agent-based, computerimplemented simulation models of real societies or a group of people in a specific situation. They are built from the bottom up, by specifying the behavior of the agents in different situations. The simulation then reveals the emerging global behavior of the system, and thus provides a link between micro-level behavior of the agents and macro-level characteristics of the system. Using simulation, researchers can now carry out social experiments under fully controlled and reproducible laboratory conditions, trying out different configurations and observing the consequences. Like deduction, simulation models are based on a set of clearly specified assumptions as written down in a computer program. This is then used to generate data, from which regularities and patterns are derived inductively. As such, research based on artificial societies stands somewhere between the classical analytical and empirical social sciences. One of the main advantages of artificial societies is that they allow to consider very complex scenarios where agents are heterogeneous, boundedly rational, or have the ability to learn. Also, they allow to observe evolution over time, instead of just the equilibrium. Artificial societies can be used for many purposes, e.g.: . Verification: Test a hypothesis or theory by examining its validity in relevant, clearly defined scenarios. . Explanation: Construct an artificial society which shows the same behavior as the real society. Then analyze the model to explain the emergent behavior. . Prediction: Run a model of an existing society into the future. Also, feed the model with different input parameters and use the result as a prediction on how the society would react. . Optimization: Test different strategies in the simulation environment, trying to find a best possible strategy. . Existence proof: Demonstrate that a specific simulation model is able to generate a certain global behavior. . Discovery: Play around with parameter settings, discovering new interdependencies and gaining new insights. . Training and education: Use simulation as demonstrator.

Artificial Societies

Structure of the Learning System Using artificial societies requires the usual steps in model building and experimental science, including . . . .

Developing a conceptual model Building the simulation model Verification (making sure the model is correct) Validation (making sure the model is suitable to answer the posed questions) . Simulation and analysis using an appropriate experimental design.

Artificial society is an interdisciplinary research area involving, among others, computer science, psychology, economics, sociology, and biology. Important Aspects

The modeling, simulation, and analysis process described in the previous section is rather complex and only remotely connected to machine learning. Thus, instead of a detailed description of all steps, the following focuses on aspects particularly interesting from a machine learning point of view. Modeling Learning

One of the main advantages of artificial societies is that they can account for boundedly rational and learning agents. For that, one has to specify (in form of a program) exactly how agents decide and learn. In principle, all the learning algorithms developed in machine learning could be used, and many have been used successfully, including 7reinforcement learning, 7artificial neural networks, and 7evolutionary algorithms. However, note that the choice of a learning algorithm is not determined by its learning speed and efficiency (as usual in machine learning), but by how well it reflects human learning in the considered scenario, at least if the goal is to construct an artificial society which allows conclusions to be transferred to the real world. As a consequence, many learning models used in artificial societies are motivated by psychology. The idea of the most suitable model depends on the simulation context, e.g., on whether the simulated learning process is conscious or nonconscious, or on the time and effort an individual may be expected to spend on a particular decision.

A

Besides individual learning (i.e., learning from own past experience), artificial societies usually feature social learning (where one agent learns by observing others), and cultural learning (e.g., the evolution of norms). While the latter simply emerges from the interaction of the agents, the former has to be modeled explicitly. Several different models for learning in artificial societies are discussed in Brenner (). One popular learning paradigm which can be used as a model for individual as well as social learning are 7evolutionary algorithms (EAs). Several studies suggest that EAs are indeed an appropriate model for learning in artificial societies, either based on comparisons of simulations with human subject experiments or based on comparisons with other learning mechanisms such as reinforcement learning (Duffy, ). As EAs are successful search strategies, they seem particularly suitable if the space of possible actions or strategies is very large. If used to model individual learning, each agent uses a separate EA to search for a better personal solution. In this case, the EA population represents the different alternative actions or strategies that an agent considers. The genetic operators crossover and mutation are clearly related to two major ingredients of human innovation: combination and variation. Crossover can be seen as deriving a new concept by combining two known concepts, and mutation corresponds to a small variation of an existing concept. So, the agent, in some sense, creatively tries out new possibilities. Selection, which favors the best solutions found so far, models the learning part. A solution’s quality is usually assessed by evaluating it in a simulation assuming all other agents keep their behavior. For modeling social learning, EAs can be used in two different ways. In both cases, the population represents the actions or strategies of the different agents in the population. From this it follows that the population size corresponds to the number of agents in the simulation. Fitness values are calculated by running the simulation and observing how the different agents perform. Crossover is now seen as a model for information exchange, or imitation, among agents. Mutation, as in the individual learning case, is seen as a small variation of an existing concept. The first social learning model simply uses a standard EA, i.e., selection chooses agents to “reproduce,”

A

A

Artificial Societies

and the resulting new agent strategy replaces an old strategy in the population. While allowing to use standard EA libraries, this approach does not provide a direct link between agents in the simulation and individuals in the EA population. In the second social learning model, each agent directly corresponds to an individual in the EA. In every iteration, each agent creates and tests a new strategy as follows. First, it selects a “donor” individual, with preference to successful individuals. Then it performs a crossover of its own strategy and the donor’s strategy, and mutates the result. This can be regarded as an agent observing other agents, and partially adopting the strategies of successful other agents. Then, the resulting new strategy is tested in a “thought experiment,” by testing whether the agent would be better off with the new strategy compared with its current strategy, assuming all other agents keep their behavior. If the new strategy performs better, it replaces the current strategy in the next iteration. Otherwise, the new strategy is discarded and the agent again uses its old strategy in the next iteration. The testing of new strategies against their parents has been termed election operator in Arifovic (), and makes sure that some very bad and obviously implausible agent strategies never enter the artificial society. Examples

One of the first forerunners of artificial societies was Schelling’s segregation model, . In this study, Schelling placed some artificial agents of two different colors on a simple grid. Each agent follows a simple rule: if less than a given percentage of agents in the neighborhood had the same color, the agent moves to a random free spot. Otherwise, it stays. As the simulation shows, in this model, segregation of agent colors could be observed even if every individual agent was satisfied to live in a neighborhood with only % of its neighbors being of the same color. Thus, with this simple model, Schelling demonstrated that segregation of races in suburbs can occur even if each individual would be happy to live in a diverse neighborhood. Note that the simulations were actually not implemented on a computer but carried out by moving coins on a grid by hand. Other milestones in artificial societies are certainly the work by Epstein and Axtell on their “sugarscape” model (Epstein & Axtell, ), and the Santa

Fe artificial stock market (Arthur, Holland, LeBaron, Palmer, & Taylor, ). In the former, agents populate a simple grid world, with sugar growing as the only resource. The agents need the sugar for survival, and can move around to collect it. Axtell and Epstein have shown that even with agents following some very simple rules, the emerging behavior of the overall system can be quite complex and similar in many aspects to observations in the real world, e.g., showing a similar wealth distribution or population trajectories. The latter is a simple model of a stock market with only a single stock and a risk-free fixed-interest alternative. This model has subsequently been refined and studied by many researchers. One remarkable result of the first model was to demonstrate that technical trading can actually be a viable strategy, something widely accepted in practice, but which classical analytical economics struggled to explain. One of the most sophisticated artificial societies is perhaps the model of the Anasazi tribe, who left their dwellings in the Long House Valley in northeastern Arizona for so far unknown reasons around BC (Axtell et al., ). By building an artificial society of this tribe and the natural surroundings (climate etc.), it was possible to replicate macro behavior which is known to have occurred and provide a possible explanation for the sudden move. The NewTies project (Gilbert et al., ) has a different and quite ambitious focus: it constructs artificial societies with the hope of an emerging artificial language and culture, which then might be studied to help explain how language and culture formed in human societies. Software Systems

Agent-based simulations can be facilitated by using specialized software libraries such as Ascape, Netlogo, Repast, StarLogo, Mason, and Swarm. A comparison of different libraries can be found in Railsback, Lytinen, and Jackson ().

Applications Artificial societies have many practical applications, from rather simple simulation models to very complex economic decision problems, examples include

Association Rule

traffic simulation, market design, evaluation of vaccination programs, evacuation plans, or supermarket layout optimization. See, e.g., Bonabeau () for a discussion of several such applications.

Future Directions, Challenges The science on artificial societies is still at its infancy, but the field is burgeoning and has already produced some remarkable results. Major challenges lie in the model building, calibration, and validation of the artificial society simulation model. Despite several agentbased modeling toolkits available, there is a lot to be gained by making them more flexible, intuitive, and user-friendly, allowing to construct complex models simply by selecting and combining provided building blocks of agent behavior. 7Behavioral Cloning may be a suitable machine learning approach to generate representative agent models.

Cross References 7Artificial Life 7Behavioral Cloning 7Co-Evolutionary Learning 7Multi-Agent Learning

Recommended Reading Agent-based computational economics, website maintained by Tesfatsion () Axelrod: The Complexity of Cooperation: Agent-Based Models of Competition and Collaboration (Axelrod, ) Bonabeau: Agent-based modeling (Bonabeau, ) Brenner: Agent learning representation: Advice on modeling economic learning (Brenner, ) Epstein: Generative social science (Epstein, ) Journal of Artificial Societies and Social Simulation () Tesfatsion and Judd (eds.): Handbook of computational economics (Tesfatsion & Judd, ) Arifovic, J. (). Genetic algorithm learning and the cobwebmodel. Journal of Economic Dynamics and Control, , –. Arthur, B., Holland, J., LeBaron, B., Palmer, R., & Taylor, P. (). Asset pricing under endogenous expectations in an artificial stock market. In B. Arthur et al., (Eds.), The economy as an evolvin complex system II (pp. –). Boston: Addison-Wesley. Axelrod, R. (). The complexity of cooperation: Agent-based models of competition and collaboration. Princeton, NJ: Princeton University Press. Axtell, R. L., Epstein, J. M., Dean, J. S., Gumerman, G. J., Swedlund, A. C., Harburger, J., et al. (). Population growth and collapse in a multiagent model of the kayenta anasazi in long

A

house valley. Proceedings of the National Academy of Sciences, , –. Bonabeau, E. (). Agent-based modeling: Methods and techniques for simulating human systems. Proceedings of the National Academy of Sciences, , –. Brenner, T. (). Agent learning representation: Advice on modelling economic learning. In L. Tesfatsion & K. L. Judd, (Eds.), Handbook of computational economics (Vol. , pp.–). Amsterdam: North-Holland. Duffy, J. (). Agent-based models and human subject experiments. In L. Tesfatsion & K. L. Judd, (Eds.), Handbook of computational economics (Vol. , pp.–). Amsterdam: North-Holland. Epstein, J. M. (). Generative social science: Studies in agentbased computational modeling. Princeton, NJ: Princeton University Press. Epstein, J. M., & Axtell, R. (). Growing artificial societies. Washington, DC: Brookings Institution Press. Gilbert, N., den Besten, M., Bontovics, A., Craenen, B. G. W., Divina, F., Eiben, A. E., et al. (). Emerging artificial societies through learning. Journal of Artificial Societies and Social Simulation, (). http://jasss.soc.surrey.ac.uk///. html. Railsback, S. F., Lytinen, S. L., & Jackson, S. K. (). Agent-based simulation platforms: Review and development recommendations. Simulation, (), –. Schelling, T. C. (). Dynamic models of segregation. Journal of Mathematical Sociology, , –. Tesfatsion, L. (). Website on agent-based computational economics. http://www.econ.iastate.edu/tesfatsi/ace. htm. Tesfatsion, L., & Judd, K. L. (Eds.) (). Handbook of computational economics – Vol : Agent-based computational economics. Amsterdam: Elsevier. The journal of artificial societies and social simulation. http:// jasss.soc.surrey.ac.uk/JASSS.html.

Assertion In 7Minimum Message Length, the code or language shared between sender and receiver that is used to describe the model.

Association Rule Hannu Toivonen University of Helsinki, Helsinki, Finland

Definition Association rules (Agrawal, Imieli´nski, & Swami, ) can be extracted from data sets where each example

A

A

Associative Bandit Problem

consists of a set of items. An association rule has the form X → Y, where X and Y are 7itemsets, and the interpretation is that if set X occurs in an example, then set Y is also likely to occur in the example. Each association rule is usually associated with two statistics measured from the given data set. The frequency or support of a rule X → Y, denoted fr(X→Y), is the number (or alternatively the relative frequency) of examples in which X ∪ Y occurs. Its confidence, in turn, is the observed conditional probability P(Y ∣ X) = fr(X ∪ Y)/fr(X). The 7Apriori algorithm (Agrawal, Mannila, Srikant, Toivonen & Verkamo, ) finds all association rules, between any sets X and Y, which exceed user-specified support and confidence thresholds. In association rule mining, unlike in most other learning tasks, the result thus is a set of rules concerning different subsets of the feature space. Association rules were originally motivated by supermarket 7basket analysis, but as a domain independent technique they have found applications in numerous fields. Association rule mining is part of the larger field of 7frequent itemset or 7frequent pattern mining.

Cross References 7Apriori Algorithm 7Basket Analysis 7Frequent Itemset 7Frequent Pattern

Recommended Reading Agrawal, R., Imieli n´ ski, T., & Swami, A. (). Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD international conference on management of data, Washington, DC (pp. –). New York: ACM. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A. I. (). Fast discovery of association rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. –). Menlo Park: AAAI Press.

Associative Bandit Problem 7Associative Reinforcement Learning

Associative Reinforcement Learning Alexander L. Strehl R¨utgers University, USA

Synonyms Associative bandit problem; Bandit problem with side information; Bandit problem with side observations; One-step reinforcement learning

Definition The associative reinforcement-learning problem is a specific instance of the 7reinforcement learning problem whose solution requires generalization and exploration but not temporal credit assignment. In associative reinforcement learning, an action (also called an arm) must be chosen from a fixed set of actions during successive timesteps and from this choice a real-valued reward or payoff results. On each timestep, an input vector is provided that along with the action determines, often probabilistically, the reward. The goal is to maximize the expected long-term reward over a finite or infinite horizon. It is typically assumed that the action choices do not affect the sequence of input vectors. However, even if this assumption is not asserted, learning algorithms are not required to infer or model the relationship between input vectors from one timestep to the next. Requiring a learning algorithm to discover and reason about this underlying process results in the full reinforcement learning problem.

Motivation and Background The problem of associative reinforcement learning may be viewed as connecting the problems of 7supervised learning or 7classification, which is more specific, and reinforcement learning, which is more general. Its study is motivated by real-world applications such as choosing which internet advertisements to display based on information about the user or choosing which stock to buy based on current information related to the market. Both problems are distinguished from supervised learning by the absence of labeled training examples to learn from. For instance, in the advertisement problem, the learner is never told which ads would have resulted in the greatest expected reward (in this problem, reward is

Associative Reinforcement Learning

determined by whether an ad is clicked on or not). In the stock problem, the best choice is never revealed since the choice itself affects the future price of the stocks and therefore the payoff.

The Learning Setting The learning problem consists of the following core objects: An input space X , which is a set of objects (often a subset of the n-dimension Euclidean space Rn ). ● A set of actions or arms A, which is often a finite set of size k. ● A distribution D over X . In some cases, D is allowed to be time-dependent and may be denoted Dt on timestep t for t = , , . . ..

●

A learning sequence proceeds as follows. During each timestep t = , , . . ., an input vector xt ∈ X is is drawn according to the distribution D and is provided to the algorithm. The algorithm selects an aarm at at ∈ A. This choice may be stochastic and depend on all previous inputs and rewards observed by the algorithm as well as all previous action choices made by the algorithm for timesteps t = , , . . .. Then, the learner receives a payoff rt generated according to some unknown stochastic process that depends only on the xt and at . The informal goal is to maximize the expected long-term payoff. Let π : X → A be any policy that maps input vectors to actions. Let T

V π (T) := E [∑ ri ∣ ai = π(xi ) for i = , , . . . , T] () i=

denotes the expected total reward over T steps obtained by choosing arms according to policy π. The expectation is taken over any randomness in the generation of input vectors xi and rewards ri . The expected regret of a learning algorithm with respect to policy π is defined as V π (T) − E[∑Ti= ri ] the expected difference between the return from following policy π and the actual obtained return. Power of Side Information

Wang, Kulkarni, and Poor () studied the associative reinforcement learning problem from a statistical viewpoint. They considered the setting with two action

A

and analyzed the expected inferior sampling time, which is the number of times that the lesser action, in terms of expected reward, is selected. The function mapping input vectors to conditional reward distributions belongs to a known parameterized class of functions, with the true parameters being unknown. They show that, under some mild conditions, an algorithm can achieve finite expected inferior sampling time. This demonstrates the power provided by the input vectors (also called side observations or side information), because such a result is not possible in the standard multi-armed bandit problem, which corresponds to the associative reinforcement-learning problem without input vectors xi . Intuitively, this type of result is possible when the side information can be used to infer the payoff function of the optimal action. Linear Payoff Functions

In its most general setting, the associative reinforcement learning problem is intractable. One way to rectify this problem is to assume that the payoff function is described by a linear system. For instance, Abe and Long () and Auer () consider a model where during each timestep t, there is a vector zt,i associated with each arm i. The expected payoff of pulling arm i on this timestep is given by θ T zt,i where θ is an unknown parameter vector and θ T denotes the transpose of f . This framework maps to the framework described above by taking xt = (zt, , zt, , . . . , zt,k ). They assume a time-dependent distribution D and focus on obtaining bounds on the regret against the optimal policy. Assuming that all rewards lie in the interval [, ], the worst possible regret of any learning algorithm is linear. When considering only the number of timesteps T, Auer () shows that √ a regret (with respect to the optimal policy) of O( T ln(T)) can be obtained. PAC Associative Reinforcement Learning

The previously mentioned works analyze the growth rate of the regret of a learning algorithm with respect to the optimal policy. Another way to approach the problem is to allow the learner some number of timesteps of exploration. After the exploration trials, the algorithm is required to output a policy. More specifically, given inputs < є < and < δ < , the algorithm is

A

A

Attribute

required to output an є-optimal policy with probability at least − δ. This type of analysis is based on the work by Valiant (), and learning algorithms satisfying the above condition are termed probably approximately correct (PAC). Motivated by the work of Kaelbling (), Fiechter () developed a PAC algorithm when the true payoff function can be described by a decision list over the action and input vector. Building on both works, Strehl, Mesterharm, Littman, and Hirsh () showed that a class of associative reinforcement learning problems can be solved efficiently, in a PAC sense, when given a learning algorithm for efficiently solving classification problems.

Recommended Reading Section . of the survey by Kaelbling, Littman, and Moore () presents a nice overview of several techniques for the associative reinforcement-learning problem, such as CRBP (Ackley and Littman, ), ARC (Sutton, ), and REINFORCE (Williams, ). Abe, N., & Long, P. M. (). Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the th international conference on machine learning (pp. –). Ackley, D. H., & Littman, M. L. (). Generalization and scaling in reinforcement learning. In Advances in neural information processing systems (pp. –). San Mateo, CA: Morgan Kaufmann. Auer, P. (). Using confidence bounds for exploitation– exploration trade-offs. Journal of Machine Learning Research, , –. Fiechter, C.-N. (). PAC associative reinforcement learning. Unpublished manuscript. Kaelbling, L. P. (). Associative reinforcement learning: Functions in k-DNF. Machine Learning, , –. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, , –. Strehl, A. L., Mesterharm, C., Littman, M. L., & Hirsh, H. (). Experience-efficient learning in associative bandit problems. In ICML-: Proceedings of the rd international conference on machine learning, Pittsburgh, Pennsylvania (pp. –). Sutton, R. S. (). Temporal credit assignment in reinforcement learning. Doctoral dissertation, University of Massachusetts, Amherst, MA. Valiant, L. G. (). A theory of the learnable. Communications of the ACM, , –. Wang, C.-C., Kulkarni, S. R., & Poor, H. V. (). Bandit problems with side observations. IEEE Transactions on Automatic Control, , –. Williams, R. J. (). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, , –.

Attribute Chris Drummond National Research Council of Canada, Ottawa, ON, Canada

Synonyms Characteristic; Feature; Property; Trait

Definition Attributes are properties of things, ways that we, as humans, might describe them. If we were talking about the appearance of our friends, we might describe one of them as “sex female,” “hair brown,” “height ft in.” Linguistically, this is rather terse, but this very terseness has the advantage of limiting ambiguity. The attributes are sex, hair color, and height. For each friend, we could give the appropriate values to go along with each attribute, some examples are shown in Table . Attribute-value pairs are a standard way of describing things within the machine learning community. Traditionally, values have come in one of three types: binary, sex has two values; nominal, hair color has many values; real, height has an ordered set of values. Ideally, the attribute-value pairs are sufficient to describe some things accurately and to tell them apart from others. What might be described is very varied, so the attributes themselves will vary widely.

Motivation and Background For machine learning to be successful, we need a language to describe everyday things that is sufficiently powerful to capture the similarities and differences between them and yet is computationally easy to manage. The idea that a sufficient number of attribute-value Attribute. Table Some friends Sex

Hair color

Height

Male

Black

ft in.

Female

Brown

ft in.

Female

Blond

ft in.

Male

Brown

ft in.

Attribute

pairs would meet this requirement is an intuitive one. It has also been studied extensively in philosophy and psychology, as a way that humans represent things mentally. In the early days of artificial intelligence research, the frame (Minsky, ) became a common way of representing knowledge. We have, in many ways, inherited this representation, attribute-value pairs sharing much in common with the labeled slots for values used in frames. In addition, the data for many practical problems comes in this form. Popular methods of storing and manipulating data such as relational databases, and less formal structures such as spread sheets, have columns as attributes and cells as values. So, attributevalue pairs are a ubiquitous way of representing data.

Future Directions The notion of an attribute-value pair is so well entrenched in machine learning that it is difficult to perceive what might replace it. As, in many practical applications, the data comes in this form, this representation will undoubtedly be around for some time. One change that is occurring is the growing complexity of attribute-values. Traditionally, we have used the simple value types, binary, nominal, and real, discussed earlier. But to effectively describe many things, we need to extend this simple language and use more complex values. For example, in 7data mining applied to multimedia, more new complex representations abound. Sound and video streams, images, and various properties of them, are just a few examples (Cord et al., ; Simoff & Djeraba, ). Perhaps, the most significant change is away from attributes, albeit with complex values, to structural forms where the relationship between things is included. As Quinlan () states “Data may concern objects or observations with arbitrarily complex structure that cannot be captured by the values of a predetermined set of attributes.” There is a large and growing community of researchers in 7relational learning. This is evidenced by the number, and growing frequency, of recent workshops at the International Conference for Machine Learning (Cord et al., ; De Raedt & Kramer, ; Dietterich, Getoor, & Murphy, ; Fern, Getoor, & Milch, ).

A

Limitations In philosophy there is the idea of essence, the properties an object must have to be what it is. In machine learning, particularly in practical applications, we get what we are given and have little control in the choice of attributes and their range of values. If domain experts have chosen the attributes, we might hope that they are properties that can be readily ascertained and are relevant to the task at the hand. For example, when describing one of our friends, we would not say Fred is the one with the spleen. It is not only difficult to observe, it is also poor at discriminating between people. Data are collected for many reasons. In medical applications, all sorts of attribute-values would be collected on patients. Most are unlikely to be important to the current task. An important part of learning is 7feature extraction, determining which attributes are necessary for learning. Whether or not attribute-value pairs are an essential representation for the type of learning required in the development, and functioning, of intelligent agents, remains to be seen. Attribute-values readily capture symbolic information, typically at the level of words that humans naturally use. But if our agents need to move around in their environment, recognizing what they encounter, we might need a different nonlinguistic representation. Certainly, other representations based on a much finer granularity of features, and more holistic in nature, have been central to areas such as 7neural networks for some time. In research into 7dynamic systems, attractors in a sensor space might be more realistic that attribute-values (See chapter on 7Classification).

Recommended Reading Cord, M., Dahyot, R., Cunningham, P., & Sziranyi, T. (Eds.). (). Workshop on machine learning techniques for processing multimedia content. In Proceedings of the twenty-second international conference on machine learning. De Raedt, L., & Kramer, S. (Eds.). (). In Proceedings of the seventeenth international conference on machine learning. Workshop on attribute-value and relational learning: Crossing the boundaries, Stanford University, Palo Alto, CA. Dietterich, T., Getoor, L., & Murphy, K. (Eds.). (). In Proceedings of the twenty-first international conference on machine learning. Workshop on statistical relational learning and its connections to other fields. Fern, A., Getoor, L., & Milch, B. (Eds.). (). In Proceedings of the twenty-fourth international conference on machine learning. Workshop on open problems in statistical relational learning.

A

A

Attribute Selection

Minsky, M. (). A framework for representing knowledge. Technical report, Massachusetts Institute of Technology, Cambridge, MA. Quinlan, J. R. (). Learning first-order definitions of functions. Journal of Artificial Intelligence Research, , –. Simoff, S. J., & Djeraba, C. (Eds.). (). In Proceedings of the sixth international conference on knowledge discovery and data mining. Workshop on multimedia data mining.

Attribute Selection 7Feature Selection

Attribute-Value Learning Attribute-value learning refers to any learning task in which the each 7Instance is described by the values of some finite set of attributes (see 7Attribute). Each of these instances is often represented as a vector of attribute values, each position in the vector corresponding to a unique attribute.

AUC 7Area Under Curve

Autonomous Helicopter Flight Using Reinforcement Learning Adam Coates , Pieter Abbeel , Andrew Y. Ng Stanford University, Stanford, CA, USA University of California, Berkeley, CA, USA Stanford University, Stanford, CA, USA

Definition Helicopter flight is a highly challenging control problem. While it is possible to obtain controllers for simple maneuvers (like hovering) by traditional manual design procedures, this approach is tedious and typically requires many hours of adjustments and flight testing, even for an experienced control engineer. For complex maneuvers, such as aerobatic routines, this approach

is likely infeasible. In contrast, 7reinforcement learning (RL) algorithms enable faster and more automated design of controllers. Model-based RL algorithms have been used successfully for autonomous helicopter flight for hovering, forward flight and, using apprenticeship learning methods for expert-level aerobatics. In modelbased RL, first one builds a model of the helicopter dynamics and specifies the task using a reward function. Then, given the model and the reward function, the RL algorithm finds a controller that maximizes the expected sum of rewards accumulated over time.

Motivation and Background Autonomous helicopter flight represents a challenging control problem and is widely regarded as being significantly harder than control of fixed-wing aircraft. (See, e.g., Leishman, (); Seddon, ()). At the same time, helicopters provide unique capabilities such as inplace hover, vertical takeoff and landing, and low-speed maneuvering. These capabilities make helicopter control an important research problem for many practical applications. Building autonomous flight controllers for helicopters, however, is far from trivial. When done by hand, it can require many hours of tuning by experts with extensive prior knowledge about helicopter dynamics. Meanwhile, the automated development of helicopter controllers has been a major success story for RL methods. Controllers built using RL algorithms have established state-of-the-art performance for both basic flight maneuvers, such as hovering and forward flight (Bagnell & Schneider, ; Ng, Kim, Jordan, & Sastry, ), as well as being among the only successful methods for advanced aerobatic stunts. Autonomous helicopter aerobatics has been successfully tackled using the innovation of “apprenticeship learning,” where the algorithm learns by watching a human demonstrator (Abbeel & Ng, ). These methods have enabled autonomous helicopters to fly aerobatics as well as an expert human pilot, and often even better (Coates, Abbeel, & Ng, ). Developing autonomous flight controllers for helicopters is challenging for a number of reasons: . Helicopters have unstable, high-dimensional, asymmetric, noisy, nonlinear, non-minimum phase dynamics. As a consequence, all successful helicopter flight

Autonomous Helicopter Flight Using Reinforcement Learning

controllers (to date) have many parameters. Controllers with – gains are not atypical. Hand engineering the right setting for each of the parameters is difficult and time consuming, especially since their effects on performance are often highly coupled through the helicopter’s complicated dynamics. Moreover, the unstable dynamics, especially in the low-speed flight regime, complicates flight testing. . Helicopters are underactuated: their position and orientation is representable using six parameters, but they have only four control inputs. Thus helicopter control requires significant planning and making trade-offs between errors in orientation and errors in desired position. . Helicopters have highly complex dynamics: Even though we describe the helicopter as having a twelve dimensional state (position, velocity, orientation, and angular velocity), the true dynamics are significantly more complicated. To determine the precise effects of the inputs, one would have to consider the airflow in a large volume around the helicopter, as well as the parasitic coupling between the different inputs, the engine performance, and the non-rigidity of the rotor blades. Highly accurate simulators are thus difficult to create, and controllers developed in simulation must be sufficiently robust that they generalize to the real helicopter in spite of the simulator’s imperfections. . Sensing capabilities are often poor: For small remotely controlled (RC) helicopters, sensing is limited because the on-board sensors must deal with a large amount of vibration caused by the helicopter blades rotating at about Hz, as well as

A

higher frequency noise from the engine. Although noise at these frequencies (which are well above the roughly Hz at which the helicopter dynamics can be modeled reasonably) might be easily removed by low pass filtering, this introduces latency and damping effects that are detrimental to control performance. As a consequence, helicopter flight controllers have to be robust to noise and/or latency in the state estimates to work well in practice.

Typical Hardware Setup A typical autonomous helicopter has several basic sensors on board. An Inertial Measurement Unit (IMU) measures angular rates and linear accelerations for each of the helicopter’s three axes. A -axis magnetometer senses the direction of the Earth’s magnetic field, similar to a magnetic compass (Fig. ). Attitude-only sensing, as provided by the inertial and magnetic sensors, is insufficient for precise, stable hovering, and slow-speed maneuvers. These maneuvers require that the helicopter maintain relatively tight control over its position error, and hence highquality position sensing is needed. GPS is often used to determine helicopter position (with carrier-phase GPS units achieving sub-decimeter accuracy), but visionbased solutions have also been employed (Abbeel, Coates, Quigley, & Ng, ; Coates et al., ; Saripalli, Montgomery, & Sukhatme, ). Vibration adds errors to the sensor measurements and may damage the sensors themselves, hence significant effort may be required to mount the sensors on the airframe (Dunbabin, Brosnan, Roberts, & Corke, ). Provided there is no aliasing, sensor errors added by

Autonomous Helicopter Flight Using Reinforcement Learning. Figure . (a) Stanford University’s instrumented XCell Tempest autonomous helicopter. (b) A Bergen Industrial Twin autonomous helicopter with sensors and on-board computer

A

A

Autonomous Helicopter Flight Using Reinforcement Learning

vibration can be removed by using a digital filter on the measurements (though, again, one must be careful to avoid adding too much latency). Sensor data from the aircraft sensors is used to estimate the state of the helicopter for use by the control algorithm. This is usually done with an extended Kalman filter (EKF). A unimodal distribution (as computed by the EKF) suffices to represent the uncertainty in the state estimates and it is common practice to use the mode of the distribution as the state estimate for feedback control. In general the accuracy obtained with this method is sufficiently high that one can treat the state as fully observed. Most autonomous helicopters have an on-board computer that runs the EKF and the control algorithm (Gavrilets, Martinos, Mettler, & Feron, a; La Civita, Papageorgiou, Messner, & Kanade, ; Ng et al., ). However, it is also possible to use groundbased computers by sending sensor data by wireless to the ground, and then transmitting control signals back to the helicopter through the pilot’s RC transmitter (Abbeel et al., ; Coates et al., ).

Helicopter State and Controls The helicopter state s is defined by its position (px , py , pz ), orientation (which could be expressed using a unit quaternion q), velocity (vx , vy , vz ) and angular velocity (ω x , ω y , ω z ). The helicopter is controlled via a -dimensional action space: . u and u : The lateral (left-right) and longitudinal (front-back) cyclic pitch controls (together referred to as the “cyclic” controls) cause the helicopter to roll left or right, and pitch forward or backward, respectively. . u : The tail rotor pitch control affects tail rotor thrust, and can be used to yaw (turn) the helicopter about its vertical axis. In analogy to airplane control, the tail rotor control is commonly referred to as “rudder.” . u : The collective pitch control (often referred to simply as “collective”), increases and decreases the pitch of the main rotor blades, thus increasing or decreasing the vertical thrust produced as the blades sweep through the air.

By using the cyclic and rudder controls, the pilot can rotate the helicopter into any orientation. This allows the pilot to direct the thrust of the main rotor in any particular direction, and thus fly in any direction, by rotating the helicopter appropriately.

Helicopter Flight as an RL Problem Formulation

A RL problem can be described by a tuple (S, A, T, H, s(), R), which is referred to as a 7Markov decision process (MDP). Here S is the set of states; A is the set of actions or inputs; T is the dynamics model, which is a t t set of probability distributions {Psu } (Psu (s′ ∣s, u) is the ′ probability of being in state s at time t + , given the state and action at time t are s and u); H is the horizon or number of time steps of interest; s() ∈ S is the initial state; R : S × A → R is the reward function. A policy π = (µ , µ , . . . , µ H ) is a tuple of mappings from states S to actions A, one mapping for each time t = , . . . , H. The expected sum of rewards when acting according to a policy π is given by: ∗ U(π) = E[∑H t = R(s(t), u(t))∣π]. The optimal policy π for an MDP (S, A, T, H, s(), R) is the policy that maximizes the expected sum of rewards. In particular, the optimal policy is given by: π ∗ = arg max π U(π). The common approach to finding a good policy for autonomous helicopter flight proceeds in two steps: First one collects data from manual helicopter flights to build a model (One could also build a helicopter model by directly measuring physical parameters such as mass, rotor span, etc. However, even when this approach is pursued, one often resorts to collecting flight data to complete the model.). Then one solves the MDP comprised of the model and some chosen reward function. Although the controller obtained, in principle, is only optimal for the learned simulator model, it has been shown in various settings that optimal controllers perform well even when the model has some inaccuracies (see, e.g., Anderson & Moore, ()).

Modeling

One way to create a helicopter model is to use direct knowledge of aerodynamics to derive an explicit mathematical model. This model will depends on a number of parameters that are particular to the helicopter

Autonomous Helicopter Flight Using Reinforcement Learning

being flown. Many of the parameters may be measured directly (e.g., mass, rotational inertia), while others must be estimated from flight experiments. This approach has been used successfully on several systems (see, e.g., (Gavrilets, Martinos, Mettler, & Feron, b; Gavrilets, Mettler, & Feron, ; La Civita, )). However, substantial expert aerodynamics knowledge is required for this modeling approach. Moreover, these models tend to cover only a limited fraction of the flight envelope. Alternatively, one can learn a model of the dynamics directly from flight data, with only limited a priori knowledge of the helicopter’s dynamics. Data is usually collected from a series of manually controlled flights. These flights involve the human sweeping the control sticks back and forth at varying frequencies to cover as much of the flight envelope as possible, while recording the helicopter’s state and the pilot inputs at each instant. Given a corpus of flight data, various different learning algorithms can be used to learn the underlying model of the helicopter dynamics. If one is only interested in a single flight regime, one could learn a linear model that maps from the current state and action to the next state. Such a model can be easily estimated using 7linear regression (While the methods presented here emphasize time-domain estimation, frequency domain estimation is also possible for the special case of estimating linear models (Tischler & Cauffman, ).). Linear models are restricted to small flight regimes (e.g., hover or inverted hover) and do not immediately generalize to fullenvelope flight. To cover a broader flight regime, non parametric algorithms such as locally-weighted linear regression have been used (Bagnell & Schneider, ; Ng et al., ). Non parametric models that map from current state and action to next state can, in principle, cover the entire flight regime. Unfortunately, one must collect large amounts of data to obtain an accurate model and the models are often quite slow to evaluate. An alternative way to increase the expressiveness of the model, without resorting to non parametric methods, is to consider a time-varying model where the dynamics are explicitly allowed to depend on time. One can then proceed to compute simpler (say, linear) parametric models for each choice of the time parameter.

A

This method is effective when learning a model specific to a trajectory whose dynamics are repeatable but vary as the aircraft travels along the trajectory. Since this method can also require a great deal of data (similar to nonparametric methods) in practice, it is helpful to begin with a non-time-varying parametric model fit from a large amount of data, and then augment it with a time-varying component that has fewer parameters (Abbeel, Quigley, & Ng, ; Coates et al., ). One can also take advantage of symmetry in the helicopter dynamics to reduce the amount of data needed to fit a parametric model. In Abbeel, Ganapathi, and Ng () observe that – in a coordinate frame attached to the helicopter – the helicopter dynamics are essentially the same for any orientation (or position) once the effect of gravity is removed. They learn a model that predicts (angular and linear) accelerations – except for the effects of gravity – in the helicopter frame as a function of the inputs and the (angular and linear) velocity in the helicopter frame. This leads to a lower-dimensional learning problem, which requires significantly less data. To simulate the helicopter dynamics over time, the predicted accelerations augmented with the effects of gravity are integrated over time to obtain velocity, angular rates, position, and orientation. Abbeel et al. () used this approach to learn a helicopter model that was later used for autonomous aerobatic helicopter flight maneuvers covering a large part of the flight envelope. Significantly less data is required to learn a model using the gravity-free parameterization compared to a parameterization that directly predicts the next state as a function of current state and actions (as was used in Bagnell and Schneider (), Ng et al. ()). Abbeel et al. evaluate their model by checking its simulation accuracy over longer time scales than just a one-step acceleration prediction. Such an evaluation criterion maps more directly to the reinforcement learning objective of maximizing the expected sum of rewards accumulated over time (see also Abbeel & Ng, (b)). The models considered above are deterministic. This normally would allow us to drop the expectation when evaluating a policy according to E[∑H t = R(s(t), u(t))∣π]. However, it is common to add stochasticity to account for unmodeled effects. Abbeel et al. () and Ng et al. () include additive process noise in

A

A

Autonomous Helicopter Flight Using Reinforcement Learning

their models. Bagnell and Schneider () go further, learning a distribution over models. Their policy must then perform well, on expectation, for a (deterministic) model selected randomly from the distribution. Control Problem Solution Methods

Given a model of the helicopter, we now seek a policy π that maximizes the expected sum of rewards U(π) = E[∑H t = R(s(t), u(t))∣π] achieved when acting according to the policy π. Policy Search General policy search algorithms can be

employed to search for optimal policies for the MDP based on the learned model. Given a policy π, we can directly try to optimize the objective U(π). Unfortunately, U(π) is an expectation over a complicated distribution making it impractical to evaluate the expectation exactly in general. One solution is to approximate the expectation U(π) by Monte Carlo sampling: under certain boundedness assumptions the empirical average of the sum of rewards accumulated over time will give a good ˆ estimate U(π) of the expectation U(π). Naively Applying Monte Carlo sampling to accurately compute, e.g., the local gradient from the difference in function value at nearby points, requires very large amounts of samples due to the stochasticity in the function evaluation. To get around this hurdle, the PEGASUS algorithm (Ng & Jordan, ) can be used to convert the stochastic optimization problem into a deterministic one. When evaluating by averaging over n simulations, PEGASUS initially fixes n random seeds. For each policy evaluation, the same n random seeds are used so that the simulator is now deterministic. In particular, multiple evaluations of the same policy will result in the same computed reward. A search algorithm can then be applied to the deterministic problem to find an optimum. The PEGASUS algorithm coupled with a simple local policy search was used by Ng et al. () to develop a policy for their autonomous helicopter that successfully sustains inverted hover. Bagnell and Schneider () proceed similarly, but use the “amoeba” search algorithm (Nelder & Mead, ) for policy search. Because of the searching involved, the policy class must generally have low dimension. Nonetheless, it is

often possible to find good policies within these policy classes. The policy class of Ng et al. (), for instance, is a decoupled, linear PD controller with a sparse dependence on the state variables (For instance, the linear controller for the pitch axis is parametrized as u = c (px −p∗x )+c (vx −v∗x )+c θ, which has just three parameters while the entire state is nine dimensional. Here, p⋅ , v⋅ , and p∗⋅ , v⋅∗ , respectively, are the actual and desired position and velocity. θ denotes the pitch angle.). The sparsity reduces the policy class to just nine parameters. In Bagnell and Schneider (), two-layer neural network structures are used with a similar sparse dependence on the state variables. Two neural networks with five parameters each are learned for the cyclic controls. Differential Dynamic Programming Abbeel et al. ()

use differential dynamic programming (DDP) for the task of aerobatic trajectory following. DDP (Jacobson & Mayne, ) works by iteratively approximating the MDP as linear quadratic regulator (LQR) problems. The LQR control problem is a special class of MDPs, for which the optimal policy can be computed efficiently. In LQR the set of states is given by S = Rn , the set of actions/inputs is given by A = Rp , and the dynamics model is given by: s(t + ) = A(t)s(t) + B(t)u(t) + w(t), where for all t = , . . . , H we have that A(t) ∈ Rn×n , B(t) ∈ Rn×p and w(t) is a mean zero random variable (with finite variance). The reward for being in state s(t) and taking action u(t) is given by: −s(t)⊺ Q(t)s(t) − u(t)⊺ R(t)u(t). Here Q(t), R(t) are positive semi-definite matrices which parameterize the reward function. It is wellknown that the optimal policy for the LQR control problem is a linear feedback controller which can be efficiently computed using dynamic programming (see, e.g., Anderson & Moore, (), for details on linear quadratic methods.) DDP approximately solves general continuous statespace MDPs by iterating the following two steps until convergence: . Compute a linear approximation to the nonlinear dynamics and a quadratic approximation to

Autonomous Helicopter Flight Using Reinforcement Learning

the reward function around the trajectory obtained when executing the current policy in simulation. . Compute the optimal policy for the LQR problem obtained in Step and set the current policy equal to the optimal policy for the LQR problem. During the first iteration, the linearizations are performed around the target trajectory for the maneuver, since an initial policy is not available. This method is used to perform autonomous flips, rolls, and “funnels” (high-speed sideways flight in a circle) in Abbeel et al. () and autonomous autorotation (Autorotation is an emergency maneuver that allows a skilled pilot to glide a helicopter to a safe landing in the event of an engine failure or tail-rotor failure.) in Abbeel, Coates, Hunter, and Ng (), Fig. . While DDP computes a solution to the non-linear optimization problem, it relies on the accuracy of the non-linear model to correctly predict the trajectory that will be flown by the helicopter. This prediction is used in Step above to linearize the dynamics. In practice, the helicopter will often not follow the predicted trajectory closely (due to stochasticity and modeling errors), and thus the linearization will become a highly inaccurate approximation of the non-linear model. A common solution to this, applied by Coates et al. (), is to compute the DDP solution online, linearizing around a trajectory that begins at the current helicopter state. This ensures that the model is always linearized around a trajectory near the helicopter’s actual flight path. Apprenticeship Learning and Inverse RL In computing a

policy for an MDP, simply finding a solution (using any method) that performs well in simulation may not be enough. One may need to adjust both the model and

A

reward function based on the results of flight testing. Modeling error may result in controllers that fly perfectly in simulation but perform poorly or fail entirely in reality. Because helicopter dynamics are difficult to model exactly, this problem is fairly common. Meanwhile, a poor reward function can result in a controller that is not robust to modeling errors or unpredicted perturbations (e.g., it may use large control inputs that are unsafe in practice). If a human “expert” is available to demonstrate the maneuver, this demonstration flight can be leveraged to obtain a better model and reward function. The reward function encodes both the trajectory that the helicopter should follow, as well as the trade-offs between different types of errors. If the desired trajectory is infeasible (either in the non-linear simulation or in reality), this results in a significantly more difficult control problem. Also, if the trade-offs are not specified correctly, the helicopter may be unable to compensate for significant deviations from the desired trajectory. For instance, a typical reward function for hovering implicitly specifies a trade-off between position error and orientation error (it is possible to reduce one error, but usually at the cost of increasing the other). If this trade-off is incorrectly chosen, the controller may be pushed off course by wind (if it tries too hard to keep the helicopter level) or, conversely, may tilt the helicopter to an unsafe attitude while trying to correct for a large position error. We can use demonstrations from an expert pilot to recover both a good choice for the desired trajectory as well as good choices of reward weights for errors relative to this trajectory. In apprenticeship learning, we are given a set of N recorded state and control sequences,

Autonomous Helicopter Flight Using Reinforcement Learning. Figure . Snapshots of an autonomous helicopter performing in-place flips and rolls

A

A

Autonomous Helicopter Flight Using Reinforcement Learning

{sk (t), uk (t)}H t = for k = , . . . , N, from demonstration flights by an expert pilot. Coates et al. () note that these demonstrations may be sub-optimal but are often sub-optimal in different ways. They suggest that a large number of expert demonstrations may implicitly encode the optimal trajectory and propose a generative model that explains the expert demonstrations as stochastic instantiations of an “ideal” trajectory. This is the desired trajectory that the expert has in mind but is unable to demonstrate exactly. Using an ExpectationMaximization (Dempster, Laird, & Rubin, ) algorithm, they infer the desired trajectory and use this as the target trajectory in their reward function. A good choice of reward weights (for errors relative to the desired trajectory) can be recovered using inverse reinforcement learning (Abbeel & Ng, ; Ng & Russell, ). Suppose the reward function is written as a linear combination of features as follows: R(s, u) = c ϕ (s, u) + c ϕ (s, u) + ⋯. For a single recorded demonstration, {s(t), u(t)}H t= , the pilot’s accumulated reward corresponding to each feature can be computed as ci ϕ∗i = ci ∑H t= ϕ i (s(t), u(t)). If the pilot out-performs the autonomous flight controller with respect to a particular feature ϕ i , this indicates that the pilot’s own “reward function” places a higher value on that feature, and hence its weight ci should be increased. Using this procedure, a good choice of reward function that makes trade-offs similar to that of a human pilot can be recovered. This method has been used to guide the choice of reward for many maneuvers during flight testing (Abbeel et al., , ; Coates et al., ). In addition to learning a better reward function from pilot demonstration, one can also use the pilot demonstration to improve the model directly and attempt to reduce modeling error. Coates et al. (), for instance, use errors observed in expert demonstrations to jointly infer an improved dynamics model along with the desired trajectory. Abbeel et al. (), however, have proposed the following alternating procedure that is broadly applicable (see also Abbeel and Ng (a) for details): . Collect data from a human pilot flying the desired maneuvers with the helicopter. Learn a model from the data.

. Find a controller that works in simulation based on the current model. . Test the controller on the helicopter. If it works, we are done. Otherwise, use the data from the test flight to learn a new (improved) model and go back to Step . This procedure has similarities with model-based RL and with the common approach in control to first perform system identification and then find a controller using the resulting model. However, the key insight from Abbeel and Ng (a) is that this procedure is guaranteed to converge to expert performance in a polynomial number of iterations. The authors report needing at most three iterations in practice. Importantly, unlike the E family of algorithms (Kearns & Singh, ), this procedure does not require explicit exploration policies. One only needs to test controllers that try to fly as well as possible (according to the current choice of dynamics model) (Indeed, the E family of algorithms (Kearns & Singh, ) and its extensions (Brafman & Tennenholtz, ; Kakade, Kearns, & Langford, ; Kearns & Koller, ) proceed by generating “exploration” policies, which try to visit inaccurately modeled parts of the state space. Unfortunately, such exploration policies do not even try to fly the helicopter well, and thus would almost invariably lead to crashes.). The apprenticeship learning algorithms described above have been used to fly the most advanced autonomous maneuvers to date. The apprenticeship learning algorithm of Coates et al. (), for example, has been used to attain expert level performance on challenging aerobatic maneuvers as well as entire airshows composed of many maneuvers in rapid sequence. These maneuvers include in-place flips and rolls, tictocs (“Tic-toc” is a maneuver where the helicopter pitches forward and backward with its nose pointed toward the sky (resembling an inverted clock pendulum).), and chaos (“Chaos” is a maneuver where the helicopter flips in-place but does so while continuously pirouetting at a high rate. Visually, the helicopter body appears to tumble chaotically while nevertheless remaining in roughly the same position.) (see Fig. ). These maneuvers are considered among the most challenging possible and can only be performed

Autonomous Helicopter Flight Using Reinforcement Learning

A

A

Autonomous Helicopter Flight Using Reinforcement Learning. Figure . Snapshot sequence of an autonomous helicopter flying a “chaos” maneuver using apprenticeship learning methods. Beginning from top-left and proceeding left-to-right, top-to-bottom, the helicopter performs a flip while pirouetting counter-clockwise about its vertical axis. (This maneuver has been demonstrated continuously for as long as cycles like the one shown here)

Autonomous Helicopter Flight Using Reinforcement Learning. Figure . Super-imposed sequence of images of autonomous autorotation landings (from Abbeel et al. ())

by advanced human pilots. In fact, Coates et al. () show that their learned controller performance can even exceed the performance of the expert pilot providing the demonstrations, putting many of the maneuvers on par with professional pilots (Fig. ). A similar approach has been used in Abbeel et al. () to perform the first successful autonomous autorotations. Their aircraft has performed more than autonomous landings successfully without engine power. Not only do apprenticeship methods achieve stateof-the-art performance, but they are among the fastest learning methods available, as they obviate the need for arduous hand tuning by engineers. Coates et al. (), for instance, report that entire airshows can be

created from scratch with just h of work. This is in stark contrast to previous approaches that may have required hours or even days of tuning for relatively simple maneuvers.

Conclusion Helicopter control is a challenging control problem and has recently seen major successes with the application of learning algorithms. This Chapter has shown how each step of the control design process can be automated using machine learning algorithms for system identification and reinforcment learning algorithms for control. It has also shown how apprenticeship learning algorithms can be employed to achieve

A

Autonomous Helicopter Flight Using Reinforcement Learning

expert-level performance on challenging aerobatic maneuvers when an expert pilot can provide demonstrations. Autonomous helicopters with control systems developed using these methods are now capable of flying advanced aerobatic maneuvers (including flips, rolls, tic-tocs, chaos, and auto-rotation) at the level of expert human pilots.

Cross References 7Apprenticeship Learning 7Reinforcement Learning 7Reward Shaping

Recommended Reading Abbeel, P., Coates, A., Hunter, T., & Ng, A. Y. (). Autonomous autorotation of an rc helicopter. In ISER . Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y. (). An application of reinforcement learning to aerobatic helicopter flight. In NIPS (pp. –). Vancouver. Abbeel, P., Ganapathi, V., & Ng, A. Y. (). Learning vehicular dynamics with application to modeling helicopters. In NIPS . Vancouver. Abbeel, P., & Ng, A. Y. (). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the international conference on machine learning. New York: ACM. Abbeel, P., & Ng, A. Y. (a). Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the international conference on machine learning. New York: ACM Abbeel, P., & Ng, A. Y. (b). Learning first order Markov models for control. In NIPS . Abbeel, P., Quigley, M., & Ng, A. Y. (). Using inaccurate models in reinforcement learning. In ICML ’: Proceedings of the rd international conference on machine learning (pp. –). New York: ACM. Anderson, B., & Moore, J. (). Optimal control: linear quadratic methods. Princeton, NJ: Prentice-Hall. Bagnell, J., & Schneider, J. (). Autonomous helicopter control using reinforcement learning policy search methods. In International conference on robotics and automation. Canada: IEEE. Brafman, R. I., & Tennenholtz, M. (). R-max, a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, , –. Coates, A., Abbeel, P., & Ng, A. Y. (). Learning for control from multiple demonstrations. In ICML ’: Proceedings of the th international conference on machine learning. Dempster, A. P., Laird, N. M., & Rubin, D. B. (). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, , –. Dunbabin, M., Brosnan, S., Roberts, J., & Corke, P. (). Vibration isolation for autonomous helicopter flight. In Proceedings of the IEEE international conference on robotics and automation (Vol. , pp. –).

Gavrilets, V., Martinos, I., Mettler, B., & Feron, E. (a). Control logic for automated aerobatic flight of miniature helicopter. In AIAA guidance, navigation and control conference. Cambridge, MA: Massachusetts Institute of Technology. Gavrilets, V., Martinos, I., Mettler, B., & Feron, E. (b). Flight test and simulation results for an autonomous aerobatic helicopter. In AIAA/IEEE digital avionics systems conference. Gavrilets, V., Mettler, B., & Feron, E. (). Nonlinear model for a small-size acrobatic helicopter. In AIAA guidance, navigation and control conference (pp. –). Jacobson, D. H., & Mayne, D. Q. (). Differential dynamic programming. New York: Elsevier. Kakade, S., Kearns, M., & Langford, J. (). Exploration in metric state spaces. In Proceedings of the international conference on machine learning. Kearns, M., & Koller, D. (). Efficient reinforcement learning in factored MDPs. In Proceedings of the th international joint conference on artificial intelligence. San Francisco: Morgan Kaufmann. Kearns, M., & Singh, S. (). Near-optimal reinforcement learning in polynomial time. Machine Learning Journal, (–), – . La Civita, M. (). Integrated modeling and robust control for full-envelope flight of robotic helicopters. PhD thesis, Carnegie Mellon University, Pittsburgh, PA. La Civita, M., Papageorgiou, G., Messner, W. C., & Kanade, T. (). Design and flight testing of a high-bandwidth H∞ loop shaping controller for a robotic helicopter. Journal of Guidance, Control, and Dynamics, (), –. Leishman, J. (). Principles of helicopter aerodynamics. Cambridge: Cambridge University Press. Nelder, J. A., & Mead, R. (). A simplex method for function minimization. The Computer Journal, , –. Ng, A. Y., & Jordan, M. (). Pegasus: A policy search method for large MDPs and POMDPs. In Proceedings of the uncertainty in artificial intelligence th conference. San Francisco: Morgan Kaufmann. Ng, A. Y., & Russell, S. (). Algorithms for inverse reinforcement learning. In Procedings of the th international conference on machine learning (pp. –). San Francisco: Morgan Kaufmann. Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., et al., (). Autonomous inverted helicopter flight via reinforcement learning. In International symposium on experimental robotics. Berlin: Springer. Ng, A. Y., Kim, H. J., Jordan, M., & Sastry, S. (). Autonomous helicopter flight via reinforcement learning. In NIPS . Saripalli, S., Montgomery, J. F., & Sukhatme, G. S. (). Visually-guided landing of an unmanned aerial vehicle. IEEE Transactions on Robotics and Autonomous Systems, (), –. Seddon, J. (). Basic helicopter aerodynamics. In AIAA education series. El Segundo, CA: America Institute of Aeronautics and Astronautics. Tischler, M. B., & Cauffman, M. G. (). Frequency response method for rotorcraft system identification: Flight application to BO- couple rotor/fuselage dynamics. Journal of the American Helicopter Society, .

Averaged One-Dependence Estimators

Average-Cost Neuro-Dynamic Programming 7Average-Reward Reinforcement Learning

called the SuperParent and this type of one-dependence classifier is called a SuperParent one-dependence estimator (SPODE). Only those SPODEs with SuperParent xi where the value of xi occurs at least m times are used for predicting a class label y for the test instance x = ⟨x , . . . , xn ⟩. For any attribute value xi ,

Average-Cost Optimization 7Average-Reward Reinforcement Learning

Averaged One-Dependence Estimators Fei Zheng, Geoffrey I. Webb Monash University

Synonyms AODE

Definition Averaged one-dependence estimators is a 7seminaive Bayesian Learning method. It performs classification by aggregating the predictions of multiple one-dependence classifiers in which all attributes depend on the same single parent attribute as well as the class.

Classification with AODE An effective approach to accommodating violations of naive Bayes’ attribute independence assumption is to allow an attribute to depend on other non-class attributes. To maintain efficiency it can be desirable to utilize one-dependence classifiers, such as 7Tree Augmented Naive Bayes (TAN), in which each attribute depends upon the class and at most one other attribute. However, most approaches to learning with onedependence classifiers perform model selection, a process that usually imposes substantial computational overheads and substantially increases variance relative to naive Bayes. AODE avoids model selection by averaging the predictions of multiple one-dependence classifiers. In each one-dependence classifier, an attribute is selected as the parent of all the other attributes. This attribute is

A

P(y, x) = P(y, xi )P(x ∣ y, xi ). This equality holds for every xi . Therefore, P(y, x) =

∑≤i≤n∧F(xi )≥m P(y, xi )P(x ∣ y, xi ) , ∣{ ≤ i ≤ n ∧ F(xi ) ≥ m}∣

()

where F(xi ) is the frequency of attribute value xi in the training sample. Utilizing () and the assumption that attributes are independent given the class and the SuperParent xi , AODE predicts the class for x by selecting argmax y

∑

≤i≤n∧F(x i )≥m

ˆ xi ) ∏ P(x ˆ j ∣ y, xi ). () P(y, ≤j≤n,j≠i

It averages over estimates of the terms in (), rather than the true values, which has the effect of reducing the variance of these estimates. Figure shows a Markov network representation of an example AODE. As AODE makes a weaker attribute conditional independence assumption than naive Bayes while still avoiding model selection, it has substantially lower 7bias with a very small increase in 7variance. A number of studies (Webb, Boughton, & Wang, ; Zheng & Webb, ) have demonstrated that it often has considerably lower zero-one loss than naive Bayes with moderate time complexity. For comparisons with other semi-naive techniques, see 7semi-naive Bayesian learning. One study (Webb, Boughton, & Wang, ) found AODE to provide classification accuracy competitive to a state-of-the-art discriminative algorithm, boosted decision trees. When a new instance is available, like naive Bayes, AODE only needs to update the probability estimates. Therefore, it is also suited to incremental learning.

A

A

Average-Payoff Reinforcement Learning

y x

x

x

y

...

x

x

x

x

y

...

x

x

x

x

...

x

...

Averaged One-Dependence Estimators. Figure . A Markov network representation of the SPODEs that comprise an example AODE

Cross References

Motivation and Background

7Bayesian Network 7Naive Bayes 7Semi-Naive Bayesian Learning 7Tree-Augmented Naive Bayes

7Reinforcement learning (RL) is the study of programs that improve their performance at some task by receiving rewards and punishments from the environment (Sutton & Barto, ). RL has been quite successful in automatic learning of good procedures for complex tasks such as playing Backgammon and scheduling elevators (Crites & Barto, ; Tesauro, ). In episodic domains in which there is a natural termination condition such as the end of the game in Backgammon, the obvious performance measure to optimize is the expected total reward per episode. But some domains such as elevator scheduling are recurrent, i.e., do not have a natural termination condition. In such cases, total expected reward can be infinite, and we need a different optimization criterion. In the discounted optimization framework, in each time step, the value of the reward is multiplied by a discount factor γ < , so that the total discounted reward is always finite. However, in many domains, there is no natural interpretation for the discount factor γ. A natural performance measure to optimize in such domains is the average reward received per time step. Although one could use a discount factor which is close to to approximate average-reward optimization, an approach that directly optimizes the average reward avoids this additional parameter and often leads to faster convergence in practice. There is significant theory behind average-reward optimization based on 7Markov decision processes (MDPs) (Puterman, ). An MDP is described by a -tuple ⟨S, A, P, r⟩, where S is a discrete set of states and A is a discrete set of actions. P is a conditional probability distribution over the next states, given the current state and action, and r gives the immediate reward for a given state and action. A policy π is a mapping from states to actions. Each policy π induces a Markov process over some set of states. In ergodic MDPs, every policy π forms a single closed set of states, and the average reward per time step of π in the limit of infinite

Recommended Reading Webb, G. I., Boughton, J., & Wang, Z. (). Not so naive Bayes: aggregating one-dependence estimators. Machine Learning, (), –. Zheng, F., & Webb, G. I. (). A comparative study of seminaive Bayes methods in classification learning. In Proceedings of the Fourth Australasian Data Mining Conference. (pp. –).

Average-Payoff Reinforcement Learning 7Average-Reward Reinforcement Learning

Average-Reward Reinforcement Learning Prasad Tadepalli Oregon State University, Corvallis, OR, USA

Synonyms ARL; Average-cost neuro-dynamic programming; Average-cost optimization; Average-payoff reinforcement learning

Definition Average-reward reinforcement learning (ARL) refers to learning policies that optimize the average reward per time step by continually taking actions and observing the outcomes including the next state and the immediate reward.

Average-Reward Reinforcement Learning

horizon is independent of the starting state. We call it the “gain” of the policy π, denoted by ρ(π), and consider the problem of finding a “gain-optimal policy,” π ∗ , that maximizes ρ(π). Even though the gain ρ(π) of a policy π is independent of the starting state s, the total expected reward in time t is not. It can be denoted by ρ(π)t + h(s), where h(s) is a state-dependent bias term. It is the bias values of states that determine which states and actions are preferred, and need to be learned for optimal performance. The following theorem gives the Bellman equation for the bias values of states. Theorem For ergodic MDPs, there exist a scalar ρ and a real-valued bias function h over S that satisfy the recurrence relation

∀s ∈ S,

h(s) = max {r(s, a) + ∑ P(s′ ∣s, a)h(s′ )} − ρ. a∈A

s′ ∈S

() Further, the gain-optimal policy µ attains the above maximum for each state s, and ρ is its gain. ∗

Note that any one solution to () yields an infinite number of solutions by adding the same constant to all h-values. However, all these sets of h-values will result in the same set of optimal policies µ ∗ , since the optimal action in a state is determined only by the relative differences between the values of h.

0

h(0)=0

3

bad-move 0 0 good-move

3

h(3)=2

0

1

h(1)=0

A

For example, in Fig. , the agent has to select between the actions good-move and bad-move in state . If it stays in state , it gets an average reward of . If it stays in state , it gets an average reward of −. For this domain, ρ = for the optimal policy of choosing good-move in state . If we arbitrarily set h() to , then h() = , h() = , and h() = satisfy the recurrence relations in (). For example, the difference between h() and h() is , which equals the difference between the immediate reward for the optimal action in state and the optimal average reward . Given the probability model P and the immediate rewards r, the above equations can be solved by White’s relative value iteration method by setting the h-value of an arbitrarily chosen reference state to and using synchronous successive approximation (Bertsekas, ). There is also a policy iteration approach to determine the optimal policy starting with some arbitrary policy, solving for its values using the value iteration, and updating the policy using one step look-ahead search. The above iteration is repeated until the policy converges (Puterman, ).

Model-Based Learning If the probabilities and the immediate rewards are not known, the system needs to learn them before applying the above methods. A model-based approach called H-learning interleaves model learning with Bellman backups of the value function (Tadepalli & Ok, ). This is an average-reward version of 7adaptive real-time dynamic programming (Barto, Bradtke, & Singh, ). The models are learned by collecting samples of state-action-next-state triples ⟨s, a, s′ ⟩ and computing P(s′ ∣s, a) using the maximum likelihood estimation. It then employs the “certainty equivalence principle” by using the current estimates as the true value while updating the h-value of the current state s according to the following update equation derived from the Bellman equation.

0

h(s) ← max {r(s, a) + ∑ P(s′ ∣s, a)h(s′ )} − ρ. a∈A

2

s′ ∈S

()

h(2)=1

Average-Reward Reinforcement Learning. Figure . A simple Markov decision process (MDP) that illustrates the Bellman equation

One complication in ARL is the estimation of the average reward ρ in the update equations during learning. One could use the current estimate of the long-term average reward, but it is distorted

A

A

Average-Reward Reinforcement Learning

by the exploratory actions that the agent needs to take to learn about the unexplored parts of the state space. Without the exploratory actions, ARL methods converge to a suboptimal policy. To take this into account, we have from (), in any state s and a nonexploratory action a that maximizes the right-hand side, ρ = r(s, a) − h(s) + ∑s′ ∈S P(s′ ∣S, a)h(s′ ). Hence, ρ is estimated by cumulatively averaging r − h(s) + h(s′ ), whenever a greedy action a is executed in state s resulting in state s′ and immediate reward r. ρ is updated using the following equation where α is the learning rate. ρ ← ρ + α(r − h(s) + h(s′ )). () One issue with model-based learning is that the models require too much space and time to learn as tables. In many cases, actions can be represented much more compactly. For example, Tadepalli and Ok () uses dynamic Bayesian networks to represent and learn action models, resulting in significant savings in space and time for learning the models.

Model-Free Learning One of the disadvantages of the model-based methods is the need to explicitly represent and learn action models. This is completely avoided in model-free methods such as 7Q-learning by learning value functions over state–action pairs. Schwartz’s R-learning is an adaptation of Q-learning, which is a discounted reinforcement learning method, to optimize average reward (Schwartz, ). The state–action value R(s, a) can be defined as the expected long-term advantage of executing action a in state s and from then on following the optimal averagereward policy. It can be defined using the bias values h and the optimal average reward ρ as follows. R(s, a) = r(s, a) + ∑ P(s′ ∣s, a)h(s′ ) − ρ. s′ ∈S

()

The main difference with Q-values is that instead of discounting the expected total reward from the next state, we subtract the average reward ρ in each time step, which is the constant penalty for using up a time step. The h value of any state s can now be defined using the following equation. h(s′ ) = max R(s′ , u). u

()

Initially all the R-values are set to . When action a is executed in state s, the value of R(s, a) is updated using the update equation R(s, a) ← ( − β)R(s, a) + β(r + h(s′ ) − ρ),

()

where β is the learning rate, r is the immediate reward received, s′ is the next state, and ρ is the estimate of the average reward of the current greedy policy. In any state s, the greedy action a maximizes the value R(s, a); so R-learning does not need to explicitly learn the immediate reward function r(s, a) or the action models P(s′ ∣s, a), since it does not use them either for the action selection or for updating the R-values. Both model-free and model-based ARL methods have been evaluated in several experimental domains (Mahadevan, ; Tadepalli & Ok, ). When there is a compact representation for models and can be learned quickly, the model-based method seems to perform better. It also has the advantage of fewer number of tunable parameters. However, model-free methods are more convenient to implement especially if the models are hard to learn or represent.

Scaling Average-Reward Reinforcement Learning Just as for discounted reinforcement learning, scaling issues are paramount for ARL. Since the number of states is exponential to the number of relevant state variables, a table-based approach does not scale well. The problem is compounded in multi-agent domains where the number of joint actions is exponential in the number of agents. Several function approximation approaches, such as linear functions, multi-layer perceptrons (Marbach, Mihatsch, & Tsitsiklis, ), local 7linear regression (Tadepalli & Ok, ), and tile coding (Proper & Tadepalli, ) were tried with varying degrees of success. 7Hierarchical reinforcement learning based on the MAXQ framework was also explored in the averagereward setting and was shown to lead to significantly faster convergence. In MAXQ framework, we have a directed acyclic graph, where each node represents a task and stores the value function for that task. Usually, the value function for subtasks depends on fewer state variables than the overall value function and hence can

Average-Reward Reinforcement Learning

be more compactly represented. The relevant variables for each subtask are fixed by the designer of the hierarchy, which makes it much easier to learn the value functions. One potential problem with the hierarchical approach is the loss due to the hierarchical constraint on the policy. Despite this limitation, both model-based (Seri & Tadepalli, ) and model-free approaches (Ghavamzadeh & Mahadevan, ) were shown to yield optimal policies in some domains that satisfy the assumptions of these methods.

Applications A temporal difference method for average reward based on TD() was used to solve a call admission control and routing problem (Marbach et al., ). On a modestly sized network of nodes, it was shown that the average-reward TD() outperforms the discounted version because it required more careful tuning of its parameters. Similar results were obtained in other domains such as automatic guided vehicle routing (Ghavamzadeh & Mahadevan, ) and transfer line optimization (Wang & Mahadevan, ).

Convergence Analysis Unlike their discounted counterparts, both R-Learning and H-Learning lack convergence guarantees. This is because due to the lack of discounting, the updates can no longer be thought of as contraction mappings, and hence the standard theory of stochastic approximation does not apply. Simultaneous update of the average reward ρ and the value functions makes the analysis of these algorithms much more complicated. However, some ARL algorithms have been proved convergent in the limit using analysis based on ordinary differential equations (ODE) (Abounadi, Bertsekas, & Borkar, ). The main idea is to turn to ordinary differential equations that are closely tracked by the update equations and use two time-scale analysis to show convergence. In addition to the standard assumptions of stochastic approximation theory, the two timescale analysis requires that ρ is updated at a much slower time scale than the value function. The previous convergence results are based on the limit of infinite exploration. One of the many challenges in reinforcement learning is that of efficient exploration

A

of the MDP to learn the dynamics and the rewards. There are model-based algorithms that guarantee learning an approximately optimal average-reward policy in time polynomial in the numbers of states and actions of the MDP and its mixing time. These algorithms work by alternating between learning the action models of the MDP by taking actions in the environment, and solving the learned MDP using offline value iteration. In the “Explicit Explore and Exploit” or E algorithm, the agent explicitly decides between exploiting the known part of the MDP and optimally trying to reach the unknown part of the MDP (exploration) (Kearns & Singh, ). During exploration, it uses the idea of “balanced wandering,” where the least executed action in the current state is preferred until all actions are executed a certain number of times. In contrast, the R-Max algorithm implicitly chooses between exploration and exploitation by using the principle of “optimism under uncertainty” (Brafman & Tennenholtz, ). The idea here is to initialize the model parameters optimistically so that all unexplored actions in all states are assumed to reach a fictitious state that yields maximum possible reward from then on regardless of which action is taken. The optimistic initialization of the model parameters automatically encourages the agent to execute unexplored actions, until the true models and values of more states and actions are gradually revealed to the agent. It has been shown that with a probability at least − δ, both E and R-MAX learn approximately correct models whose optimal policies have an average reward є-close to the true optimal in time polynomial in the numbers of states and actions, the mixing time of the MDP, є , and δ . Unfortunately the convergence results do not apply when there is function approximation involved. In the presence of linear function approximation, the averagereward version of temporal difference learning, which learns a state-based value function for a fixed policy, is shown to converge in the limit (Tsitsiklis & Van Roy, ). The transient behavior of this algorithm is similar to that of the corresponding discounted TD-learning with an appropriately scaled constant basis function (Van Roy & Tsitsiklis, ). As in the discounted case, development of provably convergent optimal policy learning algorithms with function approximation is a challenging open problem.

A

A

Average-Reward Reinforcement Learning

Cross References 7Efficient Exploration in Reinforcement Learning 7Hierarchical Reinforcement Learning 7Model-Based Reinforcement Learning

Recommended Reading Abounadi, J., Bertsekas, D. P., & Borkar, V. (). Stochastic approximation for non-expansive maps: Application to Qlearning algorithms. SIAM Journal of Control and Optimization, (), –. Barto, A. G., Bradtke, S. J., & Singh, S. P. (). Learning to act using real-time dynamic programming. Artificial Intelligence, (), –. Bertsekas, D. P. (). Dynamic programming and optimal control. Belmont, MA: Athena Scientific. Brafman, R. I., & Tennenholtz, M. (). R-MAX – a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, , –. Crites, R. H., & Barto, A. G. (). Elevator group control using multiple reinforcement agents. Machine Learning, (/), –. Ghavamzadeh, M., & Mahadevan, S. (). Hierarchical average reward reinforcement learning. Journal of Machine Learning Research, (), –. Kearns, M., & Singh S. (). Near-optimal reinforcement learning in polynomial time. Machine Learning, (/), –. Mahadevan, S. (). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, (//), –. Marbach, P., Mihatsch, O., & Tsitsiklis, J. N. (). Call admission control and routing in integrated service networks using

neuro-dynamic programming. IEEE Journal on Selected Areas in Communications, (), –. Proper, S., & Tadepalli, P. (). Scaling model-based averagereward reinforcement learning for product delivery. In European conference on machine learning (pp. –). Springer. Puterman, M. L. (). Markov decision processes: Discrete dynamic stochastic programming. New York: Wiley. Schwartz, A. (). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the tenth international conference on machine learning (pp. –). San Mateo, CA: Morgan Kaufmann. Seri, S., & Tadepalli, P. (). Model-based hierarchical averagereward reinforcement learning. In Proceedings of international machine learning conference (pp. –). Sydney, Australia: Morgan Kaufmann. Sutton, R., & Barto, A. (). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tadepalli, P., & Ok, D. (). Model-based average-reward reinforcement learning. Artificial Intelligence, , –. Tesauro, G. (). Practical issues in temporal difference learning. Machine Learning, (–), –. Tsitsiklis, J., & Van Roy, B. (). Average cost temporal-difference learning. Automatica, (), –. Van Roy, B., & Tsitsiklis, J. (). On average versus discounted temporal-difference learning. Machine Learning, (/), –. Wang, G., & Mahadevan, S. (). Hierarchical optimization of policy-coupled semi-Markov decision processes. In Proceedings of the th international conference on machine learning (pp. –). Bled, Slovenia.

B Backprop 7Backpropagation

Backpropagation Paul Munro University of Pittsburgh, Pittsburgh, PA, USA

Synonyms Backprop; BP; Generalized delta rule

Definition Backpropagation of error (henceforth BP) is a method for training feed-forward neural networks see 7Artificial Neural Networks. A specific implementation of BP is an iterative procedure that adjusts network weight parameters according to the gradient of an error measure. The procedure is implemented by computing an error value for each output unit, and by backpropagating the error values through the network.

Characteristics

denote the set of units that receive input from unit k. In an acyclic graph, at least one unit has a FanIn that is the null set. These are the input units; the activity of an input unit is not computed; rather it is set to a value external to the network (i.e., from the training data). Similarly, at least one unit has a null FanOut set. Such units typically represent the output of the network; i.e., this set of values is the result of the network computation. Intermediate units (often called hidden units) receive input from other units and project outputs to other computational units. For the BP procedure, the activity of each unit is computed in two steps: Linear step: the activities of the FanIn are each multiplied by an independent “weight” parameter, to which a “bias” parameter is added; each computational unit has a single bias parameter, independent of the other units. Let this sum be denoted xk for unit k. Nonlinear step: The activity ak of unit k is a differentiable nonlinear function of xk . A favorite function is the logistic a = /( + exp(−x)), because it maps the range [−∞, +∞] to [, ] and its derivative has properties conducive to the implementation of BP.

Feed-Forward Networks

A feed-forward neural network is a mathematical function that is composed of constituent “semi-linear” functions constrained by a feed-forward network architecture, wherein the constituent functions correspond to nodes (often called units or artificial neurons) in a graph, as in Fig. . A feedfoward network architecture has a connectivity structure that is an acyclic graph; that is, there are no closed loops. In most cases, the unit functions have a finite range such as [, ]. Thus, the network maps RN to [, ]M , where N is the number of input values and M is the number of output units. Let FanIn(k) refer to the set of units that provide input to unit k, and let FanOut(k)

ak = fk (xk );

where xk = bk +

∑

wkj sj

j∈FanIn(k)

Gradient Descent

Derivation of BP is a direct application of the gradient descent approach to optimization and is dependent on a definition of network error, a function of the actual network response to a stimulus, r(s) and the target T(s). The two most common error functions are the summed squared error (SSE) and the cross entropy error (CE) (CE error as defined here is based on the presumption that the output values are in the range [, ]. Likewise

Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----, © Springer Science+Business Media LLC

B

Backpropagation

FanOut (k)

Output units

Hidden units

Unit k Input units

Standard 3 layer classification net

FanIn (k)

General feedforward net structure

Backpropagation. Figure . Two networks are shown. Input units are shown as simple squares at the bottom of each figure. Other units are computational (designated by a horizontal line). Left: A standard -layer network. Four input units project to five hidden units, which in turn project to a single output unit. Not all connections are shown. Such a network is commonly used for classification tasks. Right: An example of a feed-forward network with four inputs, three hidden units, and two outputs

for the target values; this is often used for classification tasks, wherein target values are set to the endpoints of the range, and ). ESSE ≡ ∑ (Ti (s) − ri (s))

i∈Outut s∈Train

ECE ≡ ∑ [Ti (s) ln (ri (s)) − ( − Ti (s)) ln ( − ri (s))] i∈Outut s∈Train

Each weight parameter, wij (the weight of the connection from j to i), is updated by an amount proportional to the negative gradient of the error measure with respect to that parameter: ∆wij = −η

∂E , ∂wij

where the step size, η, modulates the intrinsic tradeoff between smooth convergence of the weights and the speed of convergence; in the regime where η is small, the system is well-behaved and converges smoothly, but slowly, and for larger η, the system may learn some subsets of the training set faster at the expense of smooth convergence on all patterns in the set. Thus, η is also called the learning rate.

Implementation

Several aspects of the feed-forward network must be defined prior to running a BP program, such as the configuration of the hidden units, the initial values of the weights, the functions they will compute, and the numerical representation of the input and target data. There are also parameters of the learning algorithm that must be chosen, such as the value of η and the form of the error function. The weight and bias parameters are set to their initial values (these are usually random within specified limits). BP is implemented as an iterative process as follows: . A stimulus-target pair is drawn from the training set. . The activity values for the units in the network are computed for all the units in the network in a forward fashion from input to output (Fig. a). . The network output values are compared to the target and a delta (δ) value is computed for each output unit based on the difference between the target and the actual output response value.

Backpropagation

B

Errors from FanOut (k)

B ak ak = fk (xk)

di

ek = Σwik di

xk = bk + Σwkj aj

i ÎFanOut(k )

j ÎFanIn(k)

Dbi = hdi

dk = fk¢(ak ) × ek aj

Dwij = hdi aj

Inputs to unit k Activity propagates forward

Error propagates backward

Weights are updated

Backpropagation. Figure . With each iteration of the backprop algorithm, (a) An activity value is computed for every unit in the network from the input to the output. (b) The network output is compared with the target. The error ek for output unit k is defined as (Tk − rk ). A value δk is computed for each output unit by multiplying ek by the derivative of the activity function. For hidden units, the error is propagated backward using the weights. (c) The weight parameters wij are updated in proportion to the product of δi and aj

. The deltas are propagated backward through the network using the same weights that were used to compute the activity values (Fig. b). . Each weight is updated by an amount proportional to the product of the downstream delta value and the upstream activity (Fig. c). The procedure can be run either in an online mode or batch mode. In the online mode, the network parameters are updated for each stimulus-target pair. In the batch mode, the weight changes are computed and accumulated over several iterations without updating the weights until a large number (B) of stimulus-target pairs have been processed (often, the entire training set), at which the weights are updated by the accumulated amounts. online :

∆wij (t) = ηδ i (t)aj (t)

Classification Tasks with BP

The simplest and most common classification function returns a binary value, indicating membership in a particular class. The most common network architecture for a task of this kind is the three-layer network of Fig. (left), with training values of and . For classification tasks, the cross entropy error function generally gives significantly faster convergence. After training, the network is in test mode or production mode, and the responses are in the continuous range [, ]; the response must thus be interpreted. The value of the response could be interpreted as a probability or fuzzy Boolean value. Often, however, a single threshold is applied to give a binary answer. A double threshold is sometimes used, with the midrange defined as “uncertain.”

∆bi (t) = ηδ i (t) Curve Fitting with BP

t+B

batch :

∆wij (t + B) = ∑ ηδ i (s)aj (s) s=t− t−B

∆bi (t + T) = ∑ ηδ i (s) s=t+

A feed-forward network can be trained to approximate any function, given the sufficient hidden units. The range of the output unit(s) must be capable of generating activity values in the required range. In order to accommodate an arbitrary range uniformly, a linear

B

Backpropagation

function is advisable for the output units, and the SSE function is the basis for gradient descent. The Autoencoder Architecture

The autoencoder is a network design in which the target pattern is identical to the input pattern. The hidden units are configured such that there is a “bottleneck layer” of units that is smaller than the input layer, through which information flows; i.e., there are no connections bypassing the bottleneck. Thus, any information necessary to reconstruct the input pattern at the output layer must be represented at the bottleneck. This approach has been successfully applied as an approach to nonlinear dimensionality reduction (e.g., Demers & Cottrell, ). It bears notable similarities and differences to linear techniques, such as 7principal components analysis (PCA). Prediction with BP

The plain “vanilla” BP propagates input to output with no explicit representation of time. Several approaches to processing of temporal patterns have been put forward. Most prominent among these are: Time delay neural network. In this approach, the input stimulus is simply a sample of a time varying signal. The input patterns are typically generated by a sliding window of samples over time or over a sequence. 7Simple recurrent network (Elman, ). A sequence of stimulus patterns is presented as input for the network, which has a single hidden layer design. With each iteration, the input is augmented by a secondary set of input units whose activity is a copy of the hidden layer activity from the previous iteration. Thus, the network is able to maintain a representation of the recent history of network stimuli. Backpropagation through time (Rumelhart, Hinton, & Williams, ). A recurrent network (i.e., a cyclic network) is “unfolded in time” by forming a large multilayer network, in which each layer is a copy of the entire network shifted in time. Thus, the number of layers limits the temporal window available to the network. Recurrent backpropagation (Pineda, ). An acyclic network is run with activity propagation and error propagation, until variables converge. Then the weights are updated.

Cognitive Modeling with BP

Interest in BP as a training technique for classifiers has waned somewhat since the introduction of 7Support vector machines (SVMs) in the mid s. However, the influence of BP as an approach to modeling cognitive processes, including perception, concept learning, spatial cognition, and language learning, remains strong. Analysis of hidden unit representations (e.g., using clustering techniques) has given insight into plausible intermediate processes that may underlie cognitive phenomena. Also, many cognitive models trained with BP have exhibited time courses consistent with stages of human learning. Biological Inspiration and Plausibility

The “connectionist” approach to modeling cognition is based on “neural network” models, which have been touted as “biologically inspired” since their inception. The similarities and differences between connectionist architectures and living brains have been exhaustively debated. Like the brain, the models consist of elements that are extremely limited, computationally. Computational power is derived by several units in network architecture. However, there are compelling differences as well. For example, the temporal dynamics in biological neurons is far more complex than the simple functions used in connectionist networks. It remains unclear what level of neurobiological detail is relevant to understand the cognitive functions. Shortcomings of BP

The BP method is notorious for convergence problems. An inherent problem of gradient descent approaches to optimization is the issue of locally optimal values. Seeking a minimum value be heading downhill is like water running downhill. Not all water reaches the lowest point (sea level). Water that flows into a mountain lake has landed in a local minimum, a region that is bounded by higher ground. Even when BP converges to a global minimum (or a local minimum that is “good enough”), it is sometimes very slow. The convergence properties of BP depend on the learning rate and random factors, such as the initial weight and bias values. Another difficulty with BP is the selection of a network structure. The number of hidden units and the

Basic Lemma

interconnectivity among them has a strong influence on both the generalization performance and the convergence time. Since the nature of this influence is poorly understood, the design of the network is left to guesswork. The standard approach is to use a single hidden layer (as in Fig. , left), which has the advantage of relatively fast convergence.

History

The idea of training a multilayered network using error propagation was originated by Frank Rosenblatt (, ). However, he was unable to apply gradient descent because he was using linear threshold functions that were not differentiable; therefore, the technique of gradient descent was unavailable. He developed a technique known as the perceptron learning rule that is only applicable to two layer networks (no hidden units). Without hidden units, the computational power of the network is severely reduced. Work in the field virtually stopped with the publication of Perceptrons (Minsky & Papert, ). The backpropagation procedure was first published by Werbos (), but did not receive significant recognition until it was put forward by Rumelhart et al. ().

Bagging is an 7ensemble learning technique. The name “Bagging” is an acronym derived from Bootstrap AGGregatING. Each member of the ensemble is constructed from a different training dataset. Each dataset is a 7bootstrap sample from the original. The models are combined by a uniform average or vote. Bagging works best with 7unstable learners, that is those that produce differing generalization patterns with small changes to the training data. Bagging therefore tends not to work well with linear models. See 7ensemble learning for more details.

Bake-Off Definition Bake-off is a disparaging term for experimental evaluation of multiple learning algorithms by a process of applying each algorithm to a limited set of benchmark problems.

Cross References

Cross References 7Artificial Neural Networks

Demers, D., & Cottrell, G. (). Non-linear dimensionality reduction. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems (Vol. ). San Mateo, CA: Morgan Kaufmann. Elman, J. (). Finding structure in time. Cognitive Science, , –. Minsky, M. L., & Papert, S. A. (). Perceptrons. Cambridge, MA: MIT Press. Pineda, F. J. (). Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neural Computation, , –. Rosenblatt, F. (). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, , –. Rosenblatt, F. (). Principles of statistical neurodynamics. Washington, DC: Spartan. Werbos, P. (). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge.

Bagging

7Algorithm Evaluation

Recommended Reading

B

Bandit Problem with Side Information 7Associative Reinforcement Learning

Bandit Problem with Side Observations 7Associative Reinforcement Learning

Basic Lemma 7Symmetrization Lemma

B

B

Basket Analysis

Basket Analysis Hannu Toivonen University of Helsinki, Helsinki, Finland

Synonyms Market basket analysis

Definition The goal of basket analysis is to utilize large volumes of electronic receipts, stored at the checkout terminals of supermarkets, for better understanding of customer behavior. While many forms of learning and mining can be applied to market baskets, the term usually refers to some variant of 7association rule mining. In the basic setting, each market basket constitutes an example essentially defined by the set of purchased products. Association rules then identify sets of items that tend to be bought together. A classical, anecdotal discovery from supermarket data is that “if a basket contains diapers then it often also contains beer.” This example illustrates several potential benefits of market basket analysis by association rules: simplicity and understandability of the results, actionability of the results, and a form of nonsupervised approach where the consequent of the rule has not been fixed by the user. Association rules are often found with the 7Apriori algorithm, and are based on 7frequent itemsets.

Cross References 7Apriori Algorithm 7Association Rule 7Frequent Itemset 7Frequent Pattern

Baum–Welch Algorithm The Baum–Welch algorithm is used for computing maximum likelihood estimates and posterior mode estimates for the parameters (transition and emission probabilities) of a HMM, when given only output sequences (emissions) as training data. The Baum–Welch algorithm is a particular instantiation of the expectation-maximization algorithm, suited for HMMs.

Bayes Adaptive Markov Decision Processes 7Bayesian Reinforcement Learning

Bayes Net 7Bayesian Network

Bayes Rule Geoffrey I. Webb Monash University

Definition Bayes rule provides a decomposition of a conditional probability that is frequently used in a family of learning techniques collectively called Bayesian Learning. Bayes rule is the equality

Batch Learning Synonyms Offline Learning

P(z ∣ w) =

P(z)P(w ∣ z) P(w)

()

P(w) is called the prior probability, P(w ∣ z) is called the posterior probability, and P(z ∣ w) is called the likelihood.

Definition A batch learning algorithm accepts a single input that is a set or sequence of observations. The algorithm produces its 7model, and does no further learning. Batch learning stands in contrast to 7online learning.

Discussion Bayes rule is used for two purposes. The first is Bayesian update. In this context, z represents some new information that has become available since an estimate P(w)

Bayesian Methods

was formed of some hypothesis w. The application of Bayes’ rule enables a new estimate of the probability of w (the posterior probability) to be calculated from estimates of the prior probability, the likelihood and P(z). The second common application of Bayes’ rule is for estimating posterior probabilities in probabilistic learning, where it is the core of 7Bayesian networks, 7naïve Bayes, and 7semi-naïve Bayesian techniques. While Bayes’ rule may initially appear mysterious, it is readily derived from the basic principle of conditional probability that P(w ∣ z) = P(w, z)P(z)

()

B

logical sense). Probabilities are updated based on new evidence or outcomes y using Bayes rule, which takes the form p(x∣C, y) =

p(x∣C)p(y∣x, C) , p(y∣C)

where χ is the discrete domain of x. More generally, any measurable set can be used for the domain χ. An integral or mixed sum and integral can replace the sum. For a utility function u(x) of some event x, for instance the benefit of a particular outcome, the expected value of u() is Ex∣C [u(x)] = ∑ p(x∣C)u(x). x∈X

As P(w, z) =

P(w)P(w, z) P(w)

()

and P(w, z) = P(z ∣ w), P(w)

()

Bayes’ rule (Eq. ) follows by simple substitution of Eq. () into Eq. () and then of the result into Eq. ().

Cross References 7Bayesian Methods 7Bayesian Network 7Naïve Bayes 7Semi-Naïve Bayesian Learning

Bayesian Methods Wray Buntine NICTA, Canberra, Australia

Definition The two most important concepts used in Bayesian modeling are probability and utility. Probabilities are used to model our belief about the state of the world and utilities are used to model the value to us of different outcomes, thus to model costs and benefits. Probabilities are represented in the form of p(x∣C), where C is the current known context and x is some event(s) of interest from a space χ. The left and right arguments of the probability function are in general propositions (in the

One then estimates the expected utility Ex∣C,y [u(x)] based on different evidence, actions or outcomes y. An action is taken to maximize this expected utility, appealing to the principle of maximum expected utility (MEU). A common application of this principle is recursive: one should take the action now that will maximize utility in the future, assuming all future actions are also taken to maximize utility.

Motivation and Background In modeling a problem, primarily, one considers an interrelated space of events or states, actions, and outcomes. Events describe the state of the world, outcomes are also sometimes considered events but they are special in that one directly obtains from them costs or benefits. Actions allow one to influence the world. Some actions may instigate tests and thus also help measure the state of the world to reduce uncertainty. Some problems may be dynamic in that a sequence of actions and outcomes are considered and the resulting changes in states modeled. The Bayesian approach is a modeling methodology that provides a principled approach of how to reason and act in the context of uncertainty and a dynamic environment. In the approach, probabilities are used to model all forms of belief or proportions about events and states, and then utilities are used to model the costs and benefits of any actions taken. An explicit assumption is that these probabilities and utilities can be adequately elicited and precisely modeled for the problem. An implicit assumption is that the computation required – recursive evaluation of

B

B

Bayesian Methods

possibly nested integrals and sums (over domain variables) – can be done quickly enough so that the computation itself does not become a significant factor in the costs considered. The Bayesian approach is named after Rev. Thomas Bayes, whose work was contributed to the Royal Society in after his death, although it was independently more generally presented as a theory by Laplace in . The field was subsequently developed into a field of statistics, inference and decision theory by a stream of authors in the s including Jeffreys (Bernardo and Smith, ). The field of statistics was dominated by the frequentist school during the s, and for a time Bayesian methods were considered controversial. Like the different schools of theory in machine learning, these statistical approaches now coexist. The Bayesian approach can be justified by axiomatic prescriptions of how a rational agent should reason and act, and by appeal to principles of consistency. In the context of learning, probabilities are used to infer models of the problem of interest, and then utilities are used to recommend predictions or analysis based on the models.

positive. Utilities should be additive in worth, and are often practically interpreted in monetary units. Strictly speaking, the value of money is nonlinear (for most people, billion dollars is not significantly better than billion dollars), so it is not a correct utility measure. However, it is adequate when the range of financial transactions expected is reasonable. Expected utility, which is the expected value of the utility function, is the fundamental quantity assessed with Bayesian methods. Some scenarios are the following:

Theory

In Bayesian machine learning, we usually take utilities as a given, and the majority of the work revolves around evaluating and estimating probabilities and maximizing of expected utility. In some ranking tasks and generalized agent learning, the utilities themselves may be poorly understood. Belief and proportions: Some probabilities correspond to proportions that exist in the real world, such as the proportion of school children in the general population of a given state. These real proportions can be measured by counting or sampling, and they are governed by Kolmogorov’s Axioms for probability, including the probability of certainty is and the probability of a disjunction of mutually exclusive events is the sum of the probabilities of the individual events. This kind of probability is used in the Frequentist School that only considers long term average proportions obtained from a series of independent and identical experiments. These proportions can be model parameters one wishes to reason about. Probabilities can also represent beliefs. For instance, in , one could have had a belief about the event that

Basic Theory

First, consider definitions, the different kinds of probability, the process of reasoning (about probabilities), and making decisions. Basic definitions: Probabilities are represented in the form of p(x∣C), where C is the current known context and x is some event(s) of interest. It is sufficient to place in C only terms relevant to x and ignore terms assumed by default. Moreover, both x and C must have welldefined events. For instance, x = “John is tall” is not considered a well-defined event since the word “tall” is not precise. One would instead replace it with something like x = “John is greater than foot tall” or x = “Julie said John is tall.” An important functional used with probabilities is the expected value. For a function f (x) of some event x from a space χ, the expected value of f () is Ex∈ χ [ f (x)]. Utility is used to measure value or relative satisfaction, and is usually represented as a function on outcomes. Costs are negative utility and benefits are

Prediction: For prediction problems, the outcome is the “true” value, and the utility is sometimes the mean square error or the absolute error. In data mining, the choices are much richer, see 7Model Evaluation. Diagnosis: The outcome is the “true” diagnosis, and utility is made up of the differing costs of treatment, mistreatment, and delay or nontreatment, as well as any benefit from correct diagnosis. Game playing: The utility comes from the eventual outcome of the game, each player has their own utility and the state of the game constantly changes as plays are made.

Bayesian Methods

George Bush would win the Presidential Election in the USA. This event is unique and has only one outcome, so the frequentist notion cannot be justified, i.e., there is no long-term sequence of different presidential elections with George Bush. Beliefs are usually considered to be subjective, in that they are specific to each agent, reflecting their sum of unique experiences, and the unique context in which the event in question occurs. To better understand the role beliefs play in Bayesian methods, also see 7Prior Probabilities. Reasoning: A stylized version of probabilistic reasoning considers an event of interest one is reasoning about, x, and evidence, y, one may obtain. Typical scenarios are Learning: x = (Θ, M) are parameters Θ of a model from family M, and y = D is a set of data D = {d , . . . , dN }. So one considers p(Θ, M∣D, C) versus p(Θ, M∣C). Diagnosis: x a disease or condition, and y is a set of observable symptoms or diagnostic tests. One might choose a test y that maximizes the expected utility. Hypothesis testing: x is a hypothesis H and y is some sequence of evidence E , E , . . . , En , so we consider p(H∣E , E , . . . , En ) and hope it is sufficiently high. Different probabilities are then considered: p(x∣C): The prior probability for event x, called the baserate in some contexts. p(y∣C): The prior probability for evidence y. Once the evidence has been seen, this is also used as a proxy for the quality of the model. p(x∣y, C): The posterior probability for event x given evidence y. p(y∣x, C): The likelihood for the event x based on evidence y. In the case of diagnostic reasoning, the prior p(x∣C) is usually the base rate for the disease or condition, and can be got from the population base rate. In the case of learning, however, the prior p(Θ, M∣C) represents a prior distribution on parameters about which we may well be largely ignorant, or at least may not be able to readily elicit from experts. For instance, the proportion θ D might be the probability of a new drug slowing the onset of AIDS

B

related diseases. At the moment of initial testing, θ D is unknown so one places a probability distribution over θ D , which represents one’s belief about the proportion. These priors are second-order probabilities, beliefs about proportions, and they are the most challenging quantity modeled with the Bayesian approach. They can be a function on thousands of parameters, and can be critical in the success of applications. They are also challenging from the philosophical perspective. Decision theory: The term Bayesian inference is usually reserved for the process of manipulating priors and posteriors, computing probabilities, and computing expected values. Bayesian decision theory describes the process of formulating utilities and then evaluating the (sometimes) recursive maximum expected utility formula, such as in game playing, or interactive advertising. In Bayesian theory one takes the action that maximizes expected utility (MEU) in the current context, sometimes referred to as the expected utility hypothesis. Decision theory places this in a dynamic context and says each action should be taken to maximize expected future utility. This is defined recursively, so taken to the limit this implies the optimal future actions need to be determined before the optimal current action can be got via MEU. Justifications

This section covers basic mathematical justifications of the theory. The best general reference for this is Bernardo and Smith (). Additional discussion of prior probabilities appears in 7Prior Probabilities. Note that Bayesian theory, with its acceptance as a branch of mainstream statistics, is widely accepted for the following reasons: Application: It has extensive support through practical success, often times by clever combination of prior knowledge and statistical and computational finesse. Explanation: It provides a convenient common language in which a variety of other theoretical approaches can be represented. For instance PAC, MDL methods, penalized likelihood methods, and the maximum margin approach all find good interpretations within the Bayesian framework.

B

B

Bayesian Methods

Composition: It allows different reasoning tasks to be composed in a coherent way. With a probabilistic framework, the components can interoperate in a coherent manner, so that information may flow bidirectionally between components via probabilities. Composition of processing steps in intelligent systems is a key application for Bayesian methods. For instance, natural language and vision recognition tasks can sometimes be broken down into a processing chain (for instance, doing a named entity recognition step before a dependency parsing step), but these components rarely work conclusively and unambiguously. By attaching probabilities to the output of components, and allowing probabilistic inputs, the uncertainty inherent in individual steps can be propagated and managed. Theoretical justifications also exist to support each of the different components, probabilities, and utilities. These justifications are based on the concept of normative axioms, axioms that do not describe reasoning but rather prescribe basic principles it should follow. The axioms try to capture principles such as coherence and consistency in a quantitative manner. These various justifications have their reported shortcomings and a rich literature exists arguing about the details and postulating new variants. These axiomatic justifications are supportive of the Bayesian approach, but they are not irrefutable. Justifying probabilities: In the Bayesian approach, beliefs and proportions are given the same mathematical treatment. One set of arguably controversial justifications for this revolve around betting (Bernardo and Smith, , Sect. ..). Someone’s subjective beliefs about specific events, such as significant economic and political events (or horse races), are claimed to be measurable by offering them a series of options or bets. Moreover, if their beliefs do not behave like proportions, then a clever bookmaker can use a so-called Dutch book to consistently profit from them. An alternative scheme for justifying probability by Cox is based on normative axioms that beliefs should follow. For instance, one controversial axiom by Cox is that belief about a single event should be represented by a single real number. These axioms are presented by

Jaynes as rules for a robot (Jaynes, ), and as rules for intelligent systems by Horvitz et al. (). Justifying decision theory: Another scheme again using normative axioms, by von Neumann and Morgenstern, is used to justify the use of utilities. This scheme assumes probabilities are the basis of inference about uncertainty. A different set of normative axiomatic schemes have been developed that justify the use of probabilities and utilities together under MEU, the best known is by Savage but others exist (Bernardo and Smith, ).

Bayesian Computation

The first part of this article has been devoted to a brief overview of the Bayesian approach. Computation for Bayesian inference is an extensive field itself. Here we review the basic aspects as a pointer to the literature. This is an active area of research in machine learning, statistics, and a many applied artificial intelligence communities such as natural language processing, image analysis, and others. In general, in Bayesian reasoning one wants to estimate posterior average parameter values, or their average variance, or some other averaged quantity, then general formulas are given by (in the case of continuous parameters) Θ = EΘ∣D,M,C [Θ] = ∫ Θ p (Θ∣D, M, C)dΘ Θ

var(Θ) = EΘ∣D,M,C [(Θ − Θ) ] Marginal likelihood: A useful quantity to assist in evaluating results, and a worthy score in its own right is the marginal likelihood, in the continuous parameter case found from the likelihood p(D∣Θ, M, C) by taking an average p(D∣M, C) = ∫ p(Θ∣M, C)p(D∣Θ, M, C)dΘ. Θ

This is also called the normalizing constant due to its occurrence in the posterior formula p(Θ∣D, M, C) =

p(Θ∣M, C)p(D∣Θ, M, C) !. p(D∣M, C)

It is generally difficult to estimate because of the multidimensional integrals and sums.

Bayesian Methods

Exponential family distributions: Standard probability distributions covered in mathematical statistics, such as the 7Gaussian Distribution, the Poisson, Dirichlet, Gamma, and Wishart, have very convenient mathematical properties that make Bayesian estimation easier. With these distributions, one computes statistics, called sufficient statistics, such as a mean and sum of squares (for the Gaussian), and then parameter estimation follows with a function inverse on a concave function. This is the basis of 7linear regression, 7principal components analysis, and some 7decision tree learning methods, for instance. All good texts on mathematical statistics cover these in detail. Note the marginal likelihood is often computable in closed form for exponential family distributions. Graphical models: 7Graphical Models are a general family of of probabilistic models formed by composing graphs over variables. They work particularly well with exponential family distributions, and allow a rich variety of popular machine learning and data mining methods to be represented and manipulated. Graphical models allow complex models to be composed from simpler components and provide a family of algorithm schemes for developing inference and learning methods that operate on them. They have become the de facto standard for presenting (suitable decomposed) models and algorithms in the machine learning community. Maximum a posterior estimation: known as MAP, is usually the simplest form of parameter estimation that could be called Bayesian. It also corresponds to a penalized or regularized maximum likelihood method. Given the posterior for a stylized learning problem of the previous section, one finds the parameters Θ that maximizes the posterior p(Θ, M∣D, C), which can be conveniently done without computing the marginal likelihood above, so ̂ Θ M P = argmax log p(Θ, D∣M, C), Θ

where the log probability can be broken down as a prior and a likelihood term log p(Θ, D∣M, C) = log p(Θ∣M, C) + log p(D∣Θ, M, C). The Laplace approximation: When the posterior is well behaved, and there is a large amount of data, the posterior is focused around a vanishing small region in

B

√ parameter space of diameter O(/ (N)). If this occurs away from the boundary of the parameter space, then one can make a second-order Taylor expansion of the log. posterior at the MAP point and the result is a Gaussian approximation to the posterior. T ̂ ̂ log p(D, Θ∣M, C) ≈ log p(D, Θ M P ∣M, C)+ (Θ M P −Θ) d log p(D, Θ∣M, C) ∣ dΘdΘ T ̂ Θ= Θ MP ̂ (Θ M,P − Θ) .

From this, one can approximate integrals such as the marginal likelihood p(D∣M, C). This is known as the Laplace approximation, the name of the corresponding general method used for the asymptotic expansion of integrals. In general, this is a poor approximation, but it serves to aid our understanding of parameter estimation (MacKay, Chaps. and ), and is the approximate basis for some model selection criteria. Latent variable models: Latent variables are data that are hidden and thus never observed in the evidence. However, their existence is postulated as a significant component of the model. For instance, in 7Clustering (an unsupervised method) and finite mixture models generally, one assumes each data point has a hidden class label, thus the Bayesian model of clustering is a simple kind of latent variable model. 7Markov chain Monte Carlo methods: The most general form of reasoning and estimation available are the Markov chain Monte Carlo (MCMC) methods. The MCMC methods couple two processes: first, they use Monte Carlo or simulation methods to estimate the integral, and second they use a Markov Chain to sample, so sampling is sequentially (Markovian) based, and samples are not independent. Simulation methods generally use the functional form of p(Θ, D∣M, C) so we do not need to compute the marginal likelihood. Hence, given a set of I samples {Θ , . . . , Θ I } the expected value is approximated with a weighted average Θ≈

I ∑ wi Θ i . I i=

The simplest case is where the samples are made independently according to the posterior itself and then the

B

B

Bayesian Methods

weights wi = , This is called the ordinary Monte Carlo (OMC) method, but it is not often usable in practice because efficient multidimensional posterior samplers rarely exist. Alternatively, one can sample according to a Markov Chain, Θ i+ ∼ q(Θ i+ ∣Θ i ), so each Θ i+ is conditionally dependent on Θ i . So while samples are not independent, as long as the long run distribution of the Markov chain is the same as the posterior, the same approximation formula holds. There are a rich variety of MCMC methods, and this forms one of the key areas of current research. Gibbs sampling: The simplest kind of MCMC method samples each dimension (or sub-vector) in turn. Suppose the parameter vector has K real components, Θ = (θ , . . . , θ K ). Sampling a complete Θ in one go is not generally possible given just a functional form of the posterior p(Θ∣D, M, C) but given no computable form for the normalizing constant. Gibbs sampling works in the one-dimensional case where normalizing bounds can be obtained and sampling tricks used. The conditional posterior of θ k is given by p(θ k ∣(θ , . . . , θ k− , θ k+ , . . . , θ K ), D, M, C), and this is usually easier to sample from. The Gibbs (and MCMC) sample Θ i+ can be drawn given the previous sample Θ i by progressively resampling each dimension in turn and so slowly updating the full vector: . Sample θ i+, according to p(θ ∣θ i, , . . . , θ i,K , D, M, C). ... k. Sample θ i+,k according to p(θ ∣θ i+, , . . . , θ i+,k− , θ i,k+ , . . . , θ i,K , D, M, C). ... K. Sample θ i+,k according to p(θ K ∣θ i+, , . . . , θ i+,K− , D, M, C). In samping terms, this method is no more successful than coordinate-wise ascent is as a primitive greedy search method: it is supported by theoretical results but can be very slow to converge. Variational approximations: When the function you seek to optimize or average over presents difficulty, perhaps it is highly multimodal, then one option is to change the function itself, and replace it with a

more readily approximated function. Variational methods provide a general principle for doing this safely. The general principle uses variational calculus, which is the calculus over functions, not just variables. Variational methods are a very general approach that can be used to develop a broad range of algorithms (Wainwright and Jordan, ). Nonparametric models: The above discussion implicitly assumed the model has a fixed finite parameter vector Θ. If one is attempting to model a regression function, or a language grammar, or image model of unknown a priori structural complexity, then one cannot know the dimension ahead of time. Moreover, as in the case of functions, the dimension cannot always be finite. The 7Bayesian Nonparametric Models address this situation, and are perhaps the most important family of techniques for general machine learning.

Cross References 7Bayes Rule 7Bayesian Nonparametric Models 7Markov Chain Monte Carlo 7Prior Probability

Recommended Reading A good introduction to the problems of uncertainty and philosophical issues behind the Bayesian treatment of probability is in Lindley (). From the statistical machine learning perspective, a good introductory text is by MacKay () who carefully covers information theory, probability, and inference but not so much statistical machine learning. Another alternative introduction to probabilities is the posthumously completed and published work of Jaynes (). Discussions from the frequentist versus Bayesian battlefront can be found in works such as (Rosenkrantz and Jaynes, ), and in the approximate artificial intelligence versus probabilistic battlefront in discussion articles such as Cheeseman’s () and the many responses and rebuttals. It should be noted that it is the continued success in applications that have really led these methods into the mainstream, not the entertaining polemics. Good mathematical statistics text books, such as Casella and Berger () cover the breadth of statistical methods and therefore handle basic Bayesian theory. A more comprehensive treatment is given in Bayesian texts such as Gelman et al. (). Most advanced statistical machine learning text books cover Bayesian methods, but to fully understand the subtleties of prior beliefs and Bayesian methodology one needs to view more advanced Bayesian literature. A detailed theoretical reference for Bayesian methods is Bernardo and Smith (). Bernardo, J., & Smith, A. (). Bayesian theory. Chichester: Wiley. Casella, G., & Berger, R. (). Statistical inference (nd ed.). Pacific Grove: Duxbury.

Bayesian Nonparametric Models

Cheeseman, P. (). An inquiry into computer understanding. Computational Intelligence, (), –. Gelman, A., Carlin, J., Stern, H., & Rubin, D. (). Bayesian data analysis (nd ed.). Boca Raton: Chapman & Hall/CRC Press. Horvitz, E., Heckerman, D., & Langlotz, C. (). A framework for comparing alternative formalisms for plausible reasoning. Fifth National Conference on Artificial Intelligence, Philadelphia, pp. –. Jaynes, E. (). Probability theory: the logic of science. New York: Cambridge University Press. Lindley, D. (). Understanding uncertainty. Hoboken: Wiley. MacKay, D. (). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press. Rosenkrantz, R. (Ed.). (). E.T. Jaynes: papers on probability, statistics and statistical physics. Dordrecht: D. Reidel. Wainwright, M. J., & Jordan, M. I. (). Graphical models, exponential families, and variational inference. Hanover: Now Publishers.

B

derive effects from causes, and intercausal reasoning, to discover the mutual causes of a common effect.

Cross References 7Graphical Models

Bayesian Nonparametric Models Peter Orbanz , Yee Whye Teh Cambridge University, Cambridge, UK University College London, London, UK

Synonyms

Bayesian Model Averaging 7Learning Graphical Models

Bayesian Network Synonyms Bayes net

Definition A Bayesian network is a form of directed 7graphical model for representing multivariate probability distributions. The nodes of the network represent a set of random variables, and the directed arcs represent causal relationships between variables. The Markov property is usually required: every direct dependency between a possible cause and a possible effect has to be shown with an arc. Bayesian networks with the Markov property are called I-maps (independence maps). If all arcs in the network correspond to a direct dependence on the system being modeled, then the network is said to be a D-map (dependence-map). Each node is associated with a conditional probability distribution, that quantifies the effects the parents of the node, if any, have on it. Bayesian support various forms of reasoning: diagnosis, to derive causes from symptoms, prediction, to

Bayesian methods; Dirichlet process; Gaussian processes; Prior probabilities

Definition A Bayesian nonparametric model is a Bayesian model on an infinite-dimensional parameter space. The parameter space is typically chosen as the set of all possible solutions for a given learning problem. For example, in a regression problem, the parameter space can be the set of continuous functions, and in a density estimation problem, the space can consist of all densities. A Bayesian nonparametric model uses only a finite subset of the available parameter dimensions to explain a finite sample of observations, with the set of dimensions chosen depending on the sample such that the effective complexity of the model (as measured by the number of dimensions used) adapts to the data. Classical adaptive problems, such as nonparametric estimation and model selection, can thus be formulated as Bayesian inference problems. Popular examples of Bayesian nonparametric models include Gaussian process regression, in which the correlation structure is refined with growing sample size, and Dirichlet process mixture models for clustering, which adapt the number of clusters to the complexity of the data. Bayesian nonparametric models have recently been applied to a variety of machine learning problems, including regression, classification, clustering, latent variable modeling, sequential modeling, image segmentation, source separation, and grammar induction.

B

B

Bayesian Nonparametric Models

Motivation and Background Most of machine learning is concerned with learning an appropriate set of parameters within a model class from 7training data. The meta-level problems of determining appropriate model classes are referred to as model selection or model adaptation. These constitute important concerns for machine learning practitioners, not only for avoidance of over-fitting and under-fitting, but also for discovery of the causes and structures underlying data. Examples of model selection and adaptation include selecting the number of clusters in a clustering problem, the number of hidden states in a hidden Markov model, the number of latent variables in a latent variable model, or the complexity of features used in nonlinear regression. Nonparametric models constitute an approach to model selection and adaptation where the sizes of models are allowed to grow with data size. This is as opposed to parametric models, which use a fixed number of parameters. For example, a parametric approach to density estimation would be to fit a Gaussian or a mixture of a fixed number of Gaussians by maximum likelihood. A nonparametric approach would be a Parzen window estimator, which centers a Gaussian at each observation (and hence uses one mean parameter per observation). Another example is the support vector machine with a Gaussian kernel. The representer theorem shows that the decision function is a linear combination of Gaussian radial basis functions centered at every input vector, and thus has a complexity that grows with more observations. Nonparametric methods have long been popular in classical (non-Bayesian) statistics (Wasserman, ). They often perform impressively in applications and, though theoretical results for such models are typically harder to prove than for parametric models, appealing theoretical properties have been established for a wide range of models. Bayesian nonparametric methods provide a Bayesian framework for model selection and adaptation using nonparametric models. A Bayesian formulation of nonparametric problems is nontrivial, since a Bayesian model defines prior and posterior distributions on a single fixed parameter space, but the dimension of the parameter space in a nonparametric approach should change with sample size. The Bayesian nonparametric solution to this problem is to use an infinite-dimensional parameter space, and to invoke only a finite subset of

the available parameters on any given finite data set. This subset generally grows with the data set. In the context of Bayesian nonparametric models, “infinitedimensional” can therefore be interpreted as “of finite but unbounded dimension.” More precisely, a Bayesian nonparametric model is a model that () constitutes a Bayesian model on an infinite-dimensional parameter space and () can be evaluated on a finite sample in a manner that uses only a finite subset of the available parameters to explain the sample. We make the above description more concrete in the next section when we describe a number of standard machine learning problems and the corresponding Bayesian nonparametric solutions. As we will see, the parameter space in () typically consists of functions or of measures, while () is usually achieved by marginalizing out surplus dimensions over the prior. Random functions and measures and, more generally, probability distributions on infinite-dimensional random objects are called stochastic processes; examples that we will encounter include Gaussian processes, Dirichlet processes, and beta processes. Bayesian nonparametric models are often named after the stochastic processes they contain. The examples are then followed by theoretical considerations, including formal constructions and representations of the stochastic processes used in Bayesian nonparametric models, exchangeability, and issues of consistency and convergence rate. We conclude this chapter with future directions and a list of literature available for reading.

Examples Clustering with mixture models. Bayesian nonparametric generalizations of finite mixture models provide an approach for estimating both the number of components in a mixture model and the parameters of the individual mixture components simultaneously from data. Finite mixture models define a density function over data items x of the form p(x) = ∑Kk= π k p(x∣θ k ), where π k is the mixing proportion and θ k are parameters associated with component k. The density can be written in a non-standard manner as an integral: p(x) = K ∫ p(x∣θ)G(θ)dθ, where G = ∑k= π k δ θ k is a discrete mixing distribution encapsulating all the parameters of the mixture model and δ θ is a dirac distribution (atom) centered at θ. Bayesian nonparametric mixtures use

Bayesian Nonparametric Models

mixing distributions consisting of a countably infinite number of atoms instead: ∞

G = ∑ πk δ θ k .

()

k=

This gives rise to mixture models with an infinite number of components. When applied to a finite training set, only a finite (but varying) number of components will be used to model the data, since each data item is associated with exactly one component but each component can be associated with multiple data items. Inference in the model then automatically recovers both the number of components to use and the parameters of the components. Being Bayesian, we need a prior over the mixing distribution G, and the most common prior to use is a Dirichlet process (DP). The resulting mixture model is called a DP mixture. Formally, a Dirichlet process DP(α, H) parametrized by a concentration paramter α > and a base distribution H is a prior over distributions (probability measures) G such that, for any finite partition A , . . . , Am of the parameter space, the induced random vector (G(A ), . . . , G(Am )) is Dirichlet distributed with parameters (αH(A ), . . . , αH(Am )) (see entitled Section “Theory” for a discussion of subtleties involved in this definition). It can be shown that draws from a DP will be discrete distributions as given in (). The DP also induces a distribution over partitions of integers called the Chinese restaurant process (CRP), which directly describes the prior over how data items are clustered under the DP mixture. For more details on the DP and the CRP, see 7Dirichlet Process. Nonlinear regression. The aim of regression is to infer a continuous function from a training set consisting of input–output pairs {(ti , xi )}ni= . Parametric approaches parametrize the function using a finite number of parameters and attempt to infer these parameters from data. The prototypical Bayesian nonparametric approach to this problem is to define a prior distribution over continuous functions directly by means of a Gaussian process (GP). As explained in the Chapter 7Gaussian Process, a GP is a distribution on an infinite collection of random variables Xt , such that the joint distribution of each finite subset Xt , . . . , Xtm is a multivariate Gaussian. A value xt taken by the variable Xt can be regarded as the value of a continuous function f at t, that is, f (t) = xt . Given the training set,

B

the Gaussian process posterior is again a distribution on functions, conditional on these functions taking values f (t ) = x , . . . , f (tn ) = xn . Latent feature models. These models represent a set of objects in terms of a set of latent features, each of which represents an independent degree of variation exhibited by the data. Such a representation of data is sometimes referred to as a distributed representation. In analogy to nonparametric mixture models with an unknown number of clusters, a Bayesian nonparametric approach to latent feature modeling allows for an unknown number of latent features. The stochastic processes involved here are known as the Indian buffet process (IBP) and the beta process (BP). Draws from BPs are random discrete measures, where each of an infinite number of atoms has a mass in (, ) but the masses of atoms need not sum to . Each atom corresponds to a feature, with the mass corresponding to the probability that the feature is present for an object. We can visualize the occurrences of features among objects using a binary matrix, where the (i, k) entry is if object i has feature k and otherwise. The distribution over binary matrices induced by the BP is called the IBP. 7Hidden Markov models (HMMs). HMMs are popular models for sequential or temporal data, where each time step is associated with a state, with state transitions dependent on the previous state. An infinite HMM is a Bayesian nonparametric approach to HMMs, where the number of states is unbounded and allowed to grow with the sequence length. It is defined using one DP prior for the transition probabilities going out from each state. To ensure that the set of states reachable from each outgoing state is the same, the base distributions of the DPs are shared and given a DP prior recursively. The construction is called a hierarchical Dirichlet process (HDP); see below. 7Density estimation. A nonparametric Bayesian approach to density estimation requires a prior on densities or distributions. However, the DP is not useful in this context, since it generates discrete distributions. A useful density estimator should smooth the empirical density (such as a Parzen window estimator), which requires a prior that can generate smooth distributions. Priors applicable in density estimation problems include DP mixture models and Pólya trees. If p(x∣θ) is a smooth density function, the density ∞ ∑k= π k p(x∣θ k ) induced by a DP mixture model is a

B

B

Bayesian Nonparametric Models

smooth random density, such that DP mixtures can be used as prior in density estimation problems. Pólya trees are priors on probability distributions that can generate both discrete and piecewise continuous distributions, depending on the choice of parameters. Pólya trees are defined by a recursive infinitely deep binary subdivision of the domain of the generated random measure. Each subdivision is associated with a beta random variable which describes the relative amount of mass on each side of the subdivision. The DP is a special case of a Pólya tree corresponding to a particular parametrization. For other parametrizations the resulting random distribution can be smooth, so it is suitable for density estimation. Power-law Phenomena. Many naturally occurring phenomena exhibit power-law behavior. Examples include natural languages, images, and social and genetic networks. An interesting generalization of the DP, called the Pitman-Yor process, PYP(α, d, H), has recently been successfully used to model power-law data. The PitmanYor process augments the DP by a third parameter d ∈ [, ). When d = the PYP is a DP(α, H), while when α = it is a so called normalized stable process. Sequential modeling. HMMs model sequential data using latent variables representing the underlying state of the system, and assuming that each state only depends on the previous state (the so called Markov property). In some applications, for example language modeling and text compression, we are interested in directly modeling sequences without using latent variables, and without making any Markov assumptions, i.e., modeling each observation conditional on all previous observations in the sequence. Since the set of potential sequences of previous observations is unbounded, this calls for nonparametric models. A hierarchical Pitman-Yor process can be used to construct a Bayesian nonparametric solution whereby the conditional probabilities are coupled hierarchically. Dependent and hierarchical models. Most of the Bayesian nonparametric models described so far are applied in settings where observations are homogeneous or exchangeable. In many real world settings observations are not homogeneous, and in fact are often structured in interesting ways. For example, the data generating process might change over time thus observations at different times are not exchangeable, or observations might come in distinct groups with those

in the same group being more similar than across groups. Significant recent efforts in Bayesian nonparametrics research have been placed in developing extensions that can handle these non-homogeneous settings. Dependent Dirichlet processes are stochastic processes, typically over a spatial or temporal domain, which define a Dirichlet process (or a related random measure) at each point with neighboring DPs being more dependent. These are used for spatial modeling, nonparametric regression, as well as for modeling temporal changes. Alternatively, hierarchical Bayesian nonparametric models like the hierarchical DP aim to couple multiple Bayesian nonparametric models within a hierarchical Bayesian framework. The idea is to allow sharing of statistical strength across multiple groups of observations. Among other applications, these have been used in the infinite HMM, topic modeling, language modeling, word segmentation, image segmentation, and grammar induction. For an overview of various dependent Bayesian nonparametric models and their applications in biostatistics please refer to Dunson (). Teh and Jordan () is an overview of hierarchical Bayesian nonparametric models as well as a variety of applications in machine learning.

Theory As we saw in the preceding examples, Bayesian nonparametric models often make use of priors over functions and measures. Because these spaces typically have uncountable number of dimensions, extra care has to be taken to define the priors properly and to study the asymptotic properties of estimation in the resulting models. In this section we give an overview of the basic concepts involved in the theory of Bayesian nonparametric models. We start with a discussion of the importance of exchangeability in Bayesian parametric and nonparametric statistics. This is followed by representations of the priors and issues of convergence. Exchangeability

The underlying assumption of all Bayesian methods is that the parameter specifying the observation model is a random variable. This assumption is subject to

Bayesian Nonparametric Models

much criticism, and at the heart of the Bayesian versus non-Bayesian debate that has long divided the statistics community. However, there is a very general type of observation for which the existence of such a random variable can be derived mathematically: For so-called exchangeable observations, the Bayesian assumption that a randomly distributed parameter exists is not a modeling assumption, but a mathematical consequence of the data’s properties. Formally, a sequence of variables X , X , . . . , Xn over the same probability space (X , Ω) is exchangeable if their joint distribution is invariant to permuting the variables. That is, if P is the joint distribution and σ any permutation of {, . . . , n}, then

B

In de Finetti’s Theorem it is important to stress that θ can be infinite dimensional (it is typically a random measure), thus the hierarchical Bayesian model () is typically a nonparametric one. For an example, the Blackwell–MacQueen urn scheme (related to the CRP) is exchangeable and thus implicitly defines a random measure, namely the DP (see 7Dirichlet Process for more details). In this sense, we will see below that de Finetti’s theorem is an alternative route to Kolmogorov’s extension theorem, which implicitly defines the stochastic processes underlying Bayesian nonparametric models.

Model Representations

P(X =x , X =x . . . Xn =xn ) = P(X =xσ() , X =xσ() . . . Xn =xσ(n) ).

()

An infinite sequence X , X , . . . is infinitely exchangeable if X , . . . , Xn is exchangeable for every n ≥ . In this chapter, we mean infinite exchangeability whenever we write exchangeability. Exchangeability reflects the assumption that the variables do not depend on their indices although they may be dependent among themselves. This is typically a reasonable assumption in machine learning and statistical applications, even if the variables are not themselves independently and identically distributed (iid). Exchangeability is a much weaker assumption than iid since iid variables are automatically exchangeable. If θ parametrizes the underlying distribution, and one assumes a prior distribution over θ, then the resulting marginal distribution over X , X , . . . with θ marginalized out will still be exchangeable. A fundamental result credited to de Finetti () states that the converse is also true. That is, if X , X , . . . is (infinitely) exchangeable, then there is a random θ such that: n

P(X , . . . , Xn ) = ∫ P(θ) ∏ P(Xi ∣θ)dθ

()

i=

for every n ≥ . In other words, the seemingly innocuous assumption of exchangeability automatically implies the existence of a hierarchical Bayesian model with θ being the random latent parameter. This the crux of the fundamental importance of exchangeability to Bayesian statistics.

In finite dimensions, a probability model is usually defined by a density function or probability mass function. In infinite dimensional spaces, this approach is not generally feasible, for reasons explained below. To define or work with a Bayesian nonparametric model, we have to choose alternative mathematical representations. Weak distributions. A weak distribution is a representation for the distribution of a stochastic process, that is, for a probability distribution on an infinite-dimensional sample space. If we assume that the dimensions of the space are indexed by t ∈ T, the stochastic process can be regarded as the joint distribution P of an infinite set of random variables {Xt }t∈T . For any finite subset S ⊂ T of dimensions, the joint distribution PS of the corresponding subset {Xt }t∈S of random variables is a finite-dimensional marginal of P. The weak distribution of a stochastic process is the set of all its finite-dimensional marginals, that is, the set {PS : S ⊂ T, ∣S∣ < ∞}. For example, the customary definition of the Gaussian process as an infinite collection of random variables, each finite subset of which has a joint Gaussian distribution, is an example of a weak distribution representation. In contrast to the explicit representations to be described below, this representation is generally not generative, because it represents the distribution rather than a random draw, but is more widely applicable. Apparently, just defining a weak distribution in this manner need not imply that it is a valid representation of a stochastic process. A given collection of finite-dimensional distributions represents a stochastic

B

B

Bayesian Nonparametric Models

process only () if a process with these distributions as its marginals actually exists, and () if it is uniquely defined by the marginals. The mathematical result which guarantees that weak distribution representations are valid is the Kolmogorov extension theorem (also known as the Daniell–Kolmogorov theorem or the Kolmogorov consistency theorem). Suppose that a collection {PS : S ⊂ T, ∣S∣ < ∞} of distributions is given. If all distributions in the collection are marginals of each other, that is, if PS is a marginal of PS whenever S ⊂ S , the set of distributions is called a projective family. The Kolmogorov extension theorem states that, if the set T is countable, and if the distributions PS form a projective family, then there exists a uniquely defined stochastic process with the collection {PS } as its marginal distributions. In other words, any projective family for a countable set T of dimensions is the weak distribution of a stochastic process. Conversely, any stochastic process can be represented in this manner, by computing its set of finite-dimensional marginals. The weak distribution representation assumes that all individual random variable Xt of the stochastic process take values in the same sample space Ω. The stochastic process P defined by the weak distribution is then a probability distribution on the sample space Ω T , which can be interpreted as the set of all functions f : T → Ω. For example, to construct a GP we might choose T = Q and Ω = R to obtain real-valued functions on the countable space of rational numbers. Since Q is dense in R, the function f can then be extended to all of R by continuity. To define the DP as a distribution over probability measures on R, we note that a probability measure is a set function that maps “random events,” i.e., elements of the Borel σ-algebra B(R) of R, into probabilities in [, ]. We could therefore choose a weak distribution consisting of Dirichlet distributions, and set T = B(R) and Ω = [, ]. However, this approach raises a new problem because the set B(R) is not countable. As in the GP, we can first define the DP on a countable “base” for B(R) then extend to all random events by continuity of measures. More precise descriptions are unfortunately beyond the scope of this chapter. Explicit representations. Explicit representations directly describe a random draw from a stochastic process, rather than its distribution. A prominent example of

an explicit representation is the so-called stick-breaking representation of the Dirichlet process. The discrete random measure G in () is completely determined by the two infinite sequences {π k }k∈N and {θ k }k∈N . The stickbreaking representation of the DP generates these two sequences by drawing θ k ∼ H iid and vk ∼ Beta(, α) for k = , , . . . . The coefficients π k are then computed as π k = vk ∏k− j= ( − vk ). The measure G so obtained can be shown to be distributed according to a DP(α, G ). Similar representations can be derived for the Pitman–Yor process and the beta process as well. Explicit representations, if they exist for a given model, are typically of great practical importance for the derivation of algorithms. Implicit Representations. A third representation of infinite dimensional models is based on de Finetti’s Theorem. Any exchangeable sequence X , . . . , Xn uniquely defines a stochastic process θ, called the de Finetti measure, making the Xi ’s iid. If the Xi ’s are sufficient to define the rest of the model and their conditional distributions are easily specified, then it is sufficient to work directly with the Xi ’s and have the underlying stochastic process implicitly defined. Examples include the Chinese restaurant process (an exchangeable distribution over partitions) with the DP as the de Finetti measure, and the Indian buffet process (an exchangeable distribution over binary matrices) with the BP being the corresponding de Finetti measure. These implicit representations are useful in practice as they can lead to simple and efficient inference algorithms. Finite representations. A fourth representation of Bayesian nonparametric models is as the infinite limit of finite (parametric) Bayesian models. For example, DP mixtures can be derived as the infinite limit of finite mixture models with particular Dirichlet priors on mixing proportions, GPs can be derived as the infinite limit of particular Bayesian regression models with Gaussian priors, while BPs can be derived as from the limit of an infinite number of independent beta variables. These representations are sometimes more intuitive for practitioners familiar with parametric models. However, not all Bayesian nonparametric models can be expressed in this fashion, and they do not necessarily make clear the mathematical subtleties involved. Consistency and Convergence Rates

A recent series of works in mathematical statistics examines the convergence properties of Bayesian

Bayesian Nonparametric Models

nonparametric models, and in particular the questions of consistency and convergence rates. In this context, a Bayesian model is called consistent if, given that an infinite amount of data is available, the model posterior will concentrate in a neighborhood of the true solution (e.g., true function or density). A rate of convergence specifies, for a finite sample, how rapidly the posterior concentrates depending on the sample size. In their pioneering article Diaconis and Freedman () showed, to the great surprise of much of the Bayesian community, that models such as the Dirichlet process can be inconsistent, and may converge to arbitrary solutions even for an infinite amount of data. More recent results, notably by van der Vaart and Ghosal, apply modern methods of mathematical statistics to study the convergence properties of Bayesian nonparametric models (see e.g., Gho, () and references therein). Consistency has been shown for a number of models, including Gaussian processes and Dirichlet process mixtures. However, a particularly interesting aspect of this line of work are results on convergence rates, which specify the rate of concentration of the posterior depending on sample size, on the complexity of the model, and on how much probability mass the prior places around the true solution. To make such results quantitative requires a measure for the complexity of a Bayesian nonparametric model. This is done by means of complexity measures developed in empirical process theory and statistical learning theory, such as metric entropies, covering numbers and bracketing, some of which are well-known in theoretical machine learning.

Inference There are two aspects to inference from Bayesian nonparametric models: the analytic tractability of posteriors for the stochastic processes embedded in Bayesian nonparametric models, and practical inference algorithms for the overall models. Bayesian nonparametric models typically include stochastic processes such as the Gaussian process and the Dirichlet process. These processes have an infinite number of dimensions, hence naïve algorithmic approaches to computing posteriors are generally infeasible. Fortunately, these processes typically have analytically tractable posteriors, so all but

B

finitely many of the dimensions can be analytically integrated out efficiently. The remaining dimensions, along with the parametric parts of the models, can then be handled by the usual inference techniques employed in parametric Bayesian modeling, including Markov chain Monte Carlo, sequential Monte Carlo, variational inference, and message-passing algorithms like expectation propagation. The precise choice of approximations to use will depend on the specific models under consideration, with speed/accuracy trade-offs between different techniques generally following those for parametric models. In the following, we will give two examples to illustrate the above points, and discuss a few theoretical issues associated with the analytic tractability of stochastic processes. Examples

In Gaussian process regression, we model the relationship between an input x and an output y using a function f , so that y ∼ f (x) + є, where є is iid Gaussian noise. Given a GP prior over f and a finite training data set {(xi , yi )}ni= we wish to compute the posterior over f . Here we can use the weak representation of f and note that { f (xi )}ni= is simply a finite-dimensional Gaussian with mean and covariance given by the mean and covariance functions of the GP. Inference for { f (xi )}ni= is then straightforward. The approach can be thought of equivalently as marginalizing out the whole function except its values on the training inputs. Note that although we only have the posterior over { f (xi )}ni= , this is sufficient to reconstruct the function evaluated at any other point x (say the test input), since f (x ) is Gaussian and independent of the training data {(xi , yi )}ni= given { f (xi )}ni= . In GP regression the posterior over { f (xi )}ni= can be computed exactly. In GP classification or other regression settings with nonlinear likelihood functions, the typical approach is to use sparse methods based on variational approximations or expectation propagation; see Chapter 7Gaussian Process for details. Our second example involves Dirichlet process mixture models. Recall that the DP induces a clustering structure on the data items. If our training set consists of n data items, since each item can only belong to one cluster, there are at most n clusters represented in the training set. Even though the DP mixture itself has an infinite number of potential clusters, all but finitely

B

B

Bayesian Nonparametric Models

many of these are not associated with data, thus the associated variables need not be explicitly represented at all. This can be understood either as marginalizing out these variables, or as an implicit representation which can be made explicit whenever required by sampling from the prior. This idea is applicable for DP mixtures using both the Chinese restaurant process and the stickbreaking representations. In the CRP representation, each data item xi is associated with a cluster index zi , and each cluster k with a parameter θ ∗k (these parameters can be marginalized out if H is conjugate to F), and these are the only latent variables that need be represented in memory. In the stick-breaking representation, clusters are ordered by decreasing prior expected size, with cluster k associated with a parameter θ ∗k and a size π k . Each data item is again associated with a cluster index zi , and only the clusters up to K = max(z , . . . , zn ) need to be represented. All clusters with index > K need not be represented since their posterior conditioning on {(xi , zi )}ni= is just the prior. On Bayes Equations and Conjugacy

It is worth noting that the posterior of a Bayesian model is, in abstract terms, defined as the conditional distribution of the parameter given the data and the hyperparameters, and this definition does not require the existence of a Bayes equation. If a Bayes equation exists for the model, the posterior can equivalently be defined as the left-hand side of the Bayes equation. However, for some stochastic processes, notably the DP on an uncountable space such as R, it is not possible to define a Bayes equation even though the posterior is still a well-defined mathematical object. Technically speaking, existence of a Bayes equation requires the family of all possible posteriors to be dominated by the prior, but this is not the case for the DP. That posteriors of these stochastic processes can be evaluated at all is solely due to the fact that they admit an analytic representation. The particular form of tractability exhibited by many stochastic processes in the literature is that of a conjugate posterior, that is, the posterior belongs to the same model family as the prior, and the posterior parameters can be computed as a function of the prior hyperparameters and the observed data. For example, the posterior of a DP(α, G ) under

observations θ , . . . , θ n is again a Dirichlet process, (αG + ∑ δ θ i )). Similarly the posterior DP(α + n, α+n of a GP under observations of f (x ), . . . , f (xn ) is still a GP. It is this conjugacy that allows practical inference in the examples above. A Bayesian nonparametric model is conjugate if and only if the elements of its weak distribution, i.e., its finite-dimensional marginals, have a conjugate structure as well (Orbanz, ). In particular, this characterizes a class of conjugate Bayesian nonparametric models whose weak distributions consist of exponential family models. Note however, that lack of conjugacy does not imply intractable posteriors. An example is given by the Pitman–Yor process in which the posterior is given by a sum of a finite number of atoms and a Pitman-Yor process independent from the atoms.

Future Directions Since MCMC (see 7Markov Chain Monte Carlo) sampling algorithms for Dirichlet process mixtures became available in the s and made latent variable models with nonparametric Bayesian components applicable to practical problems, the development of Bayesian nonparametrics has experienced explosive growth (Escobar & West, ; Neal, ). Arguably, though, the results available so far have only scratched the surface. The repertoire of available models is still mostly limited to using the Gaussian process, the Dirichlet process, the beta process, and generalizations derived from those. In principle, Bayesian nonparametric models may be defined on any infinitedimensional mathematical object of possible interest to machine learning and statistics. Possible examples are kernels, infinite graphs, special classes of functions (e.g., piece-wise continuous or Sobolev functions), and permutations. Aside from the obvious modeling questions, two major future directions are to make Bayesian nonparametric methods available to a larger audience of researchers and practitioners through the development of software packages, and to understand and quantify the theoretical properties of available methods. General-Purpose Software Package

There is currently significant growth in the application of Bayesian nonparametric models across a

Bayesian Nonparametric Models

variety of application domains both in machine learning and in statistics. However significant hurdles still exist, especially the expense and expertise needed to develop computer programs for inference in these complex models. One future direction is thus the development of software packages that can compile efficient inference algorithms automatically given model specifications, thus allowing a much wider range of modeler to make use of these models. Current developments include the R DPpackage (http://cran.rproject.org/web/packages/DPpackage), the hierarchical Bayesian compiler (http://www.cs.utah.edu/hal/HBC), adaptor grammars (http://www.cog.brown.edu/mj/ Software.htm), the MIT-Church project (http:// projects.csail.mit.edu/church/wiki/Church), as well as efforts to add Bayesian nonparametric models to the repertoire of current Bayesian modeling environments like OpenBugs (http://mathstat.helsinki.fi/openbugs) and infer.NET (http://research.microsoft.com/en-us/ um/cambridge/projects/infernet).

Statistical Properties of Models

Recent work in mathematical statistics provides some insight into the quantitative behavior of Bayesian nonparametric models (cf theory section). The elegant, methodical approach underlying these results, which quantifies model complexity by means of empirical process theory and then derives convergence rates as a function of the complexity, should be applicable to a wide range of models. So far, however, only results for Gaussian processes and Dirichlet process mixtures have been proven, and it will be of great interest to establish properties for other priors. Some models developed in machine learning, such as the infinite HMM, may pose new challenges to theoretical methodology, since their study will probably have to draw on both the theory of algorithms and mathematical statistics. Once a wider range of results is available, they may in turn serve to guide the development of new models, if it is possible to establish how different methods of model construction affect the statistical properties of the constructed model. In addition to the references embedded in the text above, we recommend the books Hjort, Holmes, Müller, and Walker (), Ghosh and Ramamoorthi (),

B

and the review articles Walker, Damien, Laud, and Smith (), Müller and Quintana () on Bayesian nonparametrics. Further references can be found in the chapter by they Teh and Jordan () of the book Hjort et al. ().

Cross References 7Bayesian Methods 7Dirichlet Processes 7Gaussian Processes 7Mixture Modelling 7Prior Probabilities

Recommended Reading Diaconis, P., & Freedman, D. () On the consistency of Bayes estimates (with discussion). Annals of Statistics, (), –. Dunson, D. B. (). Nonparametric Bayes applications to biostatistics. In N. Hjort, C. Holmes, P. Müller, & S. Walker (Eds.), Bayesian nonparametrics. Cambridge: Cambridge University Press. Escobar, M. D., & West, M. (). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, , –. de Finetti, B. (). Funzione caratteristica di un fenomeno aleatorio. Atti della R. Academia Nazionale dei Lincei, Serie . Memorie, Classe di Scienze Fisiche, Mathematice e Naturale, , –. Ghosh, J. K., & Ramamoorthi, R. V. (). Bayesian nonparametrics. New York: Springer. Hjort, N., Holmes, C., Müller, P., & Walker, S. (Eds.) (). Bayesian nonparametrics. In Cambridge series in statistical and probabilistic mathematics (No. ). Cambridge: Cambridge University Press. Müller, P., & Quintana, F. A. (). Nonparametric Bayesian data analysis. Statistical Science, (), –. Neal, R. M. (). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, , –. Orbanz, P. (). Construction of nonparametric Bayesian models from parametric Bayes equations. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems, , –. Teh, Y. W., & Jordan, M. I. (). Hierarchical Bayesian nonparametric models with applications. In N. Hjort, C. Holmes, P. Müller, & S. Walker (Eds.), Bayesian nonparametrics. Cambridge: Cambridge University Press. Walker, S. G., Damien, P., Laud, P. W., & Smith, A. F. M. (). Bayesian nonparametric inference for random distributions and related functions. Journal of the Royal Statistical Society, (), –. Wasserman, L. (). All of nonparametric statistics. New York: Springer.

B

B

Bayesian Reinforcement Learning

Bayesian Reinforcement Learning Pascal Poupart University of Waterloo, Waterloo, Ontario, Canada

Synonyms Adaptive control processes; Bayes adaptive Markov decision processes; Dual control; Optimal learning

Definition Bayesian reinforcement learning refers to 7reinforcement learning modeled as a Bayesian learning problem (see 7Bayesian Methods). More specifically, following Bayesian learning theory, reinforcement learning is performed by computing a posterior distribution on the unknowns (e.g., any combination of the transition probabilities, reward probabilities, value function, value gradient, or policy) based on the evidence received (e.g., history of past state–action pairs).

Structure of Learning Approach A Markov decision process (MDP) (Puterman, ) can be formalized by a tuple ⟨S, A, T⟩ where S is the set of states s, A is the set of actions a, T(s, a, s′ ) = Pr(s′ ∣s, a) is the transition distribution indicating the probability of reaching s′ when executing a in s. Let sr denote the reward feature of a state and Pr (s′r ∣s, a) be the probability of earning r when executing a in s. A policy π : S → A consists of a mapping from states to actions. For a given discount factor ≤ γ ≤ and horizon h, the value V π of a policy π is the expected discounted total reward earned while executing this policy: V π (s) = h ∑t=o γ t Es∣π [str ]. The value function V π can be written in a recursive form as the expected sum of the immediate reward s′r with the discounted future rewards: V π (s) = ∑s′ Pr(s′ ∣s, π(s)) [s′r + γV π (s′ )]. The goal is to find an optimal policy π ∗ , that is, a policy with the highest value V ∗ in all states (i.e., V ∗ (s) ≥ V π (s) ∀π, s). Many algorithms exploit the fact that the optimal value function V ∗ satisfies Bellman’s equation: V ∗ (s) = max ∑ Pr(s′ ∣s, a) [s′r + γV ∗ (s)] a

()

s′

Motivation and Background Bayesian reinforcement learning can be traced back to the s and s in the work of Bellman (), Fel’Dbaum (), and several of Howard’s students (Martin, ). Shortly after 7Markov decision processes were formalized, the above researchers (and several others) in Operations Research considered the problem of controlling a Markov process with uncertain transition and reward probabilities, which is equivalent to reinforcement learning. They considered Bayesian techniques since Bayesian learning is performed by probabilistic inference, which naturally combines with decision theory. In general, Bayesian reinforcement learning distinguishes itself from other reinforcement learning approaches by the use of probability distributions (instead of point estimates) to fully capture the uncertainty. This enables the learner to make more informed decisions, with the potential of learning faster with less data. In particular, the exploration/exploitation tradeoff can be naturally optimized. The use of a prior distribution also facilitates the encoding of domain knowledge, which is exploited in a natural and principled way by the learning process.

Reinforcement learning (Sutton & Barto, ) is concerned with the problem of finding an optimal policy when the transition (and reward) probabilities T are unknown (or uncertain). Bayesian learning is a learning approach in which unknowns are modeled as random variables X over which distributions encode the uncertainty. The process of learning consists of updating the prior distribution Pr(X) based on some evidence e to obtain a posterior distribution Pr(X∣e) according to Bayes theorem: Pr(X∣e) = k Pr(X) Pr(e∣X). (Here k = / Pr(e) is a normalization constant.) Hence, Bayesian reinforcement learning consists of using Bayesian learning for reinforcement learning. The unknowns are the transition (and reward) probabilities T, the optimal value function V ∗ , and the optimal policy π ∗ . Techniques that maintain a distribution on T are known as model-based Bayesian reinforcement learning since they explicitly learn the underlying model T. In contrast, techniques that maintain a distribution on V ∗ or π ∗ are known as model-free Bayesian reinforcement learning since they directly learn the optimal value function or policy without learning a model.

Bayesian Reinforcement Learning

Model-Based Bayesian Learning

In model-based Bayesian reinforcement learning, the learner starts with a prior distribution over the parameters of T, which we denote by θ. For instance, let θ sas′ = Pr(s′ ∣s, a, θ) be the unknown probability of reaching s′ when executing a in s. In general, we denote by θ the set of all θ sas′ . Then, the prior b(θ) represents the initial belief of the learner regarding the underlying model. The learner updates its belief after every s, a, s′ triple observed by computing a posterior bsas′ (θ) = b(θ∣s, a, s′ ) according to Bayes theorem: bsas′ (θ) = kb(θ) Pr(s′ ∣s, a, θ) = kb(θ)θ sas′ .

()

In order to facilitate belief updates, it is convenient to pick the prior from a family of distributions that is closed under Bayes updates. This ensures that beliefs are always parameterized in the same way. Such families are called conjugate priors. In the case of a discrete model (i.e., Pr(s′ ∣s, a, θ) is a discrete distribution), Dirichlets are conjugate priors and form a family of distributions corresponding to monomials over the simplex of discrete distributions (DeGroot, ). They are parameterized as follows: Dir(θ; n) = k ∏i θ ini − . Here θ is an unknown discrete distribution such that ∑i θ i = and n is a vector of strictly positive real numbers ni (known as the hyperparameters) such that ni − can be interpreted as the number of times that the θ i -probability event has been observed. Since the unknown transition model θ is made up of one unknown distribution θ as per s, a pair, let the prior be b(θ) = ∏s,a Dir (θ as ; nsa ) such that nsa is a ′ vector of hyperparameters ns,s a . The posterior obtained after transition ˆs, aˆ , ˆs′ is ′

′

s,s s s bs,s a (θ) = kθ a ∏ Dir (θ a ; na ) s,a

= ∏ Dir (θ as ; nsa + δˆs,ˆa,ˆs′ (s, a, s′ ))

()

s,a

where δˆs,ˆa,ˆs′ (s, a, s′ ) is a Kronecker delta that returns when s = ˆs, a = aˆ , s′ = ˆs′ and otherwise. In practice, belief monitoring is as simple as incrementing the hyperparameter corresponding to the observed transition.

the underlying model. This information is very useful to decide whether future actions should focus on exploring or exploiting. Hence, in Bayesian reinforcement learning, policies π are mappings from state-belief pairs ⟨s, b⟩ to actions. Equivalently, the problem of Bayesian reinforcement learning can be thought as one of planning with a belief MDP (or a partially observable MDP). More precisely, every Bayesian reinforcement learning problem has an equivalent belief MDP formulation ⟨Sbel , Abel , Tbel ⟩ where Sbel = S × B (B is the space of beliefs b), Abel = A, and Tbel (sbel , abel , b′bel ) = Pr (b′bel ∣bbel , abel ) = Pr(s′ , b′ ∣s, b, a) = Pr(b′ ∣s, b, a, s′ ) Pr(s′ ∣s, b, a). The decomposition of the transition dynamics is particularly interesting since ′ Pr(b′ ∣s, b, a, s′ ) equals when b′ = bs,s a (as defined in Eq. ) and otherwise. Furthermore, Pr(s′ ∣s, b, a) = ′ ∫θ b(θ)Pr(s ∣s, θ, a)dθ, which can be computed in closed form when b is a Dirichlet. As a result, the transition dynamics of the belief MDP are fully known. This is a remarkable fact since it means that Bayesian reinforcement learning problems, which by definition have unknown/uncertain transition dynamics, can be recast as belief MDPs with known transition dynamics. While this doesn’t make the problem any easier since the belief MDP has a hybrid state space (discrete s with continuous b), it allows us to treat policy optimization as a problem of planning and in particular to adapt algorithms originally designed for belief MDPs (also known as partially observable MDPs). Optimal Value Function Parameterization

Many planning techniques compute the optimal value function V ∗ , from which an optimal policy π ∗ can easily be extracted. Despite the hybrid nature of the state space, the optimal value function (for a finite horizon) has a simple parameterization corresponding to the upper envelope of a set of polynomials (Poupart, Vlassis, Hoey, & Regan, ). Recall that the optimal value function satisfies Bellman’s equation, which can be adapted as follows for a belief MDP: V ∗ (s, b) = max ∑ Pr(s′ , b′ ∣s, b, a) [s′r + γV ∗ (s′ , b′ )] . a

Belief MDP Equivalence

At any point in time, the belief b provides an explicit representation of the uncertainty of the learner about

B

s′

() Using the fact that b must be (otherwise Pr(s′ , b′ ∣s, b, a) = ) allows us to rewrite Bellman’s equation as follows: ′

′ bs,s a

B

B

Bayesian Reinforcement Learning

′

V ∗ (s, b) = max ∑ Pr(s′ ∣s, b, a) [s′r + γV ∗ (s′ , bs,s a )] . a

s′

() Let Γ be a set of polynomials in θ such that the optimal value function V n with n steps to go is V n (s, b) = ∫θ b(θ)polys,b (θ)dθ where polys,b = argmaxpoly∈Γ n ∫θ b(θ)poly(θ)dθ. It suffices to replace n

s

′

n Pr(s′ ∣s, b, a), bs,s a and V by their definitions in Bellman’s equation

V n+ (s, b) = max ∑ ∫ b(θ) Pr(s′ ∣s, θ, a) a

s′

θ

[rs′ + γ polys′ ,bs,s′ (θ)] dθ a

= max ∫ b(θ) ∑ θ as,s a

[rs′

θ

()

′

s′

+ γ polys′ ,bs,s′ (θ)] dθ a

()

to obtain a similar set of polynomials Γsn+ = ′ {∑s′ θ as,s [rs′ + γ poly′s (θ)] ∣a ∈ A, polys′ ∈ Γsn′ } that represents V n+ . The fact that the optimal value function has a closed form with a simple parameterization is quite useful for planning algorithms based on value iteration. Instead of using an arbitrary function approximator to fit the value function, one can take advantage of the fact that the value function can be represented by a set of polynomials to choose a good representation. For instance, the Beetle algorithm (Poupart et al., ) performs point-based value iteration and approximates the value function with a bounded set of polynomials that each consists of a linear combination of monomial basis functions.

discounted rewards) must naturally optimize the exploration/exploitation tradeoff. In order for a policy to be optimal, it must use all the information available. The information available to the learner consists of the history of past states and actions. One can show that state–belief pairs ⟨s, b⟩ are sufficient statistics of the history. Hence, by searching for the mapping from state–belief pairs to actions that maximizes total discounted rewards, Bayesian reinforcement learning implicitly seeks an optimal tradeoff between exploration and exploitation. In contrast, traditional reinforcement learning approaches search in the space of mappings from states to actions. As a result, such techniques typically focus on asymptotic convergence (i.e., convergence to a policy that is optimal in the limit), but do not effectively balance exploration and exploitation since they do not use histories or beliefs to quantify the uncertainty about the underlying model. Related Work

Michael Duff ’s PhD thesis (Duff, ) provides an excellent survey of Bayesian reinforcement learning up until . The above text pertains mostly to modelbased Bayesian reinforcement learning applied to discrete, fully observable, single agent domains. Bayesian learning has also been explored in model-free reinforcement learning (Dearden, Friedman, & Russell, ; Engel, Mannor, & Meir, ; Ghavamzadeh & Engel, ) continuous-valued state spaces (Ross, Chaib-Draa, & Pineau, ), partially observable domains (Poupart & Vlassis, ; Ross, ChaibDraa, & Pineau, ), and multi-agent systems (Chalkiadakis & Boutilier, , ; Gmytrasiewicz & Doshi, ).

Exploration/Exploitation Tradeoff

Since the underlying model is unknown in reinforcement learning, it is not clear whether actions should be chosen to explore (gain more information about the model) or exploit (maximize immediate rewards based on information gathered so far). Bayesian reinforcement learning provides a principled solution to the exploration/exploitation tradeoff. Despite the appearance of multiple objectives induced by exploration and exploitation, there is a single objective in reinforcement learning: maximize total discounted rewards. Hence, an optimal policy (which maximizes total

Cross References 7Active Learning 7Markov Decision Processes 7Reinforcement Learning

Recommended Reading Bellman, R. (). Adaptive control processes: A guided tour. Princeton, NJ: Princeton University Press.

Behavioral Cloning

Chalkiadakis, G., & Boutilier, C. (). Coordination in multiagent reinforcement learning: A Bayesian approach. In International joint conference on autonomous agents and multiagent systems (AAMAS), Melbourne, Australia (pp. –). Chalkiadakis, G., & Boutilier, C. (). Bayesian reinforcement learning for coalition formation under uncertainty. In International joint conference on autonomous agents and multiagent systems (AAMAS), New York (pp. –). Dearden, R., Friedman, N., & Russell, S. J. (). Bayesian Q-learning. In National conference on artificial intelligence (AAAI), Madison, Wisconsin (pp. –). DeGroot, M. H. (). Optimal statistical decisions. New York: McGraw-Hill. Duff, M. (). Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massachusetts, Amherst. Engel, Y., Mannor, S., & Meir, R. (). Reinforcement learning with Gaussian processes. In International conference on machine learning (ICML), Bonn, Germany. Fel’Dbaum, A. (). Optimal control systems. New York: Academic. Ghavamzadeh, M., & Engel, Y. (). Bayesian policy gradient algorithms. In Advances in neural information processing systems (NIPS), (pp. –). Gmytrasiewicz, P., & Doshi, P. (). A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research (JAIR), , –. Martin (). Bayesian decision problems and Markov chains. New York: Wiley. Poupart, P., & Vlassis, N. (). Model-based Bayesian reinforcement learning in partially observable domains. In International symposium on artificial intelligence and mathematics (ISAIM). Poupart, P., Vlassis, N., Hoey, J., & Regan, K. (). An analytic solution to discrete Bayesian reinforcement learning. In International conference on machine learning (ICML), Pittsburgh, Pennsylvania (pp. –). Puterman, M. L. (). Markov decision processes. New York: Wiley. Ross, S., Chaib-Draa, B., & Pineau, J. (). Bayes-adaptive POMDPs. In Advances in neural information processing systems (NIPS). Ross, S., Chaib-Draa, B., & Pineau, J. (). Bayesian reinforcement learning in continuous POMDPs with application to robot navigation. In IEEE International conference on robotics and automation (ICRA), (pp. –). Sutton, R. S., & Barto, A. G. (). Reinforcement Learning. Cambridge, MA: MIT Press.

B

a list of nodes that represent a frontier in the search space. Whereas the breadth-first adds all neighbors to the list, the beam search orders the neighboring nodes according to some heuristic and only keeps the n best, where n is the beam size. This can significantly reduce the processing and storage requirements for the search. In machine learning, the beam search has been used in algorithms, such as AQ (Dietterich & Michalski, ).

Cross References 7Learning as Search

Recommended Reading Dietterich, T. G., & Michalski, R. S. (). Learning and generalization of characteristic descriptions: Evaluation criteria and comparative review of selected methods. In Fifth international joint conference on artificial intelligence (pp. –). Cambridge, MA: William Kaufmann.

Behavioral Cloning Caude Sammut The University of New South Wales, Sydney, Australia

Synonyms Apprenticeship learning; Behavioral cloning; Learning by demonstration; Learning by imitation; Learning control rules

Definition

Beam Search Claude Sammut University of New South Wales, Sydney, Australia A beam search is a heuristic search technique that combines elements of breadth-first and best-first searches. Like a breadth-first search, the beam search maintains

Behavioral cloning is a method by which human subcognitive skills can be captured and reproduced in a computer program. As the human subject performs the skill, his or her actions are recorded along with the situation that gave rise to the action. A log of these records is used as input to a learning program. The learning program outputs a set of rules that reproduce the skilled behavior. This method can be used to construct automatic control systems for complex tasks for which classical control theory is inadequate. It can also be used for training.

B

B

Behavioral Cloning

Motivation and Background Behavioral cloning (Michie, Bain, & Hayes-Michie, ) is a form of learning by imitation whose main motivation is to build a model of the behavior of a human when performing a complex skill. Preferably, the model should be in a readable form. It is related to other forms of learning by imitation, such as 7inverse reinforcement learning (Abbeel & Ng, ; Amit & Matari´c, ; Hayes & Demiris, ; Kuniyoshi, Inaba, & Inoue, ; Pomerleau, ) and methods that use data from human performances to model the system being controlled (Atkeson & Schaal, ; Bagnell & Schneider, ). Experts might be defined as people who know what they are doing not what they are talking about. That is, once a person becomes highly skilled in some task, the skill becomes sub-cognitive and is no longer available to introspection. So when the person is asked to explain why certain decisions were made, the explanation is a post hoc justification rather than a true explanation. Michie et al. () used an induction program to learn rules for balancing a pole (in simulation) and earlier work by Donaldson (), Widrow and Smith (), and Chambers and Michie () demonstrated the feasibility of learning by imitation, also for polebalancing.

Structure of the Learning System Behavioral cloning assumes that there is a plant of some kind that is under the control of a human operator. The plant may be a physical system or a simulation. In either case, the plant must be instrumented so that it is possible to capture the state of the system, including all the control settings. Thus, whenever the operator performs an action, that is, changes a control setting, we can associate that action with a particular state. Let us use a simple example of a system that has only one control action. A pole balancer has four state variables: the angle of the pole, θ, and its angular velocity, θ˙ and the position, x, and velocity x˙ , of the cart on the track. The only action available to the controller is to apply a fixed positive of negative force, F, to accelerate the cart left or right. We can create an experimental setup where a human can control a pole and cart system (either real or in simulation) by applying a left push or a right push at

Human trainer

As the trainer executes the task all actions are recorded Log file

Plant

Learning program

Controller An learning program uses the logged data to build a controller

Behavioral Cloning. Figure . Structure system

of

learning

the appropriate time. Whenever a control action is performed, we record the action as well as values of the four state variables at the time of the action. Each of these records can be viewed as an example of a mapping from state to action. Michie et al. () demonstrated that it is possible to construct a controller by learning from these examples. The learning task is to predict the appropriate action, given the state. They used a 7decision tree learning program to produce a classifier that, given the values of the four state variables, would output an action. A decision tree is easily convertible into an executable code as a nested if statement. The quality of the controller can be tested by inserting the decision tree into the simulator, replacing the human operator. If the goal of learning is simply to produce an operational controller then any program capable of building a classifier could be used. The reason that Michie et al. () chose a symbolic learner was their desire to produce a controller whose decision making was transparent as well as operational. That is, it should be possible to extract an explanation of the behavior that is meaningful to an expert in the task. Learning Direct (Situation–Action) Controllers

A controller such as the one described above is referred to as a direct controller because it maps situations to actions. Other examples of learning a direct controller

Behavioral Cloning

are building an autopilot from behavioral traces of human pilots flying aircraft in a flight simulator (Sammut, Hurst, Kedzier, & Michie, ) and building a control system for a container crane (Urbanˇciˇc & Bratko, ). These systems extended the earlier work by operating in domains in which there is more than one control variable and the task is sufficiently complex that it must be decomposed into several subtasks. An operator of a container crane can control the speed of the cart and the length of the rope. A pilot of a fixed-wing aircraft can control the ailerons, elevators, rudder, throttle, and flaps. To build an autopilot, the learner must build a system that can set each of the control variables. Sammut et al. (), viewed this as a multitask learning problem. Each training example is a feature vector that includes the position, orientation, and velocities of the aircraft as well as the values of each of the control settings: ailerons, elevator, throttle, and flaps. The rudder is ignored. A separate decision tree is built for each control variable. For example, the aileron setting is treated as the dependent variable and all the other variables, including the other controls, are treated as the attributes of the training example. A decision tree is built for ailerons, then the process is repeated for the elevators, etc. The result is a decision tree for each control variable. The autopilot code executes each decision tree in each cycle of the control loop. This method treats the setting of each control as a separate task. It may be surprising that this method works since it is often necessary to adjust more than one control simultaneously to achieve the desired result. For example, to turn, it is normal to use the ailerons to roll the aircraft while adjusting the elevators to pull it around. This kind of multivariable control does result from multiple decision trees. When, say, the aileron decision tree initiates a roll, the elevator’s decision tree detects the roll and causes the aircraft to pitch up and execute a turn. Limitations Direct controllers work quite well for sys-

tems that have a relatively small state space. However, for complex systems, behavioral cloning of direct situation–action rules tends to produce very brittle controllers. That is, they cannot tolerate large disturbances. For example, when air turbulence is introduced into the flight simulator, the performance of the clone degrades very rapidly. This is because the examples provided by

B

logging the performance of a human only cover a very small part of the state space of a complex system such as an aircraft in flight. Thus, the“expertise” of the controller is very limited. If the system strays outside the controller’s region of expertise, it has no method for recovering and failure is usually catastrophic. More robust control is possible but only with a significant change in approach. The more successful methods decompose the learning task into two stages: learning goals and learning the actions to achieve those goals.

Learning Indirect (Goal-Directed) Controllers The problem of learning in a large search space can partially be addressed by decomposing the learning into subtasks. A controller built in this way is said to be an indirect controller. A control is “indirect” if it does not compute the next action directly from the system’s current state but uses, in addition, some intermediate information. An example of such intermediate information is a subgoal to be attained before achieving the final goal. Subgoals often feature in an operator’s control strategies and can be automatically detected from a trace of the operator’s behavior (Šuc & Bratko, ). The problem of subgoal identification can be treated as the inverse of the usual problem of controller design, that is, given the actions in an operator’s trace, find the goal that these actions achieve. The limitation of this approach is that it only works well for cases in which there are just a few subgoals, not when the operator’s trajectory contains many subgoals. In these cases, a better approach is to generalize the operator’s trajectory. The generalized trajectory can be viewed as defining a continuously changing subgoal (Bratko & Šuc, ; Šuc & Bratko, a) (see also the use of flow tubes in dynamic plan execution (Hofmann & Williams, )). Subgoals and generalized trajectories are not sufficient to define a controller. A model of the systems dynamics is also required. Therefore, in addition to inducing subgoals or a generalized trajectory, this approach also requires learning approximate system dynamics, that is a model of the controlled system. Bratko and Šuc () and Šuc and Bratko (b) use a combination of the Goldhorn (Križman & Džeroski,

B

B

Behavioral Cloning

) discovery program and locally weighted regression to build the model of the system’s dynamics. The next action is then computed “indirectly” by () computing the desired next state (e.g., next subgoal) and () determining an action that brings the system to the desired next state. Bratko and Šuc also investigated building qualitative control strategies from operator traces (Bratko & Šuc, ). An analog to this approach is 7inverse reinforcement learning (Abbeel & Ng, ; Atkeson & Schaal, ; Ng & Russell, ) where the reward function is learned. Here, the learning the reward function corresponds to learning the human operator’s goals. Isaac and Sammut () uses an approach that is similar in spirit to Šuc and Bratko but incorporates classical control theory. Learned skills are represented by a two-level hierarchical decomposition with an anticipatory goal level and a reactive control level. The goal level models how the operator chooses goal settings for the control strategy and the control level models the operator’s reaction to any error between the goal setting and actual state of the system. For example, in flying, the pilot can achieve goal values for the desired heading, altitude, and airspeed by choosing appropriate values of turn rate, climb rate, and acceleration. The controls can be set to correct errors between the current state and the desired state of these goal-directing quantities. Goal models map system states to a goal setting. Control actions are based on the error between the output of each of the goal models and the current system state. The control level is modeled as a set of proportional integral derivative (PID) controllers, one for each control variable. A PID controller determines a control value as a linear function proportional to the error on a goal variable, the integral of the error, and the derivative of the error. Goal setting and control models are learned separately. The process begins be deciding which variables are to be used for the goal settings. For example, trainee pilots will learn to execute a “constant-rate turn,” that is, their goal is to maintain a given turn rate. A separate goal rule is constructed for each goal variable using a 7model tree learner (Potts & Sammut, ). A goal rule gives the setting for a goal variable and therefore, we can find the difference (error) between the

current state value and the goal setting. The integral and derivative of the error can also be calculated. For example, if the set turn rate is ○ min, then the error on the turn rate is calculated as the actual turn rate minus . The integral is then the running sum of the error multiplied by the time interval between time samples, starting from the first time sample of the behavioral trace, and the derivative is calculated as the difference between the error and previous error all divided by the time interval. For each control available to the operator, a model tree learner is used to predict the appropriate control setting. 7Linear regression is used in the leaf nodes of the model tree to produce linear equations whose coefficients are the P, I, and D of values of the PID controller. Thus the learner produces a collection of PID controllers that are selected according to the conditions in the internal nodes of the tree. In control theory, this is known as piecewise linear control. Another indirect method is to learn a model of the dynamics of the system and use this to learn, in simulation, a controller for the system (Bagnell & Schneider, ; Ng, Jin Kim, Jordan, & Sastry, ). This approach does not seek to directly model the behavior of a human operator. A behavioral trace may be used to generate data for modeling the system but then a reinforcement learning algorithm is used to generate a policy for controlling the simulated system. The learned policy can then be transferred to the physical system. 7Locally weighted regression is typically used for system modeling, although 7model trees can also be used.

Cross References 7Apprenticeship Learning 7Inverse Reinforcement Learning 7Learning by Imitation 7Locally Weighted Regression 7Model Trees 7Reinforcement Learning 7System Identification

Recommended Reading Abbeel, P., & Ng, A. Y. (). Apprenticeship learning via inverse reinforcement learning. In International conference on machine learning, Banff, Alberta, Canada. New York: ACM.

Bias

Amit, R., & Matari´c, M. (). Learning movement sequences from demonstration. In Proceedings of the second international conference on development and learning, Cambridge, MA, USA (pp. –). Washington, D.C.: IEEE. Atkeson, C. G., & Schaal, S. (). Robot learning from demonstration. In D. H. Fisher (Ed.), Proceedings of the fourteenth international conference on machine learning, Nashville, TN, USA (pp. –). San Francisco: Morgan Kaufmann. Bagnell, J. A., & Schneider, J. G. (). Autonomous helicopter control using reinforcement learning policy search methods. In International conference on robotics and automation, South Korea. IEEE Press, New York. Bratko, I., & Šuc, D. (). Using machine learning to understand operator’s skill. In Proceedings of the th international conference on industrial and engineering applications of artificial intelligence and expert systems (pp. –). London: Springer. AAAI Press, Menlo Park, CA. Bratko, I., & Šuc, D. (). Learning qualitative models. AI Magazine, (), –. Chambers, R. A., & Michie, D. (). Man-machine co-operation on a learning task. In R. Parslow, R. Prowse, & R. Elliott-Green (Eds.), Computer graphics: techniques and applications. London: Plenum. Donaldson, P. E. K. (). Error decorrelation: A technique for matching a class of functions. In Proceedings of the third international conference on medical electronics (pp. –). Hayes, G., & Demiris, J. (). A robot controller using learning by imitation. In Proceedings of the international symposium on intelligent robotic systems, Grenoble, France (pp. –). Grenoble: LIFTA-IMAG. Hofmann, A. G., & Williams, B. C. (). Exploiting spatial and temporal flexiblity for plan execution of hybrid, underactuated systems. In Proceedings of the st national conference on artficial intelligence, July , Boston, MA (pp. –). Isaac, A., & Sammut, C. (). Goal-directed learning to fly. In T. Fawcett & N. Mishra (Eds.), Proceedings of the twentieth international conference on machine learning, Washington, D.C. (pp. –). Menlo Park: AAAI. Križman, V., & Džeroski, S. (). Discovering dynamics from measured data. Electrotechnical Review, (–), –. Kuniyoshi, Y., Inaba, M., & Inoue, H. (). Learning by watching: Extracting reusable task knowledge from visual observation of human performance. IEEE Transactions on Robotics and Automation, , –. Michie, D., Bain, M., & Hayes-Michie, J. E. (). Cognitive models from subcognitive skills. In M. Grimble, S. McGhee, & P. Mowforth (Eds.), Knowledge-based systems in industrial control. Stevenage: Peter Peregrinus. Ng, A. Y., Jin Kim, H., Jordan, M. I., & Sastry, S. (). Autonomous helicopter flight via reinforcement learning. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems . Cambridge: MIT Press. Ng, A. Y., & Russell, S. (). Algorithms for inverse reinforcement learning. In Proceedings of th international conference on machine learning, Stanford, CA, USA (pp. –). San Francisco: Morgan Kaufmann.

B

Pomerleau, D. A. (). ALVINN: An autonomous land vehicle in a neural network. In D. S. Touretzky (Ed.), Advances in neural information processing systems. San Mateo: Morgan Kaufmann. Potts, D., & Sammut, C. (November ). Incremental learning of linear model trees. Machine Learning, (–), –. Sammut, C., Hurst, S., Kedzier, D., & Michie, D. (). Learning to fly. In D. Sleeman & P. Edwards (Eds.), Proceedings of the ninth international conference on machine learning, Aberdeen (pp. –). San Francisco: Morgan Kaufmann. Šuc, D., & Bratko, I. (). Skill reconstruction as induction of LQ controllers with subgoals. In IJCAI-: Proceedings of the fiftheenth international joint conference on artificial intelligence, Nagoya, Japan (Vol. , pp. –). San Francisco: Morgan Kaufmann. Šuc, D., & Bratko, I. (a). Modelling of control skill by qualitative constraints. In Thirteenth international workshop on qualitative reasoning, – June , Lock Awe, Scotland (pp. –). Aberystwyth: University of Aberystwyth. Šuc, D., & Bratko, I. (b). Symbolic and qualitative reconstruction of control skill. Electronic Transactions on Artificial Intelligence, (B), –. Urbanˇciˇc, T., & Bratko, I. (). Reconstructing human skill with machine learning. In A. Cohn (Ed.), Proceedings of the th European conference on artificial intelligence. Wiley. Amsterdam: New York. Widrow, B., & Smith, F. W. (). Pattern recognising control systems. In J. T. Tou & R. H. Wilcox (Eds.), Computer and information sciences. London: Clever Hume.

Belief State Markov Decision Processes 7Partially Observable Markov Decision Processes

Bellman Equation The Bellman Equation is a recursive formula that forms the basis for 7dynamic programming. It computes the expected total reward of taking an action from a state in a 7Markov decision process by breaking it into the immediate reward and the total future expected reward. (See 7dynamic programming.)

Bias Bias has two meanings, 7inductive bias, and statistical bias (see 7bias variance decomposition).

B

B

Bias Specification Language

Bias Specification Language Hendrik Blockeel Katholieke Universiteit Leuven, Belgium The Netherlands

Definition A bias specification language is a language in which a user can specify a 7Language Bias. The language bias of a learner is the set of hypotheses (or hypothesis descriptions) that this learner may return. In contrast to the 7hypothesis language, the bias specification language allows us to describe not single hypotheses but sets (languages) of hypotheses.

Examples In learning approaches based on 7graphical models or 7artificial neural networks, whenever the user provides the graph structure of the model, he or she is specifying a bias. The “language” used to specify this bias, in this case, consists of graphs. Figure shows examples of such graphs. Not every kind of bias can necessarily be expressed by some bias specification language; for instance, the bias defined by the 7Bayesian network structure in Fig. cannot be expressed using a

A

B

C p(A,B,C) = p(A)p(B)p(C|A,B)

A

B

C p(A,B,C) = f1(A,C)f2(B,C)

Bias Specification Language. Figure . Graphs defining a bias for learning joint distributions. The Bayesian network structure to the left constrains the form of the joint distribution in a particular way (shown as the equation below the graph). Notably, it guarantees that only distributions can be learned in which the variables A and B are (unconditionally) independent. The Markov network structure to the right constrains the form of the joint distribution in a different way: it states that it must be possible to write the distribution as a product of a function of A and C and a function of B and C. These two biases are different. In fact, no Markov network structure over the variables A, B, and C exists that expresses the bias specified by the Bayesian network structure

7Markov network. Bayesian networks and Markov networks have a different expressiveness, when viewed as bias specification languages. Also certain parameters of decision tree learners or rule set learners effectively restrict the hypothesis language (for instance, an upper bound on the rule length or the size of the decision tree). A combination of parameter values can hardly be called a language, and even the “language” of graphs is a relatively simple kind of language. More elaborate types of bias specification languages are typically found in the field of 7inductive logic programming (ILP).

Bias Specification Languages in Inductive Logic Programming In ILP, the hypotheses returned by the learning algorithm are typically written as first-order logic clauses. As the set of all possible clauses is too large to handle, a subset of these clauses is typically defined; this subset is called the language bias. Several formalisms (“bias specification languages”) have been proposed for specifying such subsets. We here focus on a few representative ones. DLAB

In the DLAB bias specification language (Dehaspe & De Raedt, ), the language bias is defined in a declarative way, by defining a syntax that clauses must fulfill. In its simplest form, a DLAB specification simply gives a set of possible head and body literals out of which the system can build a clause. Example The actual syntax of the DLAB specification language is relatively complicated (see Dehaspe & De Raedt, ), but in essence, one can write down a specification such as: { parent({X,Y,Z},{X,Y,Z}), grandparent({X,Y,Z}, {X,Y,Z}) } :{ parent({X,Y,Z},{X,Y,Z}), parent({X,Y,Z},{X,Y,Z}), grandparent({X,Y,Z},{X,Y,Z}), grandparent({X,Y,Z}, {X,Y,Z}) } which states that the hypothesis language consists of all clauses that have at most one parent and at most one

Bias Specification Language

grandparent literal in the head, and at most two of these literals in the body; the arguments of these literals may be variables X,Y,Z. Thus, the following clauses are in the hypothesis language: grandparent(X, Y) :- parent(X, Z), parent(Z,Y) :- parent(X,Y), parent(Y,X) :- parent(X,X) These express the usual definition of grandparent as well as the fact that there can be no cycles in the parent relation. Note that for each argument of each literal, all the variables and constants that may occur have to be enumerated explicitly. This can make a DLAB specification quite complex. While DLAB contains advanced constructs to alleviate this problem, it remains the case that often very elaborate bias specifications are needed in practical situations.

B

but not the following clause: grandparent(X,Y) :- parent(Z,Y) because Z occurs as an input parameter for parent without occurring elsewhere as an output parameter (i.e., it is being used without having been given a value first). FLIPPER’s Bias Specification Language

The FLIPPER system (Cohen, ) employs a powerful, but somewhat more procedural, bias specification formalism. The user does not specify a set of valid hypotheses directly, but rather, specifies a 7Refinement Operator. The language bias is the set of all clauses that can be obtained from one or more starting clauses through repeated application of this refinement operator. The operator itself is defined by specifying under which conditions certain literals can be added to a clause. Rules defining the operator have one of the following forms:

Type- and Mode-Based Biases

A more flexible bias specification language is used by Progol (Muggleton, ) and many other ILP systems. It is based on the notions of types and modes. In Progol, arguments of a predicate can be typed, and a variable can never occur in two locations with different types. Similarly, arguments of a predicate have an input (+) or output (−) mode; each variable that occurs as an input argument of some literal must occur elsewhere as an output argument, or must occur as input argument in the head literal of a clause. Example In Progol, the specifications type(parent(human,human)). type(grandparent(human,human)). modeh(grandparent(+,+)). % modeh: specifies a head literal modeb(grandparent(+,-)). % modeb: specifies a body literal modeb(parent(+,-)).

A ← B where Pre asserting Post L where Pre asserting Post The first form defines a set of “starting clauses,” and the second form defines when a literal L can be added to a clause. Each rule can only be applied when its preconditions Pre are fulfilled, and upon application will assert a set of literals Post. As an example (Cohen, ), the rules illegal(A, B, C, D, E, F) ← where true asserting {linked(A), linked(B), . . ., linked(F)} R(X, Y) where rel(R), linked(X), linked(Y) asserting ∅ state that any clause of the form illegal(A, B, C, D, E, F) ←

allow the system to construct a clause such as grandparent(X,Y) :- parent(X,Z), parent(Z,Y)

can be used as a starting point for the refinement operator, and the variables in this clause are all linked. Further, any literal of the form R(X, Y) with R a relation

B

B

Bias Variance Decomposition

symbol (as defined by the Rel predicate) and X and Y linked variables can be added. Other Approaches

Grammars or term rewriting systems have been proposed several times as a means of defining the hypothesis language. A relatively recent approach along these lines was given by Lloyd, who uses a rewriting system to define the tests that can occur in the nodes of a decision tree built by the Alkemy system (Lloyd, ). Boström & Idestam-Almquist () present an approach where the language bias is implicitly defined through the 7Background Knowledge given to the learner. Knobbe et al. () propose the use of UML as a “common” bias specification language, specifications in which could be translated automatically to languages specific to a particular learner.

Dehaspe, L., & De Raedt, L. (). DLAB: A declarative language bias formalism. In Proceedings of the international symposium on methodologies for intelligent systems. Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer. Knobbe, A. J., Siebes, A., Blockeel, H., & van der Wallen, D. (). Multi-relational data mining, using UML for ILP. In Proceedings of PKDD- – The fourth European conference on principles and practice of knowledge discovery in databases. Lecture notes in artificial intelligence (Vol. , pp. –), Lyon, France. Berlin: Springer. Lloyd, J. W. (). Logic for learning. Berlin: Springer. Muggleton, S. (). Inverse entailment and Progol. New Generation Computing, Special Issue on Inductive Logic Programming, (–), –. Nédellec, C., Adé, H., Bergadano, F., & Tausend, B. (). Declarative bias in ILP. In L. De Raedt (Ed.), Advances in inductive logic programming. Frontiers in artificial intelligence and applications (Vol. , pp. –). Amsterdam: IOS Press.

Bias Variance Decomposition

Further Reading An overview of bias specification formalisms in ILP is given by Nédellec et al. (). The bias specification languages discussed above are discussed in more detail in Dehaspe and De Raedt (), Muggleton (), and Cohen (). De Raedt () discusses language bias and the concept of bias shift (learners weakening their bias, i.e., extending the set of hypotheses that can be represented, when a given language bias turns out to be too restrictive). A more recent approach to learning declarative bias is presented by Bridewell and Todorovski ().

Cross References 7Hypothesis Language 7Inductive Logic Programminllg

Recommended Reading Boström, H., & Idestam-Almquist, P. (). Induction of logic programs by example-guided unfolding. Journal of Logic Programming, (–), –. Bridewell, W., & Todorovski, L. (). Learning declarative bias. In Proceedings of the th international conference on inductive logic programming. Lecture notes in computer science (Vol. , pp. –). Berlin: Springer. Cohen, W. (). Learning to classify English text with ILP methods. In L. De Raedt (Ed.), Advances in inductive logic programming (pp. –). Amsterdam: IOS Press. De Raedt, L. (). Interactive theory revision: An inductive logic programming approach. New York: Academic Press.

Definition The bias-variance decomposition is a useful theoretical tool to understand the performance characteristics of a learning algorithm. The following discussion is restricted to the use of squared loss as the performance measure, although similar analyses have been undertaken for other loss functions. The case receiving most attention is the zero-one loss (i.e., classification problems), in which case the decomposition is nonunique and a topic of active research. See Domingos () for details. The decomposition allows us to see that the mean squared error of a model (generated by a particular learning algorithm) is in fact made up of two components. The bias component tells us how accurate the model is, on average across different possible training sets. The variance component tells us how sensitive the learning algorithm is to small changes in the training set (Fig. ). Mathematically, this can be quantified as a decomposition of the mean squared error function. For a testing example {x, d}, the decomposition is: ED {( f (x) − d) } = (ED { f (x)} − d) + ED {( f (x) − ED { f (x)}) }, MSE = bias + variance,

Bias-Variance Trade-offs: Novel Applications

B

B High bias High variance

Low bias High variance

High bias Low variance

Low bias Low variance

Bias Variance Decomposition. Figure . The bias-variance decomposition is like trying to hit the bullseye on a dartboard. Each dart is thrown after training our “dart-throwing” model in a slightly different manner. If the darts vary wildly, the learner is high variance. If they are far from the bullseye, the learner is high bias. The ideal is clearly to have both low bias and low variance; however this is often difficult, giving an alternative terminology as the bias-variance “dilemma” (Dartboard analogy, Moore & McCabe ())

where the expectations are with respect to all possible training sets. In practice, this can be estimated by crossvalidation over a single finite training set, enabling a deeper understanding of the algorithm characteristics. For example, efforts to reduce variance often cause increases in bias, and vice versa. A large bias and low variance is an indicator that a learning algorithm is prone to 7overfitting the model.

Cross References 7Bias-Variance Trade-offs: Novel Applications

Recommended Reading Domingos, P. (). A unified bias-variance decomposition for zero-one and squared loss. In Proceedings of national conference on artificial intelligence. Austin, TX: AAAI Press. Geman, S. (). Neural networks and the bias/variance dilemma. Neural Computation, () Moore, D. S., & McCabe, G. P. (). Introduction to the practice of statistics. Michelle Julet

Bias-Variance Trade-offs: Novel Applications Dev Rajnarayan, David Wolpert NASA Ames Research Center, Moffett Field, CA, USA

Definition Consider a given random variable F and a random variˆ We wish to use a sample of able that we can modify, F. Fˆ as an estimate of a sample of F. The mean squared error (MSE) between such a pair of samples is a sum

of four terms. The first term reflects the statistical coupling between F and Fˆ and is conventionally ignored in bias-variance analysis. The second term reflects the inherent noise in F and is independent of the estimator ˆ Accordingly, we cannot affect this term. In contrast, F. ˆ The third term, the third and fourth terms depend on F. called the bias, is independent of the precise samples of ˆ and reflects the difference between the both F and F, ˆ The fourth term, called the variance, is means of F and F. independent of the precise sample of F, and reflects the inherent noise in the estimator as one samples it. These last two terms can be modified by changing the choice of the estimator. In particular, on small sample sets, we can often decrease our mean squared error by, for instance, introducing a small bias that causes a large reduction the variance. While most commonly used in machine learning, this article shows that such bias-variance trade-offs are applicable in a much broader context and in a variety of situations. We also show, using experiments, how existing bias-variance trade-offs can be applied in novel circumstances to improve the performance of a class of optimization algorithms.

Motivation and Background In its simplest form, the bias-variance decomposition is based on the following idea. Say we have a random variable F taking on values F distributed according to a density function p(F). We want to estimate the value of a sample from p(F). To form our estimate, we sample a different random variable Fˆ taking on values Fˆ disˆ Assuming a quadratic loss tributed according to p(F). function, the quality of our estimate is measured by its MSE:

B

Bias-Variance Trade-offs: Novel Applications

ˆ F) (Fˆ − F) dFˆ dF. ˆ ≡ ∫ p(F, MSE(F) In many situations, F and Fˆ are dependent variables. For example, in supervised machine learning, F is a “target” conditional distribution, stochastically mapping elements of an input space X into a space Y of output variables. The associated distribution p(F) is the “prior” of F. A random sample D of F, called “the training set,” is generated, and D is used in a “learning algorithm” to ˆ which is our estimate of F. Clearly, this F and produce F, Fˆ are statistically dependent, via D. Indeed, intuitively speaking, the goal in designing a learning algorithm is ˆ it produces are positively correlated with F’s. that the F’s In practice this coupling is simply ignored in analyses of bias plus variance, without any justification (one such justification could be that the coupling has little effect on the value of the MSE). We shall follow that practice here. Accordingly, our equation for MSE reduces to

ˆ ˆ = ∫ p(F)p(F) (Fˆ − F) dFˆ dF. MSE(F)

()

If we were to account for the coupling of Fˆ and Fˆ an additive correction term would need to be added to the right-hand side. For instance, see Wolpert (). Using simple algebra, the right hand side of () can be written as the sum of three terms. The first is the variance of F. Since this is beyond our control in designing ˆ we ignore it for the rest of this artithe estimator F, cle. The second term involves a mean that describes the deterministic component of the error. This term ˆ depends on both the distribution of F and that of F, and quantifies how close the means of those distributions are. The third term is a variance that describes stochastic variations from one sample to the next. This term is independent of the random variable being estimated. Formally, up to an overall additive constant, we can write ˆ = ∫ p(F)( ˆ Fˆ − F Fˆ + F ) dFˆ MSE(F) ˆ Fˆ dFˆ − F ∫ p(F) ˆ Fˆ dFˆ + F = ∫ p(F) ³¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹· ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹µ ˆ + [E(F)] ˆ −F E(F) ˆ + F = V(F) ˆ + [F − E(F)] ˆ = V(F) ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ = variance + bias .

In light of (), one way to try to reduce expected quadratic error is to modify an estimator to trade-off bias and variance. Some of the most famous applications of such bias-variance trade-offs occur in parametric machine learning, where many techniques have been developed to exploit the trade-off. Nonetheless, the trade-off also arises in many other fields, including integral estimation and optimization. In the rest of this paper we present a few novel applications of bias-variance trade-off, and describe some interesting features in each case. A recurring theme is the following: whenever a bias-variance trade-off arises in a particular field, we can use many techniques from parametric machine learning that have been developed for exploiting this trade-off. See Wolpert and Rajnarayan () for further details of many of these applications.

Applications In this section, we describe some applications of the bias-variance tradeoff. First, we describe Monte Carlo (MC) techniques for the estimation of integrals, and provide a brief analysis of bias-variance trade-offs in this context. Next, we introduce the field of Monte Carlo optimization (MCO), and illustrate that there are more subtleties involved than in simple MC. Then, we describe the field of parametric machine learning, which, as will show, is formally identical to MCO. Finally, we describe the application of parametric learning (PL) techniques to improve the performance of MCO algorithms. We do this in the context of an MCO problem that addresses black-box optimization. Monte Carlo Estimation of Integrals Using Importance Sampling

Monte Carlo methods are often the method of choice for estimating difficult high-dimensional integrals. Consider a function f ∶ X → R, which we want to integrate over some region X ⊆ X, yielding the value F, as given by F = ∫ dx f (x). X

()

We can view this as a random variable F, with density function given by a Dirac delta function centered on F. Therefore, the variance of F is , and () is exact.

Bias-Variance Trade-offs: Novel Applications

A popular MC method to estimate this integral is importance sampling (see Robert & Casella, ). This exploits the law of large numbers as follows: i.i.d. samples x(i) , i = , . . . , m are generated from a so-called importance distribution h(x) that we control, and the associated values of the integrand, f (x(i) ) are computed. Denote these “data” by (i)

(i)

D = {(x , f (x ), i = , . . . , m}.

()

B

Monte Carlo Optimization

Instead of a fixed integral to evaluate, consider a para metrized integral F(θ) = ∫ dx fθ (x). X

Further, suppose we are interested in finding the value of the parameter θ ∈ Θ that minimizes F(θ): θ ⋆ = arg min F(θ).

Now,

θ∈Θ

F = ∫ dx h(x) X

f (x) h(x)

m

= lim

m→∞

(i)

f (x ) ∑ m i= h(x(i) )

with probability .

Denote by Fˆ the random variable with value given by the sample average for D: m f (x(i) ) Fˆ = ∑ . m i= h(x(i) ) We use Fˆ as our statistical estimator for F, as we broadly described in the introductory section. Assumˆ F) = (F − F) ˆ , the ing a quadratic loss function, L(F, bias-variance decomposition described in () applies exactly. It can be shown that the estimator Fˆ is unbiased, ˆ = F, where the mean is over samples of h. that is, E(F) Consequently, the MSE of this estimator is just its variance. The choice of sampling distribution h that minimizes this variance is given by (see Robert & Casella, ) h⋆ (x) =

∣f (x)∣ . ∫X ∣f (x′ )∣ dx′

By itself, this result is not very helpful, since the equation for the optimal importance distribution contains a similar integral to the one we are trying to estimate. For non-negative integrands f (x), the VEGAS algorithm (Lepage, ) describes an adaptive method to find successively better importance distributions, by iteratively estimating F, and then using that estimate to generate the next importance distribution h. In the case of these unbiased estimators, there is no tradeoff between bias and variance, and minimizing MSE is achieved by minimizing variance.

In the case where the functional form of fθ is not explicitly known, one approach to solve this problem is a technique called MCO (see Ermoliev & Norkin, ), involving repeated MC estimation of the integral in question with adaptive modification of the parameter θ. We proceed by analogy to the case with MC. First, we introduce the θ-indexed random variable F(θ), all of whose components have delta-function distributions about the associated values F(θ). Next, we introduce a θ-indexed vector random variable Fˆ with values ˆ Fˆ ≡ {F(θ) ∀ θ ∈ Θ}.

()

ˆ can be sampled and Each real-valued component F(θ) viewed as an estimate of F(θ). For example, let D be a data set as described in (). Then for every θ, any sample of D provides an associated estimate m fθ (x(i) ) ˆ F(θ) = ∑ . m i= h(x(i) ) That average serves as an estimate of F(θ). Formally, Fˆ is a function of the random variable D, and is given by such averaging over the elements of D. So, a samˆ A priori, we make no ple of D provides a sample of F. ˆ and so, in general, its components restrictions on F, may be statistically coupled with one another. Note that this coupling arises even though we are, for simplicity, treating each function F(θ) as having a delta-function distribution, rather than as having a non-zero variance that would reflect our lack of knowledge of the f (θ) functions.

B

B

Bias-Variance Trade-offs: Novel Applications

ˆ one way However Fˆ is defined, given a sample of F, ⋆ to estimate θ is ˆ θˆ⋆ = arg min F(θ).

The natural MCO algorithm provides some insight into these results. For that algorithm, ˆ (arg min F(θ)) ˆ E(L) = ∫ dFˆ p(F)F

θ∈Θ

θ

We call this approach “natural” MCO. As an example, say that D is a set of m samples of h, and let

ˆ F (arg min F(θ)) .

m fθ (x(i) ) ˆ , F(θ) ≜ ∑ m i= h(x(i) )

θ

ˆ as above. Under this choice for F, m fθ (x(i) ) θˆ⋆ = arg min ∑ . θ∈Θ m i= h(x(i) )

()

We call this approach “naive” MCO. Consider any algorithm that estimates θ ⋆ as a ˆ The estimate of θ ⋆ prosingle-valued function of F. duced by that algorithm is itself a random variable, ˆ Call this since it is a function of the random variable F. ⋆ random variable θˆ , taking on values θˆ⋆ . Any MCO ⋆ algorithm is defined by θˆ ; that random variable encapsulates the output estimate made by the algorithm. To analyze the error of such an algorithm, consider the associated random variable given by the true ⋆ parametrized integral F(θˆ ). The difference between a ⋆ sample of F(θˆ ) and the true minimal value of the integral, F(θ ⋆ ) = minθ F(θ), is the error introduced by ⋆ our estimating that optimal θ as a sample of θˆ . Since our aim in MCO is to minimize F(θ), we adopt the ⋆ ⋆ loss function L(θˆ , θ ⋆ ) ≜ F(θˆ ) − F(θ ⋆ ). This is in contrast to our discussion on MC integration, which involved quadratic loss. The current loss function just ⋆ equals F(θˆ ) up to an additive constant F(θ ⋆ ) that is fixed by the MCO problem at hand and is beyond our control. Up to that additive constant, the associated expected loss is E(L) = ∫ d θˆ⋆ p(θˆ⋆ )F(θˆ⋆ ).

()

Now change coordinates in this integral from the val⋆ ues of the scalar random variable θˆ to the values of the ˆ The expected loss underlying vector random variable F. now becomes ˆ ˆ E(L) = ∫ dFˆ p(F)F( θˆ⋆ (F)).

ˆ ), F(θ ˆ ), . . .) ˆ ) dF(θ ˆ ) . . . p(F(θ = ∫ dF(θ ()

For any fixed θ, there is an error between samples of ˆ F(θ) and the true value F(θ). Bias-variance considerations apply to this error, exactly as in the discussion of MC above. We are not, however, concerned with Fˆ for a single component θ, but rather for a set Θ of θ’s. The simplest such case is where the components ˆ ˆ of F(Θ) are independent. Even so, arg minθ F(θ) is distributed according to the laws for extrema of multiple independent random variables, and this distribution depends on higher-order moments of each ˆ random variable F(θ). This means that E[L] also depends on such higher-order moments. Only the first two moments, however, arise in the bias and variance for any single θ. Thus, even in the simplest possible case, the bias-variance considerations for the individual θ do not provide a complete analysis. In most cases, the components of Fˆ are not independent. Therefore, in order to analyze E[L], in addition to higher moments of the distribution for each θ, we must now also consider higher-order moments coupling the ˆ estimates F(θ) for different θ. Due to these effects, it may be quite acceptable ˆ for all the components F(θ) to have both a large bias and a large variance, as long as they still order the θ’s correctly with respect to the true F(θ). In such a situation, large covariances could ensure that ˆ ˆ ′ ), θ ′ ≠θ if some F(θ) were incorrectly large, then F(θ would also be incorrectly large. This coupling between the components of Fˆ would preserve the ordering of θ’s under F. So, even with large bias and variance for each θ, the estimator as a whole would still work well. Nevertheless, it is sufficient to design estimators ˆ F(θ) with sufficiently small bias plus variance for each single θ. More precisely, suppose that those terms are very small on the scale of differences F(θ) − F(θ ′ ) for any θ and θ ′ . Then by Chebychev’s inequality,

Bias-Variance Trade-offs: Novel Applications

we know that the density functions of the random ˆ ′ ) have almost no overlap. ˆ and F(θ variables F(θ) ˆ Accordingly, the probability that a sample of F(θ) − ˆ ′ ) has the opposite sign of F(θ) − F(θ ′ ) is F(θ almost zero. Evidently, E[L] is generally determined by a complicated relationship involving bias, variance, covariance, and higher moments. Natural MCO in general, and naive MCO in particular, ignore all of these effects, and consequently, often perform quite poorly in practice. In the next section we discuss some ways of addressing this problem. Parametric Machine Learning

There are many versions of the basic MCO problem described in the previous section. Some of the best-explored arise in parametric density estimation and parametric supervised learning, which together comprise the field of parametric machine learning (PL). In particular, parametric supervised learning attempts to solve arg min ∫ dx p(x) ∫ dy p(y ∣ x)fθ (x). θ∈Θ

Here, the values x represent inputs, and the values y represent corresponding outputs, generated according to some stochastic process defined by a set of conditional distributions {p(y ∣ x), x ∈ X }. Typically, one tries to solve this problem by casting it as an MCO problem. For instance, say we adopt a quadratic loss between a predictor zθ (x) and the true value of y. Using MCO notation, we can express the associated supervised learning problem as finding arg minθ F(θ), where lθ (x) = ∫ dy p(y ∣ x) (zθ (x) − y) ,

These are used to estimate arg minθ F(θ), exactly as in MCO. In particular, one could estimate the minimizer ˆ of F(θ) by finding the minimum of F(θ), just as in natural MCO. As mentioned above, this MCO algorithm can perform very poorly in practice. In PL, this poor performance is called “overfitting the data.” There are several formal approaches that have been explored in PL to try to address this “overfitting the data.” Interestingly, none are based on direct considerˆ and the ramiation of the random variable F(θˆ⋆ (F)) fications of its distribution for expected loss (cf. ()). In particular, no work has applied the mathematics of extrema of multiple random variables to analyze the bias-variance-covariance trade-offs encapsulated in (). The PL approach that perhaps comes closest to such ⋆ direct consideration of the distribution of F(θˆ ) is uniform convergence theory, which is a central part of computational learning theory (see Angluin, ). Uniform convergence theory starts by crudely encapsulating the quadratic loss formula for expected loss under natural MCO (). It does this by considering the worst-case bound, over possible p(x) and p(y ∣ x), of the probability that F(θ ⋆ ) exceeds minθ F(θ) by more than κ. It then examines how that bound varies with κ. In particular, it relates such variation to characteristics of the set of functions {fθ : θ ∈ Θ}, e.g., the “VC dimension” of that set (see Vapnik, , ). Another, historically earlier approach, is to apply bias-plus-variance considerations to the entire PL algo⋆ ˆ separately. This rithm θˆ , rather than to each F(θ) approach is applicable for algorithms that do not use natural MCO, and even for non-parametric supervised learning. As formulated for parameteric supervised learning, this approach combines the formulas in () to write F(θ) = ∫ dx dy p(x)p(y ∣ x)(zθ (x) − y) .

fθ (x) = p(x) lθ (x), F(θ) = ∫ dx fθ (x).

B

()

This is then substituted into (), giving

Next, the argmin is estimated by minimizing a sample-based estimate of the F(θ)’s. More precisely, we are given a “training set” of samples of p(y ∣ x) p(x), {(x(i) , yi )i = , . . . , m}. This training set provides a set of associated estimates of F(θ):

E[L] = ∫ dθˆ⋆ dx dy p(x) p(y ∣ x) p(θˆ⋆ )(zθˆ⋆ (x) − y)

m ˆ F(θ) = ∑ lθ (x(i) ). m i=

The term in square brackets is an x-parameterized expected quadratic loss, which can be decomposed into

= ∫ dx p(x) [∫ dθˆ⋆ dy p(x)p(y ∣ x)p(θˆ⋆ ) (zθˆ⋆ (x) − y) ] .

()

B

B

Bias-Variance Trade-offs: Novel Applications

a bias, variance, etc., in the usual way. This formulation eliminates any direct concern for issues like the distribution of extrema of multiple random variables, ˆ ′ ) for different values ˆ and F(θ covariances between F(θ) of θ, and so on. There are numerous other approaches for addressing the problems of natural MCO that have been explored in PL. Particularly important among these are Bayesian approaches, e.g., Buntine and Weigend (), Berger (), and Mackay (). Based on these approaches, as well as on intuition, many powerful techniques for addressing data-overfitting have been explored in PL, including regularization, crossvalidation, stacking, bagging, etc. Essentially all of these techniques can be applied to any MCO problem, not just PL problems. Since many of these techniques can be justified using (), they provide a way to exploit the bias-variance trade-off in other domains besides PL. PLMCO

In this section, we illustrate how PL techniques that exploit the bias-variance decomposition of () can be used to improve an MCO algorithm used in a domain outside of PL. This MCO algorithm is a version of adaptive importance sampling, somewhat similar to the CE method (Rubinstein & Kroese, ), and is related to function smoothing on continuous spaces. The PL techniques described are applicable to any other MCO problem, and this particular one is chosen just as an example. MCO Problem Description The problem is to find the

θ-parameterized distribution qθ that minimizes the associated expected value of a function G∶ Rn → R, i.e., find arg min Eq θ [G]. θ

We are interested in versions of this problem where we do not know the functional form of G, but can obtain its value G(x) at any x ∈ X . Similarly we cannot assume that G is smooth, nor can we evaluate its derivatives directly. This scenario arises in many fields, including blackbox optimization (see Wolpert, Strauss, & Rajnarayan, ), and risk minimization (see Ermoliev & Norkin, ).

We begin by expressing this minimization problem as an MCO problem. We know that Eq θ [G] = ∫ dx qθ (x)G(x) X

Using MCO terminology, fθ (x)=qθ (x)G(x) and F(θ)= Eq θ [G]. To apply MCO, we must define a vectorvalued random variable Fˆ with components indexed by θ, and then use a sample of Fˆ to estimate arg minθ Eq θ [G]. In particular, to apply naive MCO to estimate arg minθ Eq θ (G), we first i.i.d. sample a density function h(x). By evaluating the associated values of G(x) we get a data set D ≡ (DX , DG ) = ({x(i) : i = , . . . , m}, {G(x(i) ) : i = , . . . , m}). The associated estimates of F(θ) for each θ are m qθ (x(i) )G(x(i) ) ˆ . F(θ) ≜ ∑ m i= h(x(i) )

()

The associated naive MCO estimate of arg minθ Eq θ [G] is ˆ θˆ⋆ ≡ arg min F(θ). θ

Suppose Θ includes all possible density functions over x’s. Then the qθ minimizing our estimate is a delta function about the x(i) ∈ DX with the lowest associated value of G(x(i) )/h(x(i) ). This is clearly a poor estimate in general; it suffers from “data-overfitting.” Proceeding as in PL, one way to address this dataoverfitting is to use regularization. In particular, we can use the entropic regularizer, given by the negative of the Shannon entropy S(qθ ). So we now want to find the minimizer of Eq θ [G(x)] − TS(qθ ), where T is the regularization parameter. Equivalently, we can minimize βEq θ [G(x)] − S(qθ ), where β = /T. This changes the definition of Fˆ from the function given in () to m β qθ (x(i) )G(x(i) ) ˆ − S(qθ ). F(θ) ≜ ∑ m i= h(x(i) ) Solution Methodology Unfortunately, it can be difficult

to find the θ globally minimizing this new Fˆ for an arbitrary D. An alternative is to find a close approximation

Bias-Variance Trade-offs: Novel Applications

to that optimal θ. One way to do this is as follows. First, we find minimizer of m β p(x(i) )G(x(i) ) − S(p) ∑ m i= h(x(i) )

()

over the set of all possible distributions p(x) with domain X . We then find the qθ that has minimal Kullback–Leibler (KL) divergence from this p, evaluated over DX . That serves as our approximation to ˆ and therefore as our estimate of the θ arg minθ F(θ), that minimizes Eq θ (G). The minimizer p of () can be found in closed form; over DX it is the Boltzmann distribution p β (x(i) ) ∝ exp(−β G(x(i) )). The KL divergence in DX from this Boltzmann distribution to qθ is F(θ) = KL(p β ∥qθ ) = ∫ dx p β (x) log ( X

p β (x) ). qθ (x)

The minimizer of this KL divergence is given by

the cost of convexity of the KL distance minimization problem. However, a plethora of techniques from supervised learning, in particular the expectation maximization (EM) algorithm, can be applied with minor modifications. Suppose qθ is a mixture of M Gaussians, that is, θ = (µ, Σ, ϕ) where ϕ is the mixing p.m.f, we can view the problem as one where a hidden variable z decides which mixture component each sample is drawn from. We then have the optimization problem minimize − ∑ D

E-step: For each i, set Qi (z(i) ) = p(z(i) ∣x(i) ), (i)

= q µ,Σ,ϕ (z(i) = j∣x(i) ), j = , . . . , M.

m

exp(−βG(x(i) )) log(qθ (x(i) )). (i) ) θ h(x i= () This approach is an approximation to a regularized version of the naive MCO estimate of the θ that minimizes Eq θ (G). The application of the technique of regularization in this context has the same motivation as it does in PL: to reduce bias plus variance. Log-Concave Densities If q θ is log-concave in its parameters θ, then the minimization problem in () is a convex optimization problem, and the optimal parameters can be found closed-form. Denote the likelihood ratios by s(i) = exp(−βG(x(i) ))/h(x(i) ). Differentiating () with respect to the parameters µ and Σ− and setting them to zero yields (i) (i) ∑D s x ∑D s(i) ∑ s(i) (x(i) − µ ⋆ )(x(i) − µ ⋆ )T Σ⋆ = D ∑D s(i)

µ⋆ =

Mixture Models The single Gaussian is a fairly restric-

tive class of models. Mixture models (see 7Mixture Modeling) can significantly improve flexibility, but at

p(x(i) ) log (qθ (x(i) , z(i) )) . h(x(i) )

Following the standard EM procedure, we get the algorithm described in (). Since this is a nonconvex problem, one typically runs the algorithm multiple times with random initializations of the parameters.

that is, wj θ † = arg min − ∑

B

(i)

M-step: Set

µj =

∑D wj s(i) x(i) (i)

(i) ∑D wj s

,

(i)

Σj =

(i) (i) (i) T ∑D wj s (x − µ j )(x − µ j ) (i)

,

∑D wj s(i) (i)

ϕj =

∑D wj s(i) ∑D s(i)

.

Test Problems To compare the performance of this

algorithm with and without the use of PL techniques, we use a couple of very simple academic problems in two and four dimensions – the Rosenbrock function in two dimensions, given by GR (x) = (x − x ) + ( − x ) , and the Woods function in four dimensions, given by given by GWoods (x) = (x − x ) + ( − x ) + (x − x ) + ( − x ) + .[( − x ) + ( − x ) ] + .( − x )( − x ).

B

B

Bias-Variance Trade-offs: Novel Applications

For the Rosenbrock, the optimum value of is achieved at x = (, ), and for the Woods problem, the optimum value of is achieved at x = (, , , ). Application of PL Techniques As mentioned above,

there are many PL techniques beyond regularization that are designed to optimize the trade-off between bias and variance. So having cast the solution of arg minq θ E(G) as an MCO problem, we can apply those other PL techniques instead of (or in addition to) entropic regularization. This should improve the performance of our MCO algorithm, for the exact same reason that using those techniques to trade off bias and variance improves performance in PL. We briefly mention some of those alternative techniques here. The overall MCO algorithm is broadly described in Algorithm . For the Woods problem, samples of x are drawn from the updated qθ at each iteration, and for the Rosenbrock, samples. For comparing various methods and plotting purposes, , samples of G(x) are drawn to evaluate Eq θ [G(x)]. Note: in an actual optimization, we will not be drawing these test samples! All the performance results in Fig. are based on runs of the PC algorithm, randomly initialized each time. The sample mean performance across these runs is plotted along with % confidence intervals for this sample mean (shaded regions). 7Cross-Validation for Regularization: We note that we are using regularization to reduce variance, but that regularization introduces bias. As is done in PL, we use standard k-fold cross-validation to tradeoff this bias and

Algorithm Overview of pq minimization using Gaussian mixtures : Draw uniform random samples on X : Initialize regularization parameter β : Compute G(x) values for those samples : repeat : Find a mixture distribution qθ to minimize sampled pq KL distance : Sample from qθ : Compute G(x) for those samples : Update β : until Termination : Sample final q θ to get solution(s).

variance. We do this by partitioning the data into k disjoint sets. The held-out data for the ith fold is just the ith partition, and the held-in data is the union of all other partitions. First, we “train” the regularized algorithm on the held-in data Dt to get an optimal set of parameters θ ⋆ , then “test” this θ ⋆ by considering unregularized performance on the held-out data Dv . In our context, “training” refers to finding optimal parameters by KL distance minimization using the held-in data, and “testing” refers to estimating Eq θ [G(x)] on the heldout data using the following formula (Robert & Casella, ).

∑ ̂ g (θ) =

Dv

qθ (x(i) )G(x(i) ) h(x(i) ) qθ (x(i) ) ∑ (i) Dv h(x )

.

We do this for several values of the regularization parameter β in the interval k β < β < k β, and choose the one that yield the best held-out performance, averaged over all folds. For our experiments, k = ., k = , and we use five equally-spaced values in this interval. Having found the best regularization parameter in this range, we then use all the data to minimize KL distance using this optimal value of β. Note that all crossvalidation is done without any additional evaluations of G(x). Cross-validation for β in PC is similar to optimizing the annealing schedule in simulated annealing. This “auto-annealing” is seen in Fig. a, which shows the variation of β with iterations of the Rosenbrock problem. It can be seen that β value sometimes decreases from one iteration to the next. This can never happen in any kind of “geometric annealing schedule,” β ← k β β, k β > , of the sort that is often used in most algorithms in the literature. In fact, we ran trials of this algorithm on the Rosenbrock and then computed a best-fit geometric variation for β, that is, a nonlinear least squares fit to variation of β, and a linear least squares fit to the variation of log(β). These are shown in Fig. c and d. As can be seen, neither is a very good fit. We then ran trials of the algorithm with the fixed update rule obtained by best-fit to log(β), and found that the adaptive setting of β using cross-validation performed an order of magnitude better, as shown in Fig. e.

Bias-Variance Trade-offs: Novel Applications Cross-validation for β: log(β) History.

6

3

4

2

2

0

–2

–1

–4

5

0

10

15 Iteration

a x10

25

–2

30

Least-squares Fit to β

9

3

βo = 1.809e+00

2

kβ = 1.548

0

5

10

15 Iteration

20

25

30

b 10

x10

Least-squares Fit to log(β)

9

βo = 1.240e-03 β

β

4

20

B

1

0

5

kβ = 1.832

0 0

10

20

10

20

1 0 0

20

Iteration

30

40

50

10

0

10

10

0

10

20

c

30 Iteration

40

10

50

Iteration

30

40

50

Cross-validation for Model-selection:2-D Rosenbrock.

4

Single gaussian Mixture model

3.5 3 2.5

3

2

log[E(G)]

log[E(G)]

0

d

3.5

2.5

1.5 1

2

0.5

1.5

0

1

–0.5 0

10

20

e

Iteration

30

40

–1

50

0

Bagging: Noisy Rosenbrock.

3

2

2 log[E(G)]

3

1

0

–1

–1

10

Iteration

15

20

–2

25

h

Iteration

15

20

25

Single gaussian Cross-validation Stacking

1

0

5

10

Model Selection Methods: Noisy Rosenbrock.

4

No bagging Bagging

0

5

f

4

log[E(G)]

50

0

Best-fit β Cross-validation for β

4

g

40

10

Cross-validation for Regularization: Woods Problem.

4.5

–2

30

–10

–10

0.5

Iteration

10

10 log(β)

log(β)

10

10

Cross-validation for β: log[E(G) History.

4

log(E(G)

log(β)

8

B

0

5

10

Iteration

15

20

25

Bias-Variance Trade-offs: Novel Applications. Figure . Various PL techniques improve MCO performance

B

Bias-Variance Trade-offs

Cross-Validation for Model Selection: Given a set Θ (sometimes called a model class) to choose θ from, we can find an optimal θ ∈ Θ. But how do we choose the set Θ? In PL, this is done using cross-validation. We choose ˆ has the best heldthat set Θ such that arg minθ∈Θ F(θ) out performance. As before, we use that model class Θ that yields the lowest estimate of Eq θ [G(x)] on the held-out data. We demonstrate the use of this PL technique for minimizing the Rosenbrock problem, which has a long curved valley that is poorly approximated by a single Gaussian. We use cross-validation to choose between a Gaussian mixture with up to four components. The improvement in performance is shown in Fig. d. Bagging: In bagging (Breiman, a), we generate multiple data sets by resampling the given data set with replacement. These new data sets will, in general, contain replicates. We “train” the learning algorithm on each of these resampled data sets, and average the results. In our case, we average the qθ got by our KL divergence minimization on each data set. PC works even on stochastic objective functions, and on the noisy Rosenbrock, we implemented PC with bagging by resampling ten times, and obtained significant performance gains, as seen in Fig. g. Stacking: In bagging, we combine estimates of the same learning algorithm on different data sets generated by resampling, whereas in stacking (Breiman, b; Smyth & Wolpert, ), we combine estimates of different learning algorithms on the same data set. These combined estimated are often better than any of the single estimates. In our case, we combine the qθ obtained from our KL divergence minimization algorithm using multiple models Θ. Again, Fig. h shows that crossvalidation for model selection performs better than a single model, and stacking performs slightly better than cross-validation.

Conclusions The conventional goal of reducing bias plus variance has interesting applications in a variety of fields. In straightforward applications, the bias-variance tradeoffs can decrease the MSE of estimators, reduce the generalization error of learning algorithms, and so on. In this article, we described a novel application of bias-variance trade-offs: we placed bias-variance

trade-offs in the context of MCO, and discussed the need for higher moments in the trade-off, such as a bias-variance-covariance trade-off. We also showed a way of applying just a bias-variance trade-off, as used in Parametric Learning, to improve the performance of MCO algorithms.

Recommended Reading Angluin, D. (). Computational learning theory: Survey and selected bibliography. In Proceedings of the twenty-fourth annual ACM symposium on theory of computing. New York: ACM. Berger, J. O. (). Statistical decision theory and bayesian analysis. New York: Springer. Breiman, L. (a). Bagging predictors. Machine Learning, (), –. Breiman, L. (b). Stacked regression. Machine Learning, (), –. Buntine, W., & Weigend, A. (). Bayesian back-propagation. Complex Systems, , –. Ermoliev, Y. M., & Norkin, V. I. (). Monte carlo optimization and path dependent nonstationary laws of large numbers. Technical Report IR--. International Institute for Applied Systems Analysis, Austria. Lepage, G. P. (). A new algorithm for adaptive multidimensional integration. Journal of Computational Physics, , –. Mackay, D. (). Information theory, inference, and learning algorithms. Cambridge, UK: Cambridge University Press. Robert, C. P., & Casella, G. (). Monte Carlo statistical methods. New York: Springer. Rubinstein, R., & Kroese, D. (). The cross-entropy method. New York: Springer. Smyth, P., & Wolpert, D. (). Linearly combining density estimators via stacking. Machine Learning, (–), –. Vapnik, V. N. (). Estimation of dependences based on empirical data. New York: Springer. Vapnik, V. N. (). The nature of statistical learning theory. New York: Springer. Wolpert, D. H. (). On bias plus variance. Neural Computation, , –. Wolpert, D. H., & Rajnarayan, D. (). Parametric learning and monte carlo optimization. arXiv:.v [cs.LG]. Wolpert, D. H., Strauss, C. E. M., & Rajnarayan, D. (). Advances in distributed optimization using probability collectives. Advances in Complex Systems, (), –.

Bias-Variance Trade-offs 7Bias-Variance

Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity

Bias-Variance-Covariance Decomposition The bias-variance-covarianc delcomposition is a theoretical result underlying 7ensemble learning algorithms. It is an extension of the 7bias-variance decomposition, for linear combinations of models. The expected squared error of the ensemble f¯(x) from a target d is: ⎛ ⎞ covar. ED {(f¯(x) − d) } = bias + var + − T ⎝ T⎠ The error is composed of the average bias of the models, plus a term involving their average variance, and a final term involving their average pairwise covariance. This shows that while a single model has a twoway bias-variance tradeoff, an ensemble is controlled by a three-way tradeoff. This ensemble tradeoff is often referred to as the accuracy-diversity dilemma for an ensemble. See 7ensemble learning for more details.

Bilingual Lexicon Extraction

Bilingual lexicon extraction is the task of automatically identifying a terms in a first language and terms in a second language which are translation f one another. In this context, a term can be either a single word or an expression composed of several words the full meaning of which cannot be derived compositionally from the meaning of the individual words. Bilingual lexicon extraction is itself a form of 7cross-lingual text mining and is an essential preliminary step in many approaches for performing other 7cross-lingual text mining tasks.

Binning 7Discretization

B

Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity Wulfram Gerstner Brain Mind Institute, Lausanne EPFL, Switzerland

Synonyms Correlation-based learning; Hebb rule; Hebbian learning

Definition The brain of humans and animals consists of a large number of interconnected neurons. Learning in biological neural systems is thought to take place by changes in the connections between these neurons. Since the contact points between two neurons are called synapses, the change in the connection strength is called synaptic plasticity. The mathematical description of synaptic plasticity is called a (biological) learning rule. Most of these biological learning rules can be categorized in the context of machine learning as unsupervised learning rules, and the remaining ones as rewardbased or reinforcement learning. The Hebb rule is an example of an unsupervised correlation-based learning rule formulated on the level of neuronal firing rates. Spike-timing-dependent plasticity (STDP) is an unsupervised learning rule formulated on the level of spikes. Modulation of learning rates in a Hebb rule or STDP rule by a diffusive signal carrying reward-related information yields a biologically plausible form of a reinforcement learning rule.

Motivation and Background Humans and animals can adapt to environmental conditions and learn new tasks. Learning becomes measurable by changes in the behavior: humans and animals get better at seeing and distinguishing visual objects with experience; animals can learn to go to a target location; humans can memorize a list of words and recall the items days later. How learning is implemented in the biological substrate is only partially known. The brain consists of billions of neurons. Each neuron has long wire-like extensions and makes contacts with thousands of other neurons. This network of neurons is not fixed but constantly changes. Connections

B

B

Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity

can be formed or can disappear, and existing connections can be strengthened or weakened. Neuroscientists have shown in numerous experiments that changes can be induced by stimulating neuronal activity in an appropriate fashion. Moreover, changes in synaptic connections that have been induced in one or a few seconds can persist for hours or days, an effect called long-term potentiation (LTP) or long-term depression (LTD) of synapses. The question arises of whether such long-lasting changes in connections are useful for learning. To answer this question, research in theoretical and computational neuroscience needs to solve two problems: First, develop a compact but realistic description of the phenomenon of synaptic plasticity observed in biology, i.e., extract learning rules from the biological data; and second, study the functional consequences of these learning rules. An important insight from experiments on LTP is that the activation of a synaptic connection alone does not lead to a long-lasting change; however, if the activation of the synapses by presynaptic signals is combined with some activation of the postsynaptic neuron, then a long-lasting change of the synapse may occur. The coactivation of presynaptic and postsynaptic neurons as a condition for learning is the key ingredient of Hebbian learning rules. Here, activation of the presynaptic neuron means that it fires one or several action potentials; activation of the postsynaptic neuron can be represented by high firing rates, a few well-timed action potentials or input from other neurons that lead to an increase in the membrane voltage.

Structure of the Learning System The Hebb Rule

Hebbian learning rules are local, i.e., they depend only on the state of the presynaptic and postsynaptic neurons plus possibly the current value of the synaptic weight itself. Let wij denotes the weight between a presynaptic neuron j and a postsynaptic neuron i, and let us describe the activity (e.g., the firing rate) of each neuron by a continuous variable ν j and ν i , respectively. Mathematically, we may therefore write for a local learning rule d wij = F(wij ; ν i , ν j ) dt

()

where F is an unknown function. In addition to locality, Hebbian learning requires some kind of cooperation or

correlation between the activity of the presynaptic neuron and that of the postsynaptic neuron. At the moment we restrict ourselves to the requirement of simultaneous activity of presynaptic and postsynaptic neurons. Since F is a function of the rates ν i and ν j , we may expand F about ν i = ν j = . An expansion to second order of the rates yields d pre post wij (t) ≈ c (wij ) + c (wij ) ν j + c (wij )ν i dt post + ccorr (wij ) ν i ν j + c (wij ) ν i pre

+ c (wij ) ν j + O(ν ).

()

Here, ν i and ν j are functions of time, i.e., ν i (t) and ν j (t) and so is the weight wij . The bilinear term ν i (t) ν j (t) is sensitive to the instantaneous correlations between presynaptic and postsynaptic activities. It is this term that makes Hebbian learning a useful concept. The simplest implementation of Hebbian plasticity would be to require ccorr > and set all other parameters in the expansion () to zero d wij = ccorr (wij ) ν i ν j . dt

()

Equation () with fixed parameter ccorr > is the prototype of Hebbian learning. However, since the activity variables ν i and ν j are always positive, such a rule will lead eventually to an increase of all weights in a network. pre Hence, some of the other terms (e.g., c or c ) need to have a negative coefficient to make Hebbian learning stable. In passing we note that a learning rule with ccorr < is usually called anti-Hebbian. Oja’s rule. A particular interesting case is a model post with coefficients ccorr > and c < , since it guarantees the normalization of the set of weights wi , . . . wiN converging onto the same postsynaptic neuron i. BCM rule. The Bienenstock–Cooper–Munro learning rule (also called BCM rule) with d wij = a(wij )Φ(ν i − ϑ) ν j dt

()

where Φ is some nonlinear function with Φ() = is a special case of (). The parameter ϑ depends on the average firing rate. Temporally asymmetric Hebbian learning. In the Taylor expansion () we focused on instantaneous correlations. More generally, we can use a Volterra expansion so as to also include temporal correlations with

Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity

nonzero time lag. With the additional assumptions that changes are instantaneous, a Volterra expansion generates terms of the form ∞ d wij ∝ ∫ [W+ (s)ν i (t) ν j (t − s) dt + W− (s)ν j (t) ν i (t − s)]ds

()

with some functions W+ and W− . For reasons of causality, W+ and W− must vanish for s < . Since W+ (s) ≠ W− (s), learning is asymmetric in time so that learning rules of the form () are called temporally asymmetric Hebbian learning. In the special case W+ (s) = −W− (s), we have antisymmetric Hebbian learning. The functions W+ and W− may depend on the present weight value. STDP rule. STDP is a form of Hebbian learning with increased temporal resolution. In contrast to ratebased Hebb models, neuronal activity is described by the firing times of the neuron, i.e., the moments when the presynaptic and postsynaptic neurons emit action f potentials. Let tj denote the f th spike of the presynaptic neuron j and tin the nth spike of the postsynaptic neuron i. The weight change in an STDP rule depends on the exact timing of presynaptic and postsynaptic spikes d f wij = ∑ ∑[A(wij ; t − tj )δ(t − tin ) dt n f f

f

+ B(wij ; t − ti )δ(t − tj )]

()

where A(x) and B(x) are some real-valued functions with A(wij , x) = B(wij , x) = for x < . Thus, at the moment of a postsynaptic spike the synaptic weight is f f updated by an amount that depends on the time ti −tj f

since a previous presynaptic spike tj . Similarly, at the moment of a presynaptic spike the synaptic weight is f updated by an amount that depends on the time tj − f

f

ti since a previous postsynaptic spike ti . The dependence on the present value wij can be used to keep the weight in a desired range < wij < wmax . A standard f choice for the functions A and B is A(wij ); t − tj = f

f

A+ (wij ) exp[−(t − tj )/τ+ ] for t − tj > and zero otherwise. Similarly, B(wij ; t − tin ) = B− (wij ) exp[−(t − f tin )/τ− ] for t − ti > and zero otherwise. Here, τ+ and τ− are time constants in the range of – ms. The case A+ (x) = (wmax − x) c+ and Bx (x) = − c− x is called

B

soft bounds. The choice A+ (x) = c+ Θ(wmax − x) and Bx = − c− Θ(x) is called hard bounds. Here, c+ and c− are positive constants. The term proportional to A+ causes potentiation (weight increase), the one proportional to A− causes depression (weight decrease) of synapses. Note that the STDP rule () can be interpreted as a spike-based form of temporally asymmetric Hebbian learning. Functional Consequences of Hebbian Learning

Sensitivity to correlations. All Hebbian learning rules are sensitive to the correlations between the activity of the presynaptic neuron j and that of the postsynaptic neuron i. If the activity of the postsynaptic neuron is given by a linear sum of all inputs rates, i.e., ν i = γ ∑j wij ν j , then correlations between presynaptic and postsynaptic activities can be traced back to correlations in the input. A particular clear example of learning driven by correlations in the input is Oja’s learning rule applied to a statistical ensemble of inputs with zero mean. In this case, the postsynaptic neuron becomes sensitive to the dominant principal component of the input ensemble. If the neuron model is nonlinear, Hebbian learning extracts the independent components of the statistical input ensemble. These two examples show that learning by a Hebbian learning rule makes neurons adapt to the statistics of the input. While the condition of zero-mean input is biologically not realistic (because neuronal firing rates are always positive), this condition can be relaxed so that the same result is also applicable to biologically plausible learning rules. Receptive fields and cortical maps. Neurons in the primary visual cortex of cats and monkeys respond to visual stimuli in a localized region of the visual field. This small sensitive zone is called the receptive field of the neuron. Neighboring neurons normally have very similar receptive fields. The exact location and properties of the receptive field are not fixed, but can be influenced by sensory stimulation. Models of unsupervised Hebbian learning can explain the development of receptive fields and the adaptation of cortical maps to the statistics of the ensemble of stimuli. Beyond the Hebb rule. Standard models of Hebbian learning are formulated on the level of neuronal firing rates, a graded variable characterizing neuronal activity. However, real neurons communicate by spikes, short electrical pulses or “action potentials” with a rather

B

B

Biomedical Informatics

stereotyped time course. Experiments have shown that the changes of synaptic efficacy depend not only on the mean firing rate of action potentials but on the relative timing of presynaptic and postsynaptic spikes on the level of milliseconds. This Spike-Timing Dependent Synaptic Plasticity (STDP) can be considered a temporally more precise form of Hebbian learning. The STDP rule indicated above supposes that pairs of spikes (one presynaptic and one postsynaptic action potential) within some time window cause a weight change. However, experimentally it was shown that at least three spikes are necessary (one presynaptic and two postsynaptic spikes). Moreover, the voltage of the postsynaptic neuron matters even in the absence of spikes. In most models of Hebbian learning and STDP, the pre factors c , c ... are constant or depend only on the synaptic weight. However, in biological context the speed of learning is often gated by neuromodulators. Since some of these neuromodulators contain rewardrelated information, one can think of learning as a three-factor rule where weight changes depend on presynaptic activity, postsynaptic activity, and the presence of a reward-related factor. A prominent neuromodulator linked to reward information is dopamine. Three factor learning rules fall in the class of reinforcement learning algorithms.

Cross References 7Dimensionality Reduction 7Reinforcement Learning 7Self-Organizing Maps

Recommended Reading Bliss, T., & Gardner-Medwin, A. (). Long-lasting potentation of synaptic transmission in the dendate area of unanaesthetized rabbit following stimulation of the perforant path. The Journal of Physiology, , –. Bliss, T., Collingridge, G., & Morris, R. (). Long-term potentiation: Enhancing neuroscience for years - introduction. Philosophical Transactions of the Royal Society of London. Series B : Biological Sciences, , –. Cooper, L., Intrator, N., Blais, B., & Shouval, H. Z. (). Theory of cortical plasticity. Singapore: World Scientific. Dayan, P., & Abbott, L. F. (). Theoretical Neuroscience. Cambridge, MA: MIT Press. Gerstner, W., & Kistler, W. K. (). Spiking neuron models. Cambridgess, UK: Cambridge University Press. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (). A neuronal learning rule for sub-millisecond temporal coding. Nature, , –.

Hebb, D. O. (). The organization of behavior. New York: Wiley. Lisman, J. (). Long-term potentiation: Outstanding questions and attempted synthesis. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences, , –. Malenka, R. C., & Nicoll, R. A. (). Long-term potentiation–a decade of progress? Science, , –. Markram, H., Lübke, J., Frotscher, M., & Sakmann, B. (). Regulation of synaptic efficacy by coincidence of postysnaptic AP and EPSP. Science, , –. Schultz, W., Dayan, P., & Montague, R. (). A neural substrate for prediction and reward. Science, , –.

Biomedical Informatics C. David Page, Sriraam Natarajan University of Wisconsin Medical School, Madison, USA

Introduction Recent years have witnessed a tremendous increase in the use of machine learning for biomedical applications. This surge in interest has several causes. One is the successful application of machine learning technologies in other fields such as web search, speech and handwriting recognition, agent design, spatial modeling, etc. Another is the development of technologies that enable the production of large amounts of data in the time it used to take to generate a single data point (run a single experiment). A third most recent development is the advent of Electronic Medical/Health Records (EMRs/EHRs). The drastic increase in the amount of data generated has led the biologists and clinical researchers to adopt algorithms that can construct predictive models from large amounts of data. Naturally, machine learning is emerging as a tool of choice. In this article, we will present a few data types and tasks involving such large-scale biological data, where machine learning techniques have been applied. For each of these data types and tasks, we first present the required background, followed by the challenges involved in addressing the tasks. Then, we present the machine learning techniques that have been applied to these data sets. Finally and most importantly, we

Biomedical Informatics

B

present the lessons learned in these tasks. We hope that these lessons will be helpful to researchers who aim to apply machine learning algorithms to biological applications and equip them with useful knowledge when they collaborate with biological scientists. Some of the data types that we present in this work are:

design, followed by EMR data and personalized medicine. For each of these data types, we motivate the problem and survey the different machine learning solutions. Finally, we conclude by outlining the lessons learned from all these data types and presenting some interesting and exciting directions for future research.

Gene expression microarrays SNPs and genetic data ● Mass spectrometry and other proteomic data ● High-throughput screening data for drug design ● Electronic Medical Records (EMR) and personalized medicine

Gene Expression Microarrays

● ●

Some of the key lessons learned from all these data types include the following: () We can often do surprisingly well with far more features than data points if there are many highly predictive features (e.g., predicting cancer from microarray data) and if we use methods that are robust to overfitting such as Voted Decision Stumps (Hardin et al., ; Waddell et al., ) (7Ensemble Learning and 7Decision Stumps), 7Naive Bayes (Golub et al., ; Listgarten et al., ), or Linear Support Vector Machines (SVMs) (see 7Support Vector Machine) (Furey et al., ; Hardin et al., ). () Bayes Net learning (Friedman, ) (see 7Bayesian Methods) often does not give us causality, but 7Active Learning and 7Time-Series data help if available (Pe’er, Regev, Elidan, & Friedman, ; Ong, Glassner, & Page, ; Tucker, Vinciotti, Hoen, Liu, & Famili, ; Zou & Conzen, ). () Multi-relational methods are useful for EMRs or molecular data as the data in these cases are very highly relational (see 7Multi-relational Data Mining). () There are more important issues than just increasing the accuracy of the learned model on these data sets. Such issues include how data was created, its comprehensibility (physicians typically want to understand the model that has been learned), and its privacy (some data sets contain private information that cannot be posted on public web sites and cannot even be downloaded off site). The rest of the paper is organized as follows: First we present gene expression microarrays, followed by SNPs and other genetic data. We then present mass spectrometry (MS) and related proteomic data. Next, we present high-throughput screening data for drug

This data type was presented in detail in AI Magazine (Molla et al., ) and hence we will brief it in this section. We encourage the reader to read Molla et al. () for more details on this data type. Genes are contained in the DNA of an organism. The mechanism by which proteins are produced from their corresponding genes is a two-step process. The first step is the transcription of a gene into a messenger RNA (mRNA) and in the second step called as translation, a protein is built using mRNA as a blueprint. One property that DNA and RNA have in common is that each is a chain of chemicals called as bases. In the case of DNA, these bases are Adenine, Cytosine, Guanine, and Thymine, commonly referred to as A, C, G, and T, respectively. RNA has the same set of four bases, except Thymine; RNA has Uracil, commonly referred as U. An important characteristic of DNA and RNA is complementarity, that is, each base only binds well with its complement: A with T (or U) and G with C. As a result of complementarity, a strand of either DNA or RNA has a strong affinity toward what is known as its reverse complement, which is a strand of either DNA or RNA that has bases exactly complementary to the original strand. Complementarity is central to the processes of replication of the DNA and transcription. In addition, complementarity can be used to detect specific sequences of bases within strands of DNA and RNA. This is done by first synthesizing a probe, a piece of DNA that is the complement of a sequence that one wants to detect, and then introducing this probe to a solution containing the genetic material (DNA or RNA) to be searched. This solution of genetic material is called the sample. In theory, the probe will bind to the sample if and only if the probe finds its complement in the sample (in reality, this process is often imperfect). The act of binding between a sample and probe is called hybridization. Prior to the experiment, a biologist labels the probe using a florescent flag. After the

B

B

Biomedical Informatics

hybridization experiment, one can easily scan to see if the probe has hybridized to its reverse complement in the sample. This allows the molecular biologist to determine the presence or absence of the sequence in the sample. Gene Chips

DNA probe technology has been adapted for detection of tens of thousands of sequences simultaneously. This has become possible due to the device called a microarray or gene chip, the working of which is illustrated in Fig. . When using the chips it is more common to label (luminescently) the samples than the probe. Thousands of copies of this labeled sample are spread across the probe, followed by washing away any copies that do not remain bound. Since the probes are attached at specific locations on the chip, if a labeled sample is detected at any position in the chip, the probe that is hybridized to its complement can be easily determined. The most common use of these gene chips is to measure the expression levels of various genes in the organism. Probes are typically on the order of -bases long, whereas samples are usually about times, as long, with a large variation due to the process that breaks up long sequences of RNA into small samples (Molla et al., ). To understand about the biology of an organism, say to understand human biology to design new drugs or lower the blood pressure or to cure diabetes, there is a necessity to understand the degree to which different genes get expressed as proteins under different conditions and different cell types. It is much easier to estimate the amount of mRNA for a gene than the protein-production rate. Microarrays provide the

Labeled sample (RNA) Hybridization

Probes(DNA)

Gene chip surface

Biomedical Informatics. Figure . Hybridization of sample to probe

measurement of RNAs corresponding to the given gene rather than the amounts of protein. In brief, experiments with the microarrays are performed as follows: As can be seen from the figure, probes are DNA strands attached to the gene chip surface. A typical probe length is bases (i.e., letters from A, C, G, T to represent a gene). There may be several different subsequences of these bases. Then the mRNA (which is the labeled sample) is passed over the microarrays and the mRNA will bind to the complementary DNA corresponding to the gene better than the other DNA strings. Then the florescence levels of the different gene chips segments are measured, which in turn measures the amount of mRNA on that surface. This mRNA measurement serves as a surrogate to the expression level of the gene.

Machine Learning for Microarrays The data from microarrays (gene chips) have been analyzed and used by machine learning researchers in two different ways: . Data points are genes. This is the case where the examples are genes while the features are the samples (measured expression levels of a single gene under a variety of conditions). The goal of this view is to categorize new genes based on the current set of examples. . Data points are samples (e.g., patients). This is the case where the examples are patients and the features are the measured expression levels of genes under one condition. The problems have been approached in two different ways. In the 7Unsupervised Learning approach, the goal is to cluster the genes according to their expression levels or to cluster the patients (samples) based on their gene expression levels, or both. Hierarchical clustering is especially widely applied. As one of many examples, see Perou et al. (). In the 7Supervised Learning setting, the Class labels are the category of the genes or the samples. The latter is the more common supervised task, each sample being mRNA from a different patient (with the same cell type from each patient) or an organism under different conditions to learn a model that accurately predicts the class based on the features. The features could be the patient’s expression values for each

Biomedical Informatics

gene, while the class labels might be the patient’s disease state. We discuss this task further in the subsequent paragraphs. Yet another widely studied supervised learning task is to predict cancer vs. normal for a wide variety of cancer types. One of the significant lessons learned is that it is easy to predict cancer vs. normal in patients based on the gene expression by several machine learning techniques, largely regardless of the type of cancer. The main reason for this is that if cancer is present, many genes in the cancer cells “go haywire” and hence are very predictive of the cancer. The primary challenge in this prediction problem is the noise in the data (impure RNA, cross-hybridization, etc.). Other related tasks that have been addressed include distinguishing related cancer types and distinguishing cancer from a related benign condition. An early success was a work by Golub et al. (), distinguishing acute myeloid leukemia and acute lymphoblastic leukemia (ALL). They used a weighted voting algorithm similar to Naive Bayes and achieved a very high accuracy. This result has been repeated on this data with many other machine learning (ML) approaches. Other work examined multiple myeloma vs. benign condition. This task is challenging because the benign condition is very similar to the cancer, and hence the machine learning algorithms had a difficult time predicting accurately. We refer to Hardin et al. () for more details on the experiments. Another important lesson for machine learning researchers from this data type is that the biologists often do not want one predictive model, but a rankordered list of genes that a biologist can explore further with additional lab tests on certain genes. Hence, there is a need to present a small set of highly interesting genes to perform follow-up experiments on. Toward this end, statisticians have used mutual information or a t-test to rank the genes. When using a t-test, they check if the mean expression levels are different under the two conditions (cancer vs. normal), yielding a p-value. But the issue is that when working with a large number of genes (typically in the order of ,), there could be some genes with lower p-value by chance. This is known as the “multiple comparisons problem.” One solution is to do a Bonferoni correction (multiply p-values by the number of genes), but this can be a drastic step and may eliminate all the genes. There are other methods such as

B

false discovery rate (Storey & Tibshirani, ) that uses the notion of q-values. We do not go into detail of this method. But the key recommendation we make is that such a method should be used along with the supervised learning method, as the biological collaborators might be interested in the ranking of genes. One of the most important research directions for the use of microarray data lies in the prognosis and treatment. The features are the same as those of diagnosis, but the class value becomes life expectancy for a given treatment (or a positive response vs. no response to a given treatment). The goal is to use the person’s genes to make these predictions. An example of this is the breast cancer prognosis study (Van’t Veer et al., ), where the goal is to predict good prognosis (no metastastis within years of initial diagnosis) vs. poor prognosis. They used an ensemble of voting algorithms and obtained very good results. Nevertheless, an important lesson learned from this experiment and others was that when using 7cross-validation, there is a need to tune parameters and perform feature selection independently on each fold of the crossvalidation. There can be a large number of features, and it is natural to want to reduce the size of the data set before working with it. But reducing the number of features by some measure of correlation with the class, such as information gain, using the entire data set means that on each fold of cross-validation, information has leaked from the labeled test set into the training process – labels of test cases were used to eliminate many features from the training set. Hence, selecting features by looking at the entire data set can partially negate the effect of cross-validation, sometimes yielding accuracy estimates that are more than % points overly optimistic. Hence the entire training process of selecting features, tuning parameters, and learning a model must be repeated for every fold in cross-validation by looking only at the training data for that fold. An important use of microarrays for prognosis and therapy is in the area of predictive personalized medicine (PPM). While we present the idea of PPM later in the paper, it must be mentioned that combining gene expression data with clinical trials of the patients to recommend the best treatment for the patients is a very exciting problem with promising impact in the area of PPM.

B

B

Biomedical Informatics

Gene A

Problem: Not Causality

P(A) 0.2

A A P(B) T 0.9 F 0.1

Gene B

Gene C

A P(C) T 0.8 F 0.1

B

A is a good predictor of B. But is A regulating B?? Ground truth might be:

Gene D

B T T F F

C T F T F

P(D) 0.9 0.2 0.3 0.1

Biomedical Informatics. Figure . A simple Bayes net. The actual learning task typically involves thousands of variables

Bayesian Networks for Regulatory Pathways: 7Bayesian Networks have been one of the successful machine learning methods used for the analysis of microarray data. Recall that a Bayes net is a directed acyclic graph, such as the one shown in Fig. that defines a joint distribution over the variables using a set of conditional distributions. Friedman and Halpern (Friedman & Halpern, ) were the first to use Bayes nets for the microarrays data type. In particular, the problem that was considered was finding regulatory pathways in genes. This problem can be posed as a supervised learning task as follows: Given: A set of microarray experiments for a single organism under different conditions. ● Do: Learn a graphical model that accurately predicts expression of some genes in terms of others.

●

Friedman and Halpern showed that using statistical methods, a Bayes net representing the observations (expression levels of different genes) can be learned automatically. A main advantage of Bayes nets is that they can (potentially) provide insight into the interaction networks within cells that regulate the expression of genes. But one has to exercise caution, interpreting the arcs of a learned Bayes net as representing causality. For example in Fig. , one might interpret the network to mean that gene A causes gene B and gene C to be expressed, in turn influencing gene D. Note that however, the Bayes net in this case just denotes the correlation and not the causality, that is, the direction of an

B

A C

A

A B

B

C C

B A

Or a more complicated variant

Biomedical Informatics. Figure . Why a learned Bayesian network may not be representing regulation of one gene by another

arc merely represents the fact that one variable is a good predictor of the other, as illustrated in Fig. . One possible method of learning causality is to use knock-out methods [Pe’er, Regev, Elidan, & Friedman, ], where for of the genes in S. cerevisiae (bakers’ yeast), biologists have created a knock-out mutant or a genetic mutant lacking that gene. If the parent of a gene in the Bayes net is knocked out and the child’s status remains unchanged, then it is unlikely that the arc from the parent to the child captures causality. A key limitation is that the mutants are not available for many organisms. Some other approaches such as RNAi have been proposed for more efficiently doing knock-outs, but a limitation is that RNAi typically reduces rather than eliminates expression of a gene. Ong, Glassner, and Page () used time-series data (data from the same organism at various time points) to partially address the issue of causality. They used these data to learn dynamic Bayesian networks in order to infer temporal direction for gene interactions, thereby getting a potentially better handle on causality. DBNs have been employed by other researchers for time-series gene expression data, and the approach has been extended to learn DBNs with continuous variables (Segal, Pe’er, Regev, Koller, & Friedman, ).

Single Nucleotide Polymorphisms Single-Nucleotide Polymorphisms (SNPs) are individual base positions (i.e., single-nucleotide positions)

Biomedical Informatics

in DNA, where people (or the organism of interest) vary. Most of the variation in human DNA is due to SNPs variations. (There are other variations such as copy number, insertions and deletions that we do not consider in this article.) There are well over three million known SNPs in humans. Technologies such as Illumina or Affymetrix whole-genome scan can measure a million SNPs in short time. The measurement of these variations is an order of magnitude faster, easier, and cheaper than sequencing all the genes of the person. It is believed that in the next decade, it will be possible to obtain the entire genome sequence for an individual human for under $, (Mardis, ). If we had every human’s entire sequence, it could be used to predict the susceptibility of diseases for humans or the adverse reactions to drugs for a certain subset of patients. The idea is illustrated in Fig. . Suppose the red dots in the figure are two copies of nucleotide A, and the green dots denote a different nucleotide, say C. As can be seen from the figure, people who respond to a treatment T (top half of the figure) have two copies of A (for instance, these could be the positive examples), while the people who do not respond to the treatment have at most one copy of A (negative examples and are presented in the bottom half of the figure). Now, we can imagine modeling the sequence to predict the susceptibility to a disease or responsiveness to a treatment. SNP data can serve as a surrogate for the above problem. SNPs allow us to detect the variations among humans. An example of SNP data is presented in Fig.

Susceptible to disease D or responds to treatment T

Not susceptible or not responding

Biomedical Informatics. Figure . Example application of sequencing human genes. The top half is the case, where patients respond to a treatment and the bottom is the case, where three patients do not respond to the treatment

B

for the prediction of myeloma cancer that is common with older people (with age > ) and is very rare in younger people (age < ). This data set consists of people diagnosed with myeloma at young age and people who weren’t diagnosed till they were when the disease is more common. Most SNP positions represent a pair of nucleotides and are typically restricted in the combinations of values they may assume. For example, in the figure, SNP can take values from the three possible combinations < C T, C C, T T > for its two positions. The goal is to use the feature values of the different SNPs to predict the class label which could be the susceptibility. That is, the goal is to determine genetic difference between people who got the disease at a young age vs. people who did not until they were old. There is also the possibility of two patients having the same SNP pattern in the data but not the identical DNA. Patients and may have CT for the SNP and GA for SNP, where both SNPs are on chromosome . But, Patient has C on SNP in the same copy of chromosome as the G in SNP, whereas Patient has C on the same copy as an A. Hence, while they have the same SNP pattern of CT and GA, they do not have identical DNA. The process of converting the data from the form in the Figure below to the form above is called Phasing. From a machine learning perspective, there is a choice of either working with the unphased data or to use an algorithm for phasing. It turns out that phasing is very difficult and is an active research area. If there are a number of unrelated patients phasing is very hard. Hence many machine learning researchers work mainly with unphased data. Admittedly, there is a small loss of information with the unphased data that compensates for the difficulty of phasing. Most biologists and statisticians using SNP data perform genome-wide associations studies (GWAS). The goal in this work is to find individual SNPs that are significantly associated with disease, that is, such that one of the SNP values, or alleles, raises the risk of disease. This is typically measured by “relative risk” or by “odds ratio,” and significance is typically measured by statistical tests such as Wald test, Score test, or LRLR (7logistic regression log likelihood, where each SNP is used individually to predict disease, and log likelihood of the predictive model is compared to guessing under the null hypothesis that the SNP is not associated).

B

B

Biomedical Informatics

SNP

Person

1

2

3

...

Class

Person 1

C

T

A

G

T

T

...

Old

Person 2

C

C

A

G

C

T

...

Young

Person 3

T

T

A

A

C

C

...

Old

Person 4

C

T

G

G

T

T

...

Young

.

.

.

.

.

.

...

.

.

.

.

.

.

.

...

.

.

.

.

.

.

.

...

.

Biomedical Informatics. Figure . Example of SNP data

One of many examples is the use of SNPs to predict susceptibility to breast cancer (Easton et al., ). The advantages of SNP data compared to microarray data are the following: () Because SNP analysis is typically performed on DNA from saliva or peripheral blood cells, a person’s SNP pattern does not change with time or disease. If the SNPs are collected from a blood sample of a person aged years, the SNP patterns are probably the same as when they were born. This gives more insight to the susceptibility of the person to many diseases. Hence, we do not see the widespread changes in SNP pattern with cancer, for example, that we see in microarray data from tumor samples. () It is easier to collect the samples. These can be obtained from the blood samples as against obtaining say, the biopsy of other tissue types. The challenges of SNP data are as follows: () As explained earlier, the data is unphased. Algorithms exist for phasing (haplotyping), but they are error prone and do not work well with unrelated patient samples. They require the data to consist of related individuals in order to have a dense coverage. () 7Missing Values are more common than in microarray data. The good news is that the amount of missing values is decreasing substantially (down from –% a few years ago to –%). () The sheer volume of measurements – currently, it is possible to measure a million SNPs out of over three million SNPs in the human genome. While this provides a tremendous amount of potential information, the resulting high dimensionality causes problems for machine learning. As with gene expression microarray data, we have a multiple comparisons problem, so approaches such as Bonferoni correction or

q-values from False Discovery Rate can again be applied. But even when a significant SNP is found, it usually only increases our accuracy at predicting disease by % or % points, because a single SNP typically either has a small effect or small penetrance (the variation is fairly rare – one value of the SNP is strongly predominant). So GWAS are missing a major opportunity to build predictive models by combining multiple SNPs with small effects – this is an exciting opportunity for machine learning. The supervised learning task can be defined as follows: Given: A set of SNP profiles each from a different patient. Phased: Nucleotides at each SNP position on each copy of each chromosome constitute the features and patient’s disease susceptibility or drug response constitutes the class. Unphased: Unordered pair of nucleotides at each SNP position constitutes the features and patient’s disease susceptibility or drug response constitutes the class. ● Do: Learn a model to predict the class based on the features. ●

We now briefly present one example of supervised learning from SNP data. (Waddell, Page, and Shaughnessy ()) found that there was evidence of a genetic component in predicting the blood cancer multiple myeloma as it was possible to distinguish the two cases significantly better than chance (% accuracy). The results from using Support Vector Machines (SVMs) are

Biomedical Informatics

Old

Young

Old

31

9

Young

14

26

Actual

Biomedical Informatics. Figure . Results on predicting multiple myeloma, young (susceptible) vs. old (less susceptible), , SNPs

presented in Fig. . Similar results were obtained using a Naive Bayes model as well. Listgarten et al. () also used the SNP data with the goal of predicting lung cancer. The accuracy of % obtained by them was remarkably similar to the task of predicting multiple myeloma. The best models for predicting lung cancer were also Naive Bayes and SVMs. There is a striking similarity between the two experiments on unrelated tasks using SNPs. When only the individual SNPs were considered, the accuracy for both the experiments fell to %. The lessons learned from SNP data are the following: () 7Supervised learning algorithms such as 7Naive Bayes and 7SVM that can handle large number of features in the presence of smaller number of training examples can predict disease susceptibility at rates better than chance and better than individual SNPs. () Accuracies are much lower than the ones with microarray data. This is mainly due to the fact that we are predicting the susceptibility to the diseases (or the response to a drug) as against predicting whether a person already has the disease (as with the microarray data). While we are predicting using the genetic component, there are also many environmental components that are responsible for the diseases and the response. We are not considering such components in our model and hence the accuracies are often not very high. In spite of relatively lower accuracies, they give a different valuable insight to the human gene. We now briefly outline a couple of exciting future directions for the use of SNP data. Pharmacogenetics is the problem of predicting drug response from SNP profile and has been gaining momentum over the past few years. This includes predicting drug efficacy and adverse reactions to certain drugs, given a person’s SNP profile. A recent New England Journal of Medicine article showed that the analysis of SNPs can significantly improve the dosing model for the most widely

B

used orally available blood thinner, Warfarin (IWPC, ). Another exciting direction is the combination of SNP data with other data types such as clinical data that includes the history of the patient and the lab tests and microarray data. The combination of these different data sets will not only improve the accuracy of the learned model but also provide a deeper insight to the different kinds of interactions that occur within a human, such as gene interactions with other drugs. It should be mentioned that other genetic data types are becoming available and may be useful for supervised learning as well. These data types can provide additional information about DNA sequence beyond SNPs but without the expense of full genome sequencing. They include copy-number variations and exon-sequencing.

Mass Spectrometry and Proteomics Microarrays are useful primarily because mRNA concentrations can serve as surrogates for protein concentrations and they are easier to measure. Though measuring protein concentrations directly is possible, it cannot be done in the same high-throughput manner as measuring mRNA. Recently, techniques such as Mass Spectrometry (MS or mass spec) have been successful in high-throughput measuring of proteins. Mass spec still does not given the complete coverage that microarrays provide, nor as good a quantitation. Mass spectometry is improving on many fronts, using many technologies. As one example, we present Time-Of-Flight (TOF) Mass Spectometry illustrated in Fig. . This measures the time required for an ionized particle starting from the sample plate (bottom of the figure) to hit the detector. The key idea is to place some proteins (indicated as larger circles) into a matrix (smaller circles are the matrix molecules). Because of mass spec limitations, the proteins typically are digested (broken into smaller peptides), for example, by the compound trypsin. When struck by a laser, the matrix molecules release protons that attach themselves to the peptides or protein fragments (shown in (a)). Note that the plate where the peptides are present is positively charged. This causes the peptides to migrate toward the detector. As can be seen in (b) of the figure, the molecules with smaller mass move faster toward the detector. The idea is to detect the number of molecules that hit the

B

B

Biomedical Informatics

Laser

Laser

Detector

Detector

+ +

+

+ + +

+

+

+

+ +

+10kv

+10kv The protons from the matrix molecules get attached to the proteins

Positively charged proteins are repelled towards the detector Smaller mass molecules hit detector first, while heavier ones detected later

a

b

Biomedical Informatics. Figure . Time-Of-Flight mass spectrometry

detector at any given time. This makes it possible to use time as a surrogate for mass of the protein. The experiment is repeated a number of times, counting frequencies of “flight-times.” Plotting time vs. the number of particles hitting the detector yields a spectrum as presented in Fig. . The figure shows three different fractions from the same sample. These kinds of spectra provide us an insight about the different types of proteins in a given sample. A technical detail is that sometimes molecules receive additional charge (additional protons) and hence fly faster. Therefore, the horizontal mass axis in a spectrum is actually a mass/charge ratio. The main issues for machine learning researchers working with mass spectrometry data compared to microarray data are as follows: () There is a lot of 7Noise in the data. The noise is due to extra peaks from handling of sample, from machine and environment (e.g., electrical noise). Also the mass to charge values may not exactly align across the spectra; the accuracy of the mass/charge values is the resolution of the mass spec. () Intensities (peak heights) are not calibrated across the spectra, making quantification difficult. This is to say that if one spectrum is compared to another, and if one of them has more intensity at a particular mass/charge, it does not necessarily mean that

the levels of the peptide at that mass/charge are higher in that spectrum. () Another issue is that the mass spectrometry data is not as comprehensive as microarray data, in that it is not possible to measure all peptides (typically only several hundred of them can be obtained). To get the best results, there is a need to fractionate the sample beforehand, getting different groups of proteins in different subsamples (fractions). () As already mentioned, the proteins themselves typically must be broken down (digested) into smaller peptides in order to get accurate readings from the mass spec. But this means processing is needed afterward not only to determine from a spectrum which peptides are present but also from that determination which proteins are present. It is worth noting that some of these challenges are being partially addressed by ongoing improvements in mass spectrometry technologies, including the use of “tandem mass spectrometry.” This data type opens up a lot of possibilities for machine learning research. Some of the learning tasks include: Learn to predict proteins from spectra, when the organism’s proteome (full set of proteins) is known. ● Learn to identify isotopic distributions (combinations of multiple peaks for a given molecule ●

Biomedical Informatics

B

7000 line 1 line 2 6000

line 3

5000

4000

3000

2000

1000

0

0

20000

40000

60000

80000

100000

120000 140000

160000

Biomedical Informatics. Figure . Example spectra from a competition by Lin et al.

arising from different isotypes of carbon, nitrogen. and oxygen). ● Learn to predict disease from either proteins, peaks or isotopic distributions as features. ● Construct pathway models. We will now present one case study that was successful and generated a lot of interest – Early Detection of Ovarian Cancer (Petricoin et al., ). Ovarian cancer is difficult to detect early, often leading to poor prognosis. The goal of this work was to predict ovarian cancer from blood samples. To this effect, the researchers trained and tested on mass spectra from blood serum. They used training cases ( positive) and used a held-out test set of cases ( positive). The results were extremely impressive (% sensitivity, % specificity). While the results were extremely impressive and while the machine learning methodology seemed very sound, it turns out that the preprocessing stage of the data may have introduced errors (Baggerly, Morris, & Combes, ). Mass spectrometry is very sensitive to the external factors as well. For instance, if we run cancer samples on Monday and normal samples on Wednesday, it is possible that we could get differences

from variations in the machine or nearby electrical equipment that is running on Monday but not Wednesday. Hence, one of the important lessons learned from this data type is the need for careful randomization of the data samples. This is to say that we should sample the positive and negative samples under identical conditions. It should not be the case that the positive examples are run through the machine on one day and the negatives on the other day. Any preprocessing of the data must be performed similarly. While mass spectrometry is a widely used type of high-throughput proteomic data, other types of data are also important and are briefly covered next.

Protein Structures X-ray crystallography and nuclear magnetic resonance are widely used to determine the three-dimensional structures of proteins. Predicting protein structures has been a very fertile field for machine learning research for several decades. While the amino acid sequence of a protein is called its primary structure, it is more difficult to determine secondary structure and tertiary (D) structure. Secondary structure maps subsequences of the primary

B

B

Biomedical Informatics

structure in the three classes of alpha helix (helical structures akin to a telephone cord, often denoted by A), beta strand (which comes together with other strand sections to form planar structures called beta sheets, often denoted by B), and less descript regions referred to as coil, or loop regions, often denoted by C. Predicting secondary structure and tertiary structure has been a popular topic for machine learning for many years, because training data exists yet it is difficult and expensive to experimentally determine structures. We will not attempt to survey all the work in this area. Waltz and colleagues (Zhang, Mesirov, & Waltz, ) showed the benefit of applying neural networks to the task of secondary structure prediction, and the best secondary structure predictors (e.g., Rost & Sander, ) have continued to be constructed by machine learning over the years. Approaches for predicting the tertiary structure have also relied heavily on machine learning and include ab initio prediction (e.g., Bonneau & Baker, ), prediction aided by crystallography data (e.g., DiMaio et al., ), and homology-based prediction (by finding similar proteins). For over a decade, there has been a regular competition in the prediction of protein structures (Critical Assessment of Structure Prediction [CASP]).

proteins that interact with the current protein say P. Generally, this is performed as follows: In the sample, there are some proteins of type X (shown in pink in the figure) and other types of proteins. Proteins that interact with X are bonded to X. Then antibodies (shown as Y-shaped green objects) are introduced in the sample. The idea of antibodies is to collect the proteins of type X. Once the antibodies have collected all protein X’s in the sample, they can be analyzed through mass spectrometry presented earlier. A particularly high-throughput way of measuring protein–protein interactions is through “ChIP-chip” data. The supervised learning tasks for this task include: Learn to predict protein–protein interactions: Protein three-dimensional structures may be critical. ● Use protein–protein interactions in construction of pathway models. ● Learn to predict protein function from interaction data. ●

Related Data Types ●

Protein–Protein Interactions Another proteomics data type is protein–protein interactions. This is illustrated in Fig. . The idea is to identify

Metabolomics measures concentration of each lowmolecular-weight molecule in sample. These typically are metabolites, or small molecules produced or consumed by reactions in biochemical pathways. These reactions are typically catalyzed by proteins (specifically, enzymes). This data typically uses mass spectrometry.

Antibody

The pink objects are protein X and they get attached to other proteins (2 in this figure). The green Y-shaped objects are the antibodies a

The antibodies get attached only to protein X and hence collecting the antibodies will result in collecting X ’s and the proteins that interact with X b

Biomedical Informatics. Figure . Schematic of antibody-based identification of protein–protein interactions

Biomedical Informatics

ChIP-chip data measures protein–DNA interactions. For example, transcription factors are proteins that interact with DNA in specific locations to alter transcription of a nearby gene. ● Lipomics is analogous to metabolomics, but measuring concentrations of Lipids rather than metabolites. These potentially help induce biochemical pathway information or to help disease diagnosis or treatment choice. ●

High-Throughput Screening Data for Drug Design The typical steps in designing a drug are: () Identifying a target protein – for example, while developing an antibiotic, it will be useful to find a protein that belongs to the bacteria that we are interested in and find a small molecule that will bind to that protein. In order to perform this, we need the knowledge of proteome/genome and the relevant biological path ways. () Determining the target site structure once the protein has been identified – this is typically performed using crystallography. () Finding a molecule that will bind to the target site. These steps are presented in Fig. . The molecules that bind to the target may have a number of other problems and hence they cannot directly be used as a drug. Some common problems are as follows: () They may bind too tightly or not tightly enough. () They may be toxic. () They may have unanticipated side effects in the body. () They may break down as soon as they get into the body or may not leave the body soon enough. () They may not get to the right target in the body (e.g., cross blood–brain barrier). () They may not diffuse from gut to bloodstream. Also,

B

since the organisms are different, even if a molecule works in the test tube and in animal studies, it may fail in clinical trials. Also while a molecule may work for some people, it may not work for others. Conversely, while some molecules may cause harmful side effects in some people, they may not do so in others. Often pharmaceutical companies will use robotic high-throughput screening assays to test many thousands of molecules to see if they bind to the target protein, and then computational chemists will work to determine the commonalities that allow them to bind to the target as often the structure of the target protein cannot be determined. The process of discovering the commonalities across the different molecules presents a great opportunity for machine learning research. The first study of this task using machine learning was by Dietterich, Lathrop, and Lozano-Perez and led to the formulation of MultiInstance Learning. Yet, another machine learning task could be to predict the reactions of the patients to the drugs. High-Throughput Screening: When the target structure is unknown, it is a common practice to test many molecules (,,) to find some that bind to the target. This is called as High-Throughput Screening. Hence, it is important to infer the shape of the target from threedimensional structural similarities. The shared threedimensional structure is called as pharmacophore. This is a perfect example of a machine learning task with a spatial target and is presented in Fig. . Given: A set of molecules, each labeled by activity (binding affinity for a target protein) and a set of lowenergy conformers for each molecule Do: Learn a model that accurately predicts the activity (may be Boolean or real valued).

Active

Determine target site structure

Inactive

Identify target protein

Synthesize a molecule that will bind

Biomedical Informatics. Figure . Steps drug design

involved

in

Biomedical Informatics. Figure . An example of structure learning

B

B

Biomedical Informatics

The common machine learning approaches taken toward solving this problem are: . Representing a molecule by thousands to millions of features and use standard techniques (KDD, ) . Representing each low-energy conformer by feature vector and use multiple-instance learning (Jain et al., ) . Relational learning – using either Inductive Logic Programming techniques (Finn, Muggleton, Page, & Srinivasan, ) or Graph Mining Thermolysin Inhibitors: We present some results of relational learning algorithms on thermolysin inhibitors data set (Davis, a). Thermolysin belongs to the family of metalloproteases and plays roles in physiological processes such as digestion and blood pressure regulation. The molecules in the data set are known inhibitors of thermolysin. Activity for these molecules is measured in pKi = −log Ki, where Ki is a dissociation constant, measuring the ratio of the concentrations of bound product to unbound constituents. A higher value indicates a stronger affinity for binding. The data set that was used had the ten lowest energy conformations (as computed by the SYBYL software package [www.tripos.com]) for each of thermolysin inhibitors along with their activity levels. The key results for this data set using the relational algorithm SAYU (Davis, b) were: ●

● ● ● ●

Ten five-point pharmacophore identified, falling into two groups (/ molecules): ● Three “acceptors,” one hydrophobe, and one donor ● Four “acceptors,” and one donor Common core of Zn ligands, Arg, and Asn interactions identified Correct assignments of functional groups Correct geometry to Å tolerance Increasing tolerance to . Å finds common six-point pharmacophore including one extra interaction

Antibacterial Peptides: This is a data set of pentapeptides showing activity against Pseudomonas aeruginosa (Spatola, Page, Vogel, Blondell, & Crozet, ). There are six active pharmacophores with < µg/ml of IC

Biomedical Informatics. Table Identified Pharmacophore A molecule M is active against Pseudomonas aeruginosa if it has a conformation B such that M has a hydrophobic group C M has a hydrogen acceptor D The distance between C and D in conformation B is . Å M has a positively charged atom E The distance between C and E in conformation B is Å The distance between D and E in conformation B is . Å M has a positively charged atom F The distance between C and F in conformation B is . Å The distance between D and F in conformation B is . Å The distance between E and F in conformation B is . Å Tolerance . Å

and five inactives. The pharmacophore that has been identified is presented in Table . Dopamine Agonists: The last data set that we present here consists of dopamine agonists (Martin et al., ). Dopamine works as a neurotransmitter in the brain, where it plays a major role in the movement control. Dopamine agonists are molecules that function like dopamine and produce dopamine-like effects and can potentially be used to treat diseases such as Parkinson’s disease. The data set had dopamine agonists along with their activity levels. The pharmacophore identified using Inductive Logic Programming is presented in Table .

Electronic Medical Records (EMR) and Personalized Medicine Predictive personalized medicine (PPM) is a vision of the future, whose parts are beginning to come into place now. Under this vision, physicians can construct safer and more effective prevention and treatment plans for

Biomedical Informatics

each patient. This is rendered possible by predicting the impact of treatments on patients – their effectiveness for different classes of patients, adverse reactions of certain drugs that are prescribed to the patients, and susceptibility of different types of patients to diseases. PPM can become a reality due to three reasons: The

Biomedical Informatics. Table Pharmacophore Identified for Dopamine Agonists Molecule A has the desired activity if ● In conformation B molecule A contains a hydrogen acceptor at C ● In conformation B molecule A contains a basic nitrogen group at D ● The distance between C and D is . ± . Å ● In conformation B molecule A contains a hydrogen acceptor at E ● The distance between C and E is . ± . Å ● The distance between D and E is . ± . Å ● In conformation B molecule A contains a hydrophobic group at F ● The distance between C and F is . ± . Å ● The distance between D and F is . ± . Å ● The distance between E and F is . ± . Å

P1

M

Patient ID Date P1 P1

first is the widespread use by many clinics of Electronic Medical Records (EMR also called as Electronic Health Records – EHR). The second is that whole-genome scan technology makes it possible in one experiment, for well under $,, to measure for one patient a half million to one million SNPs, or individual positions in the DNA where humans vary. The third key reason is the advancement of statistical modeling (machine learning) methods in the past decade that can handle large relational longitudinal databases with significant amount of noise. The first two reasons make it possible for the clinics to have a relational database of the form presented in Fig. . Given such a database, it is conceivable to use existing machine learning algorithms for achieving the goal of PPM. These algorithms could focus on predicting which patients are at risk (pos and neg examples). Another task is predicting which patients will respond to a specific treatment – a set of patients who have undergone specific treatments in order to learn predictive models that could be extended to similar patients of the population. Similarly, it is possible to focus on certain drugs and their adverse reactions and use them to predict the adverse reactions of similar drugs that are released in the market. In this work, we focus on the machine learning solutions to predicting adverse drug reactions for different drugs. There are actually at least three different tasks for machine learning in predicting Adverse Drug Events (ADEs).

Patient ID Date

Patient ID Gender Birthdate

P1 P1

3/22/63

Lab Test

Result

1/1/01 blood glucose 1/9/01 blood glucose

42 45

B

Physician Symptoms

1/1/01 2/1/03

Smith Jones

Diagnosis

Palpitations Hypoglycemic Fever, Aches influenza

Patient ID SNP1 SNP2 … SNP500K P1 P2

AA AB

AB BB

Patient ID

Date Prescribed

Date Filled

Physician

Medication

P1

5/17/98

5/18/98

Jones

Prilosec

BB AA

Dose

Duration

10 mg 3 months

Biomedical Informatics. Figure . Electronic Health Records (dramatically simplified) – most data currently do not include SNP information but are anticipated in the future

B

B

Biomedical Informatics

Task : Given: Patient data (from claims databases and/or EMRs) and a drug D Do: Construct a model to predict a minimum efficacious dose of drug D, because a minimum dose is less likely to induce an ADE. An example of this task is predicting the “stable dose” of the blood-thinner Warfarin (Coumadin) for a patient (McCarty, Wilke, Giampietro, Wesbrook, & Caldwell, ). A stable dose of Warfarin yields the desired degree of anticoagulation, whereas a higher dose can lead to bleeding ADEs; the stable dose for a patient is currently found by trial and error, modifying the dose and measuring the degree of anticoagulation. The cited study shows that a learned dosing model can predict a significantly better starting dose (significantly closer to the final “stable dose”) than the mg/day starting dose currently used in many clinics. Task : Given: Patient data (from claims databases and/or EMRs), a drug D, and an adverse event E Do: Construct a model to predict which patients are likely to suffer the adverse event E if they take D. In this second task, we assume that the association between D and E already has been hypothesized. We seek to construct models that can predict who will suffer a given event if they take the drug. Here, whether the patient will suffer adverse event E is the class variable to be predicted. This task is important for personalized medicine, as accurate models for this task can be used to identify patients who should not be given a particular drug. An earlier study has demonstrated the benefit of a Statistical Relational Learning (SRL) system called SAYU (Davis, b) over standard machine learning approaches with a feature-vector representation of the EHR, for the task of predicting which users of cox inhibitors would have an MI. Task : Given: Patient data (from claims databases and/or EMRs) and a drug D Do: Determine if evidence exists that associates D with a previously unanticipated adverse event. This third task is the most challenging because no associated event has been hypothesized. There is a need to identify the response variable to be predicted. In brief, the major approach for this task is to use machine

learning “in reverse.” We seek a model that can predict which patients are on drug D using the data after they start the drug (left censored) and also censoring the indications of the drug. If a model can predict (with accuracy better than chance on held-aside data) which patients are taking the drug, there must be some combination of variable settings more common among patients on the drug. Because we have left censored, in theory, this commonality should not consist of common symptoms, but common effects, presumably from the drug. The model can then be examined by the experts to see if it might indicate a possible new adverse event for the drug. The preceding use of machine learning “in reverse” actually can be viewed as Subgroup Discovery (Wrobel, ; Klösgen, ), finding a subgroup of patients on drug D who share some subsequent clinical events. The learned model – say an IF-THEN rule – need not correctly identify everyone on the drug but rather merely a subgroup of those on the drug, while not generating many false positives (individuals not on the drug). This task poses several different challenges that traditional ML methods will find difficult to handle. First, the data is multi-relational. There are several objects such as doctors, patients, drugs, diseases, and labs that are connected through relations such as visits, prescriptions, diagnoses, etc. If traditional machine learning (ML) techniques are to be employed on this problem, they require flattening the data into a single table. All known flattening techniques such as computing a join or summary features result in either () changes in frequencies on which machine learning algorithms critically depend or () loss of information. They also typically result in loss of some correlations between the objects and explosion in database size. Second, the data is non-i.i.d., as there are relationships between the objects and between different rows within a table. Third, there are arbitrary numbers of patient visits, diagnoses, and prescriptions for different patients. This is to say that there is no fixed pattern in the diagnoses and prescriptions of the patients. It is incorrect to assume that the patients are diagnosed a fixed number of times or to assume only the last diagnosis is relevant. To predict the adverse reactions to a drug, it is important to consider the other drugs that the patient is prescribed or has been prescribed in the past, as well as past diagnoses and laboratory results. To capture

Biomedical Informatics

these interactions, it is critical to explicitly model time since the interactions are highly temporal. Some drugs taken at the same time can lead to side effects while in some cases, drugs taken after one another cause side effects. It is important to capture such interactions to be able to make useful predictions for the physicians and the Federal Drug Authority (FDA). In this work, we focus on this hardest task and present the results on two data sets. Cox Inhibitors: Recently, a study was performed to see if there were any unanticipated adverse events that occurred when subjects used cox inhibitors (Vioxx, Celebrex, and Bextra). Cox inhibitors are a nonsteroidal anti-inflammatory class of drugs that were used to reduce joint pain. Vioxx, Celebrex, and Bextra were approved for use in the late s and were ranked as one of the top therapeutic drugs in the USA. Several clinical trials were conducted, and the APPROVe trial (focused on Vioxx outcomes) showed an increase of adverse events from myocardial infarction, stroke, and vascular thrombosis. The manufacturer withdrew Vioxx from the market shortly after the results were published. The other cox inhibitor drugs were discontinued shortly thereafter. This study utilized the Marshfield Clinic’s Personalized Medicine Research Project (McCarty, Wilke, Giampietro, Wesbrook, & Caldwell, ) (PMRP) cohort consisting of approximately , + subjects. The PMRP cohort included adults aged years and older, who reside in the Marshfield Epidemiology Study Area (MESA). Marshfield has one of the oldest internally developed Electronic Medical Records (Cattails MD) in the USA, with coded diagnoses dating back to the early s. Cattails MD has over , users throughout central and northern Wisconsin. Since the data is multi-relational, an Inductive Logic Programming (Muggleton & Raedt, ) system, Aleph (Srinivasan, ) was used to learn the models. Aleph learns rules in the form of Prolog clauses and scores rules by positive examples covered (P) minus negative examples covered (N). Seventy-five percent of the data was used for training and rule development, while the remaining % was used for testing. There were , subjects within the PMRP cohort that had medication records. Within this cohort, almost % of the subjects indicated use of a cox inhibitor, and more specifically, .% indicated the use of Vioxx. Approximately,

B

Biomedical Informatics. Table Cox Inhibitor Test Data Results

B

Actual Rule

+

−

+

−

,

Accuracy

.

.% of this cohort had an indicated use of clopidogrel biosulfate (Plavix). Aleph generated thousands of rules and selected a subset of the “best” rules that were based on the scoring algorithm. The authors also developed specific hypotheses to test for known adverse events to validate the approach (indicated by # A). This rule was: cox(A):- diagnoses(A, _,‘’). It states that if finding (A): the subject would have the diagnosis coded as (myocardial infarction). Aleph also provided summary statistics on model performance for identifying subjects on cox inhibitors, as indicated in Table . If we assume that the probability of being on the cox inhibitor is greater than. (the common threshold), then the model has a predictive probability of % to predict cox inhibitor use. OMOP Challenge: Observational Medical Outcomes Partnership (OMOP) designed and developed an automated procedure to construct simulated data sets to identify adverse drug events. The simulated data sets are modeled after real observational data sources but are comprised of hypothetical persons with fictional drug exposure and health outcomes occurrence. The data sets are constructed such that the relationships between the fictional drugs and fictional outcomes are well characterized as true and false associations. That is, hypothetical persons are created and assigned fictional drug exposure periods and instances of health outcomes based on random sampling from probability distributions that define the relationships between the fictional drugs and outcomes. The relationships created within the simulated data sets are contrived but are representative of the types of relationships observed within real observational data sources. OMOP has made a

B

Biomedical Informatics

simulated data set and the simulator itself publicly available as part of the OMOP Cup Data Mining Competition (http://omopcup.orwik.com). Aleph was used to learn rules from a subset of the data (about , patients). Each patient had a record of drugs and diagnoses (conditions) with dates attached. A few examples of the rules learned by Aleph in this data set are: on_drug(A):- condition_occurrence(B,C,A,D, E,,F,G,H) on_drug(A):- condition_occurrence(B,C,A,D,E, ,F,G,H) condition_occurrence(I,J,A,K,L, ,M,N,O) The first rule identifies drug as interesting, while the second rule identifies two other drugs as interesting when predicting the reaction for person A. With about rules, Aleph was able to achieve a % coverage. The results were compared against a Statistical Relational Learning technique (SRL) (Getoor & Taskar, ) that uses a probability distribution on the rules. The results are presented in Fig. . As expected, with a small number of rules, SRL has a better performance than Aleph, but as the number of rules increase, they converge on the same performance. The leading approaches in the first OMOP Cup include a machine learning approach based on random forests as well as several approaches based on techniques from epidemiology such as disproportionality analysis. At the time of this writing further details, as

0.7 0.65 0.6 Accuracy

0.55 0.5 0.45 Aleph

0.4

SRL

0.35 0.3 2

3 5 Number of rules

10

Biomedical Informatics. Figure . Results of OMOP data

well as plans for future competitions, are available at http://omopcup.orwik.com/. Identifying previously unanticipated ADEs, predicting who is most at risk for an ADE, and predicting safe and efficacious doses of drugs for particular patients are all important needs for society. With the recent advent of “paperless” medical record systems, the pieces are in place for machine learning to help meet these important needs.

Conclusion In this work, we aim to survey the abundant opportunities in biomedical applications to machine learning researchers by presenting several data types to which machine learning techniques have been applied successfully or showing tremendous promise. One of the most important developments in biology and medicine over the last few years is the availability of technologies that can produce large volumes of data. This in turn has necessitated the need for processing large volumes of data in a reasonable amount of time, presenting the perfect setting for machine learning algorithms to have an impact. We outlined several data types including gene expression microarrays (measuring mRNA), mass spectrometry (measuring proteins), SNP chips (measuring genetic variation), and Electronic Medical/Health Records (EMR/EHRs). The key lessons learned from all these data types are as follows: () Even if the number of features is greater than the number of data points (e.g., predicting cancer from microarray data), we can do well provided the features are highly predictive. () Careful randomization of data samples is necessary. () It is very easy to overfit the data and hence robust techniques such as voted 7decision stumps, 7naive Bayes or linear 7SVMs are in general very useful tools for such data sets. () 7Bayes nets do not give us causality and hence knock-out experiments (7active learning) and 7DBNs with 7time-series data can help. () Multi-relational methods such as SRL and ILP are helpful for predictive personalized medicine due to the relational nature of the data. () Mostly, the collaborators are interested in measures other than just accuracy. Comprehensibility, privacy, and ranking are other criteria that are important to biologists. This chapter is necessarily incomplete because so many exciting tasks and data types exist within biology

Biomedical Informatics

and medicine. While we have touched on many of the leading such data types, other related ones also exist. For example, there are many opportunities in analyzing genomic and protein sequences (Learning Models of Biological Sequences). Other opportunities exist within phylogenetics, for example, see work by Heckerman and colleagues on HIV (Carlson et al., ). New technologies such as optical mapping are constantly being developed and refined (Ananiev et al., ). Machine learning has great potential for developing models for computer-aided diagnosis (CAD), for example, for mammography (Burnside et al., ). Data types such as metabolomics and auxotropic growth experiments raise opportunities for active learning and for automatic revision of biological network models, for example, as in the Robot Scientist projects (Jones et al., ; Oliver et al., ). Incorporation of multiple data types can further help in mapping out the regulatory entities and networks of an organism (Noto & Craven, ). It is our hope that this article will encourage some machine learning researchers to delve deeper into these and other related opportunities.

Acknowledgment We would like to thank Elizabeth Burnside, Michael Caldwell, Mark Craven, Jesse Davis, Lingjun Li, David Madigan, Sean McIlwain, Michael Molla, Irene Ong, Peggy Peissig, Patrick Ryan, Jude Shavlik, Michael Sussman, Humberto Vidaillet, Michael Waddell and Steve Wesbrook.

Cross References 7Learning Models of Biological Sequences

Recommended Reading Ananiev, G. E., Goldstein, S., Runnheim, R., Forrest, D. K., Zhou, S., Potamousis, K., Churas, C. P., Bergendah, V., Thomson, J. A., & David, C. (). Schwartz. Optical mapping discerns genome wide DNA methylation profiles. BMC Molecular Biology, , doi:./---. Baggerly, K., Morris, J. S., & Combes, K. R. (). Reproducibility of seldi-tof protein patterns in serum: Comparing datasets from different experiments. Bioinformatics, , –. Bonneau, R., & Baker, D. (). Ab initio protein structure prediction: Progress and prospects. Annual Review of Biophysics and Biomolecular Structure, , –. Burnside, E. S., Davis, J., Chhatwal, J., Alagoz, O., Lindstrom, M. J., Geller, B. M., Littenberg, B., Kahn, C. E., Shaffer, K., &

B

Page, D. (). Unique features of hla-mediated hiv evolution in a mexican cohort: A comparative study. Radiology, , –. Carlson, J., Valenzuela-Ponce, H., Blanco-Heredia, J., GarridoRodriguez, D., Garcia-Morales, C., Heckerman, D., et al. (). Unique features of hla-mediated hiv evolution in a mexican cohort: A comparative study. Retrovirology, (), . Davis, J., Costa, V. S., Ray, S., & Page, D. (a). An integrated approach to feature construction and model building for drug activity prediction. In Proceedings of the th international conference on machine learning (ICML). Davis, J., Ong, I., Struyf, J., Burnside, E., Page, D., & Costa, V. S. (b). Change of representation for statistical relational learning. In Proceedings of the th international joint conference on artificial intelligence (IJCAI). DiMaio, F., Kondrashov, D., Bitto, E., Soni, A., Bingman, C., Phillips, G., & Shavlik, J. (). Creating protein models from electron-density maps using particle-filtering methods. Bioinformatics, , –. Easton, D. F., Pooley, K. A., Dunning, A. M., Pharoah, P. D., et al. (). Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, , –. Finn, P., Muggleton, S., Page, D., & Srinivasan, A. (). Discovery of pharmacophores using the inductive logic programming system progol. Machine Learning, (, ), –. Friedman, N. (). Being Bayesian about network structure. In Machine Learning, , –. Friedman, N., & Halpern, J. (). Modeling beliefs in dynamic systems. part ii: Revision and update. Journal of AI Research, , –. Furey, T. S., Cristianini, N., Duffy, N., Bednarski, B. W., Schummer, M., & Haussler, D. (). Support vector classification and validation of cancer tissue samples using microarray expression. Bioinformatics, (), –. Getoor, L., & Taskar, B. (). Introduction to statistical relational learning. Cambridge, MA: MIT Press. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., et al. (). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, , –. Hardin, J., Waddell, M., Page, C. D., Zhan, F., Barlogie, B., Shaughnessy, J., et al. (). Evaluation of multiple models to distinguish closely related forms of disease using DNA microarray data: An application to multiple myeloma. Statistical Applications in Genetics and Molecular Biology, (). Jain, A. N., Dietterich, T. G., Lathrop, R. H., Chapman, D., Critchlow, R. E., Bauer, B. E., et al. (). Compass: A shape-based machine learning tool for drug design. Aided Molecular Design, (), –. Jones, K. E., Reiser, F. M., Bryant, P. G. K., Muggleton, C. H., Kell, S., King, D. B., et al. (). Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, , –. KDD cup (). http://pages.cs.wisc.edu/ dpage/kddcup/. Klösgen, W. (). Handbook of data mining and knowledge discovery, chapter .: Subgroup discovery. New York: Oxford University Press. Listgarten, J., Damaraju, S., Poulin, B., Cook, L., Dufour, J., Driga, A., et al. (). Predictive models for breast cancer

B

B

Blog Mining

susceptibility from multiple single nucleotide polymorphisms. Clinical Cancer Research, , –. Mardis, E. R. (). Anticipating the , dollar genome. Genome Biology, (), . Martin, Y. C., Bures, M. G., Danaher, E. A., DeLazzer, J., Lico, I. I., & Pavlik, P. A. (). A fast new approach to pharmacophore mapping and its application to dopaminergic and benzodiazepine agonists. Journal of Computer Aided Molecular Design, , –. McCarty, C., Wilke, R. A., Giampietro, P. F, Wesbrook, S. D., & Caldwell, M. D. (). Personalized Medicine Research Project (PMRP): Design, methods and recruitment for a large population-based biobank. Personalized Medicine, , –. Molla, M., Waddell, M., Page, D., & Shavlik, J. (). Using machine learning to design and interpret gene expression microarrays. AI Magazine, (), –. Muggleton, S., & De Raedt, L. (). Inductive logic programming: Theory and methods. Journal of Logic Programming, (), –. Noto, K., & Craven, M. (). A specialized learner for inferring structured cis-regulatory modules. BMC Bioinformatics, (), doi:./---. Oliver, S. G., Young, M., Aubrey, W., Byrne, E., Liakata, M., Markham, M., et al. (). The automation of science. Science, , –. Ong, I., Glassner, J., & Page, D. (). Modelling regulatory pathways in e.coli from time series expression profiles. Bioinformatics, , S–S. Pe’er, D., Regev, A., Elidan, G., & Friedman, N. (). Inferring subnetworks from perturbed expression profiles. Bioinformatics, , –. Perou, C., Jeffrey, S., Van De Rijn, M., Rees, C. A., Eisen, M. B., Ross, D. T., et al. (). Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proccedings of National Academy of Science, , –. Petricoin, E. F., III, Ardekani, A. M., Hitt, B. A., Levine, P. J., Fusaro, V. A., Steinberg, S. M., et al. (). Use of proteomic patterns in serum to identify ovarian cancer. Lancet, , –. Rost, B., & Sander, C. (). Prediction of protein secondary structure at better than accuracy. Journal of Molecular Biology, , –. Segal, E., Pe’er, D., Regev, A., Koller, D., & Friedman, N. (April ). Learning module networks. Journal of Machine Learning Research, , –. Spatola, A., Page, D., Vogel, D., Blondell, S., & Crozet, Y. (). Can machine learning and combinatorial chemistry co-exist? In Proceedings of the American Peptide Symposium. Kluwer Academic Publishers. Srinivasan, A. (). The aleph manual. http://web.comlab.ox. ac.uk/oucl/research/areas/machlearn/Aleph/. Storey, J. D., & Tibshirani, R. (). Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences, , –. The International Warfarin Pharmacogenetics Consortium (IWPC) (). Estimation of the Warfarin Dose with Clinical and Pharmacogenetic Data. The New England Journal of Medicine, :–. Tucker, A., Vinciotti, V., Hoen, P. A. C., Liu, X., & Famili, A. F. (). Bayesian network classifiers for time-series microarray data. Advances in Intelligent Data Analysis VI, , –.

Van’t Veer, L. L., Dai, H., van de Vijver, M. M., He, Y., Hart, A., Mao, M., et al. (). Gene expression profiling predicts clinical outcome of breast cancer. Nature, , –. Waddell, M., Page, D., & Shaughnessy, J., Jr. (). Predicting cancer susceptibility from single-nucleotide polymorphism data: A case study in multiple myeloma. BIOKDD’: Proceedings of the fifth international workshop on bioinformatics, Chicago, IL. Wrobel, S. (). An algorithm for multi-relational discovery of subgroups. In European symposium on principles of kdd (pp. –). Lecture notes in computer science, Springer, Norway. Zhang, X., Mesirov, J. P., & Waltz, D. L. (). Hybrid system for protein secondary structure prediction. Journal of Molecular Biology, , –. Zou, M., & Conzen, S. D. (). A new dynamic Bayesian network approach for identifying gene regulatory networks from time course microarray data. Bioinformatics, , –.

Blog Mining Blog mining is the application of data mining (in particular, Web mining) techniques on blogs, adapted to the content, format, and language of the medium blog. A blog is a (more or less) frequently updated publication on the Web, sorted in (usually reverse) chronological order of the constituent blog posts. As in other areas of the Web, mining is applied to the content of blogs, to the various types of links between blogs, and to blogrelated behavior. The latter comprises blog authoring including link setting, blog reading and commenting, and querying (often in blog search engines). For more details on blogs and on mining them, see 7text mining for news and blogs analysis.

Boltzmann Machines Geoffrey Hinton University of Toronto, ON, Canada

Synonyms Boltzmann machines

Definition A Boltzmann machine is a network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm (Hinton &

Boltzmann Machines

Sejnowski, ) that allows them to discover interesting features that represent complex regularities in the training data. The learning algorithm is very slow in networks with many layers of feature detectors, but it is fast in “restricted Boltzmann machines” that have a single layer of feature detectors. Many hidden layers can be learned efficiently by composing restricted Boltzmann machines, using the feature activations of one as the training data for the next. Boltzmann machines are used to solve two quite different computational problems. For a search problem, the weights on the connections are fixed and are used to represent a cost function. The stochastic dynamics of a Boltzmann machine then allow it to sample binary state vectors that have low values of the cost function. For a learning problem, the Boltzmann machine is shown a set of binary data vectors and it must learn to generate these vectors with high probability. To do this, it must find weights on the connections so that relative to other possible binary vectors, the data vectors have low values of the cost function. To solve a learning problem, Boltzmann machines make many small updates to their weights, and each update requires them to solve many different search problems.

Motivation and Background The brain is very good at settling on a sensible interpretation of its sensory input within a few hundred milliseconds, and it is also very good, over a much longer timescale, at learning the code that is used to express its interpretations. It achieves both the settling and the learning using spiking neurons which, over a period of a few milliseconds, have a state of or . These neurons have intrinsic noise caused by the quantal release of vesicles of neurotransmitter at the synapses between the neurons. Boltzmann machines were designed to model both the settling and the learning, and were based on two seminal ideas that appeared in . Hopfield () showed that a neural network composed of binary units would settle to a minimum of a simple, quadratic energy function provided that the units were updated asynchronously and the pairwise connections between units were symmetrically weighted. Kirkpatrick et al. () showed that systems that were settling to energy minima could find deeper minima if noise was added to

B

the update rule so that the system could occasionally increase its energy to escape from poor local minima. Adding noise to a Hopfield net allows it to find deeper minima that represent more probable interpretations of the sensory data. More significantly, by using the right kind of noise, it is possible to make the log probability of finding the system in a particular global configuration be a linear function of its energy. This makes it possible to manipulate log probabilities by manipulating energies, and since energies are simple local functions of the connection weights, this leads to a simple, local learning rule.

Structure of Learning System The learning procedure for updating the connection weights of a Boltzmann machine is very simple, but to understand why it works it is first necessary to understand how a Boltzmann machine models a probability distribution over a set of binary vectors and how it samples from this distribution. The stochastic Dynamics of a Boltzmann Machine

When unit i is given the opportunity to update its binary state, it first computes its total input, xi , which is the sum of its own bias, bi , and the weights on connections coming from other active units: xi = bi + ∑ sj wij

()

j

where wij is the weight on the connection between i and j, and sj is if unit j is on and , otherwise. Unit i then turns on with a probability given by the logistic function: prob(si = ) =

+ e−xi

()

If the units are updated sequentially in any order that does not depend on their total inputs, the network will eventually reach a Boltzmann distribution (also called its equilibrium or stationary distribution) in which the probability of a state vector, v, is determined solely by the “energy” of that state vector relative to the energies of all possible binary state vectors: P(v) = e−E(v) / ∑ e−E(u) u

()

B

B

Boltzmann Machines

As in Hopfield nets, the energy of state vector v is defined as E(v) = − ∑ svi bi − ∑ svi svj wij i

()

i FPcost. Thus, given the values of FNcost and FPcost, a variety of costsensitive meta-learning methods can be, and have been, used to solve the class imbalance problem (Japkowicz & Stephen, ; Ling & Li, ). If the values of

C

C

Classification

FNcost and FPcost are not unknown explicitly, FNcost and FPcost can be assigned to be proportional to the number of positive and negative training cases (Japkowicz & Stephen, ). In case the class distributions of training and test datasets are different (e.g., if the training data is highly imbalanced but the test data is more balanced), an obvious approach is to sample the training data such that its class distribution is the same as the test data. This can be achieved by oversampling (creating multiple copies of examples of) the minority class and/or undersampling (selecting a subset of) the majority class (Provost, ). Note that sometimes the number of examples of the minority class is too small for classifiers to learn adequately. This is the problem of insufficient (small) training data and different from that of imbalanced datasets.

Recommended Reading Drummond, C., & Holte, R. (). Exploiting the cost (in)sensitivity of decision tree splitting criteria. In Proceedings of the seventeenth international conference on machine learning (pp. –). Drummond, C., & Holte, R. (). Severe class imbalance: Why better algorithms aren’t the answer. In Proceedings of the sixteenth European conference of machine learning, LNAI (Vol. , pp. –). Japkowicz, N., & Stephen, S. (). The class imbalance problem: A systematic study. Intelligent Data Analysis, (), –. Ling, C. X., & Li, C. (). Data mining for direct marketing – Specific problems and solutions. In Proceedings of fourth international conference on Knowledge Discovery and Data Mining (KDD-) (pp. –). Provost, F. (). Machine learning from imbalanced data sets . In Proceedings of the AAAI’ workshop on imbalanced data.

Classification Chris Drummond National Research Council of Canada

Synonyms Categorization; Generalization; Identification; Induction; Recognition

Definition In common usage, the word classification means to put things into categories, group them together in some useful way. If we are screening for a disease, we would group people into those with the disease and those without. We, as humans, usually do this because things in a group, called a 7class in machine learning, share common characteristics. If we know the class of something, we know a lot about it. In machine learning, the term classification is most commonly associated with a particular type of learning where examples of one or more 7classes, labeled with the name of the class, are given to the learning algorithm. The algorithm produces a classifier which maps the properties of these examples, normally expressed as 7attribute-value pairs, to the class labels. A new example whose class is unknown is classified when it is given a class label by the classifier based on its properties. In machine learning, we use the word classification because we call the grouping of things a class. We should note, however, that other fields use different terms. In philosophy and statistics, the term categorization is more commonly used. In many areas, in fact, classification often refers to what is called 7clustering in machines learning.

Motivation and Background Classification is a common, and important, human activity. Knowing something’s class allows us to predict many of its properties and so act appropriately. Telling other people its class allows them to do the same, making for efficient communication. This emphasizes two commonly held views of the objectives of learning. First, it is a means of 7generalization, to predict accurately the values for previously unseen examples. Second, it is a means of compression, to make transmission or communication more efficient. Classification is certainly not a new idea and has been studied for some considerable time. From the days of the early Greek philosophers such as Socrates, we had the idea of categorization. There are essential properties of things that make them what they are. It embodies the idea that there are natural kinds, ways of grouping things, that are inherent in the world. A major goal of learning, therefore, is recognizing natural kinds, establishing the necessary and sufficient conditions for belonging to a category. This “classical” view of categorization, most

Classification

often attributed to Aristotle, is now strongly disputed. The main competitor is prototype theory; things are categorized by their similarity to a prototypical example (Lakoff, ), either real or imagined. There is also much debate in psychology (Ashby & Maddox, ), where many argue that there is no single method of categorization used by humans. As much of the inspiration for machine learning originated in how humans learn, it is unsurprising that our algorithms reflect these distinctions. 7Nearest neighbor algorithms would seem to have much in common with prototype theory. These have been part of pattern recognition for some time (Cover & Hart, ) and have become popular in machine learning, more recently, as 7instance-based learners (Aha, Kiber, & Albert, ). In machine learning, we measure the distance to one or more members of a concept rather a specially constructed prototype. So, this type of learning is perhaps more a case of the exemplar learning found in the psychological literature, where multiple examples represent a category. The closest we have to prototype learning occurs in clustering, a type of 7unsupervised learning, rather than classification. For example, in 7k-means clustering group membership is determined by closeness to a central value. In the early days of machine learning, our algorithms (Mitchell, ; Winston, ) had much in common with the classical theory of categorization in philosophy and psychology. It was assumed that the data were consistent, there were no examples with the same attribute values but belonging to different classes. It was quickly realized that, even if the properties where necessary and sufficient to capture the class, there was often noise in the attribute and perhaps the class values. So, complete consistency was seldom attainable in practice. New 7classification algorithms were designed, which could tolerate some noise, such as 7decision trees (Breiman, Friedman, Olshen, & Stone, ; Quinlan, , ) and rule-based learners (see 7Rule Learning) (Clark & Niblett, ; Holte, ; Michalski, ).

space into regions belonging to a single class. The input space is defined by the Cartesian product of the attributes, all possible combinations of possible values. As a simple example, Fig. shows two classes + and −, each a random sample of a normal distribution. The attributes are X and Y of real type. The values for each attribute range from ±∞. The figure shows a couple of alternative ways that the space may be divided into regions. The bold dark lines, construct regions using lines that are parallel to the axes. New examples that have Y less than and X less than . with be classified as +, all others classified as −. Decision trees and rules form this type of boundary. A 7linear discriminant function, such as the bold dashed line, would divide the space into half-spaces, with new examples below the line being classified as + and those above as −. Instance-based learning will also divide the space into regions but the boundary is implicit. Classification occurs by choosing the class of the majority of the nearest neighbors to a new example. To make the boundary explicit, we could mark the regions where an example would be classified as + and those classified as −. We would end up with regions bounded by polygons. What differs among the algorithms is the shape of the regions, and how and when they are chosen. Sometimes the regions are implicit as in lazy learners (see 7Lazy Learning) (Aha, ), where the boundaries are not decided until a new example is being classified.

4

2 +

Y

0

−2

− − − − − − − − − − − −− −− − − − − −− − − − − − −−− −− − − −− −− − − − − − − − − − − −− − − − − −−− − −−−−−− − −−−−− −−+− −−− − − + −− −−− + −− − −− − − − −− − − −− − − − − +−−−− ++ + −− −− − + + + −− +−+− − −+ − − −+−+−−− − + + + − −−−−−− − − + + ++ + − −− +− − + − − − − − + + + + + − + +++ −++++−− ++−−− −−−−−− + + + +− +− ++ + + ++ ++ ++ − ++ +++ ++ +−+++ −− + + + ++ ++ +++− + + + + +− + +++ + + + + + + +− + − + + + +++ + ++ − −+ + + + ++ + + ++++ − ++ + +++−++ − ++− ++ + +++−+ ++ − − − − − − + ++ + + ++++++++ + + − + + + + − ++ ++ + + + +−++ ++ + +++ + + +++ + −+ − −− −− + + + +− + + + − −−

+ +

+

+

+ + +

+ +

+

−4

+

Structure of the Learning System Whether one uses instance-based learning, rule-based learning, decision trees, or indeed any other classification algorithm, the end result is the division of the input

C

−4

−2

0 X

2

4

Classification. Figure . Dividing the input space

C

C

Classification

Sometimes the regions are determined by decision theory as in generative classifiers (see 7Generative Learners) (Rubinstein & Hastie, ), which model the full joint distribution of the classes. For all classifiers though, the input space is effectively partitioned into regions representing a single class.

Applications One of the reasons that classification is an important part of machine learning is that it has proved to be a very useful technique for solving practical problems. Classification has been used to help scientists in the exploration, and comprehension, of their particular domains of interest. It has also been used to help solve significant industrial problems. Over the years a number of authors have stressed the importance of applications to machine learning and listed many successful examples (Brachman, Khabaza, Kloesgen, Piatetsky-Shapiro, & Simoudis, ; Langley & Simon, ; Michie, ). There have also been workshops on applications (Aha & Riddle, ; Engels, Evans, Herrmann, & Verdenius, ; Kodratoff, ) at major machine learning conferences and a special issue of Machine Learning (Kohavi & Provost, ), one of the main journals in the field. There are now conferences that are highly focused on applications. Collocated with major artificial intelligence conferences is the Innovative Applications of Artificial Intelligence conference. Since , this conference has highlighted practical applications of machine learning, including classification (Schorr & Rappaport, ). In addition, there are now at least two major knowledge discovery and 7data mining conferences (Fayyad & Uthurusamy, ; Komorowski & Zytkow, ) with a strong focus on applications.

Future Directions In machine learning, there are already a large number of different classification algorithms, yet new ones still appear. It seems unlikely that there is an end in sight. The “no free lunch theory” (Wolpert & Macready, ) indicates that there will never be a single best algorithm, better than all others in terms of predictive power. However, apart from their predictive performance, each classifier has its own attractive properties which are important to different groups of people. So,

new algorithms are still of value. Further, even if we are solely concerned about performance, it may be useful to have many different algorithms, all with their own biases (see 7Inductive Bias). They may be combined together to form an ensemble classifier (Caruana, Niculescu-Mizil, Crew, & Ksikes, ), which outperforms single classifiers of one type (see 7Ensemble Learning).

Limitations Classification has been critical part of machine research for some time. There is a concern that the emphasis on classification, and more generally on 7supervised learning, is too strong. Certainly much of human learning does not use, or require, labels supplied by an expert. Arguably, unsupervised learning should play a more central role in machine learning research. Although classification does require a label, it does necessarily need an expert to provide labeled examples. Many successful applications rely on finding some, easily identifiable, property which stands in for the class.

Recommended Reading Aha, D. W. (). Editorial. Artificial Intelligence Review, (–), –. Aha, D. W., Kibler, D., & Albert, M. K. (). Instance-based learning algorithms. Machine Learning, (), –. Aha, D. W., & Riddle, P. J. (Eds.). (). Workshop on applying machine learning in practice. In Proceedings of the th international conference on machine learning. Ashby, F. G., & Maddox, W. T. (). Human category learning. Annual Review of Psychology, , –. Bishop, C. M. (). Pattern recognition and machine learning. New York: Springer. Brachman, R. J., Khabaza, T., Kloesgen, W., Piatetsky-Shapiro, G., & Simoudis, E. (). Mining business databases. Communications of the ACM, (), –. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (). Classification and regression trees. Belmont, CA: Wadsworth. Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. (). Ensemble selection from libraries of models. In Proceedings of the st international conference on machine learning (pp. –). Clark, P., & Niblett, T. (). The CN induction algorithm. Machine Learning, , –. Cover, T., & Hart, P. (). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, , –. Dietterich, T., & Shavlik, J. (Eds.). Readings in machine learning. San Mateo, CA: Morgan Kaufmann. Engels, R., Evans, B., Herrmann, J., & Verdenius, F. (Eds.). (). Workshop on machine learning applications in the real world;

Classification Tree

methodological aspects and implications. In Proceedings of the th international conference on machine learning. Fayyad, U. M., & Uthurusamy, R. (Eds.). (). Proceedings of the first international conference on knowledge discovery and data mining. Holte, R. C. (). Very simple classification rules perform well on most commonly used datasets. Machine Learning, (), –. Kodratoff, Y. (Ed.). (). Proceedings of MLNet workshop on industrial application of machine learning. Kodratoff, Y., & Michalski, R. S. (). Machine learning: An artificial intelligence approach, (Vol. ). San Mateo, CA: Morgan Kaufmann. Kohavi, R., & Provost, F. (). Glossary of terms. Editorial for the special issue on applications of machine learning and the knowledge discovery process. Machine Learning, (/). Komorowski, H. J., & Zytkow, J. M. (Eds.). (). Proceedings of the first European conference on principles of data mining and knowledge discovery. Lakoff, G. (). Women, fire and dangerous things. Chicago, IL: University of Chicago Press. Langley, P., & Simon, H. A. (). Applications of machine learning and rule induction. Communications of the ACM, (), –. Michalski, R. S. (). A theory and methodology of inductive learning. In R. S. Michalski, T. J. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (pp. –). Palo Alto, CA: TIOGA Publishing. Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (). Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga Publishing Company. Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (). Machine learning: An artificial intelligence approach, (Vol. ). San Mateo, CA: Morgan Kaufmann. Michie, D. (). Machine intelligence and related topics. New York: Gordon and Breach Science Publishers. Mitchell, T. M. (). Version spaces: A candidate elimination approach to rule learning. In Proceedings of the fifth international joint conferences on artificial intelligence (pp. –). Mitchell, T. M. (). Machine learning. Boston, MA: McGraw-Hill. Quinlan, J. R. (). Induction of decision trees. Machine Learning, , –. Quinlan, J. R. (). C. programs for machine learning. San Mateo, CA: Morgan Kaufmann. Rubinstein, Y. D., & Hastie, T. (). Discriminative vs informative learning. In Proceedings of the third international conference on knowledge discovery and data mining (pp. –). Russell, S., & Norvig, P. (). Artificial intelligence: A modern approach. Upper Saddle River, NJ: Prentice-Hall. Schorr, H., & Rappaport, A. (Eds.). (). Proceedings of the first conference on innovative applications of artificial intelligence. Winston, P. H. (). Learning structural descriptions from examples. In P. H. Winston (Ed.), The psychology of computer vision (pp. –). New York: McGraw-Hill. Witten, I. H., & Frank, E. (). Data mining: Practical machine learning tools and techniques. San Fransisco: Morgan Kaufmann. Wolpert, D. H., & Macready, W. G. (). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, (), –.

C

Classification Algorithms There is a very large number of classification algorithms, including 7decision trees, 7instance-based learners, 7support vector machines, 7rule-based learners, 7neural networks, 7Bayesian networks. There also ways of combining them into 7ensemble classifiers such as 7boosting, 7bagging, 7stacking, and 7forests of trees. To delve deeper into classifiers and their role in machine learning, a number of books are recommended covering machine learning (Bishop, ; Mitchell, ; Witten & Frank, ) and artificial intelligence (Russell & Norvig, ) in general. Seminal papers on classifiers can be found in collections of papers on machine learning (Dietterich & Shavlik, ; Kodratoff & Michalski, ; Michalski, Carbonell, & Mitchell, , ).

Recommended Reading Bishop, C. M. (). Pattern recognition and machine learning. New York: Springer. Dietterich, T., & Shavlik, J. (Eds.). Readings in machine learning. San Mateo, CA: Morgan Kaufmann. Kodratoff, Y., & Michalski, R. S. (). Machine learning: An artificial intelligence approach, (Vol. ). San Mateo, CA: Morgan Kaufmann. Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (). Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga Publishing Company. Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (). Machine learning: An artificial intelligence approach, (Vol. ). San Mateo, CA: Morgan Kaufmann. Mitchell, T. M. (). Machine learning. Boston, MA: McGraw-Hill. Russell, S., & Norvig, P. (). Artificial intelligence: A modern approach. Upper Saddle River, NJ: Prentice-Hall. Witten, I. H., & Frank, E. (). Data mining: Practical machine learning tools and techniques. San Fransisco: Morgan Kaufmann.

Classification Learning 7Concept Learning

Classification Tree 7Decision Tree

C

C

Classifier Systems

Classifier Systems Pier Luca Lanzi Politecnico di Milano, Milano, Italy

Synonyms Genetics-based machine learning; Learning classifier systems

Definition Classifier systems are rule-based systems that combine 7temporal difference learning or 7supervised learning with a genetic algorithm to solve classification and 7reinforcement learning problems. Classifier systems come in two flavors: Michigan classifier systems, which are designed for online learning, but can also tackle offline problems; and Pittsburgh classifier systems, which can only be applied to offline learning. In Michigan classifier systems (Holland, ), learning is viewed as an online adaptation process to an unknown environment that represents the problem and provides feedback in terms of a numerical reward. Michigan classifier systems maintain a single candidate solution consisting of a set of rules, or a population of classifiers. Michigan systems apply () temporal difference learning to distribute the incoming reward to the classifiers that are accountable for it; and () a genetic algorithm to select, recombine, and mutate individual classifiers so as to improve their contribution to the current solution. In contrast, in Pittsburgh classifier systems (Smith, ), learning is viewed as an offline optimization process in which a genetic algorithm alone is applied to search for the best solution to a given problem. In addition, Pittsburgh classifier systems maintain not one, but a set of candidate solutions. While in the Michigan classifier system each individual classifier represents a part of the overall solution, in the Pittsburgh system each individual is a complete candidate solution (itself consisting of a set of classifiers). The fitness of each Pittsburgh individual is computed offline by testing it on a representative sample of problem instances. The individuals compete among themselves through selection, while crossover and mutation recombine solutions to search for better solutions.

Motivation and Background Machine learning is usually viewed as a search process in which a solution space is explored until an appropriate solution to the target problem is found (Mitchell, ) (see 7Learning as Search). Machine learning methods are characterized by the way they represent solutions (e.g., using 7decision trees, rules), by the way they evaluate solutions (e.g., classification accuracy, information gain) and by the way they explore the solution space (e.g., using a 7general-to-specific strategy or a 7specific-to-general strategy). Classifier systems are methods of genetics-based machine learning introduced by Holland, the father of 7genetic algorithms. They made their first appearance in Holland () where the first diagram of a classifier system, labeled “cognitive system,” was shown. Subsequently, they were described in detail in the paper “Cognitive Systems based on Adaptive Algorithms” (Holland and Reitman, ). Classifier systems are characterized by a rule-based representation of solutions and a genetics-based exploration of the solution space. While other 7rule learning methods, such as CN (Clark & Niblett, ) and FOIL (Quinlan & Cameron-Jones, ), generate one rule at a time following a sequential covering strategy (see 7Covering Algorithm), classifier systems work on one or more solutions at once, and they explore the solution space by applying the principles of natural selection and genetics. In classifier systems (Holland, ; Holland and Reitman, ; Wilson, ), machine learning is modeled as an online adaptation process to an unknown environment, which provides feedback in terms of a numerical reward. A classifier system perceives the environment through its detectors and, based on its sensations, it selects an action to be performed in the environment through its effectors. Depending on the efficacy of its actions, the environment may eventually reward the system. A classifier system learns by trying to maximize the amount of reward it receives from the environment. To pursue such a goal, it maintains a set (a population) of condition-action-prediction rules, called classifiers, which represents the current solution. Each classifier’s condition identifies some part of the problem domain; the classifier’s action represents a decision on the subproblem identified by its condition; and the classifier’s prediction, or strength, estimates the value of the action in terms of future

Classifier Systems

rewards on that subproblem. Two separate components, credit assignment and rule discovery, act on the population with different goals. 7Credit assignment, implemented either by methods of temporal difference or supervised learning, exploits the incoming reward to estimate the action values in each subproblem so as to identify the best classifiers in the population. At the same time, rule discovery, usually implemented by a genetic algorithm, selects, recombines, and mutates the classifiers in the population to improve the current solution. Classifier systems were initially conceived as modeling tools. Given a real system with unknown underlying dynamics, for instance a financial market, a classifier system would be used to generate a behavior that matched the real system. The evolved rules would provide a plausible, human readable model of the unknown system – a way to look inside the box. Subsequently, with the developments in the area of machine learning and the rise of reinforcement learning, classifier systems have been more and more often studied and presented as alternatives to other machine learning methods. Wilson’s XCS (), the most successful classifier system to date, has proven to be both a valid alternative to other reinforcement learning approaches and an effective approach to classification and data mining (Bull, ; Bull & Kovacs, ; Lanzi, Stolzmann, & Wilson, ). Kenneth de Jong and his students (de Jong, ; Smith, , ) took a different perspective on genetics-based machine learning and modeled learning as an optimization process rather than an adaptation process as done in Holland (). In this case, the solution space is explored by applying a genetic algorithm to a population of individuals each representing a complete candidate solution – that is, a set of rules (or a production system, de Jong, ; Smith, ). At each cycle, a critic is applied to each individual (to each set of rules) to obtain a performance measure that is then used by the genetic algorithm to guide the exploration of the solution space. The individuals in the population compete among themselves through selection, while crossover and mutation recombine solutions to search for better ones. The approaches of Holland (Holland, ; Holland and Reitman, ) and de Jong (de Jong, ; Smith, , ) have been extended and improved

C

in several ways (see Lanzi et al. () for a review). The models of classifier systems that are inspired by the work of Holland () at the University of Michigan are usually called Michigan classifier systems; the ones that are inspired by Smith (, ) and de Jong () at the University of Pittsburgh are usually termed Pittsburgh classifier systems – or briefly, Pitt classifier systems. Pittsburgh classifier systems separate the evaluation of candidate solutions, performed by an external critic, from the genetic search. As they evaluate candidate solutions as a whole, Pittsburgh classifier systems can easily identify and emphasize sequentially cooperating classifiers, which is particularly helpful in problems involving partial observability. In contrast, in Michigan classifier systems the credit assignment is focused, due to identification of the actual classifiers that produce the reward, so learning is much faster but sequentially cooperating classifiers are more difficult to spot. As Pittsburgh classifier systems apply the genetic algorithm to a set of solutions, they only work offline, whereas Michigan classifier systems work online, although they can also tackle offline problems. Finally, the design of Pittsburgh classifier systems involves decisions as to how an entire solution should be represented and how solutions should be recombined – a task which can be daunting. In contrast, the design of Michigan classifier systems involves simpler decisions about how a rule should be represented and how two rules should be recombined. Accordingly, while the representation of solutions and its related issues play a key role in Pittsburgh models, Michigan models easily work with several types of representations (Lanzi, ; Lanzi & Perrucci, ; Mellor, ).

Structure of the Learning System Michigan and Pittsburgh classifier systems were both inspired by the work of Holland on the broadcast language (Holland, ). However, their structures reflect two different ways to model machine learning: as an adaptation process in the case of Michigan classifier systems; and as an optimization problem, in the case of Pittsburgh classifier systems. Thus, the two models, originating from the same idea (Holland’s broadcast language), have radically different structures.

C

C

Classifier Systems

Michigan Classifier Systems Holland’s classifier systems define a general paradigm for genetics-based machine learning. The description in Holland and Reitman () provides a list of principles for online learning through adaptation. Over the years, such principles have guided researchers who developed several models of Michigan classifier systems (Butz, ; Wilson, , , ) and applied them to a large variety of domains (Bull, ; Lanzi & Riolo, ; Lanzi et al., ). These models extended and improved Holland’s original ideas, but kept all the ingredients of the original recipe: a population of classifiers, which represents the current system knowledge; a performance component, which is responsible for the short-term behavior of the system; a credit assignment (or reinforcement) component, which distributes the incoming reward among the classifiers; and a rule discovery component, which applies a genetic algorithm to the classifiers to improve the current knowledge.

Knowledge Representation In Michigan classifier systems, knowledge is represented by a population of classifiers. Each classifier is usually defined by four main parameters: the condition, which identifies some part of the problem domain; the action, which represents a decision on the subproblem identified by its condition; the prediction or strength, which estimates the amount of reward that the system will receive if its action is performed; and finally, the fitness, which estimates how good the classifier is in terms of problem solution. The knowledge representation of Michigan classifier systems is extremely flexible. Each one of the four classifier components can be tailored to fit the need of a particular application, without modifying the main structure of the system. In problems involving binary inputs, classifier conditions can be simply represented using strings defined over the alphabet {, , #}, as done in Holland and Reitman (), Goldberg (), and Wilson (). In problems involving real inputs, conditions can be represented as disjunctions of intervals, similar to the ones produced by other rule learning methods (Clark & Niblett, ) Conditions can also be represented as general-purpose symbolic expressions

(Lanzi, ; Lanzi & Perrucci, ) or first-order logic expressions (Mellor, ). Classifier actions are typically encoded by a set of symbols (either binary strings or simple labels), but continuous real-valued actions are also available (Wilson, ). Classifier prediction (or strength) is usually encoded by a parameter (Goldberg, ; Holland & Reitman, ; Wilson, ). However, classifier prediction can also be computed using a parameterized function (Wilson, ), which results in solutions represented as an ensemble of local approximators – similar to the ones produced in generalized reinforcement learning (Sutton & Barto, ).

Performance Component A simplified structure of Michigan classifier systems is shown in Fig. . We refer the reader to Goldberg () and Holland and Reitman () for a detailed description of the original model and to Butz () and Wilson (, , ) for descriptions of recent classifier system models. A classifier system learns through trial and error interactions with an unknown environment. The system and the environment interact continually. At each time step, the classifier system perceives the environment through its detectors; it builds a match set containing all the classifiers in the population whose condition matches the current sensory input. The match set typically contains classifiers that advocate contrasting actions; accordingly, the classifier system evaluates each action in the match set, and selects an action to be performed balancing exploration and exploitation. The selected action is sent to the effectors to be executed in the environment; depending on the effect that the action has in the environment, the system receives a scalar reward.

Credit Assignment The credit assignment component (also called reinforcement component, Wilson ) distributes the incoming reward to the classifiers that are accountable for it. In Holland and Reitman (), credit assignment is implemented by Holland’s bucket brigade algorithm (Holland, ), which was partially inspired by the credit allocation mechanism used by Samuel in his

Classifier Systems

Perceptions

Reward

Effectors

Credit Assignment Component

Classifiers representing the current knowledge

1

Match Set

3

2

Classifiers matching the current sensory inputs

Evaluation of the actions in the match set

Rule Discovery Component

Classifier Systems. Figure . Simplified structure of a Michigan classifier system. The system perceives the environment through its detectors and () it builds the match set containing the classifiers in the population that match the current sensory inputs; then () all the actions in the match set are evaluated, and () an action is selected to be performed in the environment through the effectors

pioneering work on learning checkers-playing programs (Samuel, ). In the early years, classifier systems and the bucket brigade algorithm were confined to the evolutionary computation community. The rise of reinforcement learning increased the connection between classifier systems and temporal difference learning (Sutton, ; Sutton & Barto, ): in particular, Sutton () showed that the bucket brigade algorithm is a kind of temporal difference learning, and similar connections were also made in Watkins () and Dorigo and Bersini (). Later, the connection between classifier systems and reinforcement learning became tighter with the introduction of Wilson’s XCS (), in which credit assignment is implemented by a modification of Watkins Q-learning (Watkins, ). As a consequence, in recent years, classifier systems are often presented as methods of reinforcement learning with genetics-based generalization (Bull & Kovacs, ).

Action

Detectors

Population

C

Rule Discovery Component The rule discovery component is usually implemented by a genetic algorithm that selects classifiers in the population with probability proportional to their fitness; it copies the selected classifiers and applies genetic operators (usually crossover and mutation) to the offspring classifiers; the new classifiers are inserted in the population, while other classifiers are deleted to keep the population size constant. Classifiers selection plays a central role in rule discovery. Classifier selection depends on the definition of classifier fitness and on the subset of classifiers considered during the selection process. In Holland and Reitman (), classifier fitness coincides with classifier prediction, while selection is applied to all the classifiers in the population. This approach results in a pressure toward classifiers predicting high returns, but typically tends to produce overly general solutions. To avoid such solutions, Wilson () introduced the XCS classifier system in which accuracy-based fitness is

C

C

Classifier Systems

coupled with a niched genetic algorithm. This approach results in a pressure toward accurate maximally general classifiers, and has made XCS the most successful classifier system to date.

Pittsburgh Classifier Systems The idea underlying the development of Pittsburgh classifier systems was to show that interesting behaviors could be evolved using a simpler model than the one proposed by Holland with Michigan classifier systems (Holland, ; Holland & Reitman, ). In Pittsburgh classifier systems, each individual is a set of rules that encodes an entire candidate solution; each rule has a fixed length, but each rule set (each individual) usually contains a variable number of rules. The genetic operators, crossover and mutation, are tailored to the rule-based, variable-length representation. The individuals in the population compete among themselves, following the selection-recombination-mutation cycle that is typical of genetic algorithms (Goldberg, ; Holland, ). While in Michigan classifier systems individuals in the population (the single rules) cooperate, in Pittsburgh classifier systems there is no cooperation among individuals (the rule sets), so that the genetic algorithm operation is simpler for Pittsburgh models. However, as Pittsburgh classifier systems explore a much larger search space, they usually require more computational resources than Michigan classifier systems. The pseudo-code of a Pittsburgh classifier system is shown in Fig. . At first, the individuals in the population are randomly initialized (line ). At time t, the

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

individuals are evaluated by an external critic, which returns a performance measure that the genetic algorithm exploits to compute the fitness of individuals (lines and ). Following this, selection (line ), recombination, and mutation (line ) are applied to the individuals in the population – as done in a typical genetic algorithm. The process stops when a termination criterion is met (line ), usually when an appropriate solution is found. The design of Pittsburgh classifier systems follows the typical steps of genetic algorithm design, which means deciding how a rule set should be represented, what genetic operators should be applied, and how the fitness of a set of rules should be calculated. In addition, Pittsburgh classifier systems need to address the bloat phenomenon (Tackett, ) that arises with any variable-sized representation, like the rule sets evolved by Pittsburgh classifier systems. Bloat can be defined as the growth of individuals without an actual fitness improvement. In Pittsburgh classifier systems, bloat increases the size of candidate solutions by adding useless rules to individuals, and it is typically limited by introducing a parsimony pressure that discourages large rule sets (Bassett & de Jong, ). Alternatively, Pittsburgh classifier systems can be combined with multi-objective optimization, so as to separate the maximization of the rule set performance and the minimization of the rule set size. Examples of Pittsburgh classifier systems include SAMUEL (Grefenstette, Ramsey, & Schultz, ), the Genetic Algorithm Batch-Incremental Concept Learner (GABIL) (de Jong & Spears, ), GIL (Janikow, ), GALE (Llorà, ), and GAssist (Bacardit, ).

t := 0 Initialize the population P(t) Evaluate the rules sets in P(t) While the termination condition is not satisfied Begin Select the rule sets in P(t) and generate Ps(t) Recombine and mutate the rule sets in Ps(t) P(t+1) := Ps(t) t := t+1 Evaluate the rules sets in P(t) End

Classifier Systems. Figure . Pseudo-code of a Pittsburgh classifier system

Classifier Systems

Applications Classifier systems have been applied to a large variety of domains, including computational economics (e.g., Arthur, Holland, LeBaron, Palmer, & Talyer, ), autonomous robotics (e.g., Dorigo & Colombetti, ), classification (e.g., Barry, Holmes, & Llora, ), fighter aircraft maneuvering (Bull, ; Smith, Dike, Mehra, Ravichandran, & El-Fallah, ), and many others. Reviews of classifier system applications are available in Lanzi et al. (), Lanzi and Riolo (), and Bull ().

Programs and Data The major sources of information about classifier systems are the LCSWeb maintained by Alwyn Barry, which can be reached through, and www.learningclassifier-systems.org_maintained by Xavier Llorà. Several implementations of classifier systems are freely available online. The first standard implementation of Holland’s classifier system in Pascal was described in Goldberg (), and it is available at http://www.illigal.org/; a C version of the same implementation, developed by Robert E. Smith, is available at http://www.etsimo.uniovi.es/ftp/pub/EC/CFS/src/. Another implementation of an extension of Holland’s classifier system in C by Rick L. Riolo is available at http://www.cscs.umich.edu/Software/Contents. html. Implementations of Wilson’s XCS () are distributed by Alwyn Barry at the LCSWeb, by Martin V. Butz (at www.illigal.org), and by Pier Luca Lanzi (at xcslib.sf.net). Among the implementations of Pittsburgh classifier systems, the Samuel system is available from Alan C. Schultz at http://www.nrl.navy.mil/; Xavier Llorà distributes GALE (Genetic and Artificial Life Environment) a fine-grained parallel genetic algorithm for data mining at www.illigal.org/xllora.

Cross References 7Credit Assignment 7Genetic Algorithms 7Reinforcement Learning 7Rule Learning

Recommended Reading Arthur, B. W., Holland, J. H., LeBaron, B., Palmer, R., & Talyer, P. (). Asset pricing under endogenous expectations in an artificial stock market. Technical Report, Santa Fe Institute.

C

Bacardit i Peñarroya, J. (). Pittsburgh genetic-based machine learning in the data mining era: Representations, generalization, and run-time. PhD thesis, Computer Science Department, Enginyeria i Arquitectura La Salle Universitat Ramon Llull, Barcelona. Barry, A. M., Holmes, J., & Llora, X. (). Data mining using learning classifier systems. In L. Bull (Ed.), Applications of learning classifier systems, studies in fuzziness and soft computing (Vol. , pp. –). Pagg: Springer. Bassett, J. K., & de Jong, K. A. (). Evolving behaviors for cooperating agents. In Proceedings of the twelfth international symposium on methodologies for intelligent systems, LNAI (Vol. ). Berlin: Springer. Booker, L. B. (). Triggered rule discovery in classifier systems. In J. D. Schaffer (Ed.), Proceedings of the rd international conference on genetic algorithms (ICGA). San Francisco: Morgan Kaufmann. Bull, L. (Ed.). (). Applications of learning classifier systems, studies in fuzziness and soft computing (Vol. ). Berlin: Springer, ISBN ----. Bull, L., & Kovacs, T. (Eds.). (). Foundations of learning classifier systems, studies in fuzziness and soft computing (Vol. ). Berlin: Springer, ISBN ----. Butz, M. V. (). Anticipatory learning classifier systems. Genetic algorithms and evolutionary computation. Boston, MA: Kluwer Academic Publishers. Clark, P., & Niblett, T. (). The CN induction algorithm. Machine Learning, (), –. de Jong, K. (). Learning with genetic algorithms: An overview. Machine Learning, (–), –. de Jong, K. A., & Spears, W. M. (). Learning concept classification rules using genetic algorithms. In Proceedings of the international joint conference on artificial intelligence (pp. –). San Francisco: Morgan Kaufmann. Dorigo, M., & Bersini, H. (). A comparison of Q-learning and classifier systems. In D. Cliff, P. Husbands, J.-A. Meyer, & S. W. Wilson (Eds.), From animals to animats : Proceedings of the third international conference on simulation of adaptive behavior (pp. –). Cambridge, MA: MIT Press. Dorigo, M., & Colombetti, M. (). Robot shaping: An experiment in behavior engineering. Cambridge, MA: MIT Press/Bradford Books. Goldberg, D. E. (). Genetic algorithms in search, optimization, and machine learning. Reading, MA: Addison-Wesley. Grefenstette, J. J., Ramsey, C. L., & Schultz, A. () Learning sequential decision rules using simulation models and competition. Machine Learning, (), –. Holland, J. () Escaping brittleness: The possibilities of generalpurpose learning algorithms applied to parallel rule-based systems. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning, an artificial intelligence approach (Vol. II, Chap. ) (pp. –). San Francisco: Morgan Kaufmann. Holland, J. H. (). Adaptation in natural and artificial systems. Ann Arbor, MI: University of Michigan Press (Reprinted by the MIT Press in ). Holland, J. H. (). Adaptation. Progress in Theoretical Biology, , –. Holland, J. H., & Reitman, J. S. (). Cognitive systems based on adaptive algorithms. In D. A. Waterman & F. Hayes-Roth (Eds.), Pattern-directed inference systems. New York: Academic Press.

C

C

Clause

(Reprinted from Evolutionary computation. The fossil record. D. B. Fogel (Ed.), IEEE Press ()). Janikow, C. Z. (). A knowledge-intensive genetic algorithm for supervised learning. Machine Learning, (–), –. Lanzi, P. L. (). Mining interesting knowledge from data with the XCS classifier system. In L. Spector, E. D. Goodman, A. Wu, W. B. Langdon, H.-M. Voigt, M. Gen, et al. (Eds.), Proceedings of the genetic and evolutionary computation conference (GECCO) (pp. –). San Francisco: Morgan Kaufmann. Lanzi, P. L. (). Learning classifier systems: A reinforcement learning perspective. In L. Bull & T. Kovacs (Eds.), Foundations of learning classifier systems, studies in fuzziness and soft computing (pp. –). Berlin: Springer. Lanzi, P. L., & Perrucci, A. (). Extending the representation of classifier conditions part II: From messy coding to Sexpressions. In W. Banzhaf, J. Daida, A. E. Eiben, M. H. Garzon, V. Honavar, M. Jakiela, & R. E. Smith (Eds.), Proceedings of the genetic and evolutionary computation conference (GECCO ) (pp. –). Orlando, FL: Morgan Kaufmann. Lanzi, P. L., & Riolo, R. L. (). Recent trends in learning classifier systems research. In A. Ghosh & S. Tsutsui (Eds.), Advances in evolutionary computing: Theory and applications (pp. –). Berlin: Springer. Lanzi, P. L., Stolzmann, W., & Wilson, S. W. (Eds.). (). Learning classifier systems: From foundations to applications. Lecture notes in computer science (Vol. ). Berlin: Springer. Llorá, X. (). Genetics-based machine learning using fine-grained parallelism for data mining. PhD thesis, Enginyeria i Arquitectura La Salle, Ramon Llull University, Barcelona. Mellor, D. (). A first order logic classifier system. In H. Beyer (Ed.), Proceedings of the conference on genetic and evolutionary computation (GECCO ’), (pp. –). New York: ACM Press. Quinlan, J. R., & Cameron-Jones, R. M. (). Induction of logic programs: FOIL and related systems. New Generation Computing, (&), –. Samuel, A. L. (). Some studies in machine learning using the game of checkers. In E. A. Feigenbaum & J. Feldman (Eds.), Computers and thought. New York: McGraw-Hill. Smith, R. E., Dike, B. A., Niehra, R. K., Ravichandran, B., & ElFallah, A. (). Classifier systems in combat: Two-sided learning of maneuvers for advanced fighter aircraft. Computer Methods in Applied Mechanics and Engineering, (–), –. Smith, S. F. () A learning system based on genetic adaptive algorithms. Doctoral dissertation, Department of Computer Science, University of Pittsburgh. Smith, S. F. (). Flexible learning of problem solving heuristics through adaptive search. In Proceedings of the eighth international joint conference on artificial intelligence (pp. –). Los Altos, CA: Morgan Kaufmann. Sutton, R. S. (). Learning to predict by the methods of temporal differences. Machine Learning, , –. Sutton, R. S., & Barto, A. G. (). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tackett, W. A. (). Recombination, selection, and the genetic construction of computer programs. Unpublished doctoral dissertation, University of Southern California. Watkins, C. (). Learning from delayed rewards. PhD thesis, King’s College.

Wilson, S. W. (). Classifier fitness based on accuracy. Evolutionary Computation, (), –. Wilson, S. W. (). Classifiers that approximate functions. Natural Computing, (–), –. Wilson, S. W. (). “Three architectures for continuous action” learning classifier systems. International workshops, IWLCS –, revised selected papers. In T. Kovacs, X. Llorà, K. Takadama, P. L. Lanzi, W. Stolzmann, & S. W. Wilson (Eds.), Lecture notes in artificial intelligence Vol. (pp. –). Berlin: Springer.

Clause A clause is a logical rule in a 7logic program. Formally, a clause is a disjunction of (possibly negated) literals, such as grandfather(x, y) ∨ ¬father(x, z) ∨ ¬parent(z, y). In the logic programming language 7Prolog this clause is written as grandfather(X,Y) :- father(X,Z), parent(Z,Y). The part to the left of :- (“if ”) is the head of the clause, and the right part is its body. Informally, the clause asserts the truth of the head given the truth of the body. A clause with exactly one literal in the head is called a Horn clause or definite clause; logic programs mostly consist of definite clauses. A clause without a body is also called a fact; a clause without a head is also called a denial, or a query in a proof by refutation. The clause without head or body is called the empty clause: it signifies inconsistency or falsehood and is denoted ◻. Given a set of clauses, the resolution inference rule can be used to deduce logical consequences and answer queries (see 7First-Order Logic). In machine learning, clauses can be used to express classification rules for structured individuals. For example, the following definite clause classifies a molecular compound as carcinogenic if it contains a hydrogen atom with charge above a certain threshold. carcinogenic(M) :- atom(M,A1), element(A1,h), charge(A1,C1), geq(C1,0.168).

Cluster Optimization

Cross References 7First-Order Logic 7Inductive Logic Programming 7Learning from Structured Data 7Logic Program 7Prolog

Clause Learning In 7speedup learning, clause learning is a 7deductive learning technique used for the purpose of 7intelligent backtracking in satisfiability solvers. The approach analyzes failures at backtracking points and derives clauses that must be satisfied by the solution. The clauses are added to the set of clauses from the original satisfiability problem and serve to prune new search nodes that violate them.

Click-Through Rate (CTR) CTR measures the success of a ranking of search results, or advertisement placing. Given the number of impressions, the number of times a web result or ad has been displayed, and the number of clicks, the number of users who clicked on the result/advertisement, CTR is the number of clicks divided by the number of impressions.

Clonal Selection The clonal selection theory (CST) is the theory used to explain the basic response of the adaptive immune system to an antigenic stimulus. It establishes the idea that only those cells capable of recognizing an antigenic stimulus will proliferate, thus being selected against those that do not. Clonal selection operates on both T-cells and B-cells. When antibodies on a B-cell bind with an antigen, the B-cell becomes activated and begins to proliferate. New B-cell clones are produced that are an exact copy of the parent B-cell, but then they undergo somatic hypermutation and produce antibodies that are specific to the invading antigen. The B-cells, in addition to proliferating or differentiating into plasma cells, can differentiate into long-lived B memory cells. Plasma cells produce large amounts of antibody which will attach

C

themselves to the antigen and act as a type of tag for T-cells to pick up on and remove from the system. This whole process is known as affinity maturation. This process forms the basis of many artificial immune system algorithms such as AIRS and aiNET.

Closest Point 7Nearest Neighbor

Cluster Editing The Cluster Editing problem is almost equivalent to Correlation Clustering on complete instances. The idea is to obtain a graph that consists only of cliques. Although Cluster Deletion requires us to delete the smallest number of edges to obtain such a graph, in Cluster Editing we are permitted to add as well as remove edges. The final variant is Cluster Completion in which edges can only be added: each of these problems can be restricted to building a specified number of cliques.

Cluster Ensembles Cluster ensembles are an unsupervised 7ensemble learning method. The principle is to create multiple different clusterings of a dataset, possibly using different algorithms, then aggregate the opinions of the different clusterings into an ensemble result. The final ensemble clustering should be in theory more reliable than the individual clusterings.

Cluster Optimization 7Evolutionary Clustering

C

C

Clustering

Clustering Clustering is a type of 7unsupervised learning in which the goal is to partition a set of 7examples into groups called clusters. Intuitively, the examples within a cluster are more similar to each other than to examples from other clusters. In order to measure the similarity between examples, clustering algorithms use various distortion or 7distance measures. There are two major types clustering approaches: generative and discriminative. The former assumes a parametric form of the data and tries to find the model parameters that maximize the probability that the data was generated by the chosen model. The latter represents graph-theoretic approaches that compute a similarity matrix defined over the input data.

Cross References 7Categorical Data Clustering 7Cluster Editing 7Cluster Ensembles 7Clustering from Data Streams 7Constrained Clustering 7Consensus Clustering 7Correlation Clustering 7Cross-Language Document Clustering 7Density-Based Clustering 7Dirichlet Process 7Document Clustering 7Evolutionary Clustering 7Graph Clustering 7k-Means Clustering 7k-Mediods Clustering 7Model-Based Clustering 7Partitional Clustering 7Projective Clustering 7Sublinear Clustering

Clustering Aggregation 7Consensus Clustering

Clustering Ensembles 7Consensus Clustering

Clustering from Data Streams João Gama University of Porto, Porto, Portugal

Definition 7Clustering is the process of grouping objects into different groups, such that the common properties of data in each subset is high, and between different subsets is low. The data stream clustering problem is defined as to maintain a consistent good clustering of the sequence observed so far, using a small amount of memory and time. The issues are imposed by the continuous arriving data points, and the need to analyze them in real time. These characteristics require incremental clustering, maintaining cluster structures that evolve over time. Moreover, the data stream may evolve over time and new clusters might appear, others disappear reflecting the dynamics of the stream.

Main Techniques Major clustering approaches in data stream cluster analysis include: Partitioning algorithms: construct a partition of a set of objects into k clusters, that minimize some objective function (e.g., the sum of squares distances to the centroid representative). Examples include k-means (Farnstrom, Lewis, & Elkan, ), and k-medoids (Guha, Meyerson, Mishra, Motwani, & O’Callaghan, ) ● Microclustering algorithms: divide the clustering process into two phases, where the first phase is online and summarizes the data stream in local models (microclusters) and the second phase generates a global cluster model from the microclusters. Examples of these algorithms include BIRCH (Zhang, Ramakrishnan, & Livny, ) and CluStream (Aggarwal, Han, Wang, & Yu, ) ●

Basic Concepts A powerful idea in clustering from data streams is the concept of cluster feature, CF. A cluster feature, or microcluster, is a compact representation of a set of points. A CF structure is a triple (N, LS, SS), used to store the sufficient statistics of a set of points:

Clustering from Data Streams

N is the number of data points LS is a vector, of the same dimension of data points, that store the linear sum of the N points ● SS is a vector, of the same dimension of data points, that store the square sum of the N points ●

●

The properties of cluster features are: ●

Incrementality If a point x is added to the cluster, the sufficient statistics are updated as follows: LSA ← LSA + x, SSA ← SSA + x , NA ← NA + .

●

Additivity If A and A are disjoint sets, merging them is equal to the sum of their parts. The additive property allows us to merge subclusters incrementally. LSC ← LSA + LSB , SSC ← SSA + SSB , NC ← NA + NB .

A CF entry has sufficient information to calculate the norms n

L = ∑ ∣xai − xbi ∣, i=

¿ Án À∑(xa − xb ) L = Á i i i=

and basic measures to characterize a cluster. ●

Centroid, defined as the gravity center of the cluster: ⃗ = LS . X N

●

Radius, defined as the average distance from member points to the centroid: √ R=

N ⃗ ∑ (⃗xi − X) . N

C

Partitioning Clustering k-means is the most widely used clustering algorithm. It constructs a partition of a set of objects into k clusters that minimize some objective function, usually a squared error function, which imply round-shape clusters. The input parameter k is fixed and must be given in advance that limits its real applicability to streaming and evolving data. Farnstrom et al. () proposed a single pass k-means algorithm. The main idea is to use a buffer where points of the dataset are kept compressed. The data stream is processed in blocks. All available space on the buffer is filled with points from the stream. Using these points, find k centers such that the sum of distances from data points to their closest center is minimized. Only the k centroids (representing the clustering results) are retained, with the corresponding k cluster features. In the following iterations, the buffer is initialized with the k-centroids, found in previous iteration, weighted by the k cluster features, and incoming data points from the stream. The Very Fast k-means (VFKM) algorithm (Domingos & Hulten, ) uses the Hoeffding bound to determine the number of examples needed in each step of a k-means algorithm. VFKM runs as a sequence of k-means runs, with increasing number of examples until the Hoeffding bound is satisfied. Guha et al. () present an analytical study on k-median clustering data streams. The proposed algorithm makes a single pass over the data stream and uses small space. It requires O(nk) time and O(nє) space where k is the number of centers, n is the number of points, and є < . They have proved that any k-median algorithm that achieves a constant factor approximation cannot achieve a better run time than O(nk).

Micro Clustering The idea of dividing the clustering process into two layers, where the first layer generates local models (microclusters) and the second layer generates global models from the local ones, is a powerful idea that has been used elsewhere. The BIRCH system (Zhang et al., ) builds a hierarchical structure of data, the CF-tree, where each node contains a set of cluster features. These CF’s contain the sufficient statistics describing a set of points in the data set, and all information of the cluster features below in

C

C

Clustering from Data Streams

Monitoring the Evolution of the Cluster Structure

the tree. The system requires two user defined parameters: B the branch factor or the maximum number of entries in each non-leaf node; and T the maximum diameter (or radius) of any CF in a leaf node. The maximum diameter T defines the examples that can be absorbed by a CF. Increasing T, more examples can be absorbed by a micro-cluster and smaller CF-Trees are generated (Fig. ). When an example is available, it traverses down the current tree from the root it finds the appropriate leaf. At each non-leaf node, the example follow the closestCF path, with respect to norms L or L . If the closest-CF in the leaf cannot absorb the example, make a new CF entry. If there is no room for new leaf, split the parent node. A leaf node might be expanded due to the constraints imposed by B and T. The process consists of taking the two farthest CFs and creates two new leaf nodes. When traversing backup the CFs are updated.

The CluStream Algorithm (Aggarwal et al., ) is an extension of the BIRCH system designed for data streams. Here, the CFs include temporal information: the time-stamp of an example is treated as a feature. CFs are initialized offline, using a standard k-means, with a large value for k. For each incoming data point, the distance to the centroids of existing CFs are computed. The data point is absorbed by an existing CF if the distance to the centroid falls within the maximum boundary of the CF. The maximum boundary is defined as a factor t of the radius deviation of the CF; otherwise, the data point starts a new micro-cluster. CluStream can generate approximate clusters for any user defined time granularity. This is achieved by storing the CFT at regular time intervals, referred to as snapshots. Suppose the user wants to find clusters in the stream based on a history of length h, the off-line Root node

CF2

CF1

CF2

CF1

CF1 CF2

CFb

Noon-root node

CFb

CF1

CF1

CF2

Leaf nodes

CF2 CF3

CFb

Clustering from Data Streams. Figure . The clustering feature tree in BIRCH. B is the maximum number of CFs in a level of the tree 1 Year 12 Months

1 Month 31 days

Natural tilted time window

1 Day 24 Hours

1Hour 4 Quar t

Clustering from Data Streams. Figure . The figure presents a natural tilted time window. The most recent data is stored with high-detail, older data is stored in a compressed way. The degree of detail decreases with time

Coevolution

component can analyze the snapshots stored at the snapshots t, the current time, and (t − h) by using the addictive property of CFT. An important problem is when to store the snapshots of the current set of microclusters. For example, the natural time frame (Fig. ) stores snapshots each quarter, four quarters are aggregated in hours, h are aggregated in days, etc. The aggregation level is domain-dependent and explores the addictive property of CFT. Tracking the Evolution of the Cluster Structure

Promising research lines are tracking change in clusters. Spiliopoulou, Ntoutsi, Theodoridis, and Schult () present system MONIC, for detecting and tracking change in clusters. MONIC assumes that a cluster is an object in a geometric space. It encompasses changes that involve more than one cluster, allowing for insights on cluster change in the whole clustering. The transition tracking mechanism is based on the degree of overlapping between the two clusters. The concept of overlap between two clusters, X and Y, is defined as the normed number of common records weighted with the age of the records. Assume that cluster X was obtained at time t and cluster Y at time t . The degree of overlapping between the two clusters is given by: overlap (X, Y) = ∑a∈X∩Y age(a, t )/∑x∈X age(x, t ). The degree of overlapping allows inferring properties of the underlying data stream. Cluster transition at a given time point is a change in a cluster discovered at an earlier timepoint. MONIC considers transitions as Internal and external transitions, that reflect the dynamics of the stream. Examples of cluster transitions include: the cluster survives, the cluster is absorbed; a cluster disappears; a new cluster emerges (Fig. ).

Recommended Reading Aggarwal, C., Han, J., Wang, J., & Yu, P. (). A framework for clustering evolving data streams. In Proceedings of the th international conference on very large data bases (pp. –). San Mateo, MA: Morgan Kaufmann. Domingos, P., & Hulten, G. (). A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of international conference on machine learning (pp. –). San Mateo, MA: Morgan Kaufmann. Farnstrom, F., Lewis, J., & Elkan, C. (). Scalability for clustering algorithms revisited. SIGKDD Explorations, (), –. Guha, S., Meyerson, A., Mishra, N., Motwani, R., & O’Callaghan, L. (). Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, (), –.

C

Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., & Schult, R. (). Monic: Modeling and monitoring cluster transitions. In Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (pp. –). New York: ACM Press. Zhang, T., Ramakrishnan, R., & Livny, M. (). Birch: An efficient data clustering method for very large databases. In Proceedings of ACM SIGMOD international conference on management of data (pp. –). New York: ACM Press.

Clustering of Nonnumerical Data 7Categorical Data Clustering

Clustering with Advice 7Correlation Clustering

Clustering with Constraints 7Correlation Clustering

Clustering with Qualitative Information 7Correlation Clustering

Clustering with Side Information 7Correlation Clustering

CN 7Rule Learning

Co-Training 7Semi-Supervised Learning

Coevolution 7Coevolutionary Learning

C

C

Coevolutionary Computation

Coevolutionary Computation 7Coevolutionary Learning

Coevolutionary Learning R. Paul Wiegand University of Central Florida, Orlando, FL, USA

Synonyms Coevolution; Coevolutionary computation

Definition Coevolutionary learning is a form of evolutionary learning (see 7Evolutionary Algorithms) in which the fitness evaluation is based on interactions between individuals. Since the evaluation of an individual is dependent on interactions with other evolving entities, changes in the set of entities used for evaluation can affect an individual’s ranking in a population. In this sense, coevolutionary fitness is subjective, while fitness in traditional evolutionary learning systems typically uses an objective performance measure.

Motivation and Background Ideally, coevolutionary learning systems focus on relevant areas of a search space by making adaptive changes between interacting, concurrently evolving parts. This can be particularly helpful when problem spaces are very large – infinite search spaces in particular. Additionally, coevolution is useful when applied to problems when no intrinsic objective measure exists. The interactive nature of evaluation makes them natural methods to consider for problems such as the search for gameplaying strategies (Fogel, ). Finally, some coevolutionary systems appear natural for search spaces which contain certain kinds of complex structures (Potter, ; Stanley, ), since search on smaller components in a larger structure can be emphasized. In fact, there is reason to believe that coevolutionary systems may be well suited for uncovering complex structures within a problem (Bucci & Pollack, ). Still, the dynamics of coevolutionary learning can be quite complex, and a number of pathologies often plague naïve users. Indeed, because of the subjective nature of coevolution, it can be easy to apply a particular coevolutionary learning system without a clear

understanding of what kind of solution one expects a coevolutionary algorithm to produce. Recent theoretical analysis suggests that a clear concept of solution and a careful implementation of an evaluation process consistent with this concept can produce a coevolutionary system capable of addressing many problems (de Jong & Pollack, ; Ficici, ; Panait, ; Wiegand, ). Accordingly, a great deal of research in this area focuses on evaluation and progress measurement.

Structure of Learning System Coevolutionary learning systems work in much the same way that an evolutionary learning system works: individuals encode some aspect of potential solutions to a problem, those representatives are altered during search using genetic-like operators such as mutation and crossover, and the search is directed by selecting better individuals as determined by some kind of fitness assessment. These heuristic methods gradually refine solutions by repeatedly cycling through such steps, using the ideas of heredity and survival of the fittest to produce new generations of individuals, with increased quality of solution. Just as in traditional evolutionary computation, there are many choices available to the engineer in designing such systems. The reader is referred to the chapters relating to evolutionary learning for more details. However, there are some fundamental differences between traditional evolution and coevolution. In coevolution, measuring fitness requires evaluating the interaction between multiple individuals. Interacting individuals may reside in the same population or in different populations; the interactive nature of coevolution evokes notions of cooperation and competition in entirely new ways; the choices regarding how to best conduct evaluation of these interactions for the purposes of selection are particularly important; and there are unique coevolutionary issues surrounding representation. In addition, because of its interactive nature, the dynamics of coevolution can lead to some well-known pathological behaviors, and particularly careful attention to implementation choices to avoid such conditions is generally necessary. Multiple Versus Single Population Approaches

Coevolution can typically be broadly classified as to whether interacting individuals reside in different populations or in the same population.

Coevolutionary Learning

In the case of multipopulation coevolution, measuring fitness requires evaluating how individuals in one population interact with individuals in another. For example, individuals in each population may represent potential strategies for particular players of a game, they may represent roles in a larger ecosystem (e.g., predators and prey), or they may represent components that are fitted into a composite assembly with other component then applied to a problem. Though individuals in different populations interact for the purposes of evaluation, they are typically otherwise independent of one another in the coevolutionary search process. In single population coevolution, an individual in the population is evaluated based on his or her interaction with other individuals in the same population. Such individuals may again represent potential strategies in a game, but evaluation may require them to trade off roles as to which player they represent in that game. Here, individuals interact not only for evaluation, but also implicitly compete with one another as resources used in the coevolutionary search process itself. There is some controversy in the field as to whether this latter type qualifies as “coevolution.” Evolutionary biologists often define coevolution exclusively in terms of multiple populations; however, in biological systems, fitness is always subjective, while the vast majority of computational approaches to evolutionary learning involve objective fitness assessment – and this subjective/objective fitness distinction creates a useful classification. To be sure, there are fundamental differences between how single population and multipopulation learning systems behave (Ficici, ). Still, single population systems that employ subjective fitness assessment behave a lot more like multipopulation coevolutionary systems than like objective fitness based evolution. Moreover, historically, the field has used the term coevolution whenever fitness assessment is based on interactions between individuals, and a large amount of that research has involved systems with only one population. Competition and Cooperation

The terms cooperative and competitive have been used to describe aspects of coevolution learning in at least three ways.

C

First and less commonly, these adjectives can describe qualitatively observed behaviors of potential solutions in coevolutionary systems, the results of some evolutionary process (e.g., “tit-for-tat” strategies, Axelrod, ). Second, problems are sometimes considered to be inherently competitive or cooperative. Indeed, game theory provides some guidance for making such distinctions. However, since in many kinds of problems little may be known about the actual structure of the payoff functions involved, we may not actually be able to classify the problem as definitively competitive or cooperative. The final and by far most common use of the term is to distinguish algorithms themselves. Cooperative algorithms are those in which interacting individuals succeed or fail together, while competitive algorithms are those in which individuals succeed at the expense of other individuals. Because of the ambiguity of the terms, some researchers advocate abandoning them altogether, instead focusing distinguishing terminology on the form a potential solution takes. For example, using the term 7compositional coevolution to describe an algorithm designed to return a solution composed of multiple individuals (e.g., a multiagent team) and using the term 7test-based coevolution to describe an algorithm designed to return an individual who performs well against an adaptive set of tests (e.g., sorting network). This latter pair of terms is a slightly different, though probably more useful distinction than the cooperative and competitive terms. Still, it is instructive to survey the algorithms based on how they have been historically classified. Examples of competitive coevolutionary learning include simultaneously learning sorting networks and challenging data sets in a predator–prey type relationship (Hillis, ). Here, individuals in one population representing potential sorting networks are awarded a fitness score based on how well they sort opponent data sets from the other population. Individuals in the second population represent potential data sets whose fitness is based on how well they distinguish opponent sorting networks. Competitive coevolution has also been applied to learning game-playing strategies (Fogel, ; Rosin & Belew, ). Additionally, competition has played a vital part in the attempts to coevolve complex agent

C

C

Coevolutionary Learning

behaviors (Sims, ). Finally, competitive approaches have been applied to a variety of more traditional machine learning problems, for example, learning classifiers in one population and challenging subsets of exemplars in the other (Paredis, ). Potter developed a relatively general framework for cooperative coevolutionary learning, applying it first to static function optimization and later to neural network learning (Potter, ). Here, each population contains individuals representing a portion of the network, and evolution of these components occurs almost independently, in tandem with one another, interacting only to be assembled into a complete network in order to obtain fitness. The decomposition of the network can be static and a priori, or dynamic in the sense that components may be added or removed during the learning process. Moriarty et al. take a different, somewhat more adaptive approach to cooperative coevolution of neural networks (Moriarty & Miikkulainen, ). In this case, one population represents potential network plans, while a second is used to acquire node information. Plans are evaluated based on how well they solve a problem with their collaborating nodes, and the nodes receive a share of this fitness. Thus, a node is rewarded for participating more with successful plans, and thus receives fitness only indirectly.

Evaluation

Choices surrounding how interacting individuals in coevolutionary systems are evaluated for the purposes of selection are perhaps the most important choices facing an engineer employing these methods. Designing the evaluation method involves a variety of practical choices, as well as a broader eye to the ultimate purpose of the algorithm itself. Practical concerns in evaluation include determining the number of individuals with whom to interact, how those individuals will be chosen for the interaction, and how the selection will operate on the results of multiple interactions (Wiegand, ). For example, one might determine the fitness of an individual by pairing him or her with all other individuals in the other populations (or the same population for single population approaches) and taking the average or maximum value

of such evaluations as the fitness assessment. Alternatively, one may simply use the single best individual as determined by a previous generation of the algorithm, or a combination of those approaches. Random pairings between individuals is also common. This idea can be extended to use tournament evaluation where successful individuals from pairwise interactions are promoted and further paired, assigning fitness based on how far an individual progresses in the tournament. Many of these methods have been evaluated empirically on a variety of types of problems (Angeline & Pollack, ; Bull, ; Wiegand, ). However, the designing of the evaluation method also speaks to the broader issue of how to best implement the desired 7solution concept, (a criterion specifying which locations in the search space are solutions and which are not) (Ficici, ). The key to successful application of coevolutionary learning is to first elicit a clear and precise solution concept and then design an algorithm (an evaluation method in particular) that implements such a concept explicitly. A successful coevolutionary learner capable of achieving reliable progress toward a particular solution concept often makes use of an archive of individuals and an update rule for that archive that insists the distance to a particular solution concept decrease with every change to the archive. For example, if one is interested in finding game strategies that satisfy Nash equilibrium constraints, one might consider comparing new individuals to an archive of potential individual strategies found so far that together represent a potential Nash mixed strategy (Ficici, ). Alternatively, if one is interested in maximizing the sum of an individual’s outcomes over all tests, one may likewise employ an archive of discovered tests that candidate solutions are able to solve (de Jong, ). It is useful to note that many coevolutionary learning problems are multiobjective in nature. That is, 7underlying objectives may exist in such problems, each creating a different ranking for individuals depending on the set of tests being considered during evaluation (Bucci & Pollack, ). The set of all possible underlying objectives (were it known) is sufficient to determine the outcomes on all possible tests. A careful understanding of this can yield approaches that create

Coevolutionary Learning

ideal and minimal evaluation sets for such problems (de Jong & Pollack, ). By acknowledging the link between multiobjective optimization and coevolutionary learning, a variety of evaluation and selection methods based on notions of multiobjective optimization have been employed. For example, there are selection methods that use Pareto dominance between candidate solutions and their tests as their basis of comparison (Ficici, ). Additionally, such methods can be combined with archive-based approaches to ensure monotonicity of progress toward a Pareto dominance solution concept (de Jong & Pollack, ).

C

and restricting selection and interaction using geometric constraints defined by those topologies (Pagie, ). Typically, these systems involve overlaying multiple grids of individuals, applying selection within some neighborhood in a given grid, and evaluating interactions between individuals in different grids using a similar type of cross-population neighborhood. The benefits of these systems are in part due to their ability to naturally regulate loss of diversity and spread of interaction information by explicit control over the size and shape of these neighborhoods.

Pathologies and Remedies Representation

Perhaps the core representational question in coevolution is the role that an individual plays. In test-based coevolution, an individual typically represents a potential solution to the problem or a test for a potential solution, whereas in compositional coevolution individuals typically represent a candidate component for a composite or ensemble solution. Even in test-based approaches, the true solution to the problem may be expressed as a population of individuals, rather than a single individual. The population may represent a mixed strategy while individuals represent potential pure strategies for a game. Engineers using such approaches should be clear of the form of the final solution produced by the algorithm, and that this form is consistent with the prescribed solution concept. In compositional approaches, the key issues tend to surround about how the problem is decomposed. In some algorithms, this decomposition is performed a priori, having different populations represent explicit components of the problem (Potter, ). In other approaches, the decomposition is intended to be somewhat more dynamic (Moriarty & Miikkulainen, ; Potter, ). Still more recent approaches seek to harness the potential of compositional coevolutionary systems to search open-ended representational spaces by gradually complexifying the representational space during the search (Stanley, ). In addition, a variety of coevolutionary systems have successfully dealt with some inherent pathologies by representing populations in spatial topologies,

Perhaps the most commonly cited pathology is the socalled loss of gradient problem, in which one population comes to severely dominate the others, thus creating a situation in which individuals cannot be distinguished from one another. The populations become disengaged and evolutionary progress may stall or drift (Watson & Pollack, ). Disengagement most commonly occurs when distinguishing individuals are lost in the evolutionary process ( forgetting), and the solution to this problem typically involves somehow retaining potentially informative, though possibly inferior quality individuals (e.g., archives). Intransitivities in the reward system can cause some coevolutionary systems to exhibit cycling dynamics (Watson & Pollack, ), where reciprocal changes force the system to orbit some part of a potential search space. The remedy to this problem often involves creating coevolutionary systems that change in response to traits in several other populations. Mechanisms introduced to produce such effects include competitive fitness sharing (Rosin & Belew, ). Another challenging problem occurs when individuals in a coevolutionary systems overspecialize on one underlying objective at the expense of other necessary objectives (Watson & Pollack, ). In fact, overspecialization can be seen as a form of disengagement on some subset of underlying objectives, and likewise the repair to this problem often involves retaining individuals capable of making distinctions in as many underlying objectives as possible (Bucci & Pollack, ).

C

C

Coevolutionary Learning

Finally, certain kinds of compositional coevolutionary learning algorithms can be prone to relative overgeneralization, a pathology in which components that perform reasonably well in a variety of composite solutions are favored over those that are part of an optimal solution (Wiegand, ). In this case, it is typically possible to bias the evaluation process toward optimal values by evaluating an individual in a variety of composite assemblies and assigning the best objective value found as the fitness (Panait, ). In addition to pathological behaviors in coevolution, the subjective nature of these learning systems creates difficulty in measuring progress. Since fitness is subjective, it is impossible to determine whether these relative measures indicate progress or stagnation when the measurement values do not change much. Without engaging some kind of external or objective measure, it is difficult to understand what the system is really doing. Obviously, if an objective measure exists then it can be employed directly to measure progress (Watson & Pollack, ). A variety of measurement methodologies have been employed when objective measurement is not possible. One method is to compare current individuals against all ancestral opponents (Cliff & Miller, ). Another predator/prey based method holds master tournaments between all the best predators and all the best prey found during the search (Nolfi & Floreano, ). A similar approach suggests maintaining the best individuals from each generation in each population in a hall of fame for comparison purposes (Rosin & Belew, ). Still other approaches seek to record the points during the coevolutionary search in which a new dominant individual was found (Stanley, ). A more recent approach advises looking at the population differential, examining all the information from ancestral generations rather than simply selecting a biased subset (Bader-Natal & Pollack, ). Conversely, an alternative idea is to consider how well the dynamics of the best individuals in different populations reflect the fundamental best response curves defined by the problem (Popovici, ). With a clear solution concept, an appropriate evaluation mechanism implementing that concept, and practical progress measures in place, coevolution can be an effective and versatile machine learning tool.

Cross References 7Evolutionary Algorithms

Recommended Reading Angeline, P., & Pollack, J. (). Competitive environments evolve better solutions for complex tasks. In S. Forest (Ed.), Proceedings of the fifth international conference on genetic algorithms (pp. –). San Mateo, CA: Morgan Kaufmann. Axelrod, R. (). The evolution of cooperation. New York: Basic Books. Bader-Natal, A., & Pollack, J. (). Towards metrics and visualizations sensitive to Coevolutionary failures. In AAAI technical report FS-- coevolutionary and coadaptive systems. AAAI Fall Symposium, Washington, DC. Bucci, A., & Pollack, J. B. (). A mathematical framework for the study of coevolution. In R. Poli, et al. (Eds.), Foundations of genetic algorithms VII (pp. –). San Francisco: Morgan Kaufmann. Bucci, A., & Pollack, J. B. (). Focusing versus intransitivity geometrical aspects of coevolution. In E. Cantú-Paz, et al. (Eds.), Proceedings of the genetic and evolutionary computation conference (pp. –). Berlin: Springer. Bull, L. (). Evolutionary computing in multi-agent environments: Partners. In T. Bäck (Ed.), Proceedings of the seventh international conference on genetic algorithms (pp. –). San Mateo, CA: Morgan Kaufmann. Cliff, D., & Miller, G. F. (). Tracking the red queen: Measurements of adaptive progress in co-evolutionary simulations. In Proceedings of the third European conference on artificial life (pp. –). Berlin: Springer. de Jong, E. (). The maxsolve algorithm for coevolution. In H. Beyer, et al. (Eds.), Proceedings of the genetic and evolutionary computation conference (pp. –). New York, NY: ACM Press. de Jong, E., & Pollack, J. (). Ideal evaluation from coevolution. Evolutionary Computation, , –. Ficici, S. G. (). Solution concepts in coevolutionary algorithms. PhD thesis, Brandeis University, Boston, MA. Fogel, D. (). Blondie: Playing at the edge of artificial intelligence. San Francisco: Morgan Kaufmann. Hillis, D. (). Co-evolving parasites improve simulated evolution as an optimization procedure. Artificial life II, SFI studies in the sciences of complexity (Vol. , pp. –). Moriarty, D., & Miikkulainen, R. (). Forming neural networks through efficient and adaptive coevolution. Evolutionary Computation, , –. Nolfi, S., & Floreano, D. (). Co-evolving predator and prey robots: Do “arm races” arise in artificial evolution? Artificial Life, , –. Pagie, L. (). Information integration in evolutionary processes. PhD thesis, Universiteit Utrecht, the Netherlands. Panait, L. (). The analysis and design of concurrent learning algorithms for cooperative multiagent systems. PhD thesis, George Mason University, Fairfax, VA. Paredis, J. (). Steps towards co-evolutionary classification networks. In R. A. Brooks & P. Maes (Eds.), Artificial life IV,

Collective Classification

proceedings of the fourth international workshop on the synthesis and simulation of living systems (pp. –). Cambridge, MA: MIT Press. Popovici, E. (). An analysis of multi-population co-evolution. PhD thesis, George Mason University, Fairfax, VA. Potter, M. (). The design and analysis of a computational model of cooperative co-evolution. PhD thesis, George Mason University, Fairfax, VA. Rosin, C., & Belew, R. (). New methods for competitive coevolution. Evolutionary Computation, , –. Sims, K. (). Evolving D morphology and behavior by competition. In R. A. Brooks & P. Maes (Eds.), Artificial life IV, proceedings of the fourth international workshop on the synthesis and simulation of living systems (pp. –). Cambridge, MA: MIT Press. Stanley, K. (). Efficient evolution of neural networks through complexification. PhD thesis, The University of Texas at Austin, Austin, TX. Watson, R., & Pollack, J. (). Coevolutionary dynamics in a minimal substrate. In L. Spector, et al. (Eds.), Proceedings from the genetic and evolutionary computation conference (pp. – ). San Francisco: Morgan Kaufmann. Wiegand, R. P. (). An analysis of cooperative coevolutionary algorithms. PhD thesis, George Mason University, Fairfax, VA.

Collaborative Filtering

Collaborative Filtering (CF) refers to a class of techniques used in that recommend items to users that other users with similar tastes have liked in the past. CF methods are commonly sub-divided into neighborhoodbased and model-based approaches. In neighborhoodbased approaches, a subset of users are chosen based on their similarity to the active user, and a weighted combination of their ratings is used to produce predictions for this user. In contrast, model-based approaches assume an underlying structure to users’ rating behavior, and induce predictive models based on the past ratings of all users.

Collection

7Class

C

Collective Classification Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor University of Maryland, MD, USA

Synonyms Iterative classification; Link-based classification

Definition Many real-world 7classification problems can be best described as a set of objects interconnected via links to form a network structure. The links in the network denote relationships among the instances such that the class labels of the instances are often correlated. Thus, knowledge of the correct label for one instance improves our knowledge about the correct assignments to the other instances it connects to. The goal of collective classification is to jointly determine the correct label assignments of all the objects in the network.

Motivation and Background Traditionally, a major focus of machine learning is to solve classification problems: given a corpus of documents, classify each according to its topic label; given a collection of e-mails, determine which are spam; given a sentence, determine the part-of-speech tag for each word; given a hand-written document, determine the characters, etc. However, much of the work in machine learning makes an independent and identically distributed (IID) assumption, and focuses on predicting the class label of each instance in isolation. In many cases, however, the class labels whose values need to be determined can benefit if we know the correct assignments to related class labels. For example, it is easier to predict the topic of a webpage if we know the topics of the webpages that link to it, the chance of a particular word being a verb increases if we know that the previous word in the sentence is a noun, knowing the rest of the characters in a word can make it easier to identify an unknown character, etc. In the last decade, many researchers have proposed techniques that attempt to classify samples in a joint or collective manner instead of treating each sample in isolation, and reported significant gains in classification accuracy.

C

C

Collective Classification

Theory/Solution Collective classification is a combinatorial optimization problem, in which we are given a set of nodes, V = {v , . . . , vn }, and a neighborhood function N , where Ni ⊆ V/{vi }, which describes the underlying network structure. Each node in V is a random variable that can take a value from an appropriate domain, L = {l , . . . , lq }. V is further divided into two sets of nodes: X , the nodes for which we know the correct values (observed variables) and, Y, the nodes whose values need to be determined. Our task is to label the nodes yi ∈ Y with one of a small number of predefined labels in L. Even though it is only in the last decade that collective classification has entered the collective conscience of machine learning researchers, the general idea can be traced further back (Besag, ). As a result, a number of approaches have been proposed. The various approaches to collective classification differ in the kinds of information they aim to exploit to arrive at the correct classification, and their mathematical underpinnings. We discuss each in turn.

Relational Classification Traditional classification concentrates on using the observed attributes of the instance to be classified. Relational classification (Slattery & Craven, ) attempts to go a step further by classifying the instance using not only the instance’s own attributes but also the instance’s neighbors’ attributes. For example, in a hypertext classification domain where we want to classify webpages, not only would we use the webpage’s own words but we would also look at the webpages linking to this webpage using hyperlinks and their words to arrive at the correct class label. Results obtained using relational classification have been mixed. For example, even though there have been reports of classification accuracy gains using such techniques, in certain cases, these techniques can harm classification accuracy (Chakrabarti, Dom, & Indyk, ).

Iterative Collective Classification with Neighborhood Labels A second approach to collective classification is to use the class labels assigned to the neighbor instead of using the neighbor’s observed attributes. For example, going

back to our hypertext classification example, instead of using the linking webpage’s words we would, in this case, use its assigned labels to classify the current webpage. Chakrabarti et al. () illustrated the use of this approach and reported impressive classification accuracy gains. Neville and Jensen () further developed the approach, and referred to the approach as iterative classification, and studied the conditions under which it improved classification performance (Jensen, Neville, & Gallagher, ). Techniques for feature construction from the neighboring labels were developed and studied (Lu & Getoor, ), along with methods that make use of only the label information (Macskassy & Provost, ), as well as a variety of strategies for when to commit the class labels (McDowell, Gupta, & Aha, ). Algorithm depicts pseudo-code for a simple version of the Iterative Classification Algorithm (ICA). The basic premise behind ICA is extremely simple. Consider a node Yi ∈ Y whose value we need to determine and suppose we know the values of all the other nodes in its neighborhood Ni (note that Ni can contain both observed and unobserved variables). Then, ICA assumes that we are given a local classifier f that takes the values of Ni as arguments and returns a label value for Yi from the class label set L. For local classifiers f that do not return a class label but a goodness/likelihood value given a set of attribute values and a label, we

Algorithm Iterative classification algorithm Iterative Classification Algorithm (ICA) for each node Yi ∈ Y do {bootstrapping} {compute label using only observed nodes in Ni } compute ⃗ai using only X ∩ Ni yi ← f (⃗ai ) end for repeat {iterative classification} generate ordering O over nodes in Y for each node Yi ∈ O do {compute new estimate of yi } compute ⃗ai using current assignments to Ni yi ← f (⃗ai ) end for until all class labels have stabilized or a threshold number of iterations have elapsed

Collective Classification

simply choose the label that corresponds to the maximum goodness/likelihood value; in other words, we replace f with argmaxl∈L f . This makes the local classifier f extremely flexible and we can use anything ranging from a decision tree to a 7support vector machine (SVM). Unfortunately, it is rare in practice that we know all values in Ni , which is why we need to repeat the process iteratively, in each iteration, labeling each Yi using the current best estimates of Ni and the local classifier f , and continuing to do so until the assignments to the labels stabilize. Most local classifiers are defined as functions whose argument consists of a fixed-length vector of attribute values. A common approach to circumvent such a situation is to use an aggregation operator such as count, mode, or prop, which measures the proportion of neighbors with a given label. In Algorithm , we use ⃗ai to denote the vector encoding the values in Ni obtained after aggregation. Note that in the first ICA iteration, all labels yi are undefined and to initialize them we simply apply the local classifier to the observed attributes in the neighborhood of Yi , this is referred to as “bootstrapping” in Algorithm . Researchers in collective classification (Macskassy & Provost, ; McDowell et al., ; Neville & Jensen, ) have extended the simple algorithm described above, and developed a version of Gibbs sampling that is easy to implement and faster than traditional Gibbs sampling approaches. The basic idea behind this algorithm is to assume, just like in the case of ICA, that we have access to a local classifier f that can sample for the best label estimate for Yi given all the values for the nodes in Ni . We keep doing this repeatedly for a fixed number of iterations (a period known as “burnin”). After that, not only do we sample for labels for each Yi ∈ Y but we also maintain count statistics as to how many times we sampled label l for node Yi . After collecting a predefined number of such samples we output the best label assignment for node Yi by choosing the label that was assigned the maximum number of times to Yi while collecting samples. One of the benefits of both variants of ICA is fairly simple to make use of any local classifier. Some of the classifiers used included the following: naïve Bayes (Chakrabarti et al., ; Neville & Jensen, ), 7logistic regression (Lu & Getoor, ), 7decision trees, (Jensen et al., ) and weighted-vote relational

C

neighbor (Macskassy & Provost, ). There is some evidence to indicate that discriminately trained local classifiers such as logistic regression tend to produce higher accuracies than others; this is consistent with results in other areas. Other aspects of ICA that have been the subject of investigation include the ordering strategy to determine in which order to visit the nodes to relabel in each ICA iteration. There is some evidence to suggest that ICA is fairly robust to a number of simple ordering strategies such as random ordering, visiting nodes in ascending order of diversity of its neighborhood class labels, and labeling nodes in descending order of label confidences (Getoor, ). However, there is also some evidence that certain modifications to the basic ICA procedure tend to produce improved classification accuracies. For example, both (Neville & Jensen, ) and (McDowell et al., ) propose a strategy where only a subset of the unobserved variables are utilized as inputs for feature construction. More specifically, in each iteration, they choose the top-k most confident predicted labels and use only those unobserved variables in the following iteration’s predictions, thus ignoring the less confident predicted labels. In each subsequent iteration they increase the value of k so that in the last iteration all nodes are used for prediction. McDowell et al. report that such a “cautious” approach leads to improved accuracies.

Collective Classification with Graphical Models In addition to the approaches described above, which essentially focus on local representations and propagation methods, another approach to collective classification is by first representing the problem with a highlevel global 7graphical model and then using learning and inference techniques for the graphical modeling approach to arrive at the correct classifications. These proposals include the use of both directed 7graphical models (Getoor, Segal, Taskar, & Koller, ) and undirected graphical models (Lafferty, McCallum, & Pereira, ; Taskar, Abbeel, & Koller, ). See 7statistical relational learning and Getoor and Taskar () for a survey of various graphical models that are suitable for collective classification. In general, these techniques can use both neighborhood labels and observed attributes

C

C

Collective Classification

of neighbors. On the other hand, due to their generality, these techniques also tend to be less efficient than the iterative collective classification techniques. One common way of defining such a global model uses a pairwise Markov random field (pairwise MRF) (Taskar et al., ). Let G = (V, E) denote a graph of random variables as before where V consists of two types of random variables, the unobserved variables, Y, which need to be assigned domain values from label set L, and observed variables X whose values we know (see 7Graphical Models). Let Ψ denote a set of clique potentials. Ψ contains three distinct types of functions: For each Yi ∈ Y, ψ i ∈ Ψ is a mapping ψ i : L → R≥ , where R≥ is the set of nonnegative real numbers. ● For each (Yi , Xj ) ∈ E, ψ ij ∈ Ψ is a mapping ψ ij : L → R≥ . ● For each (Yi , Yj ) ∈ E, ψ ij ∈ Ψ is a mapping ψ ij : L × L → R≥ .

●

Let x denote the values assigned to all the observed variables in V and let xi denote the value assigned to Xi . Similarly, let y denote any assignment to all the unobserved variables in V and let yi denote a value assigned to Yi . For brevity of notation we will denote by ϕ i the clique potential obtained by computing ϕ i (yi ) = ψ i (yi ) ∏(Yi ,Xj )∈E ψ ij (yi ). We are now in a position to define a pairwise MRF. Definition A pairwise Markov random field (MRF) is given by a pair ⟨G, Ψ⟩ where G is a graph and Ψ is a set of clique potentials with ϕ i and ψ ij as defined above. Given an assignment y to all the unobserved variables Y, the pairwise MRF is associated with the probability distri bution P(y∣x) = Z(x) ∏Yi ∈Y ϕ i (yi ) ∏(Yi ,Yj )∈E ψ ij (yi , yj ) where x denotes the observed values of X and Z(x) = ∑y′ ∏Yi ∈Y ϕ i (y′i ) ∏(Yi ,Yj )∈E ψ ij (y′i , y′j ). Given a pairwise MRF, it is conceptually simple to extract the best assignments to each unobserved variable in the network. For example, we may adopt the criterion that the best label value for Yi is simply the one corresponding to the highest marginal probability obtained by summing over all other variables from the probability distribution associated with the pairwise MRF. Computationally, however, this is difficult to achieve since computing one marginal probability

requires summing over an exponentially large number of terms, which is why we need approximate inference algorithms. Hence, approximate inference algorithms are typically employed, the two most common being loopy belief propagation (LBP) and mean-field relaxation labeling.

Applications Due to its general applicability, collective classification has been applied to a number of real-world problems. Foremost in this list is document classification. Chakrabarti et al. () was one of the first to apply collective classification to corpora of patents linked via hyperlinks and reported that considering attributes of neighboring documents actually hurts classification performance. Slattery and Craven () also considered the problem of document classification by constructing features from neighboring documents using an 7inductive logic programming rule learner. Yang, Slattery, & Ghani () conducted an in-depth investigation over multiple datasets commonly used for document classification experiments and identified different patterns. Other applications of collective classification include object labeling in images (Hummel & Zucker, ), analysis of spatial statistics (Besag, ), iterative decoding (Berrou, Glavieux, & Thitimajshima, ), part-of-speech tagging (Lafferty et al., ), classification of hypertext documents using hyperlinks (Taskar et al., ), link prediction (Getoor, Friedman, Koller, & Taskar, ; Taskar, Wong, Abbeel, & Koller, ), optical character recognition (Taskar, Guestrin, & Koller, ), entity resolution in sensor networks (Chen, Wainwright, Cetin, & Willsky, ), predicting disulphide bonds in protein molecules (Taskar, Chatalbashev, Koller, & Guestrin, ), segmentation of D scan data (Anguelov et al., ), and classification of e-mail speech acts (Carvalho & Cohen, ). Recently, there have also been attempts to extend collective classification techniques to the semi-supervised learning scenario (Lu & Getoor, b; Macskassy, ; Xu, Wilkinson, Southey, & Schuurmans, ).

Cross References 7Decision Trees 7Inductive Logic Programming 7Learning From Structured Data

Community Detection

7Relational Learning 7Semi-Supervised Learning 7Statistical Relational Learning

Recommended Reading Anguelov, D., Taskar, B., Chatalbashev, V., Koller, D., Gupta. D., Heitz, G., et al. (). Discriminative learning of Markov random fields for segmentation of d scan data. In IEEE computer society conference on computer vision and pattern recognition. IEEE Computer Society, Washington D.C. Berrou, C., Glavieux, A., & Thitimajshima, P. (). Near Shannon limit error-correcting coding and decoding: Turbo codes. In Proceedings of IEEE international communications conference, Geneva, Switzerland, IEEE. Besag, J. (). On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, B-, –. Carvalho, V., & Cohen, W. W. (). On the collective classification of email speech acts. In Special interest group on information retrieval, Salvador, Brazil, ACM. Chakrabarti, S., Dom, B., & Indyk, P. (). Enhanced hypertext categorization using hyperlinks. In International conference on management of data, Seattle, Washington New York: ACM. Chen, L., Wainwright, M., Cetin, M., & Willsky, A. (). Multitargetmultisensor data association using the tree-reweighted max-product algorithm. In SPIE Aerosense conference. Orlando, Florida. Getoor, L. (). Link-based classification. In Advanced methods for knowledge discovery from complex data. New York: Springer. Getoor, L., & Taskar, B. (Eds.). (). Introduction to statistical relational learning. Cambridge, MA: MIT Press. Getoor, L., Segal, E., Taskar, B., & Koller, D. (). Probabilistic models of text and link structure fro hypertext classification. In Proceedings of the IJCAI workshop on text learning: Beyond supervision, Seattle, WA. Getoor, L., Friedman, N., Koller, D., & Taskar, B. (). Learning probabilistic models of link structure. Journal of Machine Learning Research, , –. Hummel, R., & Zucker, S. (). On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, , –. Jensen, D., Neville, J., & Gallagher, B. (). Why collective inference improves relational classification. In Proceedings of the th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA. ACM. Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (). conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning, Washington DC. San Francisco, CA: Morgan Kaufmann. Lu, Q., & Getoor, L. (a). Link based classification. In Proceedings of the international conference on machine learning. AAAI Press, Washington, D.C. Lu, Q., & Getoor, L. (b). Link-based classification using labeled and unlabeled data. In ICML workshop on the continuum from labeled to unlabeled data in machine learning and data mining. Washington, D.C. Macskassy, S., & Provost, F. (). Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, , –.

C

Macskassy, S. A. (). Improving learning in networked data by combining explicit and mined links. In Proceedings of the twenty-second conference on artificial intelligence. AAAI Press, Vancouver, Canada. McDowell, L. K., Gupta, K. M., & Aha, D. W. (). Cautious inference in collective classification. In Proceedings of AAAI. AAAI Press, Vancouver, Canada. Neville, J., & Jensen, D. (). Relational dependency networks. Journal of Machine Learning Research, , –. Neville, J., & Jensen, D. (). Iterative classification in relation data. In Workshop on statistical relational learning, AAAI. Slattery, S., & Craven, M. (). Combining statistical and relational methods for learning in hypertext domains. In International conferences on inductive logic programming. SpringerVerlag, London, UK. Taskar, B., Abbeel, P., & Koller, D. (). Discriminative probabilistic models for relational data. In Proceedings of the annual conference on uncertainty in artificial intelligence. Morgan Kauffman, San Francisco, CA. Taskar, B., Guestrin, C., & Koller, D. (a). Max-margin markov networks. In Neural information processing systems. MIT Press, Cambridge, MA. Taskar, B., Wong, M. F., Abbeel, P., & Koller, D. (b). Link prediction in relational data. In Natural information processing systems. MIT Press, Cambridge, MA. Taskar, B., Chatalbashev, V., Koller, D., & Guestrin, C. (). Learning structured prediction models: A large margin approach. In Proceedings of the international conference on machine learning. ACM, New York, NY. Xu, L., Wilkinson, D., Southey, F., & Schuurmans, D. (). Discriminative unsupervised learning of structured predictors. In Proceedings of the international conference on machine learning. ACM, New York, NY. Yang, Y., Slattery, S., & Ghani, R. (). A study of approaches to hypertext categorization. Journal of Intelligent Information Systems. (–), –.

Commercial Email Filtering 7Text Mining for Spam Filtering

Committee Machines 7Ensemble Learning

Community Detection 7Group Detection

C

C

Comparable Corpus

Comparable Corpus A comparable corpus (pl. corpora) is a document collection composed of two or more disjoint subsets, each written in a different language, such that documents in each subset are on a same topic as the documents in the others. The prototypical example of a comparable corpora is a collection of newspaper article written in different languages and reporting about the same events: while they will not be, strictly speaking, the translation of one another, they will share most of the semantic content. Some methods for 7cross-language text mining rely, totally or partially, on the statistical properties of comparable corpora.

Competitive Coevolution 7Test-Based Coevolution

Competitive Learning Competitive learning is an 7artificial neural network learning process where different neurons or processing elements compete on who is allowed to learn to represent the current input. In its purest form competitive learning is in the so-called winner-take-all networks where only the neuron that best represents the input is allowed to learn. Since all neurons learn to better represent the kinds of inputs they already are good at representing, they become specialized to represent different kinds of inputs. For vector-valued inputs and representations, the input becomes quantized to the unit having the closest representation (model), and the representations are adapted to minimize the representation error using stochastic gradient descent. Competitive learning networks have been studied as models of how receptive fields and feature detectors, such as orientation-selective visual neurons, develop in neural networks. The same process is at work in

online 7K-means clustering, and variants of it in 7SelfOrganizing Maps (SOM) and the EM algorithm of mixture models.

Complex Adaptive System 7Complexity in Adaptive Systems

Complexity in Adaptive Systems Jun He Aberystwyth University, Wales, UK

Synonyms Adaptive system; Complex adaptive system

Definition An 7adaptive system, or complex adaptive system, is a special case of complex systems, which is able to adapt its behavior according to changes in its environment or in parts of the system itself. In this way, the system can improve its performance through a continuing interaction with its environment. The concept of 7complexity in an adaptive system is used to analyze the interactive relationship between the system and its environment, which can be classified into two types: 7internal complexity for model complexity, and 7external complexity for data complexity. The internal complexity is defined by the amount of input, information, or energy that the system receives from its environment. The external complexity refers to the complexity of how the system represents these inputs through its internal process.

Motivation and Background Adaptive systems range from natural systems to artificial systems (Holland, , ; Waldrop, ). Examples of natural systems include ant colonies, ecosystem, the brain, neural network and immune system, cell and developing embryo; examples of artificial systems include stock market, social system, manufacturing businesses, and human social group-based

Complexity in Adaptive Systems

endeavor in a cultural and social system such as political parties or communities. All these systems have a common feature: they can adapt to their environment. An adaptive system is adaptive in that way it has the capacity to change its internal structure for adapting the environment. It is complex in the sense that it is interactive with its environment. The interaction between an adaptive system and its environment is dynamic and nonlinear. Complexity emerges from the interaction between the system and environment, the elements of the system, where the emergent macroscopic patterns are more complex than the sum of the these low-level (microscopic) elements encompassed in the system. Understanding the evolution and development of adaptive systems still faces many mathematical challenges (Levin, ). The concepts of external and internal complexities are used to analyze the relation between an adaptive system and its environment. The description given below is based on Jürgen Jost’s () work, which introduced these two concepts and applied the theoretical framework to the construction of learning models, e.g., to design neural network architectures. In the following, the concepts are mainly applied to analyze the interaction between the system and its environment. The interaction among individual elements of the system is less discussed however, the concepts can be explored in that situation too.

Theory Adaptive System Environment and Regularities

The environment of an adaptive system is more complex than the system itself and its changes cannot be completely predictable for the system. However, the changes of the environment are not purely random and noisy; there exist regularities in the environment. An adaptive system can recognize these regularities, and depending on these regularities the system will express them through its internal process in order to adapt to the environment. The input that an adaptive system receives or extracts from its environment usually includes two parts: one is the part with regularities; and another is that appears random to the system. The part of regularities is useful and meaningful. An adaptive system will represent these regularities by internal processes. But

C

the part of random input is useless, and even at the worst it will be detrimental for an adaptive system. However, it will depend on the adaptive system’s internal model of the external environment for how to determine which part of input is meaningful and regular, and which part is random and devoid of meaning and structure. An adaptive system will translate the external regularities into its internal ones, and only the regularities are useful to the system. The system tries to extract as many regularities as possible, and to represent these regularities as efficiently as possible in order to make optimal use of its capacity. The notions of external complexity and internal complexity are used to investigate these two complementary aspects conceptually and quantitatively. In terms of these notions, an adaptive system aims to increase its external complexity and reduce its internal complexity. The two processes operate on their own time scale but are intricately linked and mutually dependent on each other. For example, the internal complexity will be only reduced if the external complexity is fixed. Under fixed inputs received from the external environment, an adaptive system can represent these inputs systems more efficiently and optimize its internal structure. If the external complexity is increased, e.g., if additional new input is required to handle by the system, then it is necessary to increase its internal complexity. The increase of internal complexity may occur through the creation of redundancy in the existing adaptive system, e.g., to duplicate some internal structures, and then enable the system to handle more external input. Once the input is fixed, the adaptive system then will represent the input as efficiently as possible and reduce the internal input. The decrease of internal complexity can be achieved through discarding some input as meaningless and irrelevant, e.g., leaving some regularities out for the purpose. Since the inputs relevant to the systems are those which can be reflected in the internal model, the external complexity is not equivalent to the amount of raw data received from the environment. In fact, it is only relevant to the inputs which can be processed in the internal model, or observations in some adaptive systems. Thus the external complexity ultimately is decided by the internal model constructed by the system.

C

C

Complexity in Adaptive Systems

External and Internal Complexities

External complexity means data complexity, which is used to measure the amount of input received from the environment for the system to handle and process. Such a complexity can be measured by entropy in the term of information theory. Internal complexity is model complexity, which is used to measure the complexity of a model for representing the input or information received by the system. The aim of the adaptive system is to obtain an efficient model as simple as possible, with the capacity to handle as much input as possible. On one hand, the adaptive system will try to maximize its external complexity and then to adapt to its environment in a maximal way; on the other hand, to minimize its internal complexity and then to construct a model to process the input in a most efficient way. These two aims sometimes seem conflicting, but such a conflict can be avoided when these two processes operate on different time scales. If given a model, the system will organize the input data and try to increase its ability to deal with the input from its environment, and then increase its external complexity. If given the input, conversely, it tries to simplify its model which represents that input and thus to decrease the internal complexity. The meaning of the input is relevant to the time scale under investigation. On a short time scale, for example, the input may consist of individual signals, but on a long time scale, it will be a sequence of signals which satisfies a probability distribution. A good internal model tries to express regularities in the input sequence, rather than several individual signals. And the decrease of internal complexity will happen on this time scale. A formal definition of the internal and external complexities concepts is based on the concept of entropy from statistical mechanics and information theory. Given a model θ, the system can model data as with X(θ) = (X , . . . , Xk ), which is assumed to have an internal probability distribution P(X(θ)) so that entropy can be computed. The external complexity is defined by

of information can be described in other approaches, e.g., the length of the representation of the data in the internal code of the system (Rissanen, ). In this case, the optimal coding is a consequence of the minimization of internal complexity, and then the length of the representation of data Xi (θ) behaves like log P(X(θ)) (Rissanen, ). On a short time scale, for a given model θ, the system tries to increase the amount of meaningful input information X(θ). On a long time scale, when the input is given, e.g., when the system has gathered a set of inputs on a time scale with a stationary probability distribution of input patterns Ξ, then the model should be improved to handle the input as efficiently as possible and reduce the complexity of the model. This complexity, or internal complexity, is defined by k

− ∑ P(Ξ i ∣ θ) log P(Ξ i ∣ θ) − log P(θ),

()

i=

with respect to the model θ. If Rissanen’s () 7minimum description length principle is applied to the above formula, then the optimal model will satisfy the variation problem min (− log P(Ξ ∣ θ) − log P(θ)) . θ

()

Here in the above minimization problem, there are two objectives to minimize. The first term is to measure how efficiently the model represents or encodes the data; and the second one is to measure how complicated the model is. In computer science, this latter term corresponds to the length of the program required to encode the model. The concepts of external and internal complexities can be applied into a system divided into subsystems. In this case, some internal part of the original whole system will become external to a subsystem. Thus the internal input of a subsystem consists of original external input and also input from the rest of the system, i.e., other subsystems.

k

− ∑ P(Xi (θ)) log P(Xi (θ)).

()

i=

An adaptive system tries to maximize the above external complexity. The probability distribution P(X(θ)) is for quantifying the information value of the data X(θ). The value

Application: Learning The discussion of these two concepts, external and internal complexities, can be put into the background of learning. In statistical learning theory (Vapnik, ), the criterion for evaluating a learning process is the expected prediction error of future data by the model

Complexity in Adaptive Systems

based on training data set with partial and incomplete information. The task is to construct a probability distribution drawn from an a-priori specific class for representing the distribution underlying the input data received. Usually, if a higher error is produced by a model on the training data, then a higher error will be expected on the future data. The error will depend on two factors: one is the accuracy of the model on the training data set, another is the simplicity of the model itself. The description of the data set can be split into two parts, the regular part, which is useful in constructing the model; and the random part, which is a noise to the model. The learning process fits very well into the theory framework of internal and external complexities. If the model is too complicated, it will bring the risk of overfitting the training data. In this case, some spurious or putative regularity is incorporated into the model, which will not appear in the future data. The model should be constrained within some model class with bounded complexity. This complexity in this context of statistical learning theory is measured by the VapnikChervonenkis dimension (see 7VC Dimension) (Vapnik, ). Under the simplest form of statistical learning theory, the system aims at finding a representation with smallest error in a class with given complexity constraints; and then the model should minimize the expected error on future data and also over-fitting error. The two concepts of over-fitting and leaving out regularities can be distinguished in the following sense. The former is caused by the noise in the data, i.e., the random part of the data, and this leads to putative regularities, which will not appear in the future data. The latter, leaving out regularities, means that the system can forgo some part of regularities in the data, or it is possible to make data compression. Thus, leaving out regularities can be used to simplify the model and reduce the internal complexity. However, a problem is still waiting for answer here, that is, what regularities in the data set are useful for data compression and also meaningful for future prediction; and what parts are random to the model. The internal complexity is the model complexity. If the internal complexity is chosen too small, then the model does not have enough capacity to represent all the important features of the data set. If the internal complexity is too large, on the other hand, then the

C

model does not represent the data efficiently. The internal complexity is preferably minimized under appropriate constraints on the adequacy of the representation of data. This is consistent with Rissanen’s principle of Minimum Description Length (Rissanen, ) to represent a given data set in the most efficient way. Thus a good model is both to simplify the model itself and to represent the data efficiently. The external complexity is the data complexity which should be large to represent the input accurately. This is related to Jaynes’ principle of maximizing the ignorance (Jaynes, ), where a model for representing data should have the maximal possible entropy under the constraint that all regularities can be reproduced. In this way, putative regularities could be eliminated in the model. However, this principle should be applied with some conditions as argued by Gell-Mann and Lloyd (); it cannot eliminate the essential regularities in the data, and an overlying complex model should be avoided. For some learning system, only a selection of data is gathered and observed by the system. Thus a middle term, observation, is added between model and data. The concept of observation refers to the extraction of value of some specific quantity from a given data or data pool. What a system can observe depends on its internal structure and its general model of the environment. The system does not have direct access to the raw data, but through constructing a model of the environment solely on the basis of the values of its observation. For such kind of learning system, Jaynes’ principle (Jaynes, ) is still applicable for increasing the external complexity. For the given observation made on a data set, the maximum entropy representation should be selected. However, this principle is still subject to the modification of Gell-Mann and Lloyd () to a principle where the model should not lose the essential regularities observed in the data. By contrast, the observations should be selected to reduce the internal complexity. Given a model, if the observation can be made on a given data set, then these observations should be selected so as to minimize the resulting entropy of the model, with the purpose of minimizing the uncertainty left about the data. Thus it leads to reduce the complexity. In most of the cases, the environment is dynamic, i.e., the data set itself can be varied, then the external

C

C

Complexity of Inductive Inference

complexity should be maximized again. Thus the observation should be chosen for maximal information gain extracted from the data to increase the external complexity. Jaynes’ principle (Jaynes, ) can be applied as the same as in previous discussion. But on a longer time scale, when the inputs reach some stationary distribution, the model should be simplified to reduce its internal complexity.

Detail We refer the reader to the article 7Inductive Inference for basic definitions in inductive inference and the notations used below. Let N denote the set of natural numbers. Let φ , φ , . . . denote a fixed acceptable programming system (Rogers, ). Let Wi = domain(φ i ).

Mind Changes and Anomalies Recommended Reading Gell-Mann, M., & Lloyd, S. (). Information measures, effective complexity, and total information. Complexity, (), –. Holland, J. (). Adaptation in natural and artificial systems. Cambridge, MA: MIT Press. Holland, J. (). Hidden order: How adaptation builds complexity. Reading, MA: Addison-Wesley. Jaynes, E. (). Information theory and statistical mechanics. Physical Review, (), –. Jost, J. (). External and internal complexity of complex adaptive systems. Theory in Biosciences, (), –. Levin, S. (). Complex adaptive systems: Exploring the known, the unknown and the unknowable. Bulletin of the American Mathematical Society, (), –. Rissanen, J. (). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Vapnik, V. (). Statistical learning theory. New York: John Wiley & Sons. Waldrop, M. (). Complexity: The emerging science at the edge of order and chaos. New York: Simon & Schuster.

Complexity of Inductive Inference Sanjay Jain, Frank Stephan National University of Singapore, Singapore, Republic of Singapore

Definition In 7inductive inference, the complexity of learning can be measured in various ways: by the number of hypotheses issued in the worst case until the correct hypothesis is found; by the number of data items to be consumed or to be memorized in order to learn in the worst case; by the Turing degree of oracles needed to learn the class under a certain criterion; by the intrinsic complexity which is – like the Turing degrees in recursion theory – a way to measure the complexity of classes by using reducibilities between them.

The first measure of complexity of learning can be considered as the number of mind changes needed before the learner converges to its final hypothesis in the TxtEx model of learning. The number of mind changes by a learner M on a text T can be counted as card ({m : ? ≠ M(T[m]) ≠ M(T[m+])}). A learner M TxtExn learns a class L of languages iff M TxtEx learns L and for all L ∈ L, for all texts T for L, M makes at most n mind changes on T. TxtExn is defined as the collection of language classes which can be TxtExn identified (see Case & Smith () for details). Consider the class of languages Ln = {L : card(L) ≤ n}. It can be shown that Ln+ ∈ TxtExn+ − TxtExn . Now consider anomalous learning. A class C is TxtExab -learnable iff there is a learner, which makes at most b mind changes (where b = ∗ denotes that the number of mind changes is finite on each text for a language in the class, but not necessarily bounded by a constant) and whose final hypothesis is allowed to make up to a errors (where a = ∗ denotes finitely many errors). For these learning criteria, we get a twodimensional hierarchy on what can be learnt. Let Cn = {f : φ f () =n f }. For a total function f , let Lf = {⟨x, f (x)⟩ : x ∈ N}, where ⟨⋅, ⋅⟩ denotes a computable pairing function: a bijective mapping from N × N to N. Let LC = {Lf : f ∈ C}. Then, one can show that n LCn+ ∈ TxtExn+ − TxtEx . Similarly, if we consider the class Sn = {f : card({m : f (m) ≠ f (m + )}) ≤ n}, then one can show that LSn+ ∈ TxtExn+ − TxtEx∗n (we refer the reader to Case and Smith () for a proof of the above).

Data and Time Complexity Wiehagen () considered the complexity of number of data needed for learning. Regarding time complexity, one should note the result by Pitt () that any TxtEx-learnable class of languages can be TxtEx-learnt by a learner that has time complexity (with respect to

Complexity of Inductive Inference

C

the size of the input) bounded by a linear function. This result is achieved by a delaying trick, where the learner just repeats its old hypothesis unless it has enough time to compute its later hypothesis. This seriously effects what one can say about time complexity of learning. One proposal made by Daley and Smith () is to consider the total time used by the learner until its sequence of hypotheses converges, resulting in a possibly more reasonable measure of time in the complexity of learning.

Besides memorizing some past elements seen, another way to address this issue is by giving feedback to the learner (see Case, Jain, Lange, & Zeugmann, ) on whether some element has appeared in the past data. A feedback learner is an iterative learner, which is additionally allowed to query whether certain elements appeared in earlier data. An n-feedback learner is allowed to make n such queries at each stage (when it receives the new input datum). Thus, M is an mfeedback learner if there exist computable functions Q and a F such that, for all texts T and all n:

Iterative and Memory-Bounded Learning

– Q(M(T[n]), T(n)) is defined and is a set of m elements; – If Q(M(T[n]), T(n)) = (x , x , . . . , xm ) then M(T[n + ]) = F(M(T[n]), T(n), y , y , . . . , ym ), where yi = iff xi ∈ ctnt(T[n]).

Another measure of complexity of learning can be considered when one restricts how much past data a learner can remember. Wiehagen introduced the concept of iterative learning in which the learner cannot remember any past data. Its new hypothesis is based only on its previous conjecture and the new datum it receives. In other words, there exists a recursive function F such that M(T[n + ]) = F(M(T[n]), T(n)), for all texts T and for all n. Here, M(T[]) is some fixed value, say the symbol ‘?’ which is used by the learner to denote the absence of a reasonable conjecture. It can be shown that being iterative restricts the learning capacity of learners. For example, let Le = {x : x ∈ N} and let L = {Le } ∪ {{S ∪ {n + }} : n ∈ N, S ⊆ Le , and max(S) ≤ n}; then L can be shown to be TxtEx-learnable but not iteratively learnable. Memory-bounded learning (see Lange & Zeugmann, ) is an extension of memory-limited learning, where the learner is allowed to memorize upto some fixed number of elements seen in the past. Thus, M is an m-memory-bounded learner if there exists a function mem and two computable functions mF and F such that, for all texts T and all n: – mem(T[]) = /; – M(T[n + ]) = F(M(T[n]), mem(T[n]), T(n + )); – mem(T[n + ]) = mF(M(T[n]), mem(T[n]), T(n + )); – mem(T[n + ]) − mem(T[n]) ⊆ {T(n + )}; – card(mem(T[n])) ≤ m. It can be shown that the criteria of inference based on TxtEx-learning by m-memory-bounded learners form a proper hierarchy.

Again, it can be shown that allowing more feedback gives greater learning power, and thus one can get a hierarchy based on the amount of feedback allowed.

Complexity of Final Hypothesis Another possibility on complexity of learning is to consider the complexity or size of the final grammar output by the learner. Freivalds () considered the case when the final program/grammar output by the learner is minimal: that is, there is no smaller index that accepts/generates the same language. He showed that this severely restricts the learning capacity of learners. Not only that, the learning capacity depends on the acceptable programming system chosen, unlike the case for most other criteria of learning such as TxtEx or TxtBc, which are independent of the acceptable programming system chosen. In particular, there are acceptable programming systems in which only classes containing finitely many infinite languages can be learnt using minimal final grammars (see Freivalds, ; Jain and Sharma, ). Chen () considered a modification of such a paradigm where one considers convergence to nearly minimal grammars rather than minimal. That is, instead of requiring that the final grammars are minimal, one requires that they are within a recursive function h of minimal. Here h may depend on the class being learnt. Chen showed that this allows one to have the criteria of minimal learnability

C

C

Complexity of Inductive Inference

to be independent of the acceptable programming system chosen. However, one can show that some simple classes are not minimally learnable. An example of such a class is the class LC which is derived from C = {f : ∀∞ × [f (x) = ]}, the class of all functions which are almost everywhere .

Intrinsic Complexity Another way to consider complexity of learning is to consider relative complexity in a way similar to how one considers Turing reductions in computability theory. Such a notion is called intrinsic complexity of the class. This was first considered by Freivalds et al. () for function learning. Jain and Sharma () considered it for language learning, and the following discussion is from there. An enumeration operator (see Rogers, ), Θ, is an algorithmic mapping from SEQ into SEQ such that the following two conditions are satisfied: – for all σ, τ ∈ SEQ, if σ ⊆ τ, then Θ(σ) ⊆ Θ(τ); – for all texts T, limn→∞ ∣Θ(T[n])∣ = ∞. By extension, we think of Θ as also mapping texts to texts such that Θ(T) = ⋃n Θ(T[n]). Furthermore, we define Θ(L) = {ctnt(Θ(T)) : T is a text for L}. Intuitively, Θ(L) denotes the set of languages to whose texts Θ maps texts of L. The reader should note the overloading of this notation because the type of the argument to Θ could be a sequence, a text or a language. One says that a sequence of grammars g , g , . . . is an acceptable TxtEx-sequence for L if the sequence of grammars converges to a grammar for L. L ≤weak L iff there are two operators Θ and Ψ such that for all L ∈ L , for all texts T for L, Θ(T) is a text for some L′ ∈ L such that if g , g , . . . is an acceptable TxtEx-sequence for L′ then Ψ(g , g , . . .) is an acceptable TxtEx-sequence for L. Note that different texts for the same language L may be mapped by Θ to texts for different languages in L above. If we require that different texts for L are mapped to texts for the same language L′ in L , then we get a stronger notion of reduction called strong reduction: L ≤strong L iff L ≤weak L and for all L ∈ L , Θ(L) contains only one language, where Θ is as in the definition for ≤weak reduction.

It can be shown that FIN is a complete class for TxtEx-identification with respect to ≤weak reduction (see Jain & Sharma, ). Interestingly it was shown that the class of pattern languages (Angluin, ), the class SD = {L : Wmin(L) = L} and the class COINIT = {{x : x ≥ n} : n ∈ N} are all equivalent under ≤strong . Let code be a bijective mapping from non-negative rational numbers to natural numbers. Then, one can show that the class RINIT = {{code(x) : ≤ x ≤ r, x is a rational number} : ≤ r ≤ , r is a rational number } is ≤strong complete for TxtEx (see Jain, Kinber, & Wiehagen, ). Interestingly every finite directed acyclic graph can be embedded into the ≤strong degree structure (Jain & Sharma, ). On the other hand the degree structure is non-dense in the sense that there exist classes L and L such that L Year < Attribute A = true : Market Rising Attribute A = false : Market Falling Year ≥ Attribute B = true : Market Rising Attribute B = false : Market Falling This tree contains embedded knowledge about two intervals of time: in one of which, –, attribute A is predictive; in the other, onward, attribute B is predictive. As time (in this case, year) is a monotonically increasing attribute, future classification using this decision tree will only use attribute B. If this domain can be expected to have recurring hidden context, information about the prior interval of time could be valuable. The decision tree in the example above contains information about changes in context. We define context as: ▸ Context is any attribute whose values are largely inde-

which instances of a hidden context are liable to be contiguous. There is also no restriction, in principle, to one dimension. Some alternatives to time as environmental attributes are dimensions of space, and space–time combinations. Given an environmental attribute, we can utilize a CSFS machine learning algorithm to gain information on likely hidden changes in context. The accuracy of the change points found will be dependent upon at least hidden context duration, the number of different contexts, the complexity of each local concept, and noise. The CSFS identified context change points can be expected to contain errors of the following types: . 7Noise or serial correlation errors. These would take the form of additional incorrect change points. . Errors due to the repetition of tests on time in different parts of the concept. These would take the form of a group of values clustered around the actual point where the context changed. . Errors of omission, change points that are missed altogether. The initial set of identified context changes can be refined by contextual 7clustering. This process combines similar intervals of the dataset, where the similarity of two intervals is based upon the degree to which a partial model is accurate on both intervals.

pendent but tend to be stable over contiguous intervals of another attribute known as the environmental attribute.

The ability of decision trees to capture context is associated with the fact that decision tree algorithms use a form of context-sensitive feature selection (CSFS). A number of machine learning algorithms can be regarded as using CSFS including decision tree algorithms (Quinlan, ), 7rule induction algorithms (Clark & Niblett, ), and 7ILP systems (Quinlan, ). All of these systems produce concepts containing local information about context. When contiguous intervals of time reflect a hidden attribute or context, we call time the environmental attribute. The environmental attribute is not restricted to time alone as it could be any ordinal attribute over

Recent Advances With the increasing amount of data being generated by organizations, recent work on concept drift has focused on mining from high volume 7data streams Hulten, Spencer, & Domingos, ; Wang, Fan, Yu, & Han, ; Koltzer & Maloof, , Mierswa, Wurst, Klinkenberg, Scholz, & Euler, ; Chu & Zaniolo, ; Gaber, Zaslavsky, & Krishnaswamy, . Methods such as Hulten et al’ s, combine decision tree learning with incremental methods for efficient updates, thus avoiding relearning large decision trees. Koltzer and Maloof also use incremental methods combined in an 7ensemble.

Concept Learning

Cross References 7Decision Trees 7Ensemble Methods 7Incremental Learning 7Inductive Logic Programming 7Lazy Learning

Recommended Reading Aha, D. W., Kibler, D., & Albert, M. K. (). Instance-based learning algorithms. Machine Learning, , –. Chu, F., & Zaniolo, C. (). Fast and light boosting for adaptive mining of data streams. In Advances in knowledge discovery and data mining. Lecture notes in computer science (Vol. , pp. –). Springer. Clark, P., & Niblett, T. (). The CN induction algorithm. Machine Learning, , –. Clearwater, S., Cheng, T.-P., & Hirsh, H. (). Incremental batch learning. In Proceedings of the sixth international workshop on machine learning (pp. –). Morgan Kaufmann. Domingos, P. (). Context-sensitive feature selection for lazy learners. Artificial Intelligence Review, , –. [Aha, D. (Ed.). Special issue on lazy learning.] Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (). Mining data streams: A review. SIGMOD Rec., (), –. Harries, M., & Horn, K. (). Learning stable concepts in domains with hidden changes in context. In M. Kubat & G. Widmer (Eds.), Learning in context-sensitive domains (workshop notes). th international conference on machine learning, Bari, Italy. Harries, M. B., Sammut, C., & Horn, K. (). Extracting hidden context. Machine Learning, (), –. Hulten, G., Spencer, L., & Domingos, P. (). Mining timechanging data streams. In KDD ’: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. –). New York: ACM. Kilander, F., & Jansson, C. G. (). COBBIT – A control procedure for COBWEB in the presence of concept drift. In P. B. Brazdil (Ed.), European conference on machine learning (pp. –). Berlin: Springer. Kolter, J. Z., & Maloof, M. A. (). Dynamic weighted majority: A new ensemble method for tracking concept drift. In Third IEEE international conference on data mining ICDM- (pp. –). IEEE CS Press. Kubat, M. (). Floating approximation in time-varying knowledge bases. Pattern Recognition Letters, , –. Kubat, M. (). A machine learning based approach to load balancing in computer networks. Cybernetics and Systems Journal. Kubat, M. (). Second tier for decision trees. In Machine learning: Proceedings of the th international conference (pp. –). California: Morgan Kaufmann. Kubat, M., & Widmer, G. (). Adapting to drift in continuous domains. In Proceedings of the eighth European conference on machine learning (pp. –). Berlin: Springer. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., & Euler, T. (). Yale: Rapid prototyping for complex data mining tasks. In KDD ’: Proceedings of the th ACM SIGKDD international conference on knowledge discovery and data mining (pp. –). New York: ACM.

C

Quinlan, J. R. (). Learning logical definitions from relations. Machine Learning, , –. Quinlan, J. R. (). C.: Programs for machine learning. Morgan Kaufmann: San Mateo. Salganicoff, M. (). Density adaptive learning and forgetting. In Machine learning: Proceedings of the tenth international conference (pp. –). San Mateo: Morgan Kaufmann. Schlimmer, J. C., & Granger, R. I., Jr. (a). Beyond incremental processing: Tracking concept drift. In Proceedings AAAI- (pp. –). Los Altos: Morgan Kaufmann. Schlimmer, J., & Granger, R., Jr. (b). Incremental learning from noisy data. Machine Learning, (), –. Turney, P. D. (a). Exploiting context when learning to classify. In P. B. Brazdil (Ed.), European conference on machine learning (pp. –). Berlin: Springer. Turney, P. D. (b). Robust classification with context sensitive features. In Paper presented at the industrial and engineering applicatións of artificial intelligence and expert systems. Turney, P., & Halasz, M. (). Contextual normalization applied to aircraft gas turbine engine diagnosis. Journal of Applied Intelligence, , –. Wang, H., Fan, W., Yu, P. S., & Han, J. (). Mining conceptdrifting data streams using ensemble classifiers. In KDD ’: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. –). New York: ACM. Widmer, G. (). Recognition and exploitation of contextual clues via incremental meta-learning. In L. Saitta (Ed.), Machine learning: Proceedings of the th international workshop (pp. –). San Francisco: Morgan Kaufmann. Widmer, G., & Kubat, M. (). Effective learning in dynamic environments by explicit concept tracking. In P. B. Brazdil (Ed.), European conference on machine learning (pp. –). Berlin: Springer. Widmer, G., & Kubat, M. (). Learning in the presence of concept drift and hidden contexts. Machine Learning, , –.

Concept Learning Claude Sammut The University of New South Wales, Sydney, NSW, Australia

Synonyms Categorization; Classification learning

Definition The term concept learning is originated in psychology, where it refers to the human ability to learn categories for object and to recognize new instances of those categories. In machine learning, concept is more formally

C

C

Concept Learning

defined as “inferring a boolean-valued function from training examples of its inputs and outputs” (Mitchell, ).

Background Bruner, Goodnow, and Austin () published their book A Study of Thinking, which became a landmark in psychology and would later have a major impact on machine learning. The experiments reported by Bruner, Goodnow, and Austin were directed toward understanding a human’s ability to categorize and how categories are learned. ▸ We begin with what seems a paradox. The world of experience of any normal man is composed of a tremendous array of discriminably different objects, events, people, impressions. . . But were we to utilize fully our capacity for registering the differences in things and to respond to each event encountered as unique, we would soon be overwhelmed by the complexity of our environment. . . The resolution of this seeming paradox. . . is achieved by man’s capacity to categorize. To categorize is to render discriminably different things equivalent, to group objects and events and people around us into classes. . . The process of categorizing involves. . . an act of invention. . . If we have learned the class “house” as a concept, new exemplars can be readily recognised. The category becomes a tool for further use. The learning and utilization of categories represents one of the most elementary and general forms of cognition by which man adjusts to his environment.

The first question that they had to deal with was that of representation: what is a concept? They assumed that objects and events could be described by a set of attributes and were concerned with how inferences could be drawn from attributes to class membership. Categories were considered to be of three types: conjunctive, disjunctive, and relational. ▸ . . .when one learns to categorize a subset of events in a certain way, one is doing more than simply learning to recognise instances encountered. One is also learning a rule that may be applied to new instances. The concept or category is basically, this “rule of grouping” and it is

such rules that one constructs in forming and attaining concepts.

The notion of a rule as an abstract representation of a concept influenced research in machine learning. For example, 7decision tree learning was used as a means of creating a cognitive model of concept learning (Hunt, Martin, & Stone, ). This model later inspired Quinlan’s development of ID (Quinlan, ). The learning experience may be in the form of examples from a trainer or the results of trial and error. In either case, the program must be able to represent its observations of the world, and it must also be able to represent hypotheses about the patterns it may find in those observations. Thus, we will often refer to the 7observation language and the 7hypothesis language. The observation language describes the inputs and outputs of the program and the hypothesis language describes the internal state of the learning program, which corresponds to its theory of the concepts or patterns that exist in the data. The input to a learning program consists of descriptions of objects from the universe and, in the case of 7supervised learning, an output value associated with the example. The universe can be an abstract one, such as the set of all natural numbers, or the universe may be a subset of the real world. No matter which method of representation we choose, descriptions of objects in the real world must ultimately rely on measurements of some properties of those objects. These may be physical properties such as size, weight, and color or they may be defined for objects, for example, the length of time a person has been employed for the purpose of approving a loan. The accuracy and reliability of a learned concept depends on the accuracy and reliability of the measurements. A program is limited in the concepts that it can learn by the representational capabilities of both observation and hypothesis languages. For example, if an attribute/value list is used to represent examples for an induction program, the measurement of certain attributes and not others clearly places bounds on the kinds of patterns that the learner can find. The learner is said to be biased by its observation language (see 7Language Bias). The hypothesis language also places constraints on what may and may not be learned. For

Concept Learning

example, in the language of attributes and values, relationships between objects are difficult to represent. Whereas, a more expressive language, such as first-order logic, can easily be used to describe relationships. Unfortunately, representational power comes at a price. Learning can be viewed as a search through the space of all sentences in a language for a sentence that best describes the data. The richer the language, the larger is the search space. When the search space is small, it is possible to use “brute force” search methods. If the search space is very large, additional knowledge is required to reduce the search.

Rules, Relations, and Background Knowledge In the early s, there was no discipline called “machine learning.” Instead, learning was considered to be part of “pattern recognition,” which had not yet split from AI. One of the main problems addressed at that time was how to represent patterns so that they could be recognized easily. Symbolic description languages were developed to be expressive and learnable. Banerji (, ) first devised a language, which he called a “description list,” which utilized an object’s attributes to perform pattern recognition. Pennypacker, a masters student of Banerji at the Case Institute of Technology, implemented the recognition procedure and also used Bruner, Goodnow, and Austin’s Conservative Focussing Strategy to learn conjunctive concepts (Pennypacker, ). Bruner, Goodnow, and Austin describe the strategy as follows: ▸ . . . this strategy may be described as finding a positive instance to serve as a focus, then making a sequence of choices each of which alters but one attribute value [of the focus] and testing to see whether the change yields a positive or negative instance. Those attributes of the focus which, when changed, still yield positive instance are not part of the concept. Those attributes of the focus that yield negative instances when changed are features of the concept.

The strategy is only capable of learning conjunctive concepts, that is, the concept description can only consist of a simple conjunction of tests on attribute values. Recognizing the limitations of simple attribute/value representations, Banerji () introduced the use of

C

predicate logic as a description language. Thus, Banerji was one of the earliest advocates of what would, many years later, become Inductive Logic Programming. In the s, a series of algorithms emerged that developed concept learning further. Winston’s ARCH program (Winston, ) was influential as one of the first widely known concept learning programs. Michalski (, ) devised the Aq family of learning algorithms that set some of the early benchmarks for learning programs. Early relational learning programs were developed by Hayes-Roth (), Hayes-Roth and McDermott (), and Vere (, ). Banerji emphasized the importance of a description language that could “grow.” That is, its descriptive power should increase as new concepts are learned. These concepts become background knowledge for future learning. A simple example from Banerji () illustrates the use of background knowledge. There is a language for describing instances of a concept and another for describing concepts. Suppose we wish to represent the binary number, , by a left-recursive binary tree of digits “” and “”: [head : [head : ; tail : nil]; tail : ] “head” and “tail” are the names of attributes. Their values follow the colon. The concepts of binary digit and binary number are defined as x ∈ digit ≡ x = ∨ x = x ∈ num ≡ (tail(x) ∈ digit ∧ head(x) = nil) ∨ (tail(x) ∈ digit ∧ head(x) ∈ num) Thus, an object belongs to a particular class or concept if it satisfies the logical expression in the body of the description. Note that the concept above is disjunctive. Predicates in the expression may test the membership of an object in a previously learned concept and can express relations between objects. Cohen and Sammut () devised a learning system based on Banerji’s ideas of a growing concept description language and this was further extended by Sammut and Banerji ().

Concept Learning and Noise One of the most severe drawbacks of early concept learning systems was that they assumed that data sets

C

C

Conditional Random Field

were not noisy. That is, all attribute values and class labels in the training data are assumed to be correct. This is unrealistic in most real applications. Thus, concept learning systems began incorporating statistical measures to minimize the effects of noise and to estimate error rates (Breiman, Friedman, Olshen, & Stone, ; Cohen, ; Quinlan, , ). Learning to classify objects from training examples has gone on to become one of the central themes of machine learning research. As the robustness of classification systems has increased, they have found many applications, particularly in data mining but in a broad range of other areas.

Cross References 7Data Mining 7Decision Tree Learning 7Inductive Logic Programming 7Learning as Search 7Relational Learning 7Rule Learning

Recommended Reading Banerji, R. B. (). An information processing program for object recognition. General Systems, , –. Banerji, R. B. (). The description list of concepts. Communications of the Association for Computing Machinery, (), –. Banerji, R. B. (). A Language for the Description of Concepts. General Systems, , –. Banerji, R. B. (). Artificial intelligence: A theoretical approach. New York: North Holland. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (). Classification and regression trees. Belmont, CA: Wadsworth. Bruner, J. S., Goodnow, J. J., & Austin, G. A. (). A study of thinking. New York: Wiley. Cohen, B. L., & Sammut, C. A. (). Object recognition and concept learning with CONFUCIUS. Pattern Recognition Journal, (), –. Cohen, W. W. (). In fast effective rule induction. In Proceedings of the twelfth international conference on machine learning, Lake Tahoe, California. Menlo Park: Morgan Kaufmann. Hayes-Roth, F. (). A structural approach to pattern learning and the acquisition of classificatory power. In First international joint conference on pattern recognition (pp. –). Washington, D.C. Hayes-Roth, F., & McDermott, J. (). Knowledge acquisition from structural descriptions. In Fifth international joint conference on artificial intelligence (pp. –). Cambridge, MA.

Hunt, E. B., Marin, J., & Stone, P. J. (). Experiments in induction. New York: Academic. Michalski, R. S. (). Discovering classification rules using variable valued logic system VL. In Third international joint conference on artificial intelligence (pp. –). Stanford, CA. Michalski, R. S. (). A theory and methodology of inductive learning. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. Palo Alto: Tioga. Mitchell, T. M. (). Machine learning. New York: McGraw-Hill. Pennypacker, J. C. (). An elementary information processor for object recognition. SRC No. -I--. Case Institute of Technology. Quinlan, J. R. (). Learning efficient classification procedures and their application to chess end games. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. Palo Alto: Tioga. Quinlan, J. R. (). The effect of noise on concept learning. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (Vol. ). Los Altos: Morgan Kaufmann. Quinlan, J. R. (). C.: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. Sammut, C. A., & Banerji, R. B. (). Learning concepts by asking questions. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (Vol. , pp. –). Los Altos, CA: Morgan-Kaufmann. Vere, S. (). Induction of concepts in the predicate calculus. In Fourth international joint conference on artificial intelligence (pp. –). Tbilisi, Georgia, USSR. Vere, S. A. (). Induction of relational productions in the presence of background information. In Fifth international joint conference on artificial intelligence. Cambridge, MA. Winston, P. H. (). Learning structural descriptions from examples. Unpublished PhD Thesis, MIT Artificial Intelligence Laboratory.

Conditional Random Field A Conditional Random Field is a form of 7Graphical Model for segmenting and 7classifying sequential data. It is the 7discriminative learning counterpart to the 7generative learning Markov Chain model.

Recommended Reading Lafferty, J., McCallum, A., & Pereira, F. (). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the th international conference on machine learning (pp. –). San Francisco, Morgan Kaufmann.

C

Conjunctive Normal Form

Confirmation Theory

Confusion Matrix. Table An example of three-class confusion matrix Assigned Class A

Actual Class

The branch of philosophy concerned with how (and indeed whether) evidence can confirm a hypothesis, even though typically it does not entail it. A distinction is sometimes drawn between total confirmation: how well confirmed a hypothesis is, given all available evidence and weight-of-evidence: the amount of extra confirmation added to the total confirmation of a hypothesis by a particular piece of evidence. Confirmation is often measured by the probability of a hypothesis conditional on evidence.

B

C

C

A

B

C

Confusion Matrix. Table The outcomes of classification into positive and negative classes

Confusion Matrix Kai Ming Ting Monash University, Australia

Definition A confusion matrix summarizes the classification performance of a 7classifier with respect to some 7test data. It is a two-dimensional matrix, indexed in one dimension by the true class of an object and in the other by the class that the classifier assigns. Table presents an example of confusion matrix for a three-class classification task, with the classes A, B, and C. The first row of the matrix indicates that objects belong to the class A and that are correctly classified as belonging to A, two misclassified as belonging to B, and one as belonging to C. A special case of the confusion matrix is often utilized with two classes, one designated the positive class and the other the negative class. In this context, the four cells of the matrix are designated as 7true positives (TP), 7false positives (FP), 7true negatives (TN), and 7false negatives (FN), as indicated in Table . A number of measures of classification performance are defined in terms of these four classification outcomes. 7Specificity = 7True negative rate = TN/(TN + FP) 7Sensitivity = 7True positive rate = 7Recall = TP/ (TP + FN)

Actual Class

Assigned Class Positive

Negative

Positive

TP

FN

Negative

FP

TN

7Positive predictive value = 7Precision = TP/(TP + FP) 7Negative predictive value = TN/(TN + FN)

Conjunctive Normal Form Bernhard Pfahringer University of Waikato, Hamilton, New Zealand

Conjunctive normal form (CNF) is an important normal form for propositional logic. A logic formula is in conjunctive normal form if it is a single conjunction of disjunctions of (possibly negated) literals. No more nesting and no other negations are allowed. Examples are: a ¬b a∧b (a ∨ ¬b) ∧ (c ∨ d) ¬a ∧ (b ∨ ¬c ∨ d) ∧ (a ∨ ¬d)

C

Connection Strength

Any arbitrary formula in propositional logic can be transformed into conjunctive normal form by application of the laws of distribution, De Morgan’s laws, and by removing double negations. It is important to note that this process can lead to exponentially larger formulas which implies that the process in the worst case runs in exponential time. An example for this behavior is the following formula given in 7disjunctive normal form (DNF), which is linear in the number of propositional variables in this form. When transformed into conjunctive normal form (CNF), its size is exponentially larger. DNF: (a ∧ a ) ∨ (a ∧ a ) ∨ . . . ∨ (an ∧ an+ ) CNF: (a ∨ a ∨ . . . ∨ an ) ∧ (a ∨ a ∨ . . . ∨ an ) ∧ . . . ∧ (a ∨ a ∨ . . . ∨ an+ )

Recommended Reading Russell, S., & Norvig, P. (). Artificial intelligence: A modern approach (p. ). Prentice Hall

Connection Strength 7Weight

Connections Between Inductive Inference and Machine Learning John Case , Sanjay Jain University of Delaware, Newark, USA National University of Singapore, Singapore, Republic of Singapore

Definition Inductive inference is a theoretical framework to model learning in the limit. Here we will discuss some results in inductive inference, which have relevance to machine learning community. The mathematical/theoretical area called 7Inductive Inference, is also known as computability theoretic learning and learning in the limit (Jain, Osherson, Royer, &

Sharma, ; Odifreddi, ) typically but, as will be seen below, not always involves a situation depicted in () just below. Data d , d , d , . . . Ð→M Ð→ Programs e , e , e , . . . . In

Out

() Let N = the set of nonnegative integers. Strings, including program strings, computer reals, and other data structures, inside computers, are finite bit strings and, hence, can be coded into N. Therefore, mathematically at least, it is without loss of mathematical generality that we sometimes use the data type N where standard practice would use a different type. In (), d , d , d , . . . can be, e.g., the successive values of a function f : N → N or the elements of a (formal) language L ⊆ N in some order; M is a machine; the ei ’s are from some hypothesis space of programs; and, for M’s successful learning, later ei ’s exactly or approximately compute the f or L. Such learning is off-line: in successful cases, one comes away with programs for past and future data. For the related problem of online extrapolation of next values for a function f , suitable ei ’s may be the values of f (i)’s based on having seen strictly prior values of f .

Detail We will discuss the off-line case until we say otherwise. It is typical in applied machine learning to present to a learner whatever data one has and to obtain one corresponding program hopefully for predicting these data and future data. In inductive inference the case where only one program is output is called one-shot learning. More typically, in inductive inference, one allows for mind-changes, i.e., for a succession of output programs, as one receives successively more input data, with the later programs hopefully eventually being useful for predictions. Typically, one does not get success on one’s first conjecture/output program, but rather, one may achieve success eventually, or, as it is said, in the limit after some sequence of trial and error. It is helpful at this juncture to present a problem for which this latter approach makes more sense than the one-shot approach. We will consider some different criteria of successful learning of f or L by M. For example, Ex-style criteria

Connections Between Inductive Inference and Machine Learning

will require that all but finitely many of the ei ’s are syntactically the same and do a reasonable job of computing the f or L. Bc-style criteria are more relaxed, more powerful, but less useful (B¯arzdi¸nš, ; Case & Lynes, ; Case & Smith, ): they do not require almost all ei ’s be the same syntactically. Here is a well-known regression technique from, e.g., (Hildebrand, ), for exactly “curve-fitting” polynomials. It is the method involving calculating forward differences. We express it as a learning machine M and illustrate with its being fed an example data sequence generated by a cubic polynomial x − x + x + .

()

See (Hildebrand, ), for how to recover the polynomials themselves. M , fed a finite data sequence of natural numbers, first looks for iterated forward differences to become (apparently) constant, then outputs a rule/program, which uses the (apparent) constant to extrapolate the data sequence for any desired prediction. For example, were M given the data sequence in the top row of Table , it would calculate to be the apparent constant after three differencings, so M then outputs the following informal rule/program.

C

the elements of the cubic polynomial, on successive values in N – the whole sequence , , , , , , , . . . . Along the way, though, just after the first data point, M thinks the apparent constant is ; just after the second that it is ; just after the third that it is ; and only after more of the data points does it converge for this cubic polynomial to the apparent (and, on this example, actual) constant . In general, M , on a polynomial of degree m, changes its mind up to m times until converging to its final program (of course on f (x) = x , M never converges, and each level of forward differences is just the sequence f again.). Hence, M above Ex-learns, e.g., the integer polynomials f : N → N, but it does not in general one-shot learn these polynomials – since the data alone do not disclose the degree of a generating polynomial. In this entry we survey some results from inductive inference but with an eye to topics having something to say regarding or to applied machine learning. In some cases, the theoretical results lend mathematical support to preexisting empirical observations about the efficacy of known machine learning techniques. In other cases, the theoretical results provide some, typically abstract, suggestions for the machine learning practitioner. In some of these cases, some of the suggestions apparently pay off in others, intriguingly, we do not know yet.

▸ To generate the level sequence, at level , start with ; at level , start with ; at level , start with ; add the apparent constant from level to get successive level data items; add successive level items to get successive level data items; finally, add successive level items to get as many successive level data items as needed for prediction.

This program, eventually output by M when its input the whole top row of Table , correctly predicts Connections Between Inductive Inference and Machine Learning. Table Example Sequence and Its Iterated Forward Differences Sequence: st Diffs: nd Diffs: rd Diffs:

Multi-Task or Context Sensitive Learning In empirical, applied machine learning, multitask or context sensitive learning involves trying to learn Y by first (de Garis, a, b; Fahlman, ; Thrun, ; Thrun & Sullivan, ; Tsung & Cottrell, ; Waibel, a, b) or simultaneously (Caruana, , ; Dietterich, Hild, & Bakiri, ; Matwin & Kubat, ; Mitchell, Caruana, Freitag, McDermott, & Zabowski, ; Pratt, Mostow, & Kamm, ; Sejnowski & Rosenberg, ; Bartlmae, Gutjahr, & Nakhaeizadeh, ) trying to learn also X – even in cases where there may be no inherent interest in learning X (see also 7Transfer Learning). There is, in many cases, an apparent empirical advantage in doing this for some X, Y. It can happen that Y is not apparently or easily learnable by itself, but is learnable if one learns X first or simultaneously in some case X itself can be a sequence of tasks X , . . . , Xn . Here the Xi s may need to be learned sequentially or simultaneously to learn Y. For example, to teach a robot to drive

C

C

Connections Between Inductive Inference and Machine Learning

a car, it is useful to train it also to predict the center of the road markings (see, e.g., Baluja & Pomerleau, ; Caruana, ). For another example: an experimental system to predict the value of German Daimler stock performed better when it was modified to track simultaneously the German stock-index DAX (Bartlmae et al., ). The value of the Daimler stock here was the primary or target concept and the value of the DAX – a related concept – provided useful auxiliary context. Angluin, Gasarch, and Smith () shows mathematically that, in effect, there are (mathematical) learning scenarios for which it was provable that Y could not be learned without learning X first – and, in other scenarios (Angluin et al., ; Kinber, Smith, Velauthapillai, & Wiehagen, ), Y could not be learned without simultaneously learning X. These mathematical results provide a kind of evidence that the empirical observations as to the apparent usefulness of multitask or context sensitive learning may not be illusionary, luck, or a mere accident of happening to use some data sets but not others. For illustration, here is a particularly simple theoretical example needing to be learned simultaneously and similar to examples in Angluin et al. (). Let R be the set of all computable functions mapping N to N. We use numerical names in N for programs. Let S = {( f , g) ∈ R × R ∣ f () is a program for g ∧ g() is a program for f }.

()

We say (p, q) is a program for ( f , g) ∈ R × R iff p is a program for f and q is a program for g. Consider a machine M which, if, as in (), M is fed d , d , . . ., but where each di is ( f (i), g(i)), then M outputs each ei = (g(), f ()). Clearly, M oneshot learns S. It can be easily shown that the component f ’s and g’s for ( f , g) ∈ S are not separately even Bc-learnable. It is important to note that, perhaps quite unlike real-world problems, the definition of this example S employs a simple self-referential coding trick: useful programs are coded into values of the functions at argument zero. A number of inductive inference results have been proved by means of (sometimes more complicated) self-referential coding tricks (see, e.g., Case, ). B¯arzdi¸nš indirectly (see Zeugmann, ) provided a kind of informal robustness idea in his attempt to be rid of such coding tricks in inductive inference.

More formally, Fulk () considered a learnability result involving a witnessing class C of (tuples of) functions to be robust iff each computable scrambling of C also witnesses the learnability result (the allowed computable scramblers are the general recursive operators of (Rogers, ), but we omit the formal details herein.) Example: A simple shift scrambler converting each f to f ′ , where f ′ (x) = f (x + ), would eliminate the coding tricks just above – since the values of f at argument zero would be lost in this scrambling. Some inductive inference results hold robustly and some not (see, e.g., Fulk, ; Jain, ; Jain, Smith, & Wiehagen, ; Jain et al., ; Case, Jain, Ott, Sharma, & Stephan, ). Happily, the S ⊆ R × R above (that is, learnable, but its components not) can be replaced by a more complicated class S ′ that robustly witnesses the same result. This is better theoretical evidence that the empirically noticed efficacy of multitask or context sensitive learning is not just an accident. It is residually important to note that (Jain et al., ) shows, though, that the computable scramblers can not get rid of more sophisticated coding tricks they called topological. S ′ mentioned just above turns out to employ this latter kind of coding trick. It is hypothesized in (Case et al., ) that nature likely employs some sophisticated coding tricks itself. For a separate informal argument about coding tricks of nature, see (Case, ). Ott and Stephan () introduces a finite invariance constraint on top of robustness. This so-called hyperrobustness does destroy all coding tricks, and the result about the theoretical efficacy of multitask or context sensitive learning is not hyperrobust. However, hyperrobustness, perhaps, leaves unrealistically sparse structure. Final note: Machine learning is an engineering endeavor. However, philosophers of science as well as practitioners in classical scientific disciplines should likely be considering the relevance of multitask or context sensitive inductive inference to their endeavors.

Special Cases of Inductive Logic Programming In this section we discuss some learning in the limit results for elementary formal systems (EFSs) (Smullyan, ). Essentially, EFSs are programs in a string rewriting system. It is well known (Arikawa, Shinohara, & Yamamoto, ) that EFSs are essentially (pure) logic

Connections Between Inductive Inference and Machine Learning

programs over strings. Hence, the results have possible relevance for 7inductive logic programming (ILP) (Bratko & Muggleton, ; Lavraˇc & Džeroski, ; Mitchell, ; Muggleton & De Raedt, ). First we will discuss some important special cases based on Angulin’s pattern languages (Angluin, ). A pattern language is (by definition) one generated by all the positive length substitution instances in a pattern, such as, abXYcbbZXa () — where the variables (for substitutions) are depicted in upper case and the constants/terminals in lower case and are from, say, the alphabet {a,b,c}. Just below is an EFS or logic program based on this example pattern. abXYcbbZXa ← .

()

It must be understood, though, that in () and in the next example EFS below, only positive length strings are allowed to be substituted for the variables. Angluin () showed the Ex-learnability of the class of pattern languages from positive data. For these results, in the paradigm of () above d , d , d , . . . is a listing or presentation of some formal language L over a finite nonempty alphabet and the ei ’s are programs that generate languages. In particular, for Angluin’s M, for L a pattern language, the ei ’s are patterns, and, for each presentation of L, all but finitely many of the corresponding ei ’s are the same correct pattern for L. Much work has been done on the learnability of pattern languages, e.g., Salomaa (a, b); Case, Jain, Kaufmann, Sharma, and Stephan (), and on bounded finite unions thereof, e.g., Shinohara (); Wright (); Kilpeläinen, Mannila, and Ukkonen (); Brazma, Ukkonen, and Vilo (); Case, Jain, Lange, and Zeugmann (). Regarding bounded finite unions of pattern languages: an n-pattern language is the union of the pattern languages for some n patterns P , . . . , Pn . Each n-pattern language is also Ex-learnable from positive data (see Wright ()). An EFS or logic program corresponding to the n-patterns P , . . . , Pn and generating the corresponding n-pattern language is just below. P ← . ⋮ Pn ← .

C

Pattern language learning algorithms have been successfully applied toward some problems in molecular biology, see, e.g., Shimozono et al. (), Shinohara and Arikawa (). Lange and Wiehagen () presents an interesting iterative (Wiehagen, ) algorithm learning the class of pattern languages – from positive data only and with polynomial time constraints. Iterative learners are Ex-learners for which each output depends only on its just prior output (if any) and the input data element currently seen. Their algorithm works in polynomial time (actually quadratic time) in the length of the latest data item and the previous hypothesis. Furthermore, the algorithm has a linear set of good examples, in the sense that if the input data contains these good examples, then the algorithm already converges to the correct hypothesis. The number of good examples needed is at most ∣P∣ + , where P is a pattern generating the data d , d , d , . . . for the language being learned. This algorithm may be useful in practice due to its fast run time, and being able to converge quickly, if enough good data is available early. Furthermore, due to iterativeness, it does not need to store previous data! Zeugmann () considers total learning time up to convergence of the algorithm just discussed in the just prior paragraph. Note that, for arbitrary presentations, d , d , d , . . ., of a pattern language, this time can be unbounded. In the best case it is polynomial in the length of a generating pattern P, where d , d , d , . . . is based on using P to get good examples early – in fact the time taken in the best case is Θ(∣P∣ logs (s + k)), where P is the pattern, s is the alphabet size, and k is the number of variables in P. Much more interesting is the case of average time taken up to convergence. The probability distribution (called uniform by Zeugmann) considered is as follows. A variable X is replaced by a string w with probability (s) ∣w∣ (i.e., all strings of length r together have probability −r , and the distribution is uniform among strings of length r). Different variables are replaced independently of each other. In this case the average total time up to convergence is O(k k s∣P∣ logs (ks)). The main thing is that for average case on probabilistic data (as can be expected in real life, though not necessarily with this kind of uniform distribution), the algorithm converges pretty fast and computations are done efficiently.

C

C

Connections Between Inductive Inference and Machine Learning

A number of papers consider Ex-learning of EFSs (Krishna Rao, ; Krishna Rao, , , ; Krishna Rao & Sattar, ) including with various bounds on the number of mind-changes until syntactic convergence to correct programs (Jain & Sharma, , ). The EFSs considered are patterns, n-patterns, those with a constant bound on the length of clauses, and some with constant bounds on search trees. The mind-change bounds are typically more dynamic than those given by constants: they involve counting down from finite representations (called notations) for infinite constructive ordinals. An example of this kind of bound: one can algorithmically, based on some input parameters, decide how many mind-changes will be allowed. In other examples, the decision as to how many mindchanges will be allowed can be algorithmically revised some constant number of times. It is possible that not yet created special cases of some of these algorithms could be made feasible enough for practice.

Learning Drifting Concepts A drifting concept to be learned is one which is a moving target (see 7Concept Drift). In some machine learning applications, concept drift must be dealt with (Bartlett, Ben-David, & Kulkarni, ; Blum & Chalasani, ; Devaney & Ram, ; Freund & Mansour, ; Helmbold and Long, ; Kubat, ; Widmer & Kubat, ; Wrobel, ). An inductive inference contribution is (Case et al., ) in which it is shown, for online extrapolation by computable martingale betting strategies, upper bounds on the “speed” of the moving target that permit success at all. Here success is to make unbounded amounts of “money” betting on correctness of ones extrapolations. Here is an illustrative result from (Case et al., ). For the pattern languages considered in the previous section, only positive length strings of terminals can be substituted for a variable in an associated pattern. The (difficult to learn) pattern languages with erasing are just the languages obtained by also allowing the substitution of the empty string for variables in a pattern. For our example, we restrict the terminal alphabet to be {,}. With each pattern language with erasing L (over this terminal alphabet) we associate its characteristic function χ L , which is on terminal strings in L and on those not in L. For ε denoting the empty string,

and for the terminal strings in length-lexicographical order, ε, , , , , , , , . . ., we would input a χ L itself to a potential extrapolating machine as the bit string, χ L (ε), χ L (), χ L (), χ L (), χ L (), . . .. Let E be the class of these characteristic functions. Pick a positive integer constant p. To model drift with permanence p, we imagine that a potential extrapolator for E receives successive bits from a member of E but keeps switching to the next bits of another, etc., but it must see at least p bits in a row of each member of E it sees before it can see the next bits of another. p is, then, a speed limit on drift. The result is that some suitably clever computable martingale betting strategy is successful at extrapolating E with drift permanence (speed limit on drift) of p = .

Behavioral Cloning Kummer and Ott (); Case, Ott, Sharma, and Stephan () studied learning in the limit of winning control strategies for closed computable games. These games nicely model reactive process-control problems. Included are such example process-control games as regulating temperature of a room to be in a desired interval, forever after no more than some fixed number of moves between the thermostat and processes disturbing the temperature (Roughly, closed computable games are those so that one can tell algorithmically when one has lost. A temperature control game that requires stability forever after some undetermined finite number of moves is not a closed computable game. For a more formal treatment, see Cenzer and Remmel (); Maler, Pnueli, and Sifakis (); Thomas (); Kummer and Ott ()). In machine learning, there are cases where one wants to teach a machine some motor skill possessed by human experts and where these human experts do not have access to verbalizable knowledge about how they perform expertly. Piloting an aircraft or expert operation of a swinging shipyard crane provide examples, and machine learning employs, in these cases, 7behavioral cloning, which uses direct performance data from the experts (Bain & Sammut, ; Bratko, Urbanˇciˇc, & Sammut, ; Šuc, ). Case et al. () studies the effects on learning in the limit closed computable games where the learning procedures also had access to the behavioral performance (but not the algorithms) of masters/experts at the

Connections Between Inductive Inference and Machine Learning

C

games. For example, it is showed that, in some cases, there is better performance cloning n + disparate masters over cloning only n. For a while it was not known in machine learning how to clone multiple experts even after Case et al. () was known to some; however, independently of Case et al., , and later, Dorian Šuc (Šuc, ) found a way to clone behaviorally more than one human expert simultaneously (for the freeswinging shipyard crane problem) – by having more than one level of feedback control, and he got enhanced performance from cloning the multiple experts!

() For k chosen so that − −k ≥ p, there exists a blind, probabilistic algorithmic coordinator PM, such that: (i) For each member of C, PM can coordinate with with probability − −k ≥ p; and (ii) PM is k-memory limited in the sense of (Osherson, Stob, & Weinstein, , P. ); more specifically, PM needs to remember whether it is outputting one of its first k bits — which are its only random bits (e.g., , a mere k = random bits for p = suffice.).

Learning To Coordinate

Regarding possible eventual applicability: Maye, Hsieh, Sugihara, and Brembs () cites finding deterministic chaos but not randomness in the behavior of animals. Hence, animals may not be exploiting random bits in learning anything, including to coordinate. However, one might build artifactual devices to exploit randomness, say, from radioactive decay, including, then, for enhancing learning to coordinate.

Montagna and Osherson () begins the study of learning in the limit to coordinate (digital) moves between at least two agents. The machines of Montagna and Osherson () are, in effect, general extrapolating devices (Montagna & Osherson, ; Case et al., ). Technically, and without loss of generality of the results, we restrict the moves of each coordinator to bits, i.e., zeros and ones. Coordination is achieved between two coordinators iff each, reacting to the bit sequence of the other, eventually (in the limit) matches it bit for bit. Montagna and Osherson () gives an example of two people who show up in a park each day at one of noon (bit ) or pm (bit ); each silently watches the other’s past behavior; and each tries, based on the past behavior of the other, to show up eventually exactly when the other shows up. If they manage it, they have learned to coordinate. A blind coordinator is one that reacts only to the presence of a bit from another process, not to which bit the other process has played (Montagna and Osherson, ). In Case et al. () is developed and studied the notion of probabilistically correct algorithmic coordinators. Next is a sample of theorems to the effect that just a few random bits can enhance learning to coordinate. Theorem (Case et al., ) Suppose ≤ p < . There exists a class of deterministic algorithmic coordinators C such that () No deterministic algorithmic coordinator can coordinate with all of C; and

Learning Geometric Clustering Case, Jain, Martin, Sharma, and Stephen () showed that learnability in the limit of 7clustering, with or without additional information, depends strongly on geometric constraints on the shape of the clusters. In this approach the hypothesis space of possible clusters is pre-given in each setting. It was hoped to obtain thereby insight into the difficulty of clustering when the clusters are restricted to preassigned geometrically defined classes. This is interestingly complementary to the conceptual clustering approach (see, e.g., Mishra, Ron, & Swaminathan, ; Pitt & Reinke, ) where one restricts the possible clusters to have good “verbal” descriptions in some language. Clustering of many of the geometric classes investigated was shown to require information in addition to a presentation, d , d , d , . . ., of the set of points to be clustered. For example, for clusters as convex hulls of finitely many points in a rational vector space, clustering can be done – but with the number of clusters as additional information. Let S consist of all polygons including their interiors – in the rational two-dimensional plane without intersections and degenerated angles (Attention was restricted to spaces of rationals since: . computer

C

C

Connections Between Inductive Inference and Machine Learning

reals are rationals, . this avoids the uncountability of the set of reals, and . this avoids dealing with uncomputable real points.) The class S can be clustered – but with the number of vertices of the polygons of the clusters involved as additional information. Correspondingly, then, it was shown that the class S ′ containing S together with all such polygons but with one hole (the nondegenerate differences of two members in S) cannot be clustered with the number of vertices as additional information, yet S ′ can be clustered with area as additional information – and this even in higher dimensions and with any number of holes (Case et al., ). It remains to be seen if some forms of geometrically constrained clustering can be usefully complementary to, say, conceptually/verbally constrained clustering.

Insights for Limitations of Science We briefly treat below in some problems regarding parsimonious, refutable, and consistent hypotheses. It is common wisdom in science that one should fit parsimonious explanations, hypotheses, or programs to data. In machine learning, this has been successfully applied, e.g., (Wallace, ; Wallace & Dowe, ). Curiously, though, there are many results in inductive inference in which we see sometimes severe degradations of learning power caused by demanding parsimonious predictive programs (see, e.g., Freivalds (); Kinber (); Chen (); Case, Jain, and Sharma (); Ambainis, Case, Jain, and Suraj ()). It is an interesting problem to resolve the seeming, likely not actual contradiction between the just prior two paragraphs. Popper’s Refutability (Popper, ) asserts that hypotheses in science should be subject to refutation. Besides the well-known difficulties of Duhem–Quine (Harding, ) of knowing which component hypothesis to throw out when a compound hypothesis badly fails to make correct predictions, inductive inference theorems have provided very different difficulties. Case and Smith () outlines cases of usefully incomplete (hence wrong) hypothesis that cannot be refuted, and Case and Suraj () (see also Case, ) provides cases of inductively inferable higher order hypothesis not totally subject to refutation in cases where ordinary hypotheses subject to full refutation cannot be inductively inferred.

While Duhem–Quine may impact machine learning eventually, it remains to be seen about the inductive inference results of the just prior paragraph. Requiring 7inductive inference procedures always to output an hypothesis in various senses consistent with (e.g., not ignoring) the data on which that hypothesis is based seems like mere common sense. However, from B¯arzdi¸nš (a); Blum and Blum (); Wiehagen (), Case, Jain, Stephan, and Wiehagen () we see that strict adherence to various consistency principles can severely attenuate the learning power of inductive inference machines. Furthermore, interestingly, even when inductive inference is polytime constrained, we see similar counterintuitive results to the effect that a kind of consistency can strictly attenuate learning power (Wiehagen & Zeugmann, ). A machine learning analog might be Breiman’s bagging (Breiman, ) and random forests (Breiman, ), where data is purposely ignored. However, in these cases, the purpose of ignoring data is to avoid overfitting to noise. It remains to be seen, whether, in applied machine learning involving cases of practically noiseless data, one can also obtain some advantage in ignoring some consistency principles. Again the potential lesson from inductive inference is abstract and provides only a hint of something to work out in real machine learning problems.

Cross References 7Behavioural Cloning 7Clustering 7Concept Drift 7Inductive Logic Programming 7Transfer Learning

Recommended Reading Ambainis, A., Case, J., Jain, S., & Suraj, M. (). Parsimony hierarchies for inductive inference. Journal of Symbolic Logic, , –. Angluin, D., Gasarch, W., & Smith, C. (). Training sequences. Theoretical Computer Science, (), –. Angluin, D. (). Finding patterns common to a set of strings. Journal of Computer and System Sciences, , –. Arikawa, S., Shinohara, T., & Yamamoto, A. (). Learning elementary formal systems. Theoretical Computer Science, , –. Bain, M., & Sammut, C. (). A framework for behavioural cloning. In K. Furakawa, S. Muggleton, & D. Michie (Eds.), Machine intelligence . Oxford: Oxford University Press.

Connections Between Inductive Inference and Machine Learning

Baluja, S., & Pomerleau, D. (). Using the representation in a neural network’s hidden layer for task specific focus of attention. Technical Report CMU-CS--, School of Computer Science, CMU, May . Appears in Proceedings of the IJCAI. Bartlett, P., Ben-David, S., & Kulkarni, S. (). Learning changing concepts by exploiting the structure of change. In Proceedings of the ninth annual conference on computational learning theory, Desenzano del Garda, Italy. New York: ACM Press. Bartlmae, K., Gutjahr, S., & Nakhaeizadeh, G. (). Incorporating prior knowledge about financial markets through neural multitask learning. In Proceedings of the fifth international conference on neural networks in the capital markets. B¯arzdi¸nš, J. (a). Inductive inference of automata, functions and programs. In Proceedings of the international congress of mathematicians, Vancouver (pp. –). B¯arzdi¸nš, J. (b). Two theorems on the limiting synthesis of functions. In Theory of algorithms and programs (Vol. , pp. –). Latvian State University, Riga. Blum, L., & Blum, M. (). Toward a mathematical theory of inductive inference. Information and Control, , –. Blum, A., & Chalasani, P. (). Learning switching concepts. In Proceedings of the fifth annual conference on computational learning theory, Pittsburgh, Pennsylvania, (pp. –). New York: ACM Press. Bratko, I., & Muggleton, S. (). Applications of inductive logic programming. Communications of the ACM, (), –. Bratko, I., Urbanˇciˇc, T., & Sammut, C. (). Behavioural cloning of control skill. In R. S. Michalski, I. Bratko, & M. Kubat (Eds.), Machine learning and data mining: Methods and applications, (pp. –). New York: Wiley. Brazma, A., Ukkonen, E., & Vilo, J. (). Discovering unbounded unions of regular pattern languages from positive examples. In Proceedings of the seventh international symposium on algorithms and computation (ISAAC’), Lecture notes in computer science, (Vol. , pp. –), Berlin: Springer-Verlag. Breiman, L. (). Bagging predictors. Machine Learning, (), –. Breiman, L. (). Random forests. Machine Learning, (), –. Caruana, R. (). Multitask connectionist learning. In Proceedings of the connectionist models summer school (pp. –). NJ: Lawrence Erlbaum. Caruana, R. (). Algorithms and applications for multitask learning. In Proceedings th international conference on machine learning (pp. –). San Francisco, CA: Morgan Kaufmann. Case, J. (). Infinitary self-reference in learning theory. Journal of Experimental and Theoretical Artificial Intelligence, , –. Case, J. (). The power of vacillation in language learning. SIAM Journal on Computing, (), –. Case, J. (). Directions for computability theory beyond pure mathematical. In D. Gabbay, S. Goncharov, & M. Zakharyaschev (Eds.), Mathematical problems from applied logic II. New logics for the XXIst century, International Mathematical Series, (Vol. ). New York: Springer. Case, J., & Lynes, C. (). Machine inductive inference and language identification. In M. Nielsen & E. Schmidt, (Eds.), Proceedings of the th International Colloquium on Automata, Languages and Programming, Lecture notes in computer science, (Vol. , pp. –). Berlin: Springer-Verlag. Case, J., & Smith, C. (). Comparison of identification criteria for machine inductive inference. Theoretical Computer Science, , –.

C

Case, J., & Suraj, M. (). Weakened refutability for machine learning of higher order definitions, . (Working paper for eventual journal submission). Case, J., Jain, S., Kaufmann, S., Sharma, A., & Stephan, F. (). Predictive learning models for concept drift (Special Issue for ALT’). Theoretical Computer Science, , –. Case, J., Jain, S., Lange, S., & Zeugmann, T. (). Incremental concept learning for bounded data mining. Information and Computation, , –. Case, J., Jain, S., Montagna, F., Simi, G., & Sorbi, A. (). On learning to coordinate: Random bits help, insightful normal forms, and competency isomorphisms (Special issue for selected learning theory papers from COLT’, FOCS’, and STOC’). Journal of Computer and System Sciences, (), –. Case, J., Jain, S., Martin, E., Sharma, A., & Stephan, F. (). Identifying clusters from positive data. SIAM Journal on Computing, (), –. Case, J., Jain, S., Ott, M., Sharma, A., & Stephan, F. (). Robust learning aided by context (Special Issue for COLT’). Journal of Computer and System Sciences, , –. Case, J., Jain, S., & Sharma, A. (). Machine induction without revolutionary changes in hypothesis size. Information and Computation, , –. Case, J., Jain, S., Stephan, F., & Wiehagen, R. (). Robust learning – rich and poor. Journal of Computer and System Sciences, (), –. Case, J., Ott, M., Sharma, A., & Stephan, F. (). Learning to win process-control games watching gamemasters. Information and Computation, (), –. Cenzer, D., & Remmel, J. (). Recursively presented games and strategies. Mathematical Social Sciences, , –. Chen, K. (). Tradeoffs in the inductive inference of nearly minimal size programs. Information and Control, , –. de Garis, H. (a). Genetic programming: Building nanobrains with genetically programmed neural network modules. In IJCNN: International Joint Conference on Neural Networks, (Vol. , pp. –). Piscataway, NJ: IEEE Service Center. de Garis, H. (b). Genetic programming: Modular neural evolution for Darwin machines. In M. Caudill (Ed.), IJCNN--WASH DC; International joint conference on neural networks (Vol. , pp. –). Hillsdale, NJ: Lawrence Erlbaum Associates. de Garis, H. (). Genetic programming: Building artificial nervous systems with genetically programmed neural network modules. In B. Soušek, & The IRIS group (Eds.), Neural and intelligenct systems integeration: Fifth and sixth generation integerated reasoning information systems (Chap. , pp. –). New York: Wiley. Devaney, M., & Ram, A. (). Dynamically adjusting concepts to accommodate changing contexts. In M. Kubat, G. Widmer (Eds.), Proceedings of the ICML- Pre-conference workshop on learning in context-sensitive domains, Bari, Italy (Journal submission). Dietterich, T., Hild, H., & Bakiri, G. (). A comparison of ID and backpropogation for English text-tospeech mapping. Machine Learning, (), –. Fahlman, S. (). The recurrent cascade-correlation architecture. In R. Lippmann, J. Moody, and D. Touretzky (Eds.), Advances in neural information processing systems (Vol. , pp. –). San Mateo, CA: Morgan Kaufmann Publishers. Freivalds, R. (). Minimal Gödel numbers and their identification in the limit. In Lecture notes in computer science (Vol. , pp. –). Berlin: Springer-Verlag.

C

C

Connections Between Inductive Inference and Machine Learning

Freund, Y., & Mansour, Y. (). Learning under persistent drift. In S. Ben-David, (Ed.), Proceedings of the third European conference on computational learning theory (EuroCOLT’), Lecture notes in artificial intelligence, (Vol. , pp. –). Berlin: Springer-Verlag. Fulk, M. (). Robust separations in inductive inference. In Proceedings of the st annual symposium on foundations of computer science (pp. –). St. Louis, Missouri. Washington, DC: IEEE Computer Society. Harding, S. (Ed.). (). Can theories be refuted? Essays on the Duhem-Quine thesis. Dordrecht: Kluwer Academic Publishers. Helmbold, D., & Long, P. (). Tracking drifting concepts by minimizing disagreements. Machine Learning, , –. Hildebrand, F. (). Introduction to numerical analysis. New York: McGraw-Hill. Jain, S. (). Robust behaviorally correct learning. Information and Computation, (), –. Jain, S., & Sharma, A. (). Elementary formal systems, intrinsic complexity, and procrastination. Information and Computation, , –. Jain, S., & Sharma, A. (). Mind change complexity of learning logic programs. Theoretical Computer Science, (), –. Jain, S., Osherson, D., Royer, J., & Sharma, A. (). Systems that learn: An introduction to learning theory (nd ed.). Cambridge, MA: MIT Press. Jain, S., Smith, C., & Wiehagen, R. (). Robust learning is rich. Journal of Computer and System Sciences, (), –. Kilpeläinen, P., Mannila, H., & Ukkonen, E. (). MDL learning of unions of simple pattern languages from positive examples. In P. Vitányi (Ed.), Computational learning theory, second European conference, EuroCOLT’, Lecture notes in artificial intelligence, (Vol. , pp. –). Berlin: Springer-Verlag. Kinber, E. (). On a theory of inductive inference. In Lecture notes in computer science (Vol. , pp. –). Berlin: SpringerVerlag. Kinber, E., Smith, C., Velauthapillai, M., & Wiehagen, R. (). On learning multiple concepts in parallel. Journal of Computer and System Sciences, , –. Krishna Rao, M. (). A class of prolog programs inferable from positive data. In A. Arikawa & A. Sharma (Eds.), Seventh international conference on algorithmic learning theory (ALT’ ), Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer-Verlag. Krishna Rao, M. (). Some classes of prolog programs inferable from positive data (Special Issue for ALT’). Theoretical Computer Science A, , –. Krishna Rao, M. (). Inductive inference of term rewriting systems from positive data. In S. Ben-David, J. Case, & A. Maruoka (Eds.), Algorithmic learning theory: Fifteenth international conference (ALT’ ), Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer-Verlag. Krishna Rao, M. (). A class of prolog programs with nonlinear outputs inferable from positive data. In S. Jain, H. U. Simon, & E. Tomita (Eds.), Algorithmic learning theory: Sixteenth international conference (ALT’ ), Lecture notes in artificial intelligence, (Vol. , pp. –). Berlin: Springer-Verlag. Krishna Rao, M., & Sattar, A. (). Learning from entailment of logic programs with local variables. In M. Richter, C. Smith, R. Wiehagen, & T. Zeugmann (Eds.), Ninth international conference on algorithmic learning theory (ALT’ ), Lecture notes in

artificial intelligence (Vol. , pp. –). Berlin: SpringerVerlag. Kubat, M. (). A machine learning based approach to load balancing in computer networks. Cybernetics and Systems, , –. Kummer, M., & Ott, M. (). Learning branches and learning to win closed recursive games. In Proceedings of the ninth annual conference on computational learning theory, Desenzano del Garda, Italy. New York: ACM Press. Lange, S., & Wiehagen, R. (). Polynomial time inference of arbitrary pattern languages. New Generation Computing, , –. Lavraˇc, N., & Džeroski, S. (). Inductive logic programming: Techniques and applications. New York: Ellis Horwood. Maler, O., Pnueli, A., & Sifakis, J. (). On the synthesis of discrete controllers for timed systems. In Proceedings of the annual symposium on the theoretical aspects of computer science, LNCS (Vol. , pp. –). Berlin: Springer-Verlag. Matwin, S., & Kubat, M. (). The role of context in concept learning. In M. Kubat & G. Widmer (Eds.), Proceedings of the ICML- pre-conference workshop on learning in contextsensitive domains, Bari, Italy, (pp. –). Maye, A., Hsieh, C., Sugihara, G., & Brembs, B. (). Order in spontaneous behavior. PLoS One, May, . See: http://brembs. net/spontaneous/ Mishra, N., Ron, D., & Swaminathan, R. (). A new conceptual clustering framework. Machine Learning, (–), –. Mitchell, T. (). Machine learning. New York: McGraw Hill. Mitchell, T., Caruana, R., Freitag, D., McDermott, J., & Zabowski, D. (). Experience with a learning, personal assistant. Communications of the ACM, , –. Montagna, F., & Osherson, D. (). Learning to coordinate: A recursion theoretic perspective. Synthese, , –. Muggleton, S., & De Raedt, L. (). Inductive logic programming: Theory and methods. Journal of Logic Programming, /, – . Odifreddi, P. (). Classical recursion theory (Vol. II). Amsterdam: Elsivier. Osherson, D., Stob, M., & Weinstein, S. (). Systems that learn: An introduction to learning theory for cognitive and computer scientists. Cambridge, MA: MIT Press. Ott, M., & Stephan, F. (). Avoiding coding tricks by hyperrobust learning. Theoretical Computer Science, (), –. Pitt, L., & Reinke, R. (). Criteria for polynomial-time (conceptual) clustering. Machine Learning, , –. Popper, K. (). Conjectures and refutations: The growth of scientific knowledge. New York: Basic Books. Pratt, L., Mostow, J., & Kamm, C. (). Direct transfer of learned information among neural networks. In Proceedings of the th national conference on artificial intelligence (AAAI-), Anaheim, California. Menlo Park, CA: AAAI press. Rogers, H. (). Theory of recursive functions and effective computability. New York: McGraw Hill (Reprinted, MIT Press, ). Salomaa, A. (a). Patterns (The formal language theory column). EATCS Bulletin, , –. Salomaa, A. (b). Return to patterns (The formal language theory column). EATCS Bulletin, , –. Sejnowski, T., & Rosenberg, C. (). NETtalk: A parallel network that learns to read aloud. Technical Report JHU-EECS--, Johns Hopkins University.

Consensus Clustering

Shimozono, S., Shinohara, A., Shinohara, T., Miyano, S., Kuhara, S., & Arikawa, S. (). Knowledge acquisition from amino acid sequences by machine learning system BONSAI. Transactions of Information Processing Society of Japan, , –. Shinohara, T. (). Inferring unions of two pattern languages. Bulletin of Informatics and Cybernetics, , –. Shinohara, T., & Arikawa, A. (). Pattern inference. In K. P. Jantke & S. Lange (Eds.), Algorithmic learning for knowledge-based systems, Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer-Verlag. Smullyan, R. (). Theory of formal systems. In Annals of Mathematics Studies (Vol. ). Princeton, NJ: Princeton University Press. Šuc, D. (). Machine reconstruction of human control strategies. Frontiers in artificial intelligence and applications (Vol. ). Amsterdam: IOS Press. Thomas, W. (). On the synthesis of strategies in infinite games. In Proceedings of the annual symposium on the theoretical aspects of computer science, LNCS (Vol. , pp. –). Berlin: SpringerVerlag. Thrun, S. (). Is learning the n-th thing any easier than learning the first? In Advances in neural information processing systems, . San Mateo, CA: Morgan Kaufmann. Thrun, S., & Sullivan, J. (). Discovering structure in multiple learning tasks: The TC algorithm. In Proceedings of the thirteenth international conference on machine learning (ICML) (pp. –). San Francisco, CA: Morgan Kaufmann. Tsung, F., & Cottrell, G. (). A sequential adder using recurrent networks. In IJCNN--WASHINGTON DC: International joint conference on neural networks June – (Vol. , pp. –). Piscataway, NJ: IEEE Service Center. Waibel, A. (a). Connectionist glue: Modular design of neural speech systems. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the connectionist models summer school (pp. –). San Mateo, CA: Morgan Kaufmann. Waibel, A. (b). Consonant recognition by modular construction of large phonemic time-delay neural networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems I (pp. –). San Mateo, CA: Morgan Kaufmann. Wallace, C. (). Statistical and inductive inference by minimum message length. (Information Science and Statistics). New York: Springer (Posthumously published). Wallace, C., & Dowe, D. (). Minimum message length and kolmogorov complexity (Special Issue on Kolmogorov Complexity). Computer Journal, (), –. http://comjnl. oxfordjournals.org/cgi/reprint///. Widmer, G., & Kubat, M. (). Learning in the presence of concept drift and hidden contexts. Machine Learning, , –. Wiehagen, R. (). Limes-Erkennung rekursiver Funktionen durch spezielle Strategien. Electronische Informationverarbeitung und Kybernetik, , –. Wiehagen, R., & Zeugmann, T. (). Ignoring data may be the only way to learn efficiently. Journal of Experimental and Theoretical Artificial Intelligence, , –. Wright, K. (). Identification of unions of languages drawn from an identifiable class. In R. Rivest, D. Haussler, & M. Warmuth (Eds.), Proceedings of the second annual workshop on computational learning theory, Santa Cruz, California, (pp. –). San Mateo, CA: Morgan Kaufmann Publishers.

C

Wrobel, S. (). Concept formation and knowledge revision. Dordrecht: Kluwer Academic Publishers. Zeugmann, T. (). On B¯arzdi¸nš’ conjecture. In K. P. Jantke (Ed.), Analogical and inductive inference, Proceedings of the international workshop, Lecture notes in computer science, (Vol. , pp. –). Berlin: Springer-Verlag. Zeugmann, T. (). Lange and Wiehagen’s pattern language learning algorithm: An average case analysis with respect to its total learning time. Annals of Mathematics and Artificial Intelligence, , –.

Connectivity 7Topology of a Neural Network

Consensus Clustering Synonyms Clustering aggregation; Clustering ensembles

Definition In Consensus Clustering we are given a set of n objects V, and a set of m clusterings {C , C , . . . , Cm } of the objects in V. The aim is to find a single clustering C that disagrees least with the input clusterings, that is, C minimizes D(C) = ∑ d(C, Ci ), Ci

for some metric d on clusterings of V. Meil˘a () proposed the principled variation of information metric on clusterings, but it has been difficult to analyze theoretically. The Mirkin metric is the most widely used, in which d(C, C′ ) is the number of pairs of objects (u, v) that are clustered together in C and apart in C′ , or vice versa; it can be calculated in time O(mn). We can interpret each of the clusterings Ci in Consensus Clustering as evidence that pairs ought be put together or separated. That is, w+uv is the number of Ci in which Ci [u] = Ci [v] and w−uv is the number of Ci in which Ci [u] ≠ Ci [v]. It is clear that w+uv + w−uv = m and

C

C

Constrained Clustering

that Consensus clustering is an instance of Correlation clustering in which the w−uv weights obey the triangle inequality.

GPS data, gene expression microarray analysis, video object identification, document clustering, and web search result grouping.

Structure of the Learning System

Constrained Clustering Kiri L. Wagstaff Pasadena, CA, USA

Definition Constrained clustering is a semisupervised approach to 7clustering data while incorporating domain knowledge in the form of constraints. The constraints are usually expressed as pairwise statements indicating that two items must, or cannot, be placed into the same cluster. Constrained clustering algorithms may enforce every constraint in the solution, or they may use the constraints as guidance rather than hard requirements.

Motivation and Background 7Unsupervised learning operates without any domainspecific guidance or preexisting knowledge. Supervised learning requires that all training examples be associated with labels. Yet it is often the case that existing knowledge for a problem domain fits neither of these extremes. Semisupervised learning methods fill this gap by making use of both labeled and unlabeled data. Constrained clustering, a form of semisupervised learning, was developed to extend clustering algorithms to incorporate existing domain knowledge, when available. This knowledge may arise from labeled data or from more general rules about the concept to be learned. One of the original motivating applications was noun phrase coreference resolution, in which noun phrases in a text must be clustered together to represent distinct entities (e.g., “Mr. Obama” and “the President” and “he”, separate from “Sarah Palin” and “she” and “the Alaska governor”). This problem domain contains several natural rules for when noun phrases should (such as appositive phrases) or should not (such as a mismatch on gender) be clustered together. These rules can be translated into a collection of pairwise constraints on the data to be clustered. Constrained clustering algorithms have now been applied to a rich variety of domain areas, including hyperspectral image analysis, road lane divisions from

Constrained clustering arises out of existing work with unsupervised clustering algorithms. In this description, we focus on clustering algorithms that seek a partition of the data into disjoint clusters, using a distance or similarity measure to place similar items into the same cluster. Usually, the desired number of clusters, k, is specified as an input to the algorithm. The most common clustering algorithms are k-means (MacQueen, ) and expectation maximization or EM (Dempster, Laird, & Rubin, ) (Fig. ). A constrained clustering algorithm takes the same inputs as a regular (unsupervised) clustering algorithm and also accepts a set of pairwise constraints. Each constraint is a 7must-link or 7cannot-link constraint. The must-link constraints form an equivalence relation, which permits the inference of additional transitively implied must-links as well as additional entailed cannot-link constraints between items from distinct must-link cliques. Specifying a significant number of pairwise constraints might be tedious for large data sets, so often they may be generated from a manually labeled subset of the data or from domain-specific rules. The algorithm may interpret the constraints as hard constraints that must be satisfied in the output or as soft preferences that can be violated, if necessary. The former approach was used in the first constrained clustering algorithms, COP-COBWEB (Wagstaff & Cardie,

Domain knowledge

Constraints = Output clusters

Input data

Constrained clustering

Constrained Clustering. Figure . The constrained clustering algorithm takes in nine items and two pairwise constraints (one must-link and one cannot-link). The output clusters respect the specified constraints

Constraint-Based Mining

) and COP-kmeans (Wagstaff, Cardie, Rogers, & Schroedl, ). COP-kmeans accommodates the constraints by restricting item assignments to exclude any constraint violations. If a solution that satisfies the constraints is not found, COP-kmeans terminates without a solution. Later, algorithms such as PCK-means and MPCK-means (Bilenko, Basu, & Mooney, ) permitted the violation of constraints when necessary by introducing a violation penalty. This is useful when the constraints may contain noise or internal inconsistencies, which are especially relevant in real-world domains. Constrained versions of other clustering algorithms such as EM (Shental, Bar-Hillel, Hertz, & Weinshall, ) and spectral clustering (Kamvar, Klein, & Manning, ) also exist. Penalized probabilistic clustering (PPC) is a modified version of EM that interprets the constraints as (soft) probabilistic priors on the relationships between items (Lu & Leen, ). In addition to constraining the assignment of individual items, constraints can be used to learn a better distance metric for the problem at hand (Bar-Hillel, Hertz, Shental, & Weinshall, ; Klein, Kamvar, & Manning, ; Xing, Ng, Jordan, & Russell, ). Must-link constraints hint that the effective distance between those items should be low, while cannotlink constraints suggest that their pairwise distance should be high. Modifying the metric accordingly permits the subsequent application of a regular clustering algorithm, which need not explicitly work with the constraints at all. The MPCK-means algorithm fuses these approaches together, providing both constraint satisfaction and metric learning simultaneously (Basu, Bilenko, & Mooney, ; Bilenko et al., ). More information about subsequent advances in constrained clustering algorithms, theory, and novel applications can be found in a compilation edited by Basu, Davidson, and Wagstaff (). Programs and Data

The MPCK-means algorithm is available in a modified version of the Weka machine learning toolkit (Java) at http://www.cs.utexas.edu/users/ml/risc/code/.

Recommended Reading Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (). Learning a Mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, , –.

C

Basu, S., Bilenko, M., & Mooney, R. J. (). A probabilistic framework for semi-supervised clustering. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. –). Seattle, WA. Basu, S., Davidson, I., & Wagstaff, K. (Eds.). (). Constrained Clustering: Advances in Algorithms, Theory, and Applications. Boca Raton, FL: CRC Press. Bilenko, M., Basu, S., & Mooney, R. J. (). Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the Twenty-first International Conference on Machine Learning (pp. –). Banff, AB, Canada. Dempster, A. P., Laird, N. M., & Rubin, D. B. (). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, (), –. Kamvar, S., Klein, D., & Manning, C. D. (). Spectral learning. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. –). Acapulco, Mexico. Klein, D., Kamvar, S. D., & Manning, C. D. (). From instancelevel constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proceedings of the Nineteenth International Conference on Machine Learning (pp. –). Sydney, Australia. Lu, Z. & Leen, T. (). Semi-supervised learning with penalized probabilistic clustering. In Advances in Neural Information Processing Systems (Vol. , pp. –). Cambridge, MA: MIT Press. MacQueen, J. B. (). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Symposium on Math, Statistics, and Probability (Vol. , pp. –). California: University of California Press. Shental, N., Bar-Hillel, A., Hertz, T., & Weinshall, D. (). Computing Gaussian mixture models with EM using equivalence constraints. In Advances in Neural Information Processing Systems (Vol. , pp. –). Cambridge, MA: MIT Press. Wagstaff, K. & Cardie, C. (). Clustering with instance-level constraints. In Proceedings of the Seventeenth International Conference on Machine Learning (pp. –). San Francisco: Morgan Kaufmann. Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (). Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. –). San Francisco: Morgan Kaufmann. Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (). Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (Vol. , pp. –). Cambridge, MA: MIT Press.

Constraint-Based Mining Siegfried Nijssen Katholieke Universiteit Leuven, Leuven, Belgium

Definition Constraint-based mining is the research area studying the development of data mining algorithms that search

C

C

Constraint-Based Mining

through a pattern or model space restricted by constraints. The term is usually used to refer to algorithms that search for patterns only. The most well-known instance of constraint-based mining is the mining of 7frequent patterns. Constraints are needed in pattern mining algorithms to increase the efficiency of the search and to reduce the number of patterns that are presented to the user, thus making knowledge discovery more effective and useful.

Motivation and Background Constraint-based pattern mining is a generalization of frequent itemset mining. For an introduction to frequent itemset mining, see 7Frequent Patterns. A constraint-based mining problem is specified by providing the following elements: A database D, usually consisting of independent transactions (or instances) ● A 7hypothesis space L of patterns ● A constraint q(θ, D) expressing criteria that a pattern θ in the hypothesis space should fulfill on the database

●

The general constraint-based mining problem is to find the set Th(D, L, q) = {θ ∈ L∣q(θ, D) = true}. Alternative problem settings are obtained by making different choices for D, L and q. For instance, If the database and hypothesis space consist of itemsets, and the constraint checks if the support of a pattern exceeds a predefined threshold in data, the frequent itemset mining problem is obtained (see 7Frequent Patterns) ● If the database and the hypothesis space consist of graphs or trees instead of itemsets, a graph mining or a tree mining problem is obtained. For more information about these topics, see 7Graph Mining and 7Tree Mining ● Additional syntactic constraints can be imposed ●

An overview of important types of constraints is given below. One can generalize the constraint-based mining problem beyond pattern mining. Also models, such as

7Decision Trees, could be seen as languages of interest. In the broadest sense, topics such as 7Constrained Clustering, 7Cost-Sensitive Learning, and even learning 7Support Vector Machines (SVMs) may be seen as constraint-based mining problems. However, it is currently not common to categorize these topics as constraint-based mining; in practice, the term refers to constraint-based pattern mining. From the perspective of constraint-based mining, the knowledge discovery process can be seen as a process in which a user repeatedly specifies constraints for data mining algorithms; the data mining system is a solver that finds patterns or models that satisfy the constraints. This approach to data mining is very similar to querying relational databases. Whereas relational databases are usually queried using operations such as projections, selections, and joins, in the constraintbased mining framework data is queried to find patterns or models that satisfy constraints that cannot be expressed in these primitives. A database which supports constraint-based mining queries, stores patterns and models, and allows later reuse of patterns and models, is sometimes also called an inductive database (Imielinski & Mannila, ).

Structure of the Learning System Constraints

Frequent pattern mining algorithms can be generalized along several dimensions. One way to generalize pattern mining algorithms is to allow them to deal with arbitrary 7coverage relations, which determine when a pattern matches a transaction in the data. In the example of mining itemsets, the subset relation determines the coverage relation. The coverage relation is at the basis of constraints such as minimum support; an alternative coverage relation would be the superset relation. From the coverage relation follows a generality relationship. A pattern θ is defined to be more specific than a pattern θ (denoted by θ ≻ θ ) if any transaction that is covered by θ is also covered by θ (see 7Generalization). In frequent itemset mining, itemset I is more general than itemset I if and only I ⊆ I . Generalization and coverage relationships can be used to identify the following types of constraints.

Constraint-Based Mining

Monotonic and Anti-Monotonic Constraints An essen-

tial property which is exploited in 7frequent pattern mining, is that all subsets of a frequent pattern are also frequent. This is a property that can be generalized: A constraint is called monotonic if any generalization of a pattern that satisfies the constraint, also satisfies the constraint ● A constraint is called anti-monotonic if any specialization of a pattern that satisfies the constraint, also satisfies the constraint

●

In some publications, the definitions of monotonic and anti-monotonic are used reversely. The following are examples of monotonic constraints: Minimum support Syntactic constraints, for instance: a constraint that requires that patterns specializing a given pattern x are excluded a constraint requiring patterns to be small given a definition of pattern size ● Disjunctions or conjunctions of monotonic constraints ● Negations of anti-monotonic constraints

● ●

The following are examples of anti-monotonic constraints: Maximum support ● Syntactic constraints, for instance, a constraint that requires that patterns generalizing a given pattern x are excluded ● Disjunctions or conjunctions of anti-monotonic constraints ● Negations of monotonic constraints ●

Succinct Constraints Constraints that can be pushed

in the mining process by adapting the pattern space or data, are called succinct constraints. An example of a succinct constraint is the monotonic constraint that an itemset should contain the item A. This constraint could be dealt with by deleting all transactions that do not contain A. For any frequent itemset found in the new dataset, it is now known that the item A can be added to it. Convertible Constraints Some constraints that are not

monotonic, can still be convertible monotonic (Pei &

C

Han, ). A constraint is convertible monotonic if for every pattern θ one least general generalization θ ′ can be identified such that if θ satisfies the constraint, then θ ′ also satisfies the constraint. An example of a convertible constraint is a maximum average cost constraint. Assume that every item in an itemset has a cost as defined by a function c(i). The constraint c(I) = ∑i∈I c(i)/∣I∣ ≤ maxcost is not monotonic. However, for every itemset I with c(I) ≤ maxcost, if an item i is removed with c(i) = maxi∈I c(i), an itemset with c(I − {i}) ≤ c(I) ≤ maxcost is obtained. Maximum average cost has the desirable property that no access to the data is needed to identify the generalization that should satisfy the constraints. If it is not possible to identify the necessary least general generalization before accessing the data, the convertible constraint is also sometimes called weak (anti-)monotone (Zhu, Yan, Han, & Yu, ). Boundable Constraints Constraints on non-monotonic

measures for which a monotonic bound exist, are called boundable. An example of such a constraint is a minimum accuracy constraint in a database with binary class labels. Assume that every itemset is interpreted as a rule if I then else (thus, class label is predicted if a transaction contains itemset I, or class label otherwise; see 7Supervised Descriptive Rule Discovery). A minimum accuracy constraint can be formalized by the formula (fr(I, D ) + ∣D ∣ − fr(I, D ))/∣D∣ ≥ minacc, where Dk is the database containing only the examples labeled with class label k. It can be derived from this that fr(I, D ) ≥ ∣D∣minacc−∣D ∣+fr(I, D ) ≥ ∣D∣minacc−∣D ∣. In other words, if a high accuracy is desirable, a minimum number of examples of class is required to be covered, and a minimum frequency constraint can thus be derived. Therefore, minimum support can be used as a bound for minimum accuracy. The principle of deriving bounds for non-monotonic measures can be applied widely (Bayardo, Agrawal, & Gunopulos, ; Morishita & Sese, ). Borders If constraints are not restrictive enough, the

number of patterns can be huge. Ignoring statistics about patterns such as their exact frequency, the set of patterns can be represented more compactly only by

C

C

Constraint-Based Mining

listing the patterns in the border(s) (Mannila & Toivonen, ), similar to the idea of 7version spaces. An example of a border is the set of maximal frequent itemsets (see 7Frequent Patterns). Borders can be computed for other types of both monotonic and antimonotonic constraints as well. There are several complications compared to the simple frequent pattern mining setting: If there is an anti-monotonic constraint, such as maximum support, not only is it needed to compute a border for the most specific elements in the set (SSet), but also a border for the least general elements in the set (G-Set) ● If the formula is a disjunction of conjunctions, the result of a query becomes a union of version spaces, which is called a multi-dimensional version space (see Fig. ) (De Raedt, Jaeger, Lee, & Mannila, ); the G-Set of one version space may be more general than the G-Set of another version space ●

Both the S-Set and the G-Set can be represented by listing elements just within the version space (the positive border), or elements just outside the version space (the negative border). For instance, the positive border of the G-Set consists of those patterns which are part of the version space, and for which no generalizations exist which are part of the version space. Similarly, there may exist several representations of multi-dimensional version spaces; optimizing the representation of multi-dimensional version spaces is analogous to optimizing queries in relational databases (De Raedt et al., ). Borders form a condensed representations, that is, they compactly represent the solution space; see 7Frequent Patterns. Algorithms For many of the constraints specified in

the previous section specialized algorithms have been developed in combination with specific hypothesis spaces. It is beyond the scope of this chapter to discuss all these algorithms; only the most common ideas are provided here. The main idea is that 7Apriori can easily be updated to deal with general monotonic constraints in arbitrary hypothesis spaces. The concept of a specialization 7refinement operator is essential to operate on

other hypothesis spaces than itemsets. A specialization operator ρ(θ) computes a set of specializations in the hypothesis space for a given input pattern. In pattern mining, this operator should have the following properties:

Completeness: every pattern in the hypothesis space should be reachable by repeated application of the refinement operator starting from the most general pattern in the hypothesis space ● Nonredundancy: every pattern in the hypothesis space should be reachable in only one way starting from the most general pattern in the hypothesis space ●

In itemset mining, optimal refinement is usually obtained by first ordering the items (for instance, alphabetically, or by frequency), and then adding items that are higher in the chosen order to a set than the items already in the set. For instance, for the itemset {A, C}, the specialization operator returns ρ({A, C}) = {{A, C, D}, {A, C, E}}, assuming that the domain of items {A, B, C, D, E} is considered. Other refinement operators are needed while dealing with other hypothesis spaces, such as in 7graph mining. The search in Apriori proceeds 7breadth-first. Each level, the specialization operator is applied on patterns satisfying the monotonic constraints to generate candidates for the next level. For every new candidate it is checked whether its generalizations satisfy the monotonic constraints. To create a set of generalizations, a generalization refinement operator can be used. In frequent itemset mining, usually single items are removed from the itemset to generate generalizations. More changes are required to deal with antimonotonic constraints. A simple way of dealing with both monotonic and anti-monotonic constraints is to first compute all patterns that satisfy the monotonic constraints, and then to prune the patterns that fail to satisfy the anti-monotonic constraints. More challenging is to “push” anti-monotonic constraints in the mining process. An observation which is often exploited is that generalizations of patterns that do not satisfy the anti-monotonic constraint need not be considered. Well-known strategies are:

Constructive Induction Top element of the partial order G-Border (1) G-Border S-Border (1) Version Space

C G-Border (2) S-Border (2)

(a) A 1-dimensional version space

Version Space (2)

More specific

S-Border

Version Space (1)

More general

Top element of the partial order

C

(b) A 2-dimensional version space

Constraint-Based Mining. Figure . Version spaces

In a breadth-first setting: traverse the lattice in reverse order for monotonic constraints, after the patterns have been determined satisfying the antimonotonic constraints (De Raedt et al., ) ● In a depth-first setting: during the search for patterns, try to guess the largest pattern that can still be reached, and prune a branch in the search if the pattern does not satisfy the monotonic constraint on this pattern (Bucila, Gehrke, Kifer, & White, ; Kifer, Gehrke, Bucila, & White, ) ●

It is beyond the scope of this chapter to discuss how to deal with other types of constraints; however, it should be pointed out that not all combinations of constraints and hypothesis spaces have been studied; it is not obvious whether all constraints can be pushed usefully in a pattern search for any hypothesis space, for instance, when boundable constraints in more complex hypothesis spaces (such as graphs) are involved. Research in this area is ongoing.

De Raedt, L., Jaeger, M., Lee, S. D., & Mannila, H. (). A theory of inductive query answering (extended abstract). In Proceedings of the second IEEE international conference on data mining (ICDM) (pp. –). Los Alamitos, CA: IEEE Press. Imielinski, T., & Mannila, H. (). A database perspective on knowledge discovery. Communications of the ACM, , –. Kifer, D., Gehrke, J., Bucila, C., & White, W. M. (). How to quickly find a witness. In Proceedings of the twenty-second ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (pp. –). San Diego, CA: ACM Press. Mannila, H., & Toivonen, H. (). Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, (), –. Morishita, S., & Sese, J. (). Traversing itemset lattices with statistical metric pruning. In Proceedings of the nineteenth ACM SIGACT-SIGMOD-SIGART symposium on database systems (PODS) (pp. –). San Diego, CA: ACM Press. Pei, J., & Han, J. (). Constrained frequent pattern mining: A pattern-growth view. SIGKDD Explorations, (), –. Zhu, F., Yan, X., Han, J., & Yu, P. S. (). gPrune: A constraint pushing framework for graph pattern mining. In Proceedings of the sixth Pacific-Asia conference on knowledge discovery and data mining (PAKDD). Lecture notes in computer science (Vol. , pp. –). Berlin: Springer.

Cross References 7Constrained Clustering 7Frequent Pattern Mining 7Graph Mining 7Tree Mining

Recommended Reading Bayardo, R. J., Jr., Agrawal, R., & Gunopulos, D. (). Constraintbased rule mining in large, dense databases. In Proceedings of the th international conference on data engineering (ICDE) (pp. –). Sydney, Australia. Bucila, C., Gehrke, J., Kifer, D., & White, W. M. (). DualMiner: A dual-pruning algorithm for itemsets with constraints. Data Mining and Knowledge Discovery, (), –.

Constructive Induction Constructive induction is any form of 7induction that generates new descriptors not present in the input data (Dietterich & Michalski, ).

Recommended Reading Dietterich, T. G., & Michalski, R. S. (). A comparative review of selected methods for learning from examples. In Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). Machine learning: An artificial intelligence approach, pp. –. Tioga.

C

Content Match

Content Match 7Text Mining for Advertising

Content-Based Filtering Synonyms Content-based recommending

Definition Content-based filtering is prevalent in 7Information Retrieval, where the text and multimedia content of documents is used to select documents relevant to a user’s query. In the context this refers to content-based recommenders, that provide recommendations by comparing representations of content describing an item to representations of content that interests a user.

Definition A learning system that can continue adding new data without the need to ever stop or freeze the updating. Usually continual learning requires incremental and 7online learning as a component, but not every incremental learning system has the ability to achieve continual learning, i.e., the learning may deterioate after some time.

Cross References 7Cumulative Learning

Continuous Attribute A continuous attribute can assume all values on the number line within the value range. See 7Attribute and 7Measurement Scales.

Contrast Set Mining Definition

Content-Based Recommending 7Content-Based Filtering

Context-Sensitive Learning

Contrast set mining is an area of 7supervised descriptive rule induction. The contrast set mining problem is defined as finding contrast sets, which are conjunctions of attributes and values that differ meaningfully in their distributions across groups (Bay & Pazzani, ). In this context, groups are the properties of interest.

Recommended Reading 7Concept Drift

Bay, S.D., & Pazzani, M. J. (). Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, (), –.

Contextual Advertising 7Text Mining for Advertising

Cooperative Coevolution 7Compositional Coevolution

Continual Learning Co-Reference Resolution Synonyms Life-Long Learning

7Entity Resolution

Correlation Clustering

Correlation Clustering Anthony Wirth The University of Melbourne, Victoria, Australia

Synonyms Clustering with advice; Clustering with constraints; Clustering with qualitative information; Clustering with side information

Definition In its rawest form, correlation clustering is graph optimization problem. Consider a 7clustering C to be a mapping from the elements to be clustered, V, to the set {, . . . , ∣V∣}, so that u and v are in the same cluster if and only if C[u] = C[v]. Given a collection of items in which each pair (u, v) has two weights w+uv and w−uv , we must find a clustering C that minimizes ∑

w−uv +

C[u]=C[v]

∑

w+uv ,

()

C[u]≠C[v]

or, equivalently, maximizes ∑ C[u]=C[v]

w+uv +

∑

w−uv .

()

C[u]≠C[v]

Note that although w+uv and w−uv may be thought of as positive and negative evidence towards coassociation, the actual weights are nonnegative.

Motivation and Background The notion of clustering with advice, that is nonmetricdriven relations between items, had been studied in other communities (Ferligoj & Batagelj, ) prior to its appearance in theoretical computer science. Traditional clustering problems, such as k-median and k-center, assume that there is some type of distance measure (metric) on the data items, and often specify the number of clusters that should be formed. In the clustering with advice framework, however, the number of clusters to be built need not be specified in advance: it can be an outcome of the objective function. Furthermore, instead of, or in addition to, a distance function, we are given advice as to which pairs of

C

items are similar. The two weights w+uv and w−uv correspond to external advice about whether the pair should be clustered together or separately. Bansal, Blum, and Chawla () introduced the problem to the theoretical computer science and machine-learning communities. They were motivated by database consistency problems, in which the same entity appeared in different forms in various databases. Given a collection of such records from multiple databases, the aim is to cluster together the records that appear to correspond to the same entity. From this viewpoint, the log odds ratio from some classifier, log (

Pr(same) ), Pr(different)

corresponds to a label wuv for the pair. In many applications only one of the + and − weights for the pair is nonzero, that is ⎧ ⎪ ⎪(wuv , ) (w+uv , w−uv ) = ⎨ ⎪ ⎪ ⎩(, −wuv )

for wuv ≥ for wuv ≤ .

In addition, if every pair has weight ±, then the instance is called complete, otherwise it is referred to as general. Demaine, Emanuel, Fiat, and Immorlica () suggest the following motivation. Suppose we have a set of guests at a party. Each guest has preferences for whom they would like to sit with, and for whom they would like to avoid. We must group the guests into tables in a way that enhances the amicability of the party. The notion of producing good clusterings when given inconsistent advice first appeared in the work of Ben-Dor, Shamir, and Yakhini (). A canonical example of inconsistent advice is this: items u and v are similar, items v and y are similar, but u and y are dissimilar. It is impossible to find a clustering that satisfies all the advice. Figure shows a very simple example of inconsistent advice. In addition, although Correlation clustering is an NP-hard problem, recent algorithms for clustering with advice guarantee that their solutions are only a specified factor worse than the optimal: that is, they are approximation algorithms.

Theory In setting out the correlation clustering framework, Bansal et al. () noted that the following algorithm

C

C

Correlation Clustering

type procedure to round the solution of a linear programming relaxation of the problem: minimize + − ∑ wij ⋅ xij + wij ⋅ ( − xij ) ij

subject to

Correlation Clustering. Figure . Top left is a toy clustering with advice example showing three similar pairs (solid edges) and three dissimilar pairs (dashed edges). Bottom left is a clustering solution for this example with four singleton clusters, while bottom right has one cluster. Top right is a partitioning into two clusters that appears to best respect the advice

produces a -approximation for the maximization problem:

()

xik ≤ xij + xjk

for all i, j, k

xij ∈ [, ]

for all i, j

In this setting, xij = implies i and j’s separation, while xij = implies coclustering, with values in between representing partial evidence. In practice solving this linear program is very slow and has huge memory demands (Bertolacci & Wirth, ). Charikar et al. also showed that this version of problem is APX-hard. For the maximization problem (), they showed that instances with general weights were APX-hard and provided a rounding of the following semidefinite program (SDP) that yields a . factor approximation algorithm. maximize ∑ wij (vi ⋅ vj ) + ∑ wij ( − vi ⋅ vj )

▸ If the total of the positive weights exceeds the total of the negative weights then, place all the items in a single cluster; otherwise, make each item a singleton cluster.

They then showed that complete instances are NP-hard to optimize, and how to minimize the penalty () with a constant factor approximation. The constant for this combinatorial algorithm was rather large. The algorithm relied heavily on the completeness of the instance; it iteratively cleans clusters until every cluster is δ-clean. That is, for each item at most a fraction δ ( < δ < ) of the other items in its cluster have a negative relation with it, and at most δ outside its cluster a positive relation. Bansal et al. also demonstrated that the minimization problem on general instances is APX-hard: there is some constant, larger than , below which approximation is NP-hard. Finally, they provided a polynomial time approximation scheme (PTAS) for maximizing () in complete instances. The constant factor for minimizing () on complete instances was improved to by Charikar, Guruswami, and Wirth (). They employed a region-growing

+(ij)

−(ij)

subject to

()

vi ⋅ vi =

for all i

vi ⋅ vj ≥

for all i, j

In this case we interpret vi ⋅ vj = as evidence that i and j are in the same cluster, but vi ⋅ vj = as evidence toward separation. Emanuel and Fiat () extended the work of Bansal et al. by drawing a link between Correlation Clustering and the Minimum Multicut problem. This reduction to Multicut provided an O(log n) approximation algorithm for minimizing general instances of Correlation Clustering. Interestingly, Emanuel and Fiat also showed that there was reduction in the opposite direction: an optimal solution to Correlation Clustering induced an optimal solution to Minimum Multicut. Demaine and Immorlica () also drew the link from Correlation Clustering to Minimum multicut and its O(log n) approximation algorithm. In addition, they described an O(r )-approximation algorithm for graphs that exclude the complete bipartite graph Kr,r as a minor.

Correlation Clustering

Swamy (), using the same SDP () as Charikar et al., but different rounding techniques, showed how to maximize () within factor . in general instances. The factor approximation for minimization () of complete instances was lowered to . by Ailon, Charikar, and Newman (). Using the distances obtained by solving the linear program (), they repeat the following steps: ▸ form a cluster around random item i by including each (unclustered) j with probability − xij ; set the cluster aside.

Since solving the linear program is highly resource hungry, Ailon et al. provided a combinatorial alternative: add j to i’s cluster if w+ij > w−ij . Not only is this algorithm very fast, it is actually a factor approximation. Recently, Tan () has shown that the / + є inapproximability for maximizing () on general weighted graphs extends to general unweighted graphs. A further variant in the Correlation Clustering family of problems is the maximization of ()–(), known as maximizing correlation. Charikar and Wirth () proved an Ω(/ log n) approximation for the general problem of maximizing n

n

∑ ∑ aij xi xj ,

s.t. xi ∈ {−, } for all i,

()

i= j=

for a matrix A with null diagonal entries, by rounding the canonical SDP relaxation. This effectively maximized correlation with the requirement that two clusters be formed; it was not hard to extend this to general instances. The gap between the vector SDP solution and the integral solution to maximizing the quadratic program () was in fact shown to be Θ(/ log n) in general (Alon, Makarychev, Makarychev, & Naor, ). However, in other instances such as those with a bounded number of nonzero weights for each item, a constant factor approximation was possible. Arora, Berger, Hazan, Kindler, and Safra () went further and showed that it is quasi-NP-hard to approximate the maximization to a factor better than Ω(/ logγ n) for some γ > . Shamir, Sharan, and Tsur () showed that 7Cluster Editing and p-Cluster Editing, in which p clusters must be formed, are NP-complete (for p ≥ ). Gramm, Guo, Hüffner, and Niedermeier () took

C

an innovative approach to solving the Clustering Editing problem exactly. They had previously produced an O(.k + n ) time hand-made search tree algorithm, where k is the number of edges that need to be modified. This “awkward and error-prone work” was then replaced with a computer program that itself designed a search tree algorithm, involving automated case analysis, that ran in O(.k + n ) time. Kulis, Basu, Dhillon, and Mooney () unify various forms of clustering, correlation clustering, spectral clustering, and clustering with constraints in their kernel-based approach to k-means. In this, they have a general objective function that includes penalties for violating pairwise constraints and for having points spread far apart from their cluster centers, where the spread is measured in some high-dimensional space.

Applications The work of Demaine and Immorlica () on Correlation Clustering was closely linked with that of Bejerano et al. on Location Area Planning. This problem is concerned with the allocation of cells in a cellular network to clusters known as location areas. There are costs associated with traffic between the location areas (cuts between clusters) and with the size of clusters themselves (related to paging phones within individual cells). These costs drive the clustering solution in opposite directions, on top of which there are constraints on cells that must (or cannot) be in the same cluster. The authors show that the same O(log n) region-growing algorithm for minimizing Correlation Clustering and Multicut applies to Location Area Planning. Correlation clustering has been directly applied to the coreference problem in natural language processing and other instances in which there are multiple references to the same object (Daume, ; McCallum & Wellner, ). Assuming some sort of undirected graphical model, such as a Conditional Random Field, algorithms for correlation clustering are used to partition a graph whose edge weights corresponding to logpotentials between node pairs. The machine learning community has applied some of the algorithms for Correlation clustering to problems such as email clustering and image segmentation. With similar applications in mind, Finley and Joachims () explore the idea of adapting the pairwise input information to fit example

C

C

Correlation Clustering

clusterings given by a user. Their objective function is the same as Correlation Clustering (), but their main tool is the 7Support Vector Machine. There has been considerable interest in the 7consensus clustering problem, which is an excellent application of Correlation clustering techniques. Gionis, Mannila, and Tsaparas () note several sources of motivation for the Consensus Clustering; these include identifying the correct number of clusters and improving clustering robustness. They adapt Charikar et al.’s region-growing algorithm to create a three-approximation that performs reasonably well in practice, though not as well as local search techniques. Gionis et al. also suggest using sampling as a tool for handling large data sets. Bertolacci and Wirth () extended this study by implementing Ailon et al.’s algorithms with sampling, and therefore a variety of ways of developing a full clustering from the clustering of the sample. They noted that LP-based methods performed best, but placed a significant strain on resources.

Applications of Clustering with Advice The 7k-means clustering algorithm is perhaps the most-used clustering technique: Wagstaff et al. incorporated constraints into a highly cited k-means variant called COP-KMEANS. They applied this algorithm to the task of identifying lanes of traffic based on input GPS data. In the constrained-clustering framework, the constraints are usually assumed to be consistent (noncontradictory) and hard. In addition to the usual must- and cannot-link constraints, Davidson and Ravi () added constraints enforcing various requirements on the distances between points in particular clusters. They analyzed the computational feasibility of the problem of establishing the (in) feasibility of a set of constraints, for various constraint types. Their constrained k-means algorithms were used to help a robot discover objects in a scene.

Recommended Reading Ailon, N., Charikar, M., & Newman, A. (). Aggregating inconsistent information: Ranking and clustering. In Proceedings of the Thirty-Seventh ACM Symposium on the Theory of Computing (pp. –). New York: ACM Press.

Alon, N., Makarychev, K., Makarychev, Y., & Naor, A. (). Quadratic forms on graphs. Inventiones Mathematicae, (), –. Arora, S., Berger, E., Hazan, E., Kindler, G., & Safra, S. (). On non-approximability for quadratic programs. In Proceedings of Forty-Sixth Symposium on Foundations of Computer Science. (pp. –). Washington DC: IEEE Computer Society. Bansal, N., Blum, A., & Chawla, S. (). Correlation clustering. In Correlation clustering (pp. –). Washington, DC: IEEE Computer Society. Ben-Dor, A., Shamir, R., & Yakhini, Z. (). Clustering gene expression patterns. Journal of Computational Biology, , –. Bertolacci, M., & Wirth, A. (). Are approximation algorithms for consensus clustering worthwhile? In Proceedings of Seventh SIAM International Conference on Data Mining. (pp. –). Philadelphia: SIAM. Charikar, M., Guruswami, V., & Wirth, A. (). Clustering with qualitative information. In Proceedings of forty fourth FOCS (pp. –). Charikar, M., & Wirth, A. (). Maximizing quadratic programs: Extending Grothendieck’s inequality. In Proceedings of forty fifth FOCS (pp. –). Daume, H. (). Practical structured learning techniques for natural language processing. PhD thesis, University of Southern California. Davidson, I., & Ravi, S. (). Clustering with constraints: Feasibility issues and the k-means algorithm. In Proceedings of Fifth SIAM International Conference on Data Mining. Demaine, E., Emanuel, D., Fiat, A., & Immorlica, N. (). Correlation clustering in general weighted graphs. Theoretical Computer Science, (), –. Demaine, E., & Immorlica, N. (). Correlation clustering with partial information. In Proceedings of Sixth Workshop on Approximation Algorithms for Combinatorial Optimization Problems. (pp. –). Emanuel, D., & Fiat, A. (). Correlation clustering – minimizing disagreements on arbitrary weighted graphs. In Proceedings of Eleventh European Symposium on Algorithms (pp. –). Ferligoj, A., & Batagelj, V. (). Clustering with relational constraint. Psychometrika, (), –. Finley, T., & Joachims, T. (). Supervised clustering with support vector machines. In Proceedings of Twenty-Second International Conference on Machine Learning. Gionis, A., Mannila, H., & Tsaparas, P. (). Clustering aggregation. In Proceedings of Twenty-First International Conference on Data Engineering. To appear. Gramm, J., Guo, J., Hüffner, F., & Niedermeier, R. (). Automated generation of search tree algorithms for hard graph modification problems. Algorithmica, (), –. Kulis, B., Basu, S., Dhillon, I., & Mooney, R. (). Semi-supervised graph clustering: A kernel approach. In Proceedings of TwentySecond International Conference on Machine Learning (pp. –). McCallum, A., & Wellner, B. (). Conditional models of identity uncertainty with application to noun coreference. In L. Saul,

Cost-Sensitive Learning

Y. Weiss, & L. Bottou, (Eds.), Advances in neural information processing systems (pp. –). Cambridge, MA: MIT Press. Meil˘a, M. (). Comparing clusterings by the variation of information. In Proceedings of Sixteenth Conference on Learning Theory (pp. –). Shamir, R., Sharan, R., & Tsur, D. (). Cluster graph modification problems. Discrete Applied Mathematics, , –. Swamy, C. (). Correlation Clustering: Maximizing agreements via semidefinite programming. In Proceedings of Fifteenth ACM-SIAM Symposium on Discrete Algorithms (pp. –). Tan, J. (). A Note on the inapproximability of correlation clustering. Technical Report ., eprint arXiv, .

C

Definition Cost-Sensitive Learning is a type of learning that takes the misclassification costs (and possibly other types of cost) into consideration. The goal of this type of learning is to minimize the total cost. The key difference between cost-sensitive learning and cost-insensitive learning is that cost-sensitive learning treats different misclassifications differently. That is, the cost for labeling a positive example as negative can be different from the cost for labeling a negative example as positive. Cost-insensitive learning does not take misclassification costs into consideration.

Correlation-Based Learning 7Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity

Cost In 7Markov decision processes, negative rewards are often expressed as costs. A reward of −x is expressed as a cost of x. In 7supervised learning, cost is used as a synonym for 7loss.

Cross References 7Loss

Cost Function 7Loss Function

Cost-Sensitive Classification 7Cost-Sensitive Learning

Cost-Sensitive Learning Charles X. Ling, Victor S. Sheng The University of Western Ontario, Canada

Synonyms Cost-sensitive classification; Learning with different classification costs

Motivation and Background Classification is an important task in inductive learning and machine learning. A classifier, trained from a set of training examples with class labels, can then be used to predict the class labels of new examples. The class label is usually discrete and finite. Many effective classification algorithms have been developed, such as 7naïve Bayes, 7decision trees, 7neural networks, and 7support vector machines. However, most classification algorithms seek to minimize the error rate: the percentage of the incorrect prediction of class labels. They ignore the difference between types of misclassification errors. In particular, they implicitly assume that all misclassification errors have equal cost. In many real-world applications, this assumption is not true. The differences between different misclassification errors can be quite large. For example, in medical diagnosis of a certain cancer (where having cancer is regarded as the positive class, and non-cancer (healthy) as negative), misdiagnosing a cancer patient as healthy (the patient is actually positive but is classified as negative; thus it is also called “false negative”) is much more serious (thus expensive) than a false-positive error. The patient could lose his/her life because of a delay in correct diagnosis and treatment. Similarly, if carrying a bomb is positive, then it is much more expensive to miss a terrorist who carries a bomb onto a flight than searching an innocent person. Cost-sensitive learning takes costs, such as the misclassification cost, into consideration. Turney () provides a comprehensive survey of a large variety of different types of costs in data mining and machine

C

C

Cost-Sensitive Learning

learning, including misclassification costs, data acquisition cost (instance costs and attribute costs), 7active learning costs, computation cost, human–computer interaction cost, and so on. The misclassification cost is singled out as the most important cost, and it has received the most attention in recent years.

Theory The theory of cost-sensitive learning (Elkan, ; Zadrozny and Elkan, ) describes how the misclassification cost plays its essential role in various costsensitive learning algorithms. Without loss of generality, binary classification is assumed (i.e., positive and negative class) in this paper. In cost-sensitive learning, the costs of false positive (actual negative but predicted as positive; denoted as FP), false negative (FN), true positive (TP), and true negative (TN) can be given in a cost matrix, as shown in Table . In the table, the notation C(i, j) is also used to represent the misclassification cost of classifying an instance from its actual class j into the predicted class i ( is used for positive, and for negative). These misclassification cost values can be given by domain experts, or learned via other approaches. In cost-sensitive learning, it is usually assumed that such a cost matrix is given and known. For multiple classes, the cost matrix can be easily extended by adding more rows and more columns. Note that C(i, i) (TP and TN) is usually regarded as the “benefit” (i.e., negated cost) when an instance is predicted correctly. In addition, cost-sensitive learning is often used to deal with datasets with very imbalanced class distributions (see 7Class Imbalance Problem) (Japkowicz & Stephen, ). Usually (and without loss of generality), the minority or rare class is regarded as the positive class, and it is often more expensive to misclassify an actual positive example into negative,

Cost-Sensitive Learning. Table An Example of Cost Matrix for Binary Classification

than an actual negative example into positive. That is, the value of FN = C(, ) is usually larger than that of FP = C(, ). This is true for the cancer example mentioned earlier (cancer patients are usually rare in the population, but predicting an actual cancer patient as negative is usually very costly) and the bomb example (terrorists are rare). Given the cost matrix, an example should be classified into the class that has the minimum expected cost. This is the minimum expected cost principle. The expected cost R(i ∣ x) of classifying an instance x into class i (by a classifier) can be expressed as: R (i ∣ x) = ∑ P (j ∣ x) C (j, i),

where P(j ∣ x) is the probability estimation of classifying an instance into class j. That is, the classifier will classify an instance x into positive class if and only if: P ( ∣ x) C (, ) + P ( ∣ x) C (, ) ≤ P ( ∣ x) C (, ) + P ( ∣ x) C (, ) This is equivalent to: P ( ∣ x) (C (, ) − C (, )) ≤ P ( ∣ x) (C (, ) − C (, )) Thus, the decision (of classifying an example into positive) will not be changed if a constant is added into a column of the original cost matrix. Thus, the original cost matrix can always be converted to a simpler one by subtracting C(, )to the first column, and C(, ) to the second column. After such conversion, the simpler cost matrix is shown in Table . Thus, any given cost-matrix can be converted to one with C(, ) = C(, ) = . (Here it is assumed that the misclassification cost is the same for Cost-Sensitive Learning. Table A Simpler Cost Matrix with an Equivalent Optimal Classification

Actual negative

Actual positive

C(, ), or TP

C(, ), or FN

Predict negative

Predict positive C(, ), or FP

C(, ), or TP

Predict positive C(, ) – C(, )

Predict negative

()

j

True negative

True positive

C(, ) – C(, )

Cost-Sensitive Learning

all examples. This property is a special case of the one discussed in Elkan ().) In the rest of the paper, it will be assumed that C(, ) = C(, ) = . Under this assumption, the classifier will classify an instance x into positive class if and only if: P ( ∣ x) C (, ) ≤ P ( ∣ x) C (, ) As P( ∣ x) = − P( ∣ x), a threshold p∗ can be obtained for the classifier to classify an instance x into positive if P( ∣ x) ≥ p∗ , where p∗ =

C(, ) . C(, ) + C(, )

()

Thus, if a cost-insensitive classifier can produce a posterior probability estimation p( ∣ x) for each test example x, one can make the classifier cost-sensitive by simply choosing the classification threshold according to (), and classify any example to be positive whenever P( ∣ x) ≥ p∗ . This is what several cost-sensitive metalearning algorithms, such as Relabeling, are based on (see later for details). However, some cost-insensitive classifiers, such as C., may not be able to produce accurate probability estimation; they return a class label without a probability estimate. Empirical Thresholding (Sheng & Ling, ) does not require accurate estimation of probabilities – an accurate ranking is sufficient. It simply uses 7cross-validation to search for the best probability value p∗ to use as a threshold. Traditional cost-insensitive classifiers are designed to predict the class in terms of a default, fixed threshold of .. Elkan () shows that one can “rebalance” the original training examples by sampling, such that the classifiers with the . threshold is equivalent to the classifiers with the p* threshold as in (), in order to achieve cost-sensitivity. The rebalance is done as follows. If all positive examples (as they are assumed as the rare class) are kept, then the number of negative examples should be multiplied by C(,)/C(,) = FP/FN. Note that as usually FP < FN, the multiple is less than . This is, thus, often called “under-sampling the majority class.” This is also equivalent to “proportional sampling,” where positive and negative examples are sampled by the ratio of: p () FN : p () FP

()

C

where p() and p() are the prior probability of the positive and negative examples in the original training set. That is, the prior probabilities and the costs are interchangeable: doubling p() has the same effect as doubling FN, or halving FP (Drummond & Holte, ). Most sampling meta-learning methods, such as costing (Zadrozny, Langford, & Abe, ), are based on () above (see later for details). Almost all meta-learning approaches are either based on () or () for the thresholding- and samplingbased meta-learning methods, respectively, to be discussed in the next section.

Structure of Learning System Broadly speaking, cost-sensitive learning can be categorized into two categories. The first one is to design classifiers that are cost-sensitive in themselves.They are called the direct method. Examples of direct cost-sensitive learning are ICET (Turney, ) and cost-sensitive decision tree (Drummond & Holte, ; Ling, Yang, Wang, & Zhang, ). The other category is to design a “wrapper” that converts any existing cost-insensitive (or cost-blind) classifiers into cost-sensitive ones. The wrapper method is also called cost-sensitive metalearning method, and it can be further categorized into thresholding and sampling. Here is a hierarchy of the cost-sensitive learning and some typical methods. This paper will focus on cost-sensitive meta-learning that considers the misclassification cost only. Cost-Sensitive learning – Direct methods ● ICET (Turney, ) ● Cost-sensitive decision trees (Drummond & Holte, ; Ling et al., ) – Meta-learning ● Thresholding MetaCost (Domingos, ) CostSensitiveClassifier (CSC in short) (Witten & Frank, ) Cost-sensitive naïve Bayes (Chai, Deng, Yang, & Ling, ) Empirical Thresholding (ET in short) (Sheng & Ling, ) ● Sampling Costing (Zadrozny et al., ) Weighting (Ting, )

C

C

Cost-Sensitive Learning

Direct Cost-Sensitive Learning

The main idea of building a direct cost-sensitive learning algorithm is to directly introduce and utilize misclassification costs into the learning algorithms. There are several works on direct cost-sensitive learning algorithms, such as ICET (Turney, ) and cost-sensitive decision trees (Ling et al., ). ICET (Turney, ) incorporates misclassification costs in the fitness function of genetic algorithms. On the other hand, cost-sensitive decision tree (Ling et al., ), called CSTree here, uses the misclassification costs directly in its tree building process. That is, instead of minimizing entropy in attribute selection as in C., CSTree selects the best attribute by the expected total cost reduction. That is, an attribute is selected as a root of the (sub) tree if it minimizes the total misclassification cost. Note that as both ICET and CSTree directly take costs into model building, they can also take easily attribute costs (and perhaps other costs) directly into consideration, while meta cost-sensitive learning algorithms generally cannot. Drummond and Holte () investigate the costsensitivity of the four commonly used attribute selection criteria of decision tree learning: accuracy, Gini, entropy, and DKM. They claim that the sensitivity of cost is highest with the accuracy, followed by Gini, entropy, and DKM. Cost-Sensitive Meta-Learning

Cost-sensitive meta-learning converts existing costinsensitive classifiers into cost-sensitive ones without modifying them. Thus, it can be regarded as a middleware component that preprocesses the training data, or post-processes the output, from the cost-insensitive learning algorithms. Cost-sensitive meta-learning can be further classified into two main categories: thresholding and sampling, based on () and () respectively, as discussed in the theory section. Thresholding uses () as a threshold to classify examples into positive or negative if the cost-insensitive classifiers can produce probability estimations. MetaCost (Domingos, ) is a thresholding method. It first uses bagging on decision trees to obtain reliable probability estimations of training examples, relabels the classes of training examples according to (), and then uses the

relabeled training instances to build a cost-insensitive classifier. CSC (Witten & Frank, ) also uses () to predict the class of test instances. More specifically, CSC uses a cost-insensitive algorithm to obtain the probability estimations P(j ∣ x) of each test instance. (CSC is a meta-learning method and can be applied to any classifiers.) Then it uses () to predict the class label of the test examples. Cost-sensitive naïve Bayes (Chai et al., ) uses () to classify test examples based on the posterior probability produced by the naïve Bayes. As seen, all thresholding-based meta-learning methods rely on accurate probability estimations of p( ∣ x) for the test example x. To achieve this, Zadrozny and Elkan () propose several methods to improve the calibration of probability estimates. ET (Empirical Thresholding) (Sheng and Ling, ) is a thresholding-based meta-learning method. It does not require accurate estimation of probabilities – an accurate ranking is sufficient. ET simply uses cross-validation to search the best probability from the training instances as the threshold, and uses the searched threshold to predict the class label of test instances. On the other hand, sampling first modifies the class distribution of the training data according to (), and then applies cost-insensitive classifiers on the sampled data directly. There is no need for the classifiers to produce probability estimations, as long as they can classify positive or negative examples accurately. Zadrozny et al. () show that proportional sampling with replacement produces duplicated cases in the training, which in turn produces overfitting in model building. Instead, Zadrozny et al. () proposes to use “rejection sampling” to avoid duplication. More specifically, each instance in the original training set is drawn once, and accepted into the sample with the accepting probability C(j, i)/Z, where C(j, i) is the misclassification cost of class i, and Z is an arbitrary constant such that Z ≥ max C(j,i). When Z = maxij C(j, i), this is equivalent to keeping all examples of the rare class, and sampling the majority class without replacement according to C(, )/C(, ) – in accordance with (). Bagging is applied after rejection sampling to improve the results further. The resulting method is called Costing. Weighting (Ting, ) can also be viewed as a sampling method. It assigns a normalized weight to each instance according to the misclassification costs

Covariance Matrix

specified in (). That is, examples of the rare class (which carries a higher misclassification cost) are assigned, proportionally, high weights. Examples with high weights can be viewed as example duplication – thus oversampling. Weighting then induces cost-sensitivity by integrating the instances’ weights directly into C., as C. can take example weights directly in the entropy calculation. It works whenever the original cost-insensitive classifiers can accept example weights directly. (Thus, it can be said that Weighting is a semi meta-learning method.) In addition, Weighting does not rely on bagging as Costing does, as it “utilizes” all examples in the training set.

Recommended Reading Chai, X., Deng, L., Yang, Q., & Ling, C. X. (). Test-cost sensitive naïve Bayesian classification. In Proceedings of the fourth IEEE international conference on data mining. Brighton: IEEE Computer Society Press. Domingos, P. (). MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of the fifth international conference on knowledge discovery and data mining, San Diego (pp. –). New York: ACM. Drummond, C., & Holte, R. (). Exploiting the cost (in)sensitivity of decision tree splitting criteria. In Proceedings of the th international conference on machine learning (pp. –). Elkan, C. (). The foundations of cost-sensitive learning. In Proceedings of the th international joint conference of artificial intelligence (pp. –). Seattle: Morgan Kaufmann. Japkowicz, N., & Stephen, S. (). The class imbalance problem: A systematic study. Intelligent Data Analysis, (), –. Ling, C. X., Yang, Q., Wang, J., & Zhang, S. (). Decision trees with minimal costs. InProceedings of international conference on machine learning (ICML’). Sheng, V. S., & Ling, C. X. (). Thresholding for making classifiers cost-sensitive. In Proceedings of the st national conference on artificial intelligence (pp. –), – July , Boston, Massachusetts. Ting, K. M. (). Inducing cost-sensitive trees via instance weighting. In Proceedings of the second European symposium on principles of data mining and knowledge discovery (pp. –). Heidelberg: Springer. Turney, P. D. (). Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research, , –. Turney, P. D. (). Types of cost in inductive concept learning. In Proceedings of the workshop on cost-sensitive learning at the th international conference on machine learning, Stanford University, California. Witten, I. H., & Frank, E. (). Data mining – Practical machine learning tools and techniques with Java implementations. San Francisco: Morgan Kaufmann. Zadrozny, B., & Elkan, C. (). Learning and making decisions when costs and probabilities are both unknown. In Proceedings

C

of the seventh international conference on knowledge discovery and data mining (pp. –). Zadrozny, B., Langford, J., & Abe, N. (). Cost-sensitive learning by cost-proportionate instance weighting. In Proceedings of the third International conference on data mining.

C Cost-to-Go Function Approximation 7Value Function Approximation

Covariance Matrix Xinhua Zhang Australian National University, Canberra, Australia

Definition It is convenient to define a covariance matrix by using multi-variate random variables (mrv): X = (X , . . . , Xd )⊺. For univariate random variables Xi and Xj , their covariance is defined as: Cov(Xi , Xj ) = E [(Xi − µ i )(Xj − µ j )] , where µ i is the mean of Xi : µ i = E[Xi ]. As a special case, when i = j, then we get the variance of Xi , Var(Xi ) = Cov(Xi , Xi ). Now in the setting of mrv, assuming that each component random variable Xi has finite variance under its marginal distribution, the covariance matrix Cov(X, X) can be defined as a d-by-d matrix whose (i, j)-th entry is the covariance: (Cov(X, X))ij = Cov(Xi , Xj ) = E [(Xi − µ i )(Xj − µ j )] .

And its inverse is also called precision matrix.

Motivation and Background The covariance between two univariate random variables measures how much they change together, and as a special case, the covariance of a random variable with itself is exactly its variance. It is important to note that covariance is an unnormalized measure of the correlation between the random variables. As a generalization to multi-variate random variables X = (X , . . . , Xd )⊺ , the covariance matrix is a

C

Covariance Matrix

d-by-d matrix whose (i, j)-th component is the covariance between Xi and Xj . In many applications, it is important to characterize the relations between a set of factors, hence the covariance matrix plays an important role in practice, especially in machine learning.

Theory It is easy to rewrite the element-wise definition into the matrix form: Cov(X, X) = E [(X − E[X])(X − E[X])⊺ ] ,

()

which naturally generalizes the variance of univariate random variables: Var(X) = E[(X − E[X]) ]. Moreover, it is also straightforward to extend the covariance of a single mrv X to two mrv ’s X (d dimensional) and y (s dimensional), under the name cross-covariance. It quantifies how much the component random variables in X and y change together. The crosscovariance matrix is defined as a d × s matrix Cov(X, y) whose (i, j)-th entry is (Cov(X, y))ij = Cov(Xi , Yj )

Cross-covariance Cov(X, y) has the following properties. . Symmetry: Cov(X, y) = Cov(y, X). . Linearity: Cov(X + X , y) = Cov(X , y) + Cov (X , y). . Relating to covariance: If X and y have the same dimension, then Cov(X + y, X + y) = Cov(X, X) + Cov(y, y) + Cov(y, X). . Linear transform: Cov(AX, By) = ACov(X, y)B. It is highly important to note that Cov(X, y) = is a necessary but not sufficient condition for X and y to be independent. Correlation Coefficient

Entries in the covariance matrix are sometimes presented in a normalized form by dividing each entry by its corresponding standard deviations. This quantity is called the correlation coefficient, represented as ρ Xi ,Xj , and defined as ρ Xi ,Xj =

= E [(Xi − E[Xi ])(Yj − E[Yj ])] . Cov(X, y) can also be written in the matrix form as Cov(X, y) = E [(X − E[X])(y − E[y])⊺ ] , where the expectation is with respect to the joint distribution of (X, y). Obviously, Cov(X, y) becomes Cov(X, X) when y = X. Properties

Covariance Cov(X, X) has the following properties: . Positive semi-definiteness. It follows from () that Cov(X, X) is positive semi-definite. Cov(X, X) = if, and only if, X is a constant almost surely, i.e., there exists a constant x such that Pr(X ≠ x) = . Cov(X, X) is not positive definite if, and only if, there exists a constant α such that ⟨α, X⟩ is constant almost surely. . Relating cumulant to moments: Cov(X, X) = E[XX⊺ ] − E[X]E[X]⊺ . . Linear transform: If y = AX + b where A ∈ Rs×d and b ∈ Rs , then Cov(y, y) = ACov(X, X)A⊺ .

Cov(Xi , Xj ) . Cov(Xi , Xi )/ Cov(Xj , Xj )/

The corresponding matrix is called the correlation matrix, and for ΓX set to Cov(X, X) with all nondiagonal entries zeroed, and ΓY likewise, then the correlation matrix is given by Corr(X, y) = ΓX

−/

Cov(X, y)ΓY

−/

.

The correlation coefficient takes on values between [−, ]. Parameter Estimation

Given observations x , . . . , xn of a mrv X, an unbiased estimator of Cov(X, X) is: S=

n ⊺ ∑(xi − x¯ )(xi − x¯ ) , n − i=

where x¯ = n ∑ni= xi . The denominator n − reflects the fact that the mean is unknown and the sample mean is used in place. Note the maximum likelihood estimator in this case replaces the denominator n − by n.

Covariance Matrix

Conjugate Priors

A covariance matrix is used to define the Gaussian distribution. In this case, the inverse Wishart distribution is the conjugate prior for the covariance matrix. Since the Gamma distribution is a -D version of the Wishart distribution, in the -D case the Gamma is the conjugate prior for precision matrix.

C

there: k(xi , xj ) := ϕ(xi )⊺ ϕ(xj ). Since the measure in () only needs inner products, one can even directly define k(, ) without explicitly specifying ϕ. This allows us to

C ● Implicitly use a rich feature space whose dimension

can be infinitely high. ● Apply this measure of cross correlation to non-

Applications Several key uses of the covariance matrix are reviewed here.

Euclidean spaces as long as a kernel k(xi , xj ) can be defined on it.

Correlation and Least Squares Approximation Correlation and Kernel Methods

In many machine learning problems, we often need to quantify the correlation of two mrv s which may be from two different spaces. For example, we may want to study how much the image stream of a movie is correlated with the comments it receives. For simplicity, we consider a r-dimensional mrv X and a s-dimensional mrv y. To study their correlation, suppose we have n n pairs of observations {(xi , yi )}i= drawn iid from certain underlying joint distribution of (X, y). Let x¯ = n x and y¯ = n ∑ni= yi , and stack {xi } and {yi } into n ∑i= i x˜ = (x , . . . , xn )⊺ and Y˜ = (y , . . . , yn )⊺ respectively. Then the cross-covariance matrix Cov(X, y) can be estimated by n ∑ni= (xi − x¯ )(yi − y¯ )⊺ . To quantify the crosscorrelation by a real number, we need to apply some norm of the cross-covariance matrix, and the simplest one is the Frobenius norm, ∥A∥F = ∑ij Aij . Therefore, we obtain a measure of cross-correlation,

n ∥ ∑(xi − x¯ )(yi − y¯ )⊺ ∥ = H˜xx˜ ⊺ H Y˜ Y˜ ⊺ , n i= n F

()

where Hij = δ ij − n , and δ ij = if i = j and otherwise. It is important to notice that () in this measure, inner product is performed only in the space of X and y separately, i.e., no transformation between X and y is required, () the data points affect the measure only via inner products x⊺i xj as the (i, j)-th entry of x˜ x˜ ⊺ (and similarly for yi ). Hence we can endow new inner products on X and y, which eventually allows us to apply kernels, e.g., Gretton, Herbrich, Smola, Bousquet, & Schölkopf (). In a nutshell, kernel methods (Schölkopf & Smola, ) redefine the inner product x⊺i xj by mapping xi to a richer feature space via ϕ(xi ) and then compute the inner product

The measure of () can be equivalently motivated by least square 7linear regression. That is, we look for a linear transform T : Rd → Rs which minimizes n ∑ ∥(yi − y¯ ) − T(xi − x¯ )∥ . n i= And one can show that its minimum objective value is exactly equal to () up to a constant, as long as all yi − y¯ and xi − x¯ have unit length. In practice, this can be achieved by normalization. Or, the measure in () itself can be normalized by replacing the covariance matrix with the correlation matrix. Principal Component Analysis

The covariance matrix plays a key role in principal component analysis (PCA). Assume that we are given n iid observations x , . . . , xn of a mrv X, and let x¯ = x . PCA tries to find a set of orthogonal directions n ∑i i w , w , . . ., such that the projection of X to the direction w , w⊺ X, has the highest variance among all possible directions in the d-dimensional space. After subtracting from X the projection to w , w is chosen as the highest variance projection direction for the remainder. This procedure goes on for the required number of components. To find w := argmax w Var(w⊺ X), we need an empirical estimate of Var(w⊺ X). Estimating E[(w⊺ X) ] by w⊺ ( n ∑i xi x⊺i ) w, and E[w⊺ X] by n ∑i w⊺ xi , we get w = argmaxw : ∥w = ∥ w⊺ Sw, where S =

n ⊺ ∑(xi − x¯ )(xi − x¯ ) , n i=

n i.e., S is n− times the unbias empirical estimate of the covariance of X, based on samples x , . . . , xn . w turns

C

Covering Algorithm

out to be exactly the eigenvector of S corresponding to the greatest eigenvalue. Note that PCA is independent of the distribution of X. More details on PCA can be found at Jolliffe (). Gaussian Processes

Gaussian processes are another important framework in machine learning that rely on the covariance matrix. It is a distribution over functions f (⋅) from certain space X to R, such that for any n ∈ N and any n points n {xi ∈ X }i= , the set of values of f evaluated at {xi }i , {f (x ), . . . , f (xn )}, will have an n-dimensional Gaussian distribution. Different choices of the covariance matrix of the multi-variate Gaussian lead to different stochastic processes such as Wiener process, Brownian motion, Ornstein–Uhlenbeck process, etc. In these cases, it makes more sense to define a covariance funcn tion C : X × X ↦ R, such that given any set {xi ∈ X }i= for any n ∈ N, the n-by-n matrix (C(xi , xj ))ij is positive semi-definite and can be used as the covariance matrix. This further allows straightforward kernelization of a Gaussian process by using the kernel function as the covariance function. Although the space of functions is infinite dimensional, the marginalization property of multi-variate Gaussian distributions guarantees that the user of the model only needs to consider the observed xi , and ignore all the other possible x ∈ X . This important property says that for a mrv X = (X⊺ , X⊺ )⊺ ∼ N (µ, Σ), the marginal distribution of X is N (µ , Σ ), where Σ is the submatrix of Σ corresponding to X (and similarly for µ ). So taking into account the random variable X will not change the marginal distribution of X . For a complete treatment of covariance matrix from a statistical perspective, see Casella and Berger (), and Mardia, Kent, and Bibby () provides details for the multi-variate case. PCA is comprehensively discussed in Jolliffe (), and kernel methods are introduced in Schölkopf and Smola (). Williams & Rasmussen () gives the state of the art on how Gaussian processes can be utilized for machine learning.

Cross References 7Gaussian Distribution 7Gaussian Processes 7Kernel Methods

Recommended Reading Casella, G., & Berger, R. (). Statistical inference (nd ed.). Pacific Grove, CA: Duxbury. Gretton, A., Herbrich, R., Smola, A., Bousquet, O., & Schölkopf, B. (). Kernel methods for measuring independence. Journal of Machine Learning Research, , –. Jolliffe, I. T. () Principal component analysis (nd ed.). Springer series in statistics. New York: Springer. Mardia, K. V., Kent, J. T., & Bibby, J. M. (). Multivariate analysis. London: Academic Press. Schölkopf, B., & Smola, A. (). Learning with kernels. Cambridge, MA: MIT Press. Williams, C. K. I., & Rasmussen, C. E. (). Gaussian processes for regression. Cambridge, MA: MIT Press.

Covering Algorithm 7Rule Learning

Credit Assignment Claude Sammut The University of New South Wales

Synonyms Structural credit assignment

assignment;

Temporal

credit

Definition When a learning system employs a complex decision process, it must assign credit or blame for the outcomes to each of its decisions. Where it is not possible to directly attribute an individual outcome to each decision, it is necessary to apportion credit and blame between each of the combinations of decisions that contributed to the outcome. We distinguish two cases in the credit assignment problem. Temporal credit assignment refers to the assignment of credit for outcomes to actions. Structural credit assignment refers to the assignment of credit for actions to internal decisions. The first subproblem involves determining when the actions that deserve credit were taken and the second involves assigning credit to the internal structure of actions (Sutton, ).

Credit Assignment

Motivation Consider the problem of learning to balance a pole that is hinged on a cart (Michie & Chambers, , Anderson & Miller, ). The cart is constrained to run along a track of finite length and a fixed force can be applied to push the cart left or right. A controller for the pole and cart system must make a decision whether to push left or right at frequent, regular time intervals, for example, times a second. Suppose that this controller is capable of learning from trial-and-error. If the pole falls over, then it must determine which actions it took helped or hurt its performance. Determining that action is the problem of temporal credit assignment. Although the actions are directly responsible for the outcome of a trial, the internal process for choosing the action indirectly affects the outcome. Assigning credit or blame to those internal processes that lead to the choice of action is the structural credit assignment problem. In the case of pole balancing, the learning system will typically keep statistics such as how long, on average, the pole remained balanced after taking a particular action in a particular state, or after a failure, it may count back and determine the average amount of time to failure after taking a particular action in a particular state. Using these statistics, the learner attempts to determine the best action for a given state. The above example is typical of many problems in 7reinforcement learning (Sutton & Barto, ), where an agent interacts with its environment and through that interaction, learns to improve its performance in a task. Although Samuel () was the first to use a form of reinforcement learning in his checkers playing program, Minksy () first articulated the credit assignment, as follows: ▸ Using devices that also learn which events are associated with reinforcement, i.e., reward, we can build more autonomous “secondary reinforcement” systems. In applying such methods to complex problems, one encounters a serious difficulty – in distributing credit for success of a complex strategy among the many decisions that were involved.

The BOXES algorithm of Michie and Chambers () learned to control a pole balancer and performed credit assignment but the problem of credit assignment later became central to reinforcement learning, particularly following the work of Sutton (). Although credit

C

assignment has become most strongly identified with reinforcement learning, it may appear in any learning system that attempts to assess and revise its decisionmaking process.

C Structural Credit Assignment The setting for our learning system is that we have an agent that interacts with an environment. The environment may be a virtual one, as in game playing, or it may be physical, as in a robot performing some task. The agent receives input, possibly through sensing devices, that allows it to characterize the state of the world. Somehow, the agent must map these inputs to appropriate responses. These responses may change the state of the world. In reinforcement learning, we assume that the agent will receive some reward signal after an action or sequence of actions. Its job is to maximize these rewards over time. Structural credit assignment is associated with generalization over the input space of the agent. For example, a game player may have to respond to a very large number of potential board positions or a robot may have to respond to a stream of camera images. It is infeasible to learn a complete mapping from every possible input to every possible output. Therefore, a learning agent will typically use some means of grouping input signals. In the case of the BOXES pole balancer, Michie and Chambers discretized the state space. The state is characterized by the cart’s position and velocity and the pole’s angle and angular velocity. These parameters create a four-dimensional space, which was broken into three regions (left, center, right) for the pole angle, five for the angular velocity, and three for the cart position and velocity. These choices were arbitrary and other combinations also worked. Having divided the input space into non-overlapping regions, Michie and Chambers associated a push-left and push-right action with each region, or box. The learning algorithm maintains a score for each action and chooses the next action based on that score. BOXES was an early, and simple example, of creating an internal representation for mapping inputs to outputs. The problem with this method is that the structure of the decision-making system is fixed at the start and the learner is incapable of changing the representation. This may be needed if, for example, the subdivisions

C

Credit Assignment

that were chosen do not correspond to a real decision boundary. A learning system that could adapt its representation has an advantage, in this case. The BOXES representation can be thought of as a lookup table that implements a function that maps an input to an output. The fixed lookup table can be replaced by a 7function approximator that, given examples from the desired function, generalizes from them to construct an approximation of that function. Different function approximation techniques can be used. For example, Moore’s () function approximator was a 7nearest-neighbor algorithm, implemented using 7kd-tree to improve efficiency. Other function approximation methods may also be used, e.g., Albus’ CMAC algorithm (), 7locally weighted regression (Atkeson, Schaal, & Moore, ), 7perceptrons (Rosenblatt, ), 7multi-layer networks (Hinton, Rumelhart, & Williams, ), 7radial basis functions, etc. Structural credit assignment is also addressed in the creation of hierarchical representations. See 7hierarchical reinforcement learning. Other approaches to structural credit assignment include 7Value function approximation (Bertsekas & Tsitsiklis, ) and automatic basis generation (Mahadevan, ). See the entry on 7Gaussian Processes for examples of recent Bayesian and kernel method based approaches to solving the credit assignment problem.

Temporal Credit Assignment In the pole balancing example described above, the learning system receives a signal when the pole has fallen over. How does it know which actions leading up to the failure contributed to the fall? The system will receive a high-level punishment in the event of a failure or a reward in tasks where there is a goal to be achieved. In either case, it makes sense to assign the greatest credit or blame to the most recent actions and assign progressively less to the preceding actions. Each time a learning trial is repeated, the value of an action is updated so that if it leads to another action of higher value, its weight is increased. Thus, the reward or punishment propagates back through the sequence of decisions taken by the system. The credit assignment problem was addressed by Michie and Chambers, in the BOXES, algorithm but many other solutions

have subsequently been proposed. See the entries on 7Q-learning (Watkins, ; Watkins & Dayan, ) and 7temporal difference learning (Barto, Sutton, & Anderson, ; Sutton, ). Although temporal credit assignment is usually associated with reinforcement learning, it also appears in other forms of learning. In 7learning by imitation or 7behavioral cloning, an agent observes the actions of another agent and tries to learn from traces of behaviors. In this case, the learner must judge which actions of the other agent should receive credit or blame. Plan learning also encounters the same problem (Benson & Nilsson, ; Wang, Simon, & Lehman, ), as does 7explanation-based learning (Mitchell, Keller, & Kedar-Cabelli, ; Dejong & Mooney, ; Laird, Newell, & Rosenbloom, ). To illustrate the connection with explanation-based learning, we use one of the earliest examples of this kind of learning, Mitchell and Utgoff ’s, LEX program (Mitchell, Utgoff, & Banerji, ). The program was intended to learn heuristics for performing symbolic integration. Given a mathematical expression that included an integral sign, the program tried to transform the expression into one they did not. The standard symbolic integration operators were known to the program but not when it is best to apply them. The task of the learning system was to learn the heuristics for when to apply the operators. This was done by experimentation. If no heuristics were available, the program attempted a brute force search. If the search was successful, all the operators applied, leading to the success were assumed to be positive examples for a heuristic, whereas operators applied during a failed attempt became negative examples. Thus, LEX performed a simple form of credit assignment, which is typical of any system that learns how to improve sequences of decisions. 7Genetic algorithms can also be used to evolve rules that perform sequences of actions (Holland, ). When situation-action rules are applied in a sequence, we have a credit assignment problem that is similar to when we use a reinforcement learning. That is, how do we know which rules were responsible for success or failure and to what extent? Grefenstette () describes a bucket brigade algorithm in which rules are given strengths that are adjusted to reflect credit or blame.

Credit Assignment

This is similar to temporal difference learning except that in the bucket brigade the strengths apply to rules rather than states. See Classifier Systems and for a more comprehensive survey of bucket brigade methods, see Goldberg ().

Transfer Learning After a person has learned to perform some task, learning a new, but related, task is usually easier because knowledge of the first learning episode is transferred to the new task. Transfer Learning is particularly useful for acquiring new concepts or behaviors when given only a small amount for training data. It can be viewed as a form of credit assignment because successes or failures in previous learning episodes bias future learning. Reid (, ) identifies three forms of 7inductive bias involved in transfer learning for rules: language bias, which determines what kinds of rules can be constructed by the learner; the search bias, which determines the order in which rules will be searched; and the evaluation bias, which determines how the quality of the rules will be assessed. Note that learning language bias is a form of structural credit assignment. Similarly, where rules are applied sequentially, evaluation bias becomes temporal credit assignment. Taylor and Stone () give a comprehensive survey of transfer in 7reinforcement learning, in which they describe a variety of techniques for transferring the structure of an RL task from one case to another. They also survey methods for transferring evaluation bias. Transfer learning can be applied in many different settings. Caruana () developed a system for transferring inductive bias in 7neural networks performing multitask learning and more recent research has been directed toward transfer learning in 7Bayesian Networks (Niculescu-mizil & Caruana, ). See 7Transfer Learning and Silver et al. () and Banerjee, Liu, and Youngblood () for recent work on transfer learning.

Cross References 7Bayesian Network 7Classifier Systems 7Genetic Algorithms

C

7Hierarchical Reinforcement Learning 7Inductive Bias 7kd-Trees 7Locally Weighted Regression 7Nearest-Neighbor 7Perceptrons 7Radial Basis Function 7Reinforcement Learning 7Temporal Difference Learning 7Transfer Learning

Recommended Reading Albus, J. S. (). A new approach to manipulator control: The cerebellar model articulation controller (CMAC). Journal of Dynamic Systems, Measurement and Control, Transactions ASME, (), –. Anderson, C. W., & Miller, W. T. (). A set of challenging control problems. In W. Miller, R. S. Sutton, & P. J. Werbos (Eds.), Neural Networks for Control. Cambridge: MIT Press. Atkeson, C., Schaal, S., & Moore, A. (). Locally weighted learning. AI Review, , –. Banerjee, B., Liu, Y., & Youngblood, G. M. (Eds.), (). Proceedings of the ICML workshop on “Structural knowledge transfer for machine learning.” Pittsburgh, PA. Barto, A., Sutton, R., & Anderson, C. (). Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-, –. Benson, S., & Nilsson, N. J. (). Reacting, planning and learning in an autonomous agent. In K. Furukawa, D. Michie, & S. Muggleton (Eds.), Machine Intelligence . Oxford: Oxford University Press. Bertsekas, D. P., & Tsitsiklis, J. (). Neuro-dynamic programming. Nashua, NH: Athena Scientific. Caruana, R. (). Multitask learning. Machine Learning, , –. Dejong, G., & Mooney, R. (). Explanation-based learning: An alternative view. Machine Learning, , –. Goldberg, D. E. (). Genetic algorithms in search, optimization and machine learning. Boston: Addison-Wesley Longman Publishing. Grefenstette, J. J. (). Credit assignment in rule discovery systems based on genetic algorithms. Machine Learning, (–), –. Hinton, G., Rumelhart, D., & Williams, R. (). Learning internal representation by back-propagating errors. In D. Rumelhart, J. McClelland, & T. P. R. Group (Eds.), Parallel distributed computing: Explorations in the microstructure of cognition (Vol. ., pp. –). Cambridge: MIT Press.

C

C

Cross-Language Document Categorization

Holland, J. (). Escaping brittleness: The possibilities of generalpurpose learning algorithms applied to parallel rule-based systems. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (Vol. ). Los Altos: Morgan Kaufmann. Laird, J. E., Newell, A., & Rosenbloom, P. S. (). SOAR: An architecture for general intelligence. Artificial Intelligence, (), –. Mahadevan, S. (). Learning representation and control in Markov decision processes: New frontiers. Foundations and Trends in Machine Learning, (), –. Michie, D., & Chambers, R. (). Boxes: An experiment in adaptive control. In E. Dale & D. Michie (Eds.), Machine Intelligence . Edinburgh: Oliver and Boyd. Minsky, M. (). Steps towards artificial intelligence. Proceedings of the IRE, , –. Mitchell, T. M., Keller, R. M., & Kedar-Cabelli, S. T. (). Explanation based generalisation: A unifying view. Machine Learning, , –. Mitchell, T. M., Utgoff, P. E., & Banerji, R. B. (). Learning by experimentation: Acquiring and refining problem-solving heuristics. In R. Michalski, J. Carbonell, & T. Mitchell (Eds.), Machine kearning: An artificial intelligence approach. Palo Alto: Tioga. Moore, A. W. (). Efficient memory-based learning for robot control. Ph.D. Thesis, UCAM-CL-TR-, Computer Laboratory, University of Cambridge, Cambridge. Niculescu-mizil, A., & Caruana, R. (). Inductive transfer for Bayesian network structure learning. In Proceedings of the th International Conference on AI and Statistics (AISTATS ). San Juan, Puerto Rico. Reid, M. D. (). Improving rule evaluation using multitask learning. In Proceedings of the th International Conference on Inductive Logic Programming (pp. –). Porto, Portugal. Reid, M. D. (). DEFT guessing: Using inductive transfer to improve rule evaluation from limited data. Ph.D. thesis, School of Computer Science and Engineering, The University of New South Wales, Sydney, Australia. Rosenblatt, F. (). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanics. Washington, DC: Spartan Books. Samuel, A. (). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, (), –. Silver, D., Bakir, G., Bennett, K., Caruana, R., Pontil, M., Russell, S., et al. (). NIPS workshop on “Inductive transfer: years later”. Whistler, Canada. Sutton, R. (). Temporal credit assignment in reinforcement learning. Ph.D. thesis, Department of Computer and Information Science, University of Massachusetts, Amherst, MA. Sutton, R., & Barto, A. (). Reinforcement learning: An introduction. Cambridge: MIT Press. Taylor, M. E., & Stone, P. (). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, , –. Wang, X., Simon, H. A., Lehman, J. F., & Fisher, D. H. (). Learning planning operators by observation and practice. In Proceedings of the Second International Conference on AI Planning Systems, AIPS- (pp. –). Chicago, IL.

Watkins, C. (). Learning with delayed rewards. Ph.D. thesis, Psychology Department, University of Cambridge, Cambridge. Watkins, C., & Dayan, P. (). Q-learning. Machine Learning, (–), –.

Cross-Language Document Categorization Document Categorization is the task consisting in assigning a document to zero, one or more categories in a predefined taxonomy. Cross-language document categorization describes the specific case in which one is interested in automatically categorize a document in a same taxonomy regardless of the fact that the document is written in one of several languages. For more details on the methods used to perform this task see 7cross-lingual text mining.

Cross-Language Information Retrieval Cross-language information retrieval (CLIR) is the task consisting in recovering the subset of a document collection D relevant to a query q, in the special case in which D contains documents written in more than one language. Generally, it is additionally assumed that the subset of relevant documents must be returned as an ordered list, in decreasing order of relevance. For more details on methods and applications see 7cross-lingual text mining.

Cross-Language Question Answering Question answering is the task consisting in finding in a document collection the answer to a question. CLCat is the specific case in which the question and the documents can be in different languages. For more details on the methods used to perform this task see 7crosslingual text mining.

Cross-Lingual Text Mining

Cross-Lingual Text Mining Nicola Cancedda, Jean-Michel Renders Xerox Research Centre Europe, Meylan, France

Definition Cross-lingual text mining is a general category denoting tasks and methods for accessing the information in sets of documents written in several languages, or whenever the language used to express an information need is different from the language of the documents. A distinguishing feature of cross-lingual text mining is the necessity to overcome some language translation barrier.

Motivation and Background Advances in mass storage and network connectivity make enormous amounts of information easily accessible to an increasingly large fraction of the world population. Such information is mostly encoded in the form of running text which, in most cases, is written in a language different from the native language of the user. This state of affairs creates many situations in which the main barrier to the fulfillment of an information need is not technological but linguistic. For example, in some cases the user has some knowledge of the language in which the text containing a relevant piece of information is written, but does not have a sufficient control of this language to express his/her information needs. In other cases, documents in many different languages must be categorized in a same categorization schema, but manually categorized examples are available for only one language. While the automatic translation of text from a natural language into another (machine translation) is one of the oldest problems on which computers have been used, a palette of other tasks has become relevant only more recently, due to the technological advances mentioned above. Most of them were originally motivated by needs of government Intelligence communities, but received a strong impulse from the diffusion of the World-Wide Web and of the Internet in general.

C

Tasks and Methods A number of specific tasks fall under the term of Crosslingual text mining (CLTM), including: Cross-language information retrieval Cross-language document categorization ● Cross-language document clustering ● Cross-language question answering ● ●

These tasks can in principle be performed using methods which do not involve any 7Text Mining, but as a matter of fact all of them have been successfully approached relying on the statistical analysis of multilingual document collections, especially parallel corpora. While CLTM tasks differ in many respect, they are all characterized by the fact that they require to reliably measure the similarity of two text spans written in different languages. There are essentially two families of approaches for doing this: . In translation-based approaches one of the two text spans is first translated into the language of the other. Similarity is then computed based on any measure used in mono-lingual cases. As a variant, both text spans can be translated in a third pivot language. . In latent semantics approaches, an abstract vector space is defined based on the statistical properties of a parallel corpus (or, more rarely, of a comparable corpus). Both text spans are then represented as vectors in such latent semantic space, where any similarity measure for vector spaces can be used. The rest of this entry is organized as follows: first Translation-related approaches will be introduced, followed by Latent-semantic approaches. Finally, each of the specific CLTM tasks will be discussed in turn.

Translation-Based Approaches The simplest approach consists in using a manuallywritten machine-readable bilingual dictionary: words from the first span are looked up and replaced with words in the second language (see e.g., Zhang & Vines, ). Since typically dictionaries contain entries for “citation forms” only (e.g., the singular for nouns, the infinitive for verbs etc.), words in both spans are preliminarily lemmatized, i.e., replaced with the corresponding

C

C

Cross-Lingual Text Mining

citation form. In all cases when the lexica and morphological analyzers required to perform lemmatization are not available, a frequently adopted crude alternative consists in stemming (i.e., truncating by taking away a suffix) both the words in the span to be translated and in the corresponding side in the lexicon. Some languages (e.g., Germanic languages) are characterized by a very productive compounding: simpler words are connected together to form complex words. Compound words are rarely in dictionaries as such: in order to find them it is first necessary to break compounds into their elements. This can be done based on additional linguistic resources or by means of heuristics, but in all cases it is a challenging operation in itself. If the method used afterward to compare the two spans in the target language can take weights into account, translations are “normalized” in such a way that the cumulative weight of all translations of a word is the same regardless of the number of alternative translations. Most often, the weight is simply distributed uniformly among all alternative translations. Sometimes, only the first translation for each word is kept, or the first two or three. A second approach consists in extracting a bilingual lexicon from a parallel corpus instead of using a manually-written one. Methods for extracting probabilistic lexica look at the frequencies with which a word s in one language was translated with a word t to estimate the translation probability p(t∣s). In order to determine which word is the translation of which other word in the available examples, these examples are preliminarily aligned, first at the sentence level (to know what sentence is the translation of what other sentence) and then at the word level. Several methods for aligning sentences at the word level have been proposed, and this problem is a lively research topic in itself (see Brown, Della Pietra, Della Pietra, & Mercer, for a seminal paper). Once a probabilistic bilingual dictionary is available, it can be used much in the same way as human-written dictionaries, with the notable difference that the estimated conditional probabilities provide a natural way to distribute weight across translations. When the example documents used for extracting the bilingual dictionaries are of the same style and domain as the text spans to be translated, this can result in a significant increase in accuracy for the final task, whatever this is. It is often the case that a parallel corpus sufficiently similar in topic and style to the spans to be translated is unavailable, or it is too small to be used for reliably

estimating translation probabilities. In such cases, it can be possible to replace or complement the parallel corpus with a “comparable” corpus. A comparable corpus is a pair of collections of documents, one in each of the languages of interest, which are known to be similar in content, although not the translation of one another. A typical case might be two sets of articles from corresponding sections of different newspapers collected during a same period of time. If some additional bilingual seed dictionary (human-written or extracted from a parallel corpus) is also available, then the comparable corpus can be leveraged as well: a word t is likely to be the translation of a word s if it turns out that the words often appearing near s are translations of the words often appearing near t. Using this observation it is thus possible to estimate the probability that t is a valid translation of s even though they are not contained in the original dictionary. Most approaches proceed by associating with s a context vector. This vector, with one component for each word in the source language, can simply be formed by summing together the count histograms of the words occurring within a fixed window centered in all occurrences of s in the corpus, but is often constructed using statistically more robust association measures, such as mutual information. After a possible normalization step, the context vector CV(s) is translated using the seed dictionary into the target language. A context vector is also extracted from the corpus for all target words t. Eventually, a translation score between s and t is computed as ⟨Tr(CV(s)), CV(t)⟩: S(s, t) = ⟨CV(s), Tr(CV(t))⟩ =

∑

(s′ ,t ′ )∈D

a(s, s′ )a(t, t ′ ),

where a is the association score used to construct the context vector. While effective in many cases, this approach can provide inaccurate similarity values when polysemous words and synonyms appear in the corpus. To deal with this problem, Gaussier, Renders, Matveeva, Goutte, and Déjean () propose the following extension: S(s, t) =

′ ′′ ′′ ∑ (∑ a(s , s )a(s, s )) s′

(s′ ,t ′ )∈D

(∑ a(t , t ′′ )a(t, t ′′ )), ′

t ′′

which is more robust in cases when the entries in the seed bilingual dictionary do not cover all senses

Cross-Lingual Text Mining

actually present in the two sides of the comparable corpus. Although these methods for building bilingual dictionaries can be (and often are) used in isolation, it can be more effective to combine them. Using a bilingual dictionary directly is not the only way for translating a span from one language into another. A second alternative consists in using a machine translation (MT) system. While the MT system, in turn, relies on a bilingual dictionary of some sort, it is in general in the position of leveraging contextual clues to select the correct words and put them in the right order in the translation. This can be more or less useful depending on the specific task. MT systems fall, broadly speaking, into two classes: rule-based and statistical. Systems in the first class rely on sets of hand-written rules describing how words and syntactic structures should be translated. Statistical machine translation (SMT) systems learn this mapping by performing a statistical analysis of a parallel corpus. Some authors (e.g., Savoy & Berger, ) also experimented with combining translation from multiple machine translation systems.

Latent Semantic Approaches In CLTM, Latent Semantic approaches rely on some interlingua (language-independent) representation. Most of the time, this interlingua representation is obtained by linear or non-linear statistical analysis techniques and more specifically 7dimensionality reduction methods with ad-hoc optimization criterion and constraints. But, others adopt a more manual approach by exploiting multilingual thesauri or even multilingual ontologies in order to map textual objects towards a list – possibly weighted – of interlingua concepts. For any textual object (typically a document or a section of document), the interlingua concept representation is derived from a sequence of operations that encompass: . Linguistic preprocessing (as explained in previous sections, this step amounts to extract the relevant, normalized “terms” of the textual objects, by tokenisation, word segmentation/decompounding, lemmatisation/stemming, part-of-speech tagging, stopword removal, corpus-based term filtering, Noun-phrase extractions, etc.).

C

. Semantic enrichment and/or monolingual dimensionality reduction. . Interlingua semantic projection. A typical semantic enrichment method is the generalized vector space model, that adds related terms – or neighbour terms – to each term of the textual object, neighbour terms being defined by some cooccurrence measures (for instance, mutual information). Semantic enrichment can alternatively be achieved by using (monolingual) thesaurus, exploiting relationships such as synonymy, hyperonymy and hyponymy. Monolingual dimensionality reduction consists typically in performing some latent semantic analysis (LSA), some form of principal component analysis on the textual object/term matrix. Dimensionality reduction techniques such as LSA or their discrete/probabilistic variants such as probabilistic semantic analysis (PLSA) and latent dirichlet allocation (LDA) offer to some extent a semantic robustness to deal with the effects of polysemy/synonymy, adopting a languagedependent concept representation in a space of dimension much smaller than the size of the vocabulary in a language. Of course, steps () and () are highly languagedependent. Textual objects written in different languages will not follow the same linguistic processing or semantic enrichment/ dimensionality reduction. The last step (), however, aims at projecting textual objects in the same language-independent concept space, for any source language. This is done by first extracting these common concepts, typically from a parallel corpus that offers a natural multiple-view representation of the same objects. Starting from these multiple-view observations, common factors are extracted through the use of canonical correlation analysis (CCA), crosslanguage latent semantic analysis, their kernelized variants (eg. Kernel-CCA) or their discrete, probabilistic extensions (cross-language latent dirichlet allocation, multinomial CCA, …). All these methods try to discover latent factors that simultaneously explain as much as possible the “intra-language” variance and the “inter-language” correlation. They differ in the choice of the underlying distributions and how they precisely define and combine these two criteria. The following subsections will describe them in more details. As already emphasized, CLTM mainly relies on defining appropriate similarities between textual objects

C

C

Cross-Lingual Text Mining

expressed in different languages. Numerous categorization, clustering and retrieval algorithms focus on defining efficient and powerful measures of similarity between objects, as strengthened recently by the development of kernel methods for textual information access. We will see that the (linear) statistical algorithms used for performing steps () and () can most of the time be embedded into one valid (Mercer) kernel, so that we can very easily obtain non-linear variants of these algorithms, just by adopting some standard non-linear kernels. Cross-Language Semantic Analysis

This amounts to concatenate the vectorial representation of each view of the objects of the parallel collection (typically, objects are aligned sentences), and then to perform standard singular value decomposition of the global object/term matrix. Equivalently, defining the kernel similarity matrix between all pairs of multiview objects as the sum of the mono-lingual textual similarity matrices, this amounts to perform the eigenvalue decomposition of the corresponding kernel Gram matrix, if a dual formulation is adopted. The number of eigenvalues/eigenvectors that are retained to define the latent factors and the corresponding projections is typically from several hundreds of components to several thousands, still much fewer than the original sizes of the vocabulary. Note that this process does not really control the formation of interlingua concepts: nothing prevents the method from extracting factors that are linear combination of terms in one language only.

different languages is obtained by comparing their posterior distribution over these latent classes. Note that this approach could easily integrate supervised topic information and provides a nice framework for semisupervised interlingua concept extraction. Cross-Language Canonical Correlation Analysis The Primal Formulation CCA is a standard statistical

method to perform multi-block multivariate analysis, the goal being to find linear combinations of variables for each block (i.e., each language) that are maximally correlated. In other words, CCA is able to enforce the commonality of latent concept formations by extracting maximally correlated projections. Starting from a set of paired views of the same objects (typically, aligned sentences of a parallel corpus) in languages L and L, the algebraic formulation of this optimization problem leads to a generalized eigenvalue problem of size (n + n ), where n and n are the sizes of the vocabularies in L and L respectively. For obvious scalability reasons, the dual – or kernel – formulation (of size N, the number of paired objects in the training set) is often preferred. Kernel Canonical Correlation Analysis Basically, Kernel

Canonical Correlation Analysis amounts to do CCA on some implicit, but more complex feature space and to express the projection coefficients as linear combination of the training paired objects. This results in the dual formulation, which is a generalized eigenvalue/vector α

Cross-Language Latent Dirichlet Allocation

The extraction of interlingua components is realised by using LDA to model the set of parallel objects, by imposing the same proportion of components (topics) for all views of the same object. This is represented in Fig. . LDA is performing some form of clustering, with a predefined number of components (K) and with the constraint that the two views of the same object belongs to the clusters with the same membership values. This results in .K component profiles that are then used for “folding in” (projecting) new documents by launching some form of EM to derive their posterior probabilities to belong to each of the language-independent component. The similarity between two documents written in

θ

β1

Z1

Z2

W1

W2 N1

β2

N2 Nseg

Cross-Lingual Text Mining. Figure . Latent allocation of a parallel corpus

dirichlet

Cross-Lingual Text Mining

problem of size N, that involves only the monolingual kernel gram matrices K and K (matrices of monolingual textual similarities between all pairs of objects in the training set in language L and L respectively). Note that it is easy to show that the eigenvalues go by pairs: we always have two symmetrical eigenvalues +λ and −λ. This kernel formulation has the advantage to include any text specific prior properties in the kernel (e.g., use of N-gram kernels, word-sequence kernels, and any semantically-smoothed kernel). After extraction of the first k generalized eigenvalues/eigenvectors, the similarity between any pair of test objects in languages L and L can be computed by using projection matrices composed of extracted eigenvector as well as the (monolingual) kernels of the test objects with the training objects. Regularization and Partial Least Squares Solution When

the number of training examples (N) is less than n and n (the dimensions of the monolingual feature spaces), the eigenvalue spectrum of the KCCA problem has generally two null eigenvalues (due to data centering), (N −) eigenvalues in + and (N −) eigenvalues in −, so that, as such, the KCCA problem only results in trivial solutions and is useless. When using kernel methods, the case (N < n , n ) is frequent, so that some regularization scheme is needed. One way of realizing this regularization is to resort to finding the directions of maximum covariance (instead of correlation): this can be considered as a partial least squares (PLS) problem, whose formulation is very similar to the CCA problem. Adopting a mixed criterion CCA/PLS (trying to maximize a combination of covariance and correlation between projections) turns out to both avoid overfitting (or spurious solutions) and to enhance numerical stability. Approximate Solutions Both CCA and KCCA suffer from a lack of scalability, due to the fact the complexity of generalized eigenvalue/vector decomposition is O(N ) for KCCA or O(min(n , n ) ) for CCA. As it can be shown that performing a complete KCCA (or KPLS) analysis amounts to do first complete PCA’s, and then a linear CCA (or PLS) on the resulting new projections, it is obvious that we could reduce the complexity by working on a reduced-rank approximation (incomplete

C

KPCA) of the kernel matrices. However, the implicit projections derived from incomplete KPCA may be not optimal with respect to cross-correlation or covariance criteria. Another idea to decrease the complexity is to perform some incomplete Cholesky decomposition of the (monolingual) kernel matrices K and K (that is equivalent to partial Gram-Schmit orthogonalisation in the feature space): K = G .Gt and K = G .Gt , with Gi of rank k ≪ N. Considering Gi as the new representation of the training data, KCCA now reduces to solving a generalized eigenvalue problem of size .k.

Specific Applications The previous sections illustrated a number of different ways of solving the core problem of cross-language text mining: quantifying the similarity between two spans of text in different languages. In this section we turn to describing some actual applications relying on these methods. Cross-Language Information Retrieval (CLIR)

Given a collection of documents in several languages and a single query, the CLIR problem consists in producing a single ranking of all documents according to their relevance to the query. CLIR is in particular useful whenever a user has some knowledge of the languages in which documents are written, but not enough to express his/her information needs in those languages by means of a precise query. Sometimes CLIR engines are coupled with translation tools to help the user access the content of relevant documents written in languages unknown to him/her. In this case document collections in an even larger number of languages can be effectively queried. It is probably fair to say that the vast majority of the CLIR systems use a translation-based approach. In most cases it is the query which is translated in all languages before being sent to monolingual search engines. While this limits the amount of translation work that needs be done, it requires doing it on-line at query time. Moreover, when queries are short it can be difficult to translate them correctly, since there is little context to help identifying the correct sense in which words are used. For these reasons several groups also proposed translating all documents at indexing time instead. Regardless of whether queries or documents

C

C

Cross-Lingual Text Mining

are translated, whenever similarity scores between (possibly translated) queries and (possibly translated) documents are not directly comparable, all methods then face the problem of merging multiple monolingual rankings in a single multilingual ranking. Research in CLIR and cross-language question answering (see below) has been significantly stimulated by at least three government-sponsored evaluation campaigns:

segments) by using information retrieval techniques treating the question as a query, and then performing some finer-grained analysis to converge to a sufficiently short snippet. Questions are classified in a hierarchy of possible “question types.” Also, documents are preliminarily indexed to identify elements (e.g., person names) that are potential answers to questions of relevant types (e.g., “Who” questions). Cross-language question answering (CLQA) is the extension of this task to the case where the collection ● The NII Test Collection for IR Systems (NTCIR) contains documents in a language different than the lan(http://research.nii.ac.jp/ntcir/), running yearly since guage of the question. In this task a CLIR step replaces , focusing on Asian languages (Japanese, the monolingual IR step to shortlist promising docuChinese, Korean) and English. ments. The classification of the question is generally ● The Cross-Language Evaluation Forum (CLEF) done in the source language. (http://www.clef-campaign.org), running yearly since Both CLEF and NTCIR (see above) organize cross, focusing on European languages. language question answering comparative evaluations ● A cross-language track at the Text Retrieval Conon an annual basis. ference (TREC) (http://trec.nist.gov/), which was run until , focused on querying documents in Arabic using queries in English. Cross-Language Categorization (CLCat) and Clustering The respective websites are ideal starting points for any further exploration on the subject. Cross-Language Question Answering (CLQA)

Question answering is the task of automatically finding the answer to a specific question in a document collection. While in practice this vague description can be instantiated in many different ways, the sense in which the term is mostly understood is strongly influenced by the task specification formulated by the National Institute of Science and Technology (NIST) of the United States for its TREC evaluation conferences (see above). In this sense, the task consists in identifying a text snippet, i.e., a substring, of a predefined maximal length (e.g., characters, or characters) within a document in the collection containing the answer. Different classes of questions are considered: Questions around facts and events. Questions requiring the definition of people, things and organizations. ● Questions requiring as answer lists of people, objects or data. ● ●

Most proposals for solving the QA problem proceed by first identifying promising documents (or document

(CLCLu)

Cross-language categorization tackles the problem of categorizing documents in different languages in a same categorization scheme. The vast majority of document categorization systems rely on machine learning techniques to automatically acquire the necessary knowledge (often referred to as a model) from a possibly large collection of manually categorized documents. Most often the model is based on frequency counts of words, and is thus intrinsically language-dependent. The most direct way to perform categorization in different languages would consist in manually categorizing a sufficient amount of documents in all languages of interest and then train a set of independent categorizer. In some cases, however, it is impractical to manually categorize a sufficient number of documents to ensure accurate categorization in all languages, while it can be easier to identify bilingual dictionaries or parallel (or comparable) corpora for the language pairs and in the application domain of interest. In such cases it is then preferable to obtain manually categorized documents only for a single language A and use them to train a monolingual categorizer. Any of the translation-based approaches described above can then be used to translate a document originally in language B – or most often its representation as a bag of

Cumulative Learning

words– into language A. Once the document is translated, it can be categorized using the monolingual A system. As an alternative, latent-semantics approaches can be used as well. An existing parallel corpus can be used to identify an abstract vector space common to A and B. The manually categorized documents in A can then be represented in this space, and a model can be learned which operates directly on this latent-semantic representation. Whenever a document in B needs be categorized, it is first projected in the common semantic space and then categorized using the same model. All these considerations carry unchanged to the cross-language clustering task, which consists in identifying subsets of documents in a multilingual document collection which are mutually similar to one another according to some criterion. Again, this task can be effectively solved by either translating all documents into a single language or by learning a common semantic space and performing the clustering task there. While CLCat and Clustering are relevant tasks in many real-world situations, it is probably fair to say that less effort has been devoted to them by the research community than to CLIR and CLQA.

Recommended Reading Brown, P. E., Della Pietra, V. J., Della Pietra, S. A., & Mercer, R. L. (). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, (), – . Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., & Déjean, H. (). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the nd annual meeting of the association for computational linguistics, Barcelona, Spain. Morristown, NJ: Association for Computational Linguistics. Savoy, J., & Berger, P. Y. (). Report on CLEF- evaluation campaign: Monolingual, bilingual and GIRT information retrieval. In Proceedings of the cross-language evaluation forum (CLEF) (pp. –). Heidelberg: Springer. Zhang, Y., & Vines, P. (). Using the web for translation disambiguation. In Proceedings of the NTCIR- workshop meeting, Tokyo, Japan.

Cross-Validation Definition Cross-validation is a process for creating a distribution of pairs of 7training and 7test sets out of a single

C

7data set. In cross validation the data are partitioned into k subsets, S …Sk , each called a fold. The folds are usually of approximately the same size. The learning algorithm is then applied k times, for i = to k, each time using the union of all subsets other than Si as the 7training set and using Si as the 7test set.

Cross References 7Algorithm Evaluation 7Leave-One-Out Cross-Validation

Cumulative Learning Pietro Michelucci , Daniel Oblinger Strategic Analysis, Inc., Arlington, VA, USA DARPA/IPTO, Arlington, VA , USA

Synonyms Continual learning; Lifelong learning; Sequential inductive transfer

Definition Cumulative learning (CL) exploits knowledge acquired on prior tasks to improve learning performance on subsequent related tasks. Consider, for example, a CL system that is learning to play chess. Here, one might expect the system to learn from prior games concepts (e.g., favorable board positions, standard openings, end games, etc.) that can be used for future learning. This is in contrast to base learning (Vilalta & Drissi, ) in which a fixed learning algorithm is applied to a single task and performance tends to improve only with more exemplars. So, in CL there tends to be explicit reuse of learned knowledge to constrain new learning, whereas base learning depends entirely upon new external inputs. Relevant techniques for CL operate over multiple tasks, often at higher levels of abstraction, such as new problem space representations, task-based selection of learning algorithms, dynamic adjustment of learning parameters, and iterative analysis and modification of the learning algorithms themselves. Though actual usage of this term is varied and evolving, CL typically connotes sequential 7inductive transfer. It should be noted that the word “inductive” in this connotation

C

C

Cumulative Learning

qualifies the transfer of knowledge to new tasks, not the underlying learning algorithms.

Related Terminology The terms “meta-learning” and “learning to learn” are sometimes used interchangeably with CL. However each of these concepts has a specific relationship to CL. 7Meta-learning (Brazdil et al., ; Vilalta & Drissi, ) involves the application of learning algorithms to meta-data, which are abstracted representations of input data or learning system knowledge. In the case that abstractions of system knowledge are themselves learning algorithms, meta-learning involves assessing the suitability of these algorithms for previous tasks and, on that basis, selecting algorithms for new tasks (see entry on “meta-learning”). In general, the sharing of abstracted knowledge across tasks in a CL system implies the use of meta-learning techniques. However, the converse is not true. Meta-learning can and does occur in learning systems that do not accumulate and transfer knowledge across tasks. Learning to learn is a synonym for inductive transfer. Thus, learning to learn is more general than CL. Though it specifies the application of knowledge learned in one domain to another, it does not stipulate whether that knowledge is accumulated and applied sequentially or shared in a parallel learning context.

Motivation and Background Traditional 7supervised learning approaches require large datasets and extensive training in order to generalize to new inputs in a single task. Furthermore, traditional (non-CL) 7reinforcement learning approaches require tightly constrained environments to ensure a

tractable state space. In contrast, humans are able to generalize across tasks in dynamic environments from brief exposure to small datasets. The human advantage seems to derive from the ability to draw upon prior task and context knowledge to constrain hypothesis development for new tasks. Recognition of this disparity between human learning and traditional machine learning had led to the pursuit of methods that seek to emulate the accumulation and exploitation of taskbased knowledge that is observed in humans. A coarse evolution of this work is depicted in Fig. .

History Advancements in CL have resulted from two classes of innovation: the development of techniques for 7inductive transfer and the integration of those techniques into autonomous learning systems. Alan Turing () was the first to propose a cumulative learning system. His paper is best remembered for the imitation game, later known as the Turing test. However, the final sections of the paper address the question of how a machine could be made sufficiently complex to be able to pass the test. He posited that programming it would be too difficult a task. Therefore, it should be instructed as one might teach a child, starting with simple concepts and working up to more complex ones. Banerji () introduced the use of predicate logic as a description language for machine learning. Thus, Banerji was one of the earliest advocates of what would later become 7ILP. His concept description language allowed the use of background knowledge and therefore was an extensible language. The first implementation of a cumulative learning system based on Banerji’s ideas was Cohen’s CONFUCIUS (Cohen, ;

Supervised Learning Learning Supervised

Parallel: Parallel: Inductive Inductive Bias Bias

Inductive Inductive Transfer Transfer

MULTI-TASK MULTI-TASK LEARNING LEARNING

Sequential/ Sequential/ Hybrid: Hybrid: CUMULATIVE CUMULATIVE LEARNING LEARNING

Reinforcement Learning

Cumulative Learning. Figure . Evolution of cumulative learning

Cumulative Learning

Cohen & Sammut, ). In this work, an instructor teaches the system concepts that are stored in a longterm memory. When examples of a new concept are seen, their descriptions are matched against stored concepts, which allow the system to re-describe the examples in terms of the background knowledge. Thus, as more concepts are accumulated, the system is capable of describing complex objects more compactly than if it had not had the background knowledge. Compact representations generally allow complex concepts to be learned more efficiently. In many cases, learning would be intractable without the prior knowledge. See the entries on 7Inductive Logic Programming, which describe the use of background knowledge further. Independent of the research in symbolic learning, much of the 7inductive transfer research that underlies CL took root in 7artificial neural network research, a traditional approach to 7supervised learning. For example, Abu-Mostafa () introduced the notion of reducing the hypothesis space of a neural network by introducing “hints” either as hard-wired additions to the network or via examples designed to teach a particular invariance. The task of a neural network can be thought of as the determination of a function that maps exemplars into a classification space. So, in this context, hints constitute an articulation of some aspect of the target mapping function. For example, if a neural network is tasked with mapping numbers into primes and composites, one “hint” would be that all even numbers (besides ) are composite. Leveraging such a priori knowledge about the mapping function may facilitate convergence on a solution. An inherent limitation to neural networks, however, is their immutable architecture, which does not lend itself to the continual accumulation of knowledge. Consequently, Ring () introduced a neural network that constructs new nodes on demand in a reinforcement learning context in order to support ongoing hierarchical knowledge acquisition and transfer. In this model, nodes called “bions” correspond simultaneously to the enactment and perception of a single behavior. If two bions are activated in sequence repeatedly, a new bion is created to join the coincident pair and represent their collective functionality. Contemporaneously, Pratt, Mostow, and Kamm () investigated the hypothesis that knowledge

C

acquired by one neural network could be used to assist another neural network learn a related task. In the speech recognition domain, they trained three separate networks, each corresponding to speech segments of a different length, such that each network was optimized to learn certain types of phonemes. They then demonstrated that a direct transfer of information encoded as network weights from these three specialized networks to a single, combined speech recognition network resulted in a tenfold reduction in training epochs for the combined network compared with the number of training epochs required when no knowledge was transferred. This was one of the first empirical results in neural network-based transfer learning. Caruana () extended this work to demonstrate the performance benefits associated with the simultaneous transfer of 7inductive bias in a “Multitask Learning” (MTL) methodology. In this work, Caruana hypothesized that training the same neural network simultaneously on related tasks would naturally induce additional constraints on learning for each individual task. The intuition was that converging on a mapping in support of multiple tasks with shared representations might best reveal aspects of the input that are invariant across tasks, thus obviating within-task regularities, which might be less relevant to classification. Those empirical results are supported by Baxter () who proved that the number of examples required by a representation learner for learning a single task is an inverse linear function of the number of simultaneous tasks being learned. Though the innovative underpinnings of inductive transfer that critically underlie CL evolved in a supervised learning context, it was the integration of those methods with classical reinforcement learning that has led to current models of CL. Early integration of this type comes from Thrun and Mitchell (), who applied an extension of explanation-based learning (EBL), called explanation-based neural networks (EBNN) (Mitchell & Thrun, ), to an agent-based “lifelong learning framework.” This framework provides for the acquisition of different control policies for different environments and reward functions. Since the robot actuators, sensors, and the environment (largely) remain invariant, this framework supports the use of knowledge acquired from one control problem to be applied to another. By using EBNN to allow learning

C

C

Cumulative Learning

from previous control problems to constrain learning on new control problems, learning is accelerated over the lifetime of the robot. More recently, Silver and Mercer () introduced a hybrid model that involves a combination of parallel and sequential inductive transfer in an autonomous agent framework. The so-called task rehearsal method (TRM) uses MTL to combine new training inputs with relevant exemplars that are generated from prior task knowledge. Thus, inductive bias is achieved by training the neural networks on new tasks while simultaneously rehearsing learned task knowledge.

process evaluates the training input in the context of LTM to determine the most relevant domain knowledge that can be used to constrain short term learning. The comparison process also determines the weight assigned to domain knowledge that is used to bias short term learning. Once the rate of performance improvement on the primary task falls below a threshold the assessment process compares the state of STM to the environment to determine which domain knowledge to extract and store in LTM.

Structure of the Learning System

The simplicity of the architecture shown in Fig. belies the richness of the feature space for CL systems. The following classification dimensions are derived largely from the ML specification. This list includes both qualitative and quantitative dimensions. They are presented in three overlapping categories: architectural features, characteristics of the knowledge base, and learning capabilities.

CL is characterized by systems that use prior knowledge to bias future learning. The canonical interpretation is that knowledge transfer occurs at the task level. Although this description encompasses a broad research space, it is not boundless. In particular, CL systems must be able to () retain knowledge and () use that knowledge to restrict the hypothesis space for new learning. Nonetheless, learning systems can vary widely across numerous orthogonal dimensions and still meet these criteria.

Toward a CL Specification Recognizing the empirical utility of a more specific delineation of CL systems, Silver and Poirier () introduced a set of functional requirements, classification criteria, and performance specifications that characterize more precisely the scope of machines capable of lifelong learning. Any system that meets these requirements is considered a machine lifelong learning (ML) system. A general CL architecture that conforms to the ML standard is depicted in Fig. . Two basic memory constructs are typical of CL systems. Long term memory (LTM) is required for storing domain knowledge (DK) that can be used to bias new learning. Short term memory (STM) provides a working memory for building representations and testing hypotheses associated with new task learning. Most of the ML requirements specify the interplay of these constructs. LTM and STM are depicted in Fig. , along with a comparison process, an assessment process, and the learning environment. In this model, the comparison

Classification of CL Systems

Architecture

The following architectural dimensions for a CL system range from paradigm choices to low-level interface considerations. Learning paradigm – The learning paradigm(s) may include supervised learning (e.g., neural network, SVM, ILP, etc.), unsupervised learning (e.g., clustering), reinforcement learning (e.g., automated agent), or some combination thereof. Figure depicts a general architecture with processes that are common across these

Comparison ComparisonProcess Engine State Relevant DK

LTM LTM

Extracted DK

STM STM

Environment Environment

State

Assessment AssessmentProcess Engine

Cumulative Learning. Figure . Typical CL system

Cumulative Learning

learning paradigms, and which could be elaborated to reflect the details of each. Task order – CL systems may learn tasks sequentially (Thrun & Mitchell, ), in parallel (e.g., MTL (Caruana, )), or via a hybrid methodology (e.g., TRM (Silver & Mercer, )). One hybrid approach is to engage in practice (i.e., revisiting prior learned tasks). Transferring knowledge between learned tasks through practice may serve to improve generalization accuracy. Task order would be reflected in the sequence of events within and among process arrows in the Fig. architecture. For example, a system may alternate between processing new exemplars and “practicing” with old, stored exemplars. Transfer method – Knowledge transfer can also be representational or functional. Functional transfer provides implicit pressure from related training exemplars. For example, the environmental input in Fig. may take the form of training exemplars drawn randomly from data representing two related tasks, such that learning to classify exemplars from one task implicitly constrains learning on the other task. Representational knowledge transfer involves the direct or indirect (Pratt et al., ) assignment of a hypothesis representation. A direct inductive transfer entails the assignment of an original hypothesis representation, such as a vector of trained neural network activation weights. This might take the form of a direct injection to LTM in Fig. . Indirect transfer implies that some level of abstraction analysis has been applied to the hypothesis representation prior to assignment. Learning stages – A learning system may implement learning in a single stage or in a series of stages. An example of a two-stage system is one that waits to initiate the long-term storage of domain knowledge until after primary task learning in short-term memory is complete. Like task order, learning stages would be reflected in the sequence of events within and among process arrows in the Fig. architecture. But in this case, ordering pertains to the manner in which learning is staged across encoding processes. Interface cardinality – The interface cardinality can be fixed or variable. Fixing the number of inputs and outputs has the advantage of providing a consistent interface without posing restrictions on the growth of the internal representation.

C

Data type – The input and output data types can be fixed or variable. A type-flexible system can produce both categorical and scalar predictions. Scalability – CL systems may or may not scale on a variety of dimensions including inputs, outputs, training examples, and tasks.

Knowledge

This category pertains to the long-term storage of learned knowledge. Thus, the following CL dimensions characterize knowledge representation, storage, and retrieval. Knowledge representation – Stored knowledge can manifest as functional or representational. Functional knowledge retention involves the storage of specific exemplars or parameter values, which tends to be more accurate, whereas representational knowledge retention involves the storage of hypotheses derived from training on exemplars, which has the advantage of storage economy. Retention efficacy – The efficacy of long term retention varies across CL systems. Effective retention implies that only domain knowledge with an acceptable level of accuracy is retained so that errors aren’t propagated to future hypotheses. A related consideration is whether or not the consolidation of new domain knowledge degrades the accuracy of current or prior hypotheses. Retention efficiency – The retention efficiency of long term memory can vary according to both economy of representation and computationally efficiency. Indexing method – The input to the comparison process used to select appropriate knowledge for biasing new learning may simply be exemplars (as provided by LTM in Fig. ) or may take a representational form (e.g., a vector of neural network weights). Indexing efficiency – CL systems vary in terms of the speed and accuracy with which they can identify related prior knowledge that is suitable for inductive transfer during short term learning. The input to this selection process is the indexing method. Meta-knowledge – CL systems differentially exhibit the ability to abstract, store, and utilize meta-knowledge, such as characteristics of the input space, learning system parameter values, etc.

C

C

Cumulative Learning

Cumulative Learning. Table CL System Dimensions Category

Dimension

Values (ML guidance is indicated by ✓)

Architecture

Learning paradigm

Supervised learning Reinforcement learning Unsupervised learning ✓ Hybrid

Task order

Sequential Parallel ✓ Revisit (practice) Hybrid

Transfer method

Functional Representational – direct Representational – indirect

Learning stages

✓ Single (computational retention efficiency) Multiple

Interface cardinality

✓ Fixed Variable

Data type

Fixed Variable

Scalability

✓ Inputs ✓ Outputs ✓ Exemplars ✓ Tasks

Knowledge

Representation

Functional Representational – disjoint ✓ Representational – continuous

Retention efficacy

✓ Improves prior task performance ✓ Improves new task performance

Retention efficiency

✓ Space (memory usage) ✓ Time (computational processing)

Indexing method

✓ Deliberative – functional ✓ Deliberative – representational Reflexive

Cumulative Learning

C

Cumulative Learning. Table (Continued) Category

Dimension

Values (ML guidance is indicated by ✓)

Indexing efficiency

✓ Time < O(nc ), c > (n = tasks)

Meta-knowledge

✓ Probability distribution of input space Learning curve Error rate

Learning

Agency

Single learning method Task-based selection of learning method

Utility

Single learning method Task-based selection of learning method

Task awareness

Task boundary identification (begin/end)

Bias modulation

✓ Estimated sample complexity ✓ Number of task exemplars ✓ Generalization knowledge

accuracy

of

retained

✓ Relatedness of retained knowledge Learning efficacy

✓ Generalization ∣ bias ≥ generalization ∣ no bias

Learning efficiency

✓ Time ∣ bias ≤ time ∣ no bias

Learning

While all of the dimensions listed herein impact learning, the following dimensions correspond to specific learning capabilities or learning performance metrics. Agency – The agency of a learning system is the degree of sophistication exhibited by its top-level controller. For example a learning system may be on the low end of the agency continuum if it always applies one predetermined learning method to one task or on the high end if it selects among many learning methods as a function of the learning task. One might imagine, for example, two process diagrams such as the one depicted in Fig. , that share the same LTM, but are otherwise distinct and differentially activated by a governing controller as a function of qualitative aspects of the input. Utility – Domain knowledge acquisition can be deliberative in the sense that the learning system decides which hypotheses to incorporate based upon their estimated utility, or reflexive, in which case all

hypotheses are stored irrespective of utility considerations. Task awareness – Task awareness characterizes the system’s ability to identify the beginning and end of a new task. Bias modulation – A CL system may have the ability to determine the extent to which short-term learning would benefit from inductive transfer and, on that basis, assign a relevant weight. The depth of this analysis can vary and might consider factors such as the estimated sample complexity, number of exemplars, the generalization accuracy of retained knowledge, and relatedness of retained knowledge. Learning efficacy – A measure of learning efficacy is derived by comparing generalization performance in the presence and absence of an inductive bias. Learning is considered effective when the application of an inductive bias results in greater generalization performance on the primary task than when the bias is absent.

C

C

Cumulative Learning

Learning efficiency – Similarly, learning efficiency is assessed by comparing the computational time needed to generate a hypothesis in the presence and absence of an inductive bias. Lower computational time in the presence of bias signifies greater learning efficiency.

The Research Space Table summarizes the classification dimensions, providing an overview of the research space, an evaluative framework for assessing and contrasting CL approaches, and a generative framework for identifying new areas of exploration. In addition, checked items in the Values column indicate ML guidance. Specifically, an ideal ML system would correspond functionally to the called-out items and performance criteria. However, Silver and Poirier () allude to the fact that it would be nigh impossible to generate a strictly compliant ML system since some of the recommended criteria do not coexist easily. For example, effective and efficient learning are mutually incompatible because they require different forms of knowledge transfer. Nonetheless, a CL system that falls within scope of the majority of the ML criteria would be well-positioned to exhibit lifelong learning behavior.

above, is also premised on a model of building concepts from structured lessons. In this case, however, there is no a priori knowledge acquisition. Instead, some “common” knowledge about the world is provided explicitly to the learning system, and then lessons are taught by a human teacher using the same natural instruction methods that would be used to teach another human. Rather than requiring a specific learning algorithm, this framework provides a context for evaluating and comparing learning algorithms. It includes a knowledge representation language that supports syntactic, logical, procedural, and functional knowledge, an interaction language for communication among the learning system, instructor, and environment, and an integration architecture that evaluates, processes, and responds to interaction language communiqués in the context of existing knowledge and through the selective utilization of available learning algorithms. The learning performance advantages anticipated by these proposals for instructional computing seem to stem from the economy of representation afforded by hierarchical knowledge combined with the tremendous learning bias imposed by explicit instruction.

Recommended Reading Future Directions Emergent work (Oblinger, ; Swarup, Lakkaraju, Ray, & Gasser, ) in instructable computing has given rise to a new CL paradigm that is largely ML compliant and involves high degrees of task awareness and agency sophistication. Swarup et al. () describe an approach in which domain knowledge is represented in the form of structured graphs. Short term (primary task) learning occurs via a genetic algorithm, after which domain knowledge is extracted by mining frequent subgraphs. The accumulated domain knowledge forms an ontology to which the learning system grounds symbols as a result of structured interactions with instructional agents. Subsequent interactions occur using the symbol system as a shared lexicon for communication between the instructor and the learning system. Knowledge acquired from these interactions bootstrap future learning. The Bootstrapped Learning framework proposed by Oblinger () provides for hierarchical, domainindependent learning that, like the effort described

Abu-Mostafa, Y. (). Learning from hints in neural networks (invited). Journal of Complexity, (), –. Banerji, R. B. (). A Language for the Description of Concepts. General Systems, , –. Baxter, J. (). Learning internal representations. In (COLT): Proceeding of the workshop on computational learning theory, Santa Cruz, California. Morgan Kaufmann. Brazdil P., Giraud-Carrier, C., Soares, C., & Vilalta, R. (). Metalearning – Applications to Data Mining, Springer. Caruana, R. (). Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the tenth international conference on machine learning, University of Massachusetts, Amherst (pp. –). Caruana, R. (). Algorithms and applications for multitask learning. In Machine learning: Proceedings of the th international conference on machine learning (ICML ), Bari, Italy (pp. –). Morgan Kauffmann. Cohen, B. L. (). A Theory of Structural Concept Formation and Pattern Recognition. Ph.D. Thesis, Department of Computer Science, The University of New South Wales. Cohen, B. L., & Sammut, C. A. (). Object Recognition and Concept Learning with CONFUCIUS. Pattern Recognition Journal, (), –. Mitchell, T. (). The need for biases in learning generalizations. Rutgers TR CBM-TR-. Mitchell, T. M., & Thrun, S. B. (). Explanation-based neural network learning for robot control. In Hanson, Cowan, &

Curse of Dimensionality

Giles (Eds.), Advances in neural information processing systems (pp. –). San Francisco, CA: Morgan-Kaufmann. Nilsson, N. J. (). Introduction to machine learning: An early draft of a proposed textbook (p. ). Online at http://ai.stanford.edu/ $\sim$nilsson/MLBOOK.pdf. Accessed on July , . Oblinger, D. (). Bootstrapped learning proposer information pamphlet for broad agency announcement -. Online at http://fs.fbo.gov/EPSData/ODA/Synopses//BAA/BLPIPfinal.pdf. Pratt, L. Y., Mostow, J., & Kamm, C. A. (). Direct transfer of learned information among neural networks. In Proceedings of the ninth national conference on artificial intelligence (AAAI-), Anaheim, CA (pp. –). Ring, M. (). Incremental development of complex behaviors through automatic construction of sensory-motor hierarchies. In Proceedings of the eighth international workshop (ML), San Mateo, California. Silver, D., & Mercer, R. (). The task rehearsal method of lifelong learning: Overcoming impoverished data. In R. Cohen & B. Spencer (Eds.), Advances in artificial intelligence, th conference of the Canadian society for computational studies of intelligence (AI ), Calgary, Canada, May –, . Lecture notes in computer science (Vol. , pp. –). London: Springer. Silver, D., & Poirier, R. (). Requirements for machine lifelong learning. JSOCS Technical Report TR--, Acadia University. Swarup, S., Lakkaraju, K., Ray, S. R., & Gasser, L. (). Symbol grounding through cumulative learning. In P. Vogt et al. (Eds.), Symbol grounding and beyond: Proceedings of the third international workshop on the emergence and evolution of linguistic communication, Rome, Italy (pp. –). Berlin: Springer. Swarup, S., Mahmud, M. M. H., Lakkaraju, K., & Ray, S. R. (). Cumulative learning: Towards designing cognitive architectures for artificial agents that have a lifetime. Tech. Rep. UIUCDCS-R--. Thrun, S. (). Lifelong learning algorithms. In S. Thrun & L. Y. Pratt (Eds.), Learning to learn. Norwell, MA: Kluwer Academic. Thrun, S., & Mitchell, T. (). Lifelong robot learning. Robotics and Autonomous Systems, , –. Turing, A. M. (). Computing Machinery and Intelligence. Mind Mind, (), –. Vilalta, R., & Drissi, Y. (). A perspective view and survey of meta-learning. Artificial Intelligence Review, , –.

Curse of Dimensionality Eamonn Keogh, Abdullah Mueen University California-Riverside, Riverside, CA, USA

Definition The curse of dimensionality is a term introduced by Bellman to describe the problem caused by the expo-

C

nential increase in volume associated with adding extra dimensions to Euclidean space (Bellman, ). For example, evenly-spaced sample points suffice to sample a unit interval with no more than . distance between points; an equivalent sampling of a -dimensional unit hypercube with a grid with a spacing of . between adjacent points would require sample points: thus, in some sense, the D hypercube can be said to be a factor of “larger” than the unit interval. Informally, the phrase curse of dimensionality is often used to simply refer to the fact that one’s intuitions about how data structures, similarity measures, and algorithms behave in low dimensions do typically generalize well to higher dimensions.

Background Another way to envisage the vastness of high-dimensional Euclidean space is to compare the size of the unit sphere with the unit cube as the dimension of the space increases: as the dimension increases. As we can see in Fig. , the unit sphere becomes an insignificant volume relative to that of the unit cube. In other words, almost all of the high-dimensional space is far away from the center. In research papers, the phrase curse of dimensionality is often used as shorthand for one of its many implications for machine learning algorithms. Examples of these implications include: 7Nearest neighbor searches can be made significantly faster for low-dimensional data by indexing the data with an R-tree, a KD-tree, or a similar spatial access method. However, for high-dimensional data all such methods degrade to the performance of a simple linear scan across the data. ● For machine learning problems, a small increase in dimensionality generally requires a large increase in the numerosity of the data, in order to keep the same level of performance for regression, clustering, etc. ● In high-dimensional spaces, the normally intuitive concept of proximity or similarity may not be qualitatively meaningful. This is because the ratio of an object’s nearest neighbor over its farthest neighbor approaches one for high-dimensional spaces (Aggarwal, Hinneburg, & Keim, ). In other ●

C

C

Curse of Dimensionality 1

r=

0.8

Volume of the hypersphere Volume of the hypercube

r

0.6 0.4 0.2 0

0

2

4

6

8

10

12

14

16

18

20

Dimension

Curse of Dimensionality. Figure . The ratio of the volume of the hypersphere enclosed by the unit hypercube. The most intuitive example, the unit square and unit circle, are shown as an inset. Note that the volume of the hypersphere quickly becomes irrelevant for higher dimensionality

words, all objects are approximately equidistant from each other. There are many ways to attempt to mitigate the curse of dimensionality, including 7feature selection and 7dimensionality reduction. However, there is no single solution to the many difficulties caused by the effect.

Recommended Reading The major database (SIGMOD, VLDB, PODS), data mining (SIGKDD, ICDM, SDM), and machine learning (ICML, NIPS)

conferences typically feature several papers which explicitly address the curse of dimensionality each year. Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (). On the surprising behavior of distance metrics in high dimensional spaces. In ICDT (pp. –). London, England. Bellman, R. E. (). Dynamic programming. Princeton, NJ: Princeton University Press. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., & Keogh, E. (). Querying and mining of time series data: Experimental comparison of representations and distance measures. In Proceedings of the VLDB endowment (Vol. , pp. –). Auckland, NewZealand.

D Data Mining On Text 7Text Mining

are required and the specific learning techniques and software by which they are to be analyzed. The following are a number of key processes and techniques. Sourcing, Selecting, and Auditing Appropriate Data

Data Preparation Geoffrey I. Webb Monash University, Victoria, Australia

Synonyms Data preprocessing; Feature construction

Definition Before data can be analyzed, they must be organized into an appropriate form. Data preparation is the process of manipulating and organizing data prior to analysis.

Motivation and Background Data are collected for many purposes, not necessarily with machine learning in mind. Consequently, there is often a need to identify and extract relevant data for the given analytic purpose. Every learning system has specific requirements about how data must be presented for analysis and hence, data must be transformed to fulfill those requirements. Further, the selection of the specific data to be analyzed can greatly affect the models that are learned. For these reasons, data preparation is a critical part of any machine learning exercise. Data preparation is often the most time-consuming part of any nontrivial machine learning project.

Processes and Techniques The manner in which data are prepared varies greatly depending upon the analytic objectives for which they

It is necessary to review the data that are already available, assess their suitability to the task at hand, and investigate the feasibility of sourcing new data collected specifically for the desired task. Much of the theory on which learning systems are based assumes that the training data are a random sample of the population about which the user wishes to learn a model. However, much historical data represent biased samples, for example, data that have been easy to collect or that have been considered interesting for some other purpose. It is desirable to consider whether the available data are sufficiently representative of the future data to which a learned model is to be applied. It is important to assess whether there is sufficient data to realistically obtain the desired machine learning outcomes. Data quality should be investigated. Much data is

Claude Sammut, Geoﬀrey I. Webb (Eds.)

Encyclopedia of Machine Learning With Figures and Tables

123

Editors Claude Sammut School of Computer Science and Engineering University of New South Wales Sydney Australia [email protected] Geoffrey I. Webb Faculty of Information Technology Clayton School of Information Technology Monash University P.O. Box Victoria Australia [email protected]

ISBN ---- e-ISBN ---- Print and electronic bundle ISBN ---- DOI ./---- Springer New York Library of Congress Control Number: © Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, Spring Street, New York, NY , USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface The term “Machine Learning” came into wide-spread use following the first workshop by that name, held at Carnegie-Mellon University in . The papers from that workshop were published as Machine Learning: An Artificial Intelligence Approach, edited by Ryszard Michalski, Jaime Carbonell and Tom Mitchell. Machine Learning came to be identified as a research field in its own right as the workshops evolved into international conferences and journals of machine learning appeared. Although the field coalesced in the s, research on what we now call machine learning has a long history. In his paper on “Computing Machinery and Intelligence”, Alan Turing introduced his imitation game as a means of determining if a machine could be considered intelligent. In the same paper he speculates that programming the computer to have adult level intelligence would be too difficult. “Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child’s? If this were then subjected to an appropriate course of education one would obtain the adult brain”. Investigations into induction, a fundamental operation in learning, go back much further to Francis Bacon and David Hume in the th and th centuries. Early approaches followed the classical AI tradition of symbolic representation and logical inference. As machine learning began to be used in a wide variety of areas, the range of techniques expanded to incorporate ideas from psychology, information theory, statistics, neuroscience, genetics, operations research and more. Because of this diversity, it is not always easy for a new researcher to find his or her way around the machine learning landscape. The purpose of this encyclopedia is to guide enquiries into the field as a whole and to serve as an entry point to specific topics, providing overviews and, most importantly, references to source material. All the entries have been written by experts in their field and have been refereed and revised by an international editorial board consisting of leading machine learning researchers. Putting together an encyclopedia for such a diverse field has been a major undertaking. We thank all the authors, without whom this would not have been possible. They have devoted their expertise and patience to the project because of their desire to contribute to this dynamic and still growing field. A project as large as this could only succeed with the help of the area editors whose specialised knowledge was essential in defining the range and structure of the entries. The encyclopedia was started by the enthusiasm of Springer editors Jennifer Evans and Oona Schmidt and continued with the support of Melissa Fearon. Special thanks to Andrew Spencer, who oversaw production and kept everyone, including the editors on track. Claude Sammut and Geoffrey I. Webb

Editors-in-Chief Claude Sammut School of Computer Science and Engineering University of New South Wales Sydney, Australia [email protected] Geoﬀrey I. Webb Faculty of Information Technology Clayton School of Information Technology Monash University P.O. Box Victoria, Australia Geoﬀ[email protected]

Area Editors Charu Aggarwal IBM T. J. Watson Research Center Skyline Drive Hawthorne NY USA [email protected] Wray Buntine NICTA Locked Bag Canberra ACT Australia [email protected] James Cussens Department of Biology (Area ) York Centre for Complex Systems Analysis University of York PO Box York YO YW UK [email protected] Luc De Raedt Dept. of Computer Science Katholieke Universiteit Leuven Celestijnenlaan A Heverlee Belgium [email protected] Peter A. Flach Department of Computer Science University of Bristol Woodland Road Bristol BS UB UK [email protected] Russ Greiner Department of Computing Science University of Alberta Athabasca Hall Edmonton

Alberta TG E Canada [email protected] Eamonn Keogh Computer Science & Engineering Department University of California Riverside California CA USA [email protected] Michael L. Littman Department of Computer Science Rutgers, the State University of New Jersey Frelinghuysen Road Piscataway New Jersey - USA [email protected] Sridhar Mahadevan Department of Computer Science University of Massachusetts Governor’s Drive Amherst MA USA [email protected] Stan Matwin School of Information Technology and Engineering University of Ottawa King Edward Ave., P.O. Box Stn A Ottawa Ontario KN N Canada [email protected] Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin University Station C

x

Area Editors

Austin Texas TX - USA [email protected] Dunja Mladenic Department for Intelligent Systems J. Stefan Institute Jamova Ljubljana Slovenia [email protected] C. David Page Department of Biostatistics and Medical Informatics University of Wisconsin Medical School University Avenue Wisconsin Madison WI USA [email protected] Bernhard Pfahringer Department of Computer Science University of Waikato Private Bag Hamilton New Zealand [email protected] Michail Prokopenko CSIRO Macquarie University Building EB, Campus Herring Road North Ryde NSW Australia

Frank Stephan Department of Mathematics National University of Singapore Science Drive S, Singapore Singapore [email protected] Peter Stone Department of Computer Sciences The University of Texas at Austin University Station C Austin Texas TX - USA [email protected] Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University Kelley Engineering Center Corvallis Oregon OR - USA [email protected] Takashi Washio The Institute of Scientific and Industrial Research Osaka University - Mihogaoka Osaka Ibaraki Japan [email protected]

List of Contributors Pieter Abbeel Department of Electrical Engineering and Computer Sciences University of California Sutardja Dai Hall # CA -, Berkeley California USA [email protected]

Charu C. Aggarwal IBM T. J. Watson Research Center Skyline Drive Hawthorne NY USA [email protected]

Biliana Alexandrova-Kabadjova General Directorate of Central Bank Operations Central Banking Operations Division Bank of Mexico Av. de Mayo No. Col. Centro, C.P. Mexico, D.F [email protected]

J. Andrew Bagnell Robotics Institute Carnegie Mellon University Forbes Avenue Pittsburgh, PA USA [email protected] Michael Bain University of New South Wales Sydney Australia [email protected] Arindam Banerjee Department of Computer Science and Engineering University of Minnesota Minneapolis, MN USA [email protected] Andrew G. Barto Department of Computer Science University of Massachusetts Amherst Computer Science Building Amherst, MA USA [email protected]

Periklis Andritsos Thoora Inc. Toronto, ON Canada [email protected]

Rohan A. Baxter Analytics, Intelligence and Risk Australian Taxation Office PO Box Civic Square, ACT Australia [email protected]

Peter Auer Institute of Computer Science University of Leoben Franz-Josef-Strasse Leoben Austria [email protected]

Bettina Berendt Katholieke Universiteit Leuven Department of Computer Science Celestijnenlaan A Heverlee Belgium [email protected]

xii

List of Contributors

Indrajit Bhattacharya IBM India Research Laboratory New Delhi India

Mustafa Bilgic University of Maryland AV Williams Bldg Rm College Park, MD USA

Mauro Birattari IRIDIA Université Libre de Bruxelles Brussels Belgium [email protected]

Hendrik Blockeel Department of Computer Science Katholieke Universiteit Leuven Celestijnenlaan A Heverlee Belgium [email protected]

Shawn Bohn Pacific Northwest National Laboratory

Antal van den Bosch Tilburg centre for Creative Computing Tilburg University P.O. Box LE, Tilburg The Netherlands [email protected]

Janez Brank Department for Intelligent Systems Jožef Stefan Institute Jamova Ljubljana Slovenia [email protected]

Jürgen Branke Institut für Angewandte Informatik und Formale Beschreibungsverfahren Universität Karlsruhe (TH) Karlsruhe Germany [email protected] Pavel Brazdil LIAAD-INESC Porto L.A./Faculdade de Economia Laboratory of Artificial Intelligence and Computer Science University of Porto Rua de Ceuta n. .piso Porto - Portugal [email protected] Gavin Brown The University of Manchester School of Computer Science Kilburn Building Oxford Road Manchester, M PL UK [email protected] Ivan Bruha Department of Computing & Software McMaster University Hamilton, ON Canada [email protected] M.D. Buhmann Numerische Mathematik Justus-Liebig University Mathematisches Institut Heinrich-Buff-Ring Giessen Germany [email protected] Wray L. Buntine NICTA Locked Bag Canberra ACT Australia [email protected]

List of Contributors

Tibério Caetano Research School of Information Sciences and Engineering Australian National University Canberra ACT Australia tibé[email protected] Nicola Cancedda Xerox Research Centre Europe , chemin de Maupertuis Meylan France [email protected] Gail A. Carpenter Department of Cognitive and Neural Systems Center for Adaptive Systems Boston University Boston, MA USA John Case Department of Computer and Information Sciences University of Delaware Newark DE - USA [email protected] Tonatiuh Peña Centeno Economic Research Division Bank of Mexico Av. de Mayo # Col. Centro, C.P. Mexico, D.F. Deepayan Chakrabarti Yahoo! Research st Avenue Sunnyvale, CA USA [email protected] Philip K. Chan Department of Computer Sciences Florida Institute of Technology Melbourne, FL USA [email protected]

Massimiliano Ciaramita Yahoo! Research Barcelona Ocata Barcelona Spain [email protected] Adam Coates Department of Computer Science Stanford University Stanford, CA USA David Cohn Google, Inc. Amphitheatre Parkway Mountain View, CA USA [email protected] David Corne Heriot-Watt University Earl Mountbatten Building Edinburgh EH AS UK [email protected] Susan Craw IDEAS Research Institute School of Computing The Robert Gordon University St. Andrew Street Aberdeen AB HG Scotland UK [email protected] Artur Czumaj Department of Computer Science University of Warwick Coventry CV AL UK [email protected] Walter Daelemans Department of Linguistics CLIPS University of Antwerp Prinsstraat Antwerpen Belgium [email protected]

xiii

xiv

List of Contributors

Sanjoy Dasgupta Department of Computer Science and Engineering University of California San Diego Gilman Drive Mail Code La Jolla, California - USA [email protected] Gerald DeJong Department of Computer Science University of Illinois at Urbana Urbana, IL USA [email protected] Marco Dorigo IRIDIA Université Libre de Bruxelles Avenue Franklin Roosevelt Brussels Belgium [email protected] Kurt Driessens Departement Computerwetenschappen Katholieke Universiteit Leuven Celestijnenlaan A Heverlee Belgium [email protected] Christopher Drummond Integrated Reasoning National Research Council Institute for Information Technology Montreal Road Building M-, Room Ottawa, ON KA R Canada [email protected] Yaakov Engel AICML, Department of Computing Science University of Alberta - Athabasca Hall Edmonton Alberta TG E Canada [email protected]

Scott E. Fahlman Language Technologies Institute Carnegie Mellon University GHC Forbes Avenue Pittsburgh, PA USA [email protected] Alan Fern School of Electrical Engineering and Computer Science Oregon State University Kelley Engineering Center Corvallis, OR - USA [email protected] Peter A. Flach Department of Computer Science University of Bristol Woodland Road Bristol, BS UB UK [email protected] Pierre Flener Department of Information Technology Uppsala University Box SE- Uppsala Sweden [email protected] Johannes Fürnkranz TU Darmstadt Fachbereich Informatik Hochschulstraße Darmstadt Germany [email protected] Thomas Gärtner Knowledge Discovery Fraunhofer Institute for Intelligent Analysis and Information Systems Schloss Birlinghoven Sankt Augustin Germany [email protected]

List of Contributors

João Gama Laboratory of Artificial Intelligence and Decision Support University of Porto Porto Portugal [email protected]

Alma Lilia García-Almanza General Directorate of Information Technology Bank of Mexico Av. de Mayo No. Col. Centro, C.P. Mexico, D.F. [email protected]

Gemma C. Garriga Laboratoire d’Informatique de Paris Universite Pierre et Marie Curie place Jussieu Paris France [email protected]

Wulfram Gerstner Laboratory of Computational Neuroscience Brain Mind Institute Ecole Polytechnique Fédérale de Lausanne Station Lausanne EPFL Switzerland [email protected]

Lise Getoor Department of Computer Science University of Maryland AV Williams Bldg, Rm College Park, MD USA [email protected]

Christophe Giraud-Carrier Department of Computer Science Brigham Young University TMCB Provo UT USA

Marko Grobelnik Department for Intelligent Systems Jožef Stefan Institute Jamova , Ljubljana Slovenia [email protected]

Stephen Grossberg Department of Cognitive Boston University Beacon Street Boston, MA USA [email protected]

Jiawei Han Department of Computer Science University of Illinois at Urbana Champaign N. Goodwin Avenue Urbana, IL USA [email protected]

Julia Handl Faculty of Life Sciences in Manchester University of Manchester UK [email protected]

Michael Harries Technology Strategy Division Advanced Products Group, Citrix Labs North Ryde NSW Australia

Jun He Department of Computer Science Aberystwyth University Aberystwyth SY DB Wales UK [email protected]

xv

xvi

List of Contributors

Bernhard Hengst School of Computer Science & Engineering University of New South Wales Sydney NSW Australia [email protected]

Phil Husbands Department of Informatics University of Sussex Brighton BNQH UK [email protected]

Tom Heskes Radboud University Nijmegen Toernooiveld ED Nijmegen The Netherlands [email protected]

Marcus Hutter Australian National University RSIS Room B Building Corner of North and Daley Road ACT Canberra Australia [email protected]

Geoffrey Hinton Department of Computer Science Office PT G University of Toronto King’s College Road MS G, Toronto Ontario Canada [email protected] Lawrence Holder School of Electrical Engineering and Computer Science Box Washington State University Pullman, WA USA [email protected] Tamás Horváth Department of Computer Science III University of Bonn and Fraunhofer IAIS Fraunhofer Institute for Intelligent Analysis and Information Systems Schloss Birlinghoven Sankt Augustin Germany [email protected] Eyke Hüllermeier Knowledge Engineering & Bioinformatics Head of the KEBI Lab Department of Mathematics and Computer Science Philipps-Universität Marburg Mehrzweckgebäude Hans-Meerwein-Straße Marburg Germany [email protected]

Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum Universitstr. Bochum Germany [email protected]

Sanjay Jain Department of Computer Science National University of Singapore Computing Drive Singapore Republic of Singapore [email protected]

Tommy R. Jensen Institut für Mathematik Alpen-Adria-Universität Klagenfurt Universitässtr. - Klagenfurt Austria [email protected]

Xin Jin University of Illinois at Urbana-Champaign Toernooiveld ED Urbana, IL USA

List of Contributors

Antonis C. Kakas Department of Computer Science University of Cyprus Kallipoleos Str., P.O. Box Nicosia Cyprus [email protected]

James Kennedy U.S. Bureau of Labor Statistics Postal Square Building Massachusetts Ave., NE Washington, DC - USA [email protected]

Subbarao Kambhampati Department of Computer Science and Engineering Arizona State University Tempe, AZ USA [email protected]

Eamonn Keogh Computer Science & Engineering Department University of California Riverside, CA USA [email protected]

Anne Kao The Boeing Company P.O. Box MC L- Seattle, WA - USA [email protected]

Kristian Kersting Knowledge Discovery Fraunhofer IAIS Schloß Birlinghoven Sankt Augustin Germany [email protected]

George Karypis Department of Computer Science and Engineering Digital Technology Center and Army HPC Research Center University of Minnesota Minneapolis, MN USA [email protected]

Joshua Knowles University of Manchester

Samuel Kaski Laboratory of Computer and Information Science Helsinki University of Technology P.O. Box TKK Finland [email protected]

Kevin B. Korb School of Information Technology Monash University Room , Bldg , Clayton, Victoria Australia [email protected]

Carlos Kavka Istituto Nazionale di Fisica Nucleare University of Trieste Trieste Italy [email protected]

Aleksander Kołcz Microsoft One Microsoft Way Redmond, WA USA [email protected]

Stefan Kramer Institut für Informatik/I Technische Universität München Boltzmannstr. Garching b. München Germany [email protected]

xvii

xviii

List of Contributors

Krzysztof Krawiec Institute of Computing Science Poznan University of Technology Piotrowo - Poznan Poland [email protected]

Christina Leslie Computational Biology Program Sloan-Kettering Institute Memorial Sloan-Kettering Cancer Center York Ave Mail Box # New York, NY [email protected]

Nicolas Lachiche Image Sciences, Computer Sciences and Remote Sensing Laboratory , bld Brant llkirch-Graffenstaden France [email protected]

Shiau Hong Lim University of Illinois IL USA [email protected]

Michail G. Lagoudakis Department of Electronic and Computer Engineering Technical University of Crete Chania Crete Greece [email protected] John Langford Yahoo Research New York, NY USA [email protected] Pier Luca Lanzi Dipartimento di Elettronica e Informazione Politecnico di Milano Milano Italy [email protected] Nada Lavraˇc Department of Knowledge Technologies Jožef Stefan Institute Jamova Ljubljana Slovenia Faculty of Information Technology University of Nova Gorica Vipavska Nova Gorica Slovenia

Charles X. Ling The University of Western Ontario Canada [email protected] Huan Liu Computer Science and Engineering Ira Fulton School of Engineering Arizona State University Brickyard Suite South Mill Avenue Tempe, AZ - USA [email protected] Bin Liu Faculty of Information Technology Monash University Melbourne Australia [email protected] John Lloyd College of Engineering and Computer Science The Australian National University , Canberra ACT Australia [email protected] Shie Mannor Department of Electrical Engineering Israel Institute of Technology Technion Technion City Haifa Israel [email protected]

List of Contributors

Eric Martin Department of Artificial Intelligence School of Computer Science and Engineering University of New South Wales NSW Sydney Australia [email protected] Serafín Martínez-Jaramillo General Directorate of Financial System Analysis Financial System Analysis Division Bank of Mexico Av. de Mayo No. Col. Centro, C.P. Mexico, D.F [email protected] Stan Matwin School of Information Technology and Engineering University of Ottawa Ottawa, ON Canada [email protected] Julian McAuley Statistical Machine Learning Program Department of Engineering and Computer Science National University of Australia NICTA, Locked Bag Canberra ACT Australia [email protected] Prem Melville Machine Learning IBM T. J. Watson Research Center Route /P.O. Box Kitchawan Rd Yorktown Heights, NY USA [email protected] Pietro Michelucci Strategic Analysis, Inc. Wilson Blvd Suite Arlington, VA USA [email protected]

Rada Mihalcea Department of Computer Science and Engineering University of North Texas Denton, TX - USA [email protected] Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin University Station C Austin, TX - USA [email protected] Dunja Mladeni´c Department of Knowledge Technologies Jožef Stefan Insitute Jamova , Ljubljana Slovenia [email protected] Katharina Morik Department of Computer Science Technische Universität Dortmund Dortmund Germany [email protected] Jun Morimoto Advanced Telecommunication Research Institute International ATR Kyoto Japan Abdullah Mueen Department of Computer Science and Engineering University California-Riverside Riverside, CA USA Paul Munro School of Information Sciences University of Pittsburgh Pittsburgh, PA USA [email protected]

xix

xx

List of Contributors

Ion Muslea Language Weaver, Inc. Admiralty Way, Suite Marina del Rey, CA USA [email protected] Galileo Namata Department of Computer Science University of Maryland College Park, MD USA Sriraam Natarajan Department of Computer Sciences University of Wisconsin Medical School University Avenue Madison, WI USA [email protected] Andrew Y. Ng Stanford AI Laboratory Stanford University Serra Mall, Gates Building A Stanford, CA - USA [email protected] Siegfried Nijssen Institut für Informatik Albert-Ludwigs-Universität Freiburg Georges-Köhler-Allee, Gebäude Freiburg i. Br. Germany [email protected] William Stafford Noble Department of Genome Sciences University of Washington Seattle, WA USA [email protected] Petra Kralj Novak Department of Knowledge Technologies Jožef Stefan Institute Jamova Ljubljana Slovenia [email protected]

Daniel Oblinger DARPA/IPTO Fairfax Drive Arlington, VA USA [email protected]

Peter Orbanz Department of Engineering Cambridge University Trumpington Street Cambridge, CB PZ UK

Miles Osborne Institute for Communicating and Collaborative Systems University of Edinburgh Buccleuch Place Edinburgh EH LW Scotland UK [email protected]

C. David page Department of Biostatistics and Medical Informatics University of Wisconsin Medical School University Avenue Madison, WI USA [email protected]

Jonathan Patrick Telfer School of Management University of Ottawa Laurier avenue Ottawa, ON KN N Canada [email protected]

Claudia Perlich Data Analytics Research Group IBM T.J. Watson Research Center P.O. Box Yorktown Heights, NY USA [email protected]

List of Contributors

Jan Peters Department of Empirical Inference and Machine Learning Max Planck Institute for Biological Cybernetics Spemannstr. Tuebingen Germany [email protected]

Bernhard Pfahringer Department of Computer Science University of Waikato Private Bag Hamilton New Zealand [email protected]

Steve Poteet Boeing Phantom Works P.O. Box MC L- Seattle, WA USA

Pascal Poupart School of Computer Science University of Waterloo University Avenue West Waterloo ON NL G Canada [email protected]

Rob Powers Computer Science Department Stanford University Serra Mall Stanford, CA USA [email protected]

Cecilia M. Procopiuc AT&T Labs Florham Park, NJ USA [email protected]

Martin L. Puterman Centre for Health Care Management Sauder School of Business University of British Columbia Main Mall Vancouver, BC VT Z Canada [email protected] Lesley Quach Boeing Phantom Works P.O. Box MC L- Seattle, WA USA Novi Quadrianto Department of Engineering and Computer Science Australian National University NICTA London Circuit Canberra ACT Australia [email protected] Luc De Raedt Department of Computer Science Katholieke Universiteit Leuven Celestijnenlaan A BE - Heverlee Belgium [email protected] Dev Rajnarayan NASA Ames Research Center Mail Stop - Moffett Field, CA USA Adwait Ratnaparkhi Yahoo! Labs Santa Clara California USA [email protected] Soumya Ray School of EECS Oregon State University Kelley Engineering Center Corvallis, OR USA [email protected]

xxi

xxii

List of Contributors

Mark Reid Research School of Information Sciences and Engineering The Australian National University Canberra, ACT Australia [email protected] Jean-Michel Renders Xerox Research Centre Europe , chemin de Maupertuis Meylan France John Risch Pacific Northwest National Laboratory Jorma Rissanen Complex Systems Computation Group Department of Computer Science Helsinki Institute of Information Technology Helsinki Finland [email protected] Nicholas Roy Massachusetts Institute of Technology Cambridge, MA USA Lorenza Saitta Università del Piemonte Orientale Alessandria Italy [email protected] Yasubumi Sakakibara Department of Biosciences and Informatics Keio University [email protected] Hiyoshi Kohoku-ku Japan Claude Sammut School of Computer Science and Engineering The University of New South Wales Sydney NSW Australia [email protected]

Joerg Sander Department of Computing Science University of Alberta Edmonton, AB Canada [email protected] Scott Sanner Statistical Machine Learning Group NICTA, London Circuit, Tower A ACT Canberra Australia [email protected] Stefan Schaal Department of Computer Science University of Southern California ATR Computational Neuroscience Labs Watt Way Los Angeles, CA - USA [email protected] Ute Schmid Department of Information Systems and Applied Computer Science University of Bamberg Feldkirchenstr. Bamberg Germany [email protected] Stephen Scott University of Nebraska Lincoln, NE USA Michele Sebag Laboratoire de Recherche en Informatique Université Paris-Sud Bât Orsay France [email protected] Prithviraj Sen University of Maryland AV Williams Bldg, Rm College Park, MD USA

List of Contributors

Hanhuai Shan Department of Computer Science and Engineering University of Minnesota Minneapolis, MN USA [email protected]

Hossam Sharara Department of Computer Science University of Maryland College Park, MD Maryland USA

Victor S. Sheng The University of Western Ontario Canada

Jelber Sayyad Shirabad School of Information Technology and Engineering University of Ottawa King Edward P.O. Box Stn A, KN N Ottawa, Ontario Canada [email protected]

Yoav Shoham Computer Science Department Stanford University Serra Mall Stanford, CA USA [email protected]

Thomas R. Shultz Department of Psychology and School of Computer Science McGill University Dr. Penfield Avenue Montréal QC HA B Canada [email protected]

Ricardo Silva Gatsby Computational Neuroscience Unit University College London Alexandra House Queen Square London WCN AR UK [email protected] Vikas Sindhwani IBM T. J. Watson Research Center Route /P.O. Box Kitchawan Rd Yorktown Heights, NY USA Moshe Sipper Department of Computer Science Ben-Gurion University P.O. Box Beer-Sheva Israel [email protected] William D. Smart Associate Professor Department of Computer Science and Engineering Washington University in St. Louis Campus Box One Brookings Drive St. Louis, MO USA [email protected] Carlos Soares LIAAD-INESC Porto L.A./Faculdade de Economia Laboratory of Artificial Intelligence and Computer Science University of Porto Rua de Ceuta n. .piso, - Porto Portugal Christian Sohler Heinz Nixdorf Institute & Computer Science Department University of Paderborn Fuerstenallee Paderborn Germany [email protected]

xxiii

xxiv

List of Contributors

Frank Stephan Department of Computer Science and Department of Mathematics National University of Singapore Singapore Republic of Singapore [email protected]

Jon Timmis Department of Computer Science and Department of Electronics University of York Heslington York DD UK [email protected]

Peter Stone Department of Computer Sciences The University of Texas at Austin Austin, TX USA [email protected]

Jo-Anne Ting University of Edinburgh

Alexander L. Strehl Department of Computer Science Rutgers University Frelinghuysen Road Piscataway, NJ USA [email protected]

Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University Kelley Engineering Center Corvallis, OR - USA [email protected]

Russ Tedrake Department of Computer Science Massachusetts Institute of Technology Vassar Street Cambridge, MA USA [email protected]

Yee Whye Teh Gatsby Computational Neuroscience Unit University College London Queen Square London WCN AR UK [email protected]

Kai Ming Ting Gippsland School of Information Technology Monash University Gippsland Campus Churchill , Victoria Australia [email protected] Ljupˇco Todorovski Faculty of Administration University of Ljubljana Gosarjeva Ljubljana Slovenia [email protected] Hannu Toivonen Department of Computer Science University of Helsinki P.O. Box (Gustaf Hällströmin katu b) Helsinki Finland [email protected] Luís Torgo Department of Computer Science Faculty of Sciences University of Porto Rua Campo Alegre /, – Porto Portugal [email protected] Panayiotis Tsaparas Microsoft Research Microsoft Mountain View, CA USA [email protected]

List of Contributors

Paul E. Utgoff Department of Computer Science University of Massachusetts Governor’s Drive Amherst, MA – USA William Uther NICTA and the University of New South Wales [email protected] Sethu Vijayakumar University of Edinburgh University of Southern California

Eric Wiewiora University of California San Diego [email protected] Anthony Wirth Department of Computer Science and Software Engineering The University of Melbourne Victoria Australia [email protected]

Ricardo Vilalta Department of Computer Science University of Houston Calhoun Rd Houston, TX - USA

Michael Witbrock Cycorp, Inc. Executive Center Drive Austin, TX USA [email protected]

Michail Vlachos IBM Zürich Research Laboratory Säumerstrasse Rüschlikon Switzerland [email protected]

David Wolpert NASA Ames Research Center Moffett Field, CA USA [email protected]

Kiri L. Wagstaff Machine Learning Systems Jet Propulsion Laboratory California Institute of Technology Pasadena, CA USA [email protected] Geoffrey I. Webb Faculty of Information Technology Clayton School of Information Technology Monash University P.O. Box Victoria Australia [email protected] R. Paul Wiegand Institute for Simulation and Training University of Central Florida Orlando, FL USA [email protected] [email protected]

Stefan Wrobel Department of Computer Science University of Bonn, and Fraunhofer IAIS (Institute for Intelligent Analysis and Information Systems) Fraunhofer IAIS Schloss Birlinghoven Sankt Augustin Germany Jason Wu Boeing Phantom Works P.O. Box MC L- Seattle, WA USA Zhao Xu Knowledge Discovery Fraunhofer IAIS Schloß Birlinghoven Sankt Augustin Germany

xxv

xxvi

List of Contributors

Ying Yang Australian Taxation Office White Horse Road Box Hill VIC Australia [email protected]

Ying Zhao Department of Computer Science and Technology Tsinghua University Beijing China

Sungwook Yoon PARC Labs Coyote Hill Road Palo Alto, CA USA

Fei Zheng Faculty of Information Technology Monash University Clayton School of I.T. Room , Bldg Wellington Road Clayton Melbourne Victoria Australia [email protected]

Thomas Zeugmann Division of Computer Science Graduate School of Information Science and Technology Hokkaido University Sapparo Japan [email protected] Xinhua Zhang School of Computer Science Australian National University NICTA London Circuit Canberra Australia [email protected]

Xiaojin Zhu Department of Computer Sciences University of Wisconsin-Madison West Dayton Street, Madison, WI USA [email protected]

- -Norm Distance 7Manhattan Distance

Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----, © Springer Science+Business Media LLC

A Abduction Antonis C. Kakas University of Cyprus, Nicosia, Cyprus

Definition Abduction is a form of reasoning, sometimes described as “deduction in reverse,” whereby given a rule that “A follows from B” and the observed result of “A” we infer the condition “B” of the rule. More generally, given a theory, T, modeling a domain of interest and an observation, “A,” we infer a hypothesis “B” such that the observation follows deductively from T augmented with “B.” We think of “B” as a possible explanation for the observation according to the given theory that contains our rule. This new information and its consequences (or ramifications) according to the given theory can be considered as the result of a (or part of a) learning process based on the given theory and driven by the observations that are explained by abduction. Abduction can be combined with 7induction in different ways to enhance this learning process.

Motivation and Background Abduction is, along with induction, a synthetic form of reasoning whereby it generates, in its explanations, new information not hitherto contained in the current theory with which the reasoning is performed. As such, it has a natural relation to learning, and in particular to knowledge intensive learning, where the new information generated aims to complete, at least partially, the current knowledge (or model) of the problem domain as described in the given theory.

Early uses of abduction in the context of machine learning concentrated on how abduction can be used as a theory revision operator for identifying where the current theory could be revised in order to accommodate the new learning data. This includes the work of Michalski (), Ourston and Mooney (), and Ade, Malfait, and Raedt (). Another early link of abduction to learning was given by the 7explanation based learning method (DeJong & Mooney, ), where the abductive explanations of the learning data (training examples) are generalized to all cases. Following this, it was realized (Flach & Kakas, ) that the role of abduction in learning could be strengthened by linking it to induction, culminating in a hybrid integrated approach to learning where abduction and induction are tightly integrated to provide powerful learning frameworks such as the ones of Progol . (Muggleton & Bryant, ) and HAIL (Ray, Broda, & Russo, ). On the other hand, from the point of view of abduction as “inference to the best explanation” (Josephson & Josephson, ) the link with induction provides a way to distinguish between different explanations and to select those explanations that give a better inductive generalization result. A recent application of abduction, on its own or in combination with induction, is in Systems Biology where we try to model biological processes and pathways at different levels. This challenging domain provides an important development test-bed for these methods of knowledge intensive learning (see e.g., King et al., ; Papatheodorou, Kakas, & Sergot, ; Ray, Antoniades, Kakas, & Demetriades, ; TamaddoniNezhad, Kakas, Muggleton, & Pazos, ; Zupan et al., ).

Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----, © Springer Science+Business Media LLC

A

Abduction

Structure of the Learning Task Abduction contributes to the learning task by first explaining, and thus rationalizing, the training data according to a given and current model of the domain to be learned. These abductive explanations either form on their own the result of learning or they feed into a subsequent phase to generate the final result of learning. Abduction in Artificial Intelligence

Abduction as studied in the area of Artificial Intelligence and the perspective of learning is mainly defined in a logic-based approach (Other approaches to abduction include the set covering approach See, e.g., Reggia () or case-based explanation, e.g., Leake ().) as follows. Given a set of sentences T (a theory or model), and a sentence O (observation), the abductive task is the problem of finding a set of sentences H (abductive explanation for O) such that: . T ∪ H ⊧ O, . T ∪ H is consistent, where ⊧ denotes the deductive entailment relation of the formal logic used in the representation of our theory and consistency refers also to the corresponding notion in this logic. The particular choice of this underlying formal framework of logic is in general a matter that depends on the problem or phenomena that we are trying to model. In many cases, this is based on 7first order predicate calculus, as, for example, in the approach of theory completion in Muggleton and Bryant (). But other logics can be used, e.g., the nonmonotonic logics of default logic or logic programming with negation as failure when the modeling of our problem requires this level of expressivity. This basic formalization as it stands, does not fully capture the explanatory nature of the abductive explanation H in the sense that it necessarily conveys some reason why the observations hold. It would, for example, allow an observation O to be explained by itself or in terms of some other observations rather than in terms of some “deeper” reason for which the observation must hold according to the theory T. Also, as the above specification stands, the observation can be abductively explained by generating in H some new (general) theory

completely unrelated to the given theory T. In this case, H does not account for the observations O according to the given theory T and in this sense it may not be considered as an explanation for O relative to T. For these reasons, in order to specify a “level” at which the explanations are required and to understand these relative to the given general theory about the domain of interest, the members of an explanation are normally restricted to belong to a special preassigned, domain-specific class of sentences called abducible. Hence abduction, is typically applied on a model, T, in which we can separate two disjoint sets of predicates: the observable predicates and the abducible (or open) predicates. The basic assumption then is that our model T has reached a sufficient level of comprehension of the domain such that all the incompleteness of the model can be isolated (under some working hypotheses) in its abducible predicates. The observable predicates are assumed to be completely defined (in T) in terms of the abducible predicates and other background auxiliary predicates; any incompleteness in their representation comes from the incompleteness in the abducible predicates. In practice, the empirical observations that drive the learning task are described using the observable predicates. Observations are represented by formulae that refer only to the observable predicates (and possibly some background auxiliary predicates) typically by ground atomic facts on these observable predicates. The abducible predicates describe underlying (theoretical) relations in our model that are not observable directly but can, through the model T, bring about observable information. The assumptions on the abducible predicates used for building up the explanations may be subject to restrictions that are expressed through integrity constraints. These represent additional knowledge that we have on our domain expressing general properties of the domain that remain valid no matter how the theory is to be extended in the process of abduction and associated learning. Therefore, in general, an abductive theory is a triple, denoted by ⟨T, A, IC⟩, where T is the background theory, A is a set of abducible predicates, and IC is a set of integrity constraints. Then, in the definition of an abductive explanation given above, one more requirement is added: . T ∪ H satisfies IC.

Abduction

The satisfaction of integrity constraints can be formally understood in several ways (see Kakas, Kowalski, & Toni, and references therein). Note that the integrity constraints reduce the number of explanations for a set of observations filtering out those explanations that do not satisfy them. Based on this notion of abductive explanation a credulous form of abductive entailment is defined. Given an abductive theory, T = ⟨T, A, IC⟩, and an observation O then, O is abductively entailed by T, denoted by T ⊧A O, if there exists an abductive explanation of O in T. This notion of abductive entailment can then form the basis of a coverage relation for learning in the face of incomplete information.

Abductive Concept Learning

Abduction allows us to reason in the face of incomplete information. As such when we have learning problems where the background data on the training examples is incomplete the use of abduction can enhance the learning capabilities. Abductive concept learning (ACL) (Kakas & Riguzzi, ) is a learning framework that allows us to learn from incomplete information and to later be able to classify new cases that again could be incompletely specified. Under ACL, we learn abductive theories, ⟨T, A, IC⟩ with abduction playing a central role in the covering relation of the learning problem. The abductive theories learned in ACL contain both rules, in T, for the concept(s) to be learned as well as general clauses acting as integrity constraints in IC. Practical problems that can be addressed with ACL: () concept learning from incomplete background data where some of the background predicates are incompletely specified and () concept learning from incomplete background data together with given integrity constraints that provide some information on the incompleteness of the data. The treatment of incompleteness through abduction is integrated within the learning process. This allows the possibility of learning more compact theories that can alleviate the problem of over fitting due to the incompleteness in the data. A specific subcase of these two problems and important third application problem of ACL is that of () multiple predicate learning, where each predicate is required to be learned from the incomplete data for the other

A

predicates. Here the abductive reasoning can be used to suitably connect and integrate the learning of the different predicates. This can help to overcome some of the nonlocality difficulties of multiple predicate learning, such as order-dependence and global consistency of the learned theory. ACL is defined as an extension of 7Inductive Logic Programming (ILP) where both the background knowledge and the learned theory are abductive theories. The central formal definition of ACL is given as follows where examples are atomic ground facts on the target predicate(s) to be learned. Definition (Abductive Concept Learning) Given A set of positive examples E+ ● A set of negative examples E− ● An abductive theory T = ⟨P, A, I⟩ as background theory ● An hypothesis space T = ⟨P, I⟩ consisting of a space of possible programs P and a space of possible constraints I ●

Find A set of rules P′ ∈ P and a set of constraints I ′ ∈ I such that the new abductive theory T ′ = ⟨P ∪ P′ , A, I ∪ I ′ ⟩ satisfies the following conditions T ′ ⊧ A E+ ● ∀e− ∈ E− , T ′ ⊭A e− ●

where E+ stands for the conjunction of all positive examples. An individual example e is said to be covered by a theory T ′ if T ′ ⊧A e. In effect, this definition replaces the deductive entailment as the example coverage relation in the ILP problem with abductive entailment to define the ACL learning problem. The fact that the conjunction of positive examples must be covered means that, for every positive example, there must exist an abductive explanation and the explanations for all the positive examples must be consistent with each other. For negative examples, it is required that no abductive explanation exists for any of them. ACL can be illustrated as follows.

A

A

Abduction

Example Suppose we want to learn the concept father. Let the background theory be T = ⟨P, A, ∅⟩ where: P = {parent(john, mary), male(john), parent(david, steve), parent(kathy, ellen), female(kathy)}, A = {male, female}. Let the training examples be: E+ = {father(john, mary), father(david, steve)}, E− = {father(kathy, ellen), father(john, steve)}. In this case, a possible hypotheses T ′ = ⟨P ∪ P′ , A, I ′ ⟩ learned by ACL would consist of P′ = {father(X, Y) ← parent(X, Y), male(X)}, I ′ = { ← male(X), female(X)}. This hypothesis satisfies the definition of ACL because: . T ′ ⊧A father(john, mary), father(david, steve) with ∆ = {male(david)}. . T ′ ⊭A father(kathy, ellen), as the only possible explanation for this goal, namely {male(kathy)} is made inconsistent by the learned integrity constraint in I ′ . . T ′ ⊭A father(john, steve), as this has no possible abductive explanations. Hence, despite the fact that the background theory is incomplete (in its abducible predicates), ACL can find an appropriate solution to the learning problem by suitably extending the background theory with abducible assumptions. Note that the learned theory without the integrity constraint would not satisfy the definition of ACL, because there would exist an abductive explanation for the negative example father(kathy, ellen), namely ∆− = {male(kathy)}. This explanation is prohibited in the complete theory by the learned constraint together with the fact female(kathy). The algorithm and learning system for ACL is based on a decomposition of this problem into two subproblems: () learning the rules in P′ together with appropriate explanations for the training examples and () learning integrity constraints driven by the explanations generated in the first part. This decomposition allows ACL to be developed by combining the two IPL settings of explanatory (predictive) learning and confirmatory (descriptive) learning. In fact, the first subproblem can be seen as a problem of learning from

entailment, while the second subproblem as a problem of learning from interpretations. Abduction and Induction

The utility of abduction in learning can be enhanced significantly when this is integrated with induction. Several approaches for synthesizing abduction and induction in learning have been developed, e.g., Ade and Denecker (), Muggleton and Bryant (), Yamamoto (), and Flach and Kakas (). These approaches aim to develop techniques for knowledge intensive learning with complex background theories. One problem to be faced by purely inductive techniques, is that the training data on which the inductive process operates, often contain gaps and inconsistencies. The general idea is that abductive reasoning can feed information into the inductive process by using the background theory for inserting new hypotheses and removing inconsistent data. Stated differently, abductive inference is used to complete the training data with hypotheses about missing or inconsistent data that explain the example or training data, using the background theory. This process gives alternative possibilities for assimilating and generalizing this data. Induction is a form of synthetic reasoning that typically generates knowledge in the form of new general rules that can provide, either directly, or indirectly through the current theory T that they extend, new interrelationships between the predicates of our theory that can include, unlike abduction, the observable predicates and even in some cases new predicates. The inductive hypothesis thus introduces new, hitherto unknown, links between the relations that we are studying thus allowing new predictions on the observable predicates that would not have been possible before from the original theory under any abductive explanation. An inductive hypothesis, H, extends, like in abduction, the existing theory T to a new theory T ′ =T ∪ H, but now H provides new links between observables and nonobservables that was missing or incomplete in the original theory T. This is particularly evident from the fact that induction can be performed even with an empty given theory T, using just the set of observations. The observations specify incomplete (usually extensional) knowledge about the observable

Abduction

predicates, which we try to generalize into new knowledge. In contrast, the generalizing effect of abduction, if at all present, is much more limited. With the given current theory T, that abduction always needs to refer to, we implicitly restrict the generalizing power of abduction as we require that the basic model of our domain remains that of T. Induction has a stronger and genuinely new generalizing effect on the observable predicates than abduction. While the purpose of abduction is to extend the theory with an explanation and then reason with it, thus enabling the generalizing potential of the given theory T, in induction the purpose is to extend the given theory to a new theory, which can provide new possible observable consequences. This complementarity of abduction and induction – abduction providing explanations from the theory while induction generalizes to form new parts of the theory – suggests a basis for their integration within the context of theory formation and theory development. A cycle of integration of abduction and induction (Flach & Kakas, ) emerges that is suitable for the task of incremental modeling (Fig. ). Abduction is used to transform (and in some sense normalize) the observations to information on the abducible predicates. Then, induction takes this as input and tries to generalize this information to general rules for the abducible predicates now treating these as observable predicates for its own purposes. The cycle can then be repeated by adding the learned information on the abducibles back in the model as new partial information T′

O

T∪H

Induction

T

O

A

on the incomplete abducible predicates. This will affect the abductive explanations of new observations to be used again in a subsequent phase of induction. Hence, through this cycle of integration the abductive explanations of the observations are added to the theory, not in the (simple) form in which they have been generated, but in a generalized form given by a process of induction on these. A simple example, adapted from Ray et al. (), that illustrates this cycle of integration of abduction and induction is as follows. Suppose that our current model, T, contains the following rule and background facts: sad(X) ← tired(X), poor(X), tired(oli), tired(ale), tired(kr), academic(oli), academic(ale), academic(kr), student(oli), lecturer(ale), lecturer(kr), where the only observable predicate is sad/. Given the observations O = {sad(ale), sad(kr), not sad(oli)} can we improve our model? The incompleteness of our model resides in the predicate poor. This is the only abducible predicate in our model. Using abduction we can explain the observations O via the explanation: E = {poor(ale), poor(kr), not poor(oli)}. Subsequently, treating this explanation as training data for inductive generalization we can generalize this to get the rule: poor(X) ← lecturer(X)

Abduction

O′

Abduction. Figure . The cycle of abductive and inductive knowledge development. The cycle is governed by the “equation” T ∪ H ⊧ O, where T is the current theory, O the observations triggering theory development, and H the new knowledge generated. On the left-hand side we have induction, its output feeding into the theory T for later use by abduction on the right; the abductive output in turn feeds into the observational data O′ for later use by induction, and so on

thus (partially) defining the abducible predicate poor when we extend our theory with this rule. This combination of abduction and induction has recently been studied and deployed in several ways within the context of ILP. In particular, inverse entailment (Muggleton and Bryant, ) can be seen as a particular case of integration of abductive inference for constructing a “bottom” clause and inductive inference to generalize it. This is realized in Progol . and applied to several problems including the discovery of the function of genes in a network of metabolic pathways (King et al., ), and more recently to the study of

A

A

Abduction

inhibition in metabolic networks (Tamaddoni-Nezhad, Chaleil, Kakas, & Muggleton, ; Tamaddoni-Nezhad et al., ). In Moyle (), an ILP system called ALECTO, integrates a phase of extraction-case abduction to transform each case of a training example to an abductive hypothesis with a phase of induction that generalizes these abductive hypotheses. It has been used to learn robot navigation control programs by completing the specific domain knowledge required, within a general theory of planning that the robot uses for its navigation (Moyle, ). The development of these initial frameworks that realize the cycle of integration of abduction and induction prompted the study of the problem of completeness for finding any hypotheses H that satisfies the basic task of finding a consistent hypothesis H such that T ∪ H ⊧ O for a given theory T, and observations O. Progol was found to be incomplete (Yamamoto, ) and several new frameworks of integration of abduction and induction have been proposed such as SOLDR (Ito & Yamamoto, ), CF-induction (Inoue, ), and HAIL (Ray et al., ). In particular, HAIL has demonstrated that one of the main reasons for the incompleteness of Progol is that in its cycle of integration of abduction and induction, it uses a very restricted form of abduction. Lifting some of these restrictions, through the employment of methods from abductive logic programming (Kakas et al., ), has allowed HAIL to solve a wider class of problems. HAIL has been extended to a framework, called XHAIL (Ray, ), for learning nonmonotonic ILP, allowing it to be applied to learn Event Calculus theories for action description (Alrajeh, Ray, Russo, & Uchitel, ) and complex scientific theories for systems biology (Ray & Bryant, ). Applications of this integration of abduction and induction and the cycle of knowledge development can be found in the recent proceedings of the Abduction and Induction in Artificial Intelligence workshops in (Flach & Kakas, ) and (Ray, Flach, & Kakas, ).

Abduction in Systems Biology

Abduction has found a rich field of application in the domain of systems biology and the declarative modeling of computational biology. In a project called, Robot scientist (King et al., ), Progol . was used to

generate abductive hypotheses about the function of genes. Similarly, learning the function of genes using abduction has been studied in GenePath (Zupan et al., ) where experimental genetic data is explained in order to facilitate the analysis of genetic networks. Also in Papatheodorou et al. () abduction is used to learn gene interactions and genetic pathways from microarray experimental data. Abduction and its integration with induction has been used in the study of inhibitory effect of toxins in metabolic networks (Tamaddoni-Nezhad et al., , ) taking into account also the temporal variation that the inhibitory effect can have. Another bioinformatics application of abduction (Ray et al., ) concerns the modeling of human immunodeficiency virus (HIV) drug resistance and using this in order to assist medical practitioners in the selection of antiretroviral drugs for patients infected with HIV. Also, the recently developed frameworks of XHAIL and CF-induction have been applied to several problems in systems biology, see e.g., Ray (), Ray and Bryant (), and Doncescu, Inoue, and Yamamoto (), respectively.

Cross References 7Explanation-Based Learning 7Inductive Logic Programming

Recommended Reading Ade, H., & Denecker, M. (). AILP: Abductive inductive logic programming. In C. S. Mellish (Ed.), IJCAI (pp. –). San Francisco: Morgan Kaufmann. Ade, H., Malfait, B., & Raedt, L. D. (). Ruth: An ILP theory revision system. In ISMIS. Berlin: Springer. Alrajeh, D., Ray, O., Russo, A., & Uchitel, S. (). Using abduction and induction for operational requirements elaboration. Journal of Applied Logic, (), –. DeJong, G., & Mooney, R. (). Explanation-based learning: An alternate view. Machine Learning, , –. Doncescu, A., Inoue, K., & Yamamoto, Y. (). Knowledge based discovery in systems biology using cf-induction. In H. G. Okuno & M. Ali (Eds.), IEA/AIE (pp. –). Heidelberg: Springer. Flach, P., & Kakas, A. (). Abductive and inductive reasoning: Background and issues. In P. A. Flach & A. C. Kakas (Eds.), Abductive and inductive reasoning. Pure and applied logic. Dordrecht: Kluwer. Flach, P. A., & Kakas, A. C. (Eds.). (). Abduction and induction in artificial intelligence [Special issue]. Journal of Applied Logic, (). Inoue, K. (). Inverse entailment for full clausal theories. In LICS workshop on logic and learning.

Accuracy

Ito, K., & Yamamoto, A. (). Finding hypotheses from examples by computing the least generlisation of bottom clauses. In Proceedings of discovery science ’ (pp. –). Berlin: Springer. Josephson, J., & Josephson, S. (Eds.). (). Abductive inference: Computation, philosophy, technology. New York: Cambridge University Press. Kakas, A., Kowalski, R., & Toni, F. (). Abductive logic programming. Journal of Logic and Computation, (), –. Kakas, A., & Riguzzi, F. (). Abductive concept learning. New Generation Computing, , –. King, R., Whelan, K., Jones, F., Reiser, P., Bryant, C., Muggleton, S., et al. (). Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, , –. Leake, D. (). Abduction, experience and goals: A model for everyday abductive explanation. The Journal of Experimental and Theoretical Artificial Intelligence, , –. Michalski, R. S. (). Inferential theory of learning as a conceptual basis for multistrategy learning. Machine Learning, , –. Moyle, S. (). Using theory completion to learn a robot navigation control program. In Proceedings of the th international conference on inductive logic programming (pp. –). Berlin: Springer. Moyle, S. A. (). An investigation into theory completion techniques in inductive logic programming. PhD thesis, Oxford University Computing Laboratory, University of Oxford. Muggleton, S. (). Inverse entailment and Progol. New Generation Computing, , –. Muggleton, S., & Bryant, C. (). Theory completion using inverse entailment. In Proceedings of the tenth international workshop on inductive logic programming (ILP-) (pp. –). Berlin: Springer. Ourston, D., & Mooney, R. J. (). Theory refinement combining analytical and empirical methods. Artificial Intelligence, , –. Papatheodorou, I., Kakas, A., & Sergot, M. (). Inference of gene relations from microarray data by abduction. In Proceedings of the eighth international conference on logic programming and non-monotonic reasoning (LPNMR’) (Vol. , pp. –). Berlin: Springer. Ray, O. (). Nonmonotonic abductive inductive learning. Journal of Applied Logic, (), –. Ray, O., Antoniades, A., Kakas, A., & Demetriades, I. (). Abductive logic programming in the clinical management of HIV/AIDS. In G. Brewka, S. Coradeschi, A. Perini, & P. Traverso (Eds.), Proceedings of the th European conference on artificial intelligence. Frontiers in artificial intelligence and applications (Vol. , pp. –). Amsterdam: IOS Press. Ray, O., Broda, K., & Russo, A. (). Hybrid abductive inductive learning: A generalisation of Progol. In Proceedings of the th international conference on inductive logic programming. Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer. Ray, O., & Bryant, C. (). Inferring the function of genes from synthetic lethal mutations. In Proceedings of the second international conference on complex, intelligent and software

A

intensive systems (pp. –). Washington, DC: IEEE Computer Society. Ray, O., Flach, P. A., & Kakas, A. C. (Eds.). (). Abduction and induction in artificial intelligence. Proceedings of IJCAI workshop. Reggia, J. (). Diagnostic experts systems based on a set-covering model. International Journal of Man-Machine Studies, (), –. Tamaddoni-Nezhad, A., Chaleil, R., Kakas, A., & Muggleton, S. (). Application of abductive ILP to learning metabolic network inhibition from temporal data. Machine Learning, (–), –. Tamaddoni-Nezhad, A., Kakas, A., Muggleton, S., & Pazos, F. (). Modelling inhibition in metabolic pathways through abduction and induction. In Proceedings of the th international conference on inductive logic programming (pp. –). Berlin: Springer. Yamamoto, A. (). Which hypotheses can be found with inverse entailment? In Proceedings of the seventh international workshop on inductive logic programming. Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer. Zupan, B., Bratko, I., Demsar, J., Juvan, P., Halter, J., Kuspa, A., et al. (). Genepath: A system for automated construction of genetic networks from mutant data. Bioinformatics, (), –.

Absolute Error Loss 7Mean Absolute Error

Accuracy Definition Accuracy refers to a measure of the degree to which the predictions of a 7model match the reality being modeled. The term accuracy is often applied in the context of 7classification models. In this context, accuracy = P(λ(X) = Y), where XY is a 7joint distribution and the classification model λ is a function X → Y. Sometimes, this quantity is expressed as a percentage rather than a value between . and .. The accuracy of a model is often assessed or estimated by applying it to test data for which the 7labels (Y values) are known. The accuracy of a classifier on test data may be calculated as number of correctly classified objects/total number of objects. Alternatively, a smoothing function may be applied, such as a 7Laplace estimate or an 7m-estimate.

A

A

ACO

Accuracy is directly related to 7error rate, such that accuracy = . − error rate (or when expressed as a percentage, accuracy = − error rate).

Cross References 7Confusion Matrix 7Resubstitution Accuracy

ACO 7Ant Colony Optimization

Actions In a 7Markov decision process, actions are the available choices for the decision-maker at any given decision epoch, in any given state.

Active Learning David Cohn Mountain View, CA, USA

Definition The term Active Learning is generally used to refer to a learning problem or system where the learner has some role in determining on what data it will be trained. This is in contrast to Passive Learning, where the learner is simply presented with a 7training set over which it has no control. Active learning is often used in settings where obtaining 7labeled data is expensive or time-consuming; by sequentially identifying which examples are most likely to be useful, an active learner can sometimes achieve good performance, using far less 7training data than would otherwise be required.

Structure of Learning System In many machine learning problems, the training data are treated as a fixed and given part of the problem definition. In practice, however, the training data

are often not fixed beforehand. Rather, the learner has an opportunity to play a role in deciding what data will be acquired for training. This process is usually referred to as “active learning,” recognizing that the learner is an active participant in the training process. The typical goal in active learning is to select training examples that best enable the learner to minimize its loss on future test cases. There are many theoretical and practical results demonstrating that, when applied properly, active learning can greatly reduce the number of training examples, and even the computational effort required for a learner to achieve good generalization. A toy example that is often used to illustrate the utility of active learning is that of learning a threshold function over a one-dimensional interval. Given +/− labels for N points drawn uniformly over the interval, the expected error between the true value of the threshold and any learner’s best guess is bounded by O(/N). Given the opportunity to sequentially select the position of points to be labeled, however, a learner can pursue a binary search strategy, obtaining a best guess that is within O(/N ) of the true threshold value. This toy example illustrates the sequential nature of example selection that is a component of most (but not all) active learning strategies: the learner makes use of initial information to discard parts of the solution space, and to focus future data acquisition on distinguishing parts that are still viable.

Related Problems The term “active learning” is usually applied in supervised learning settings, though there are many related problems in other branches of machine learning and beyond. The “exploration” component of the “exploration/exploitation” strategy in reinforcement learning is one such example. The learner must take actions to gain information, and must decide what actions will give him/her the information that will best minimize future loss. A number of fields of Operations Research predate and parallel machine learning work on active learning, including Decision Theory (North, ), Value of Information Computation, Bandit problems (Robbins, ), and Optimal Experiment Design (Fedorov, ; Box & Draper, ).

Active Learning

Active Learning Scenarios When active learning is used for classification or regression, there are three common settings: constructive active learning, pool-based active learning, and streambased active learning (also called selective sampling). Constructive Active Learning

In constructive active learning, the learner is allowed to propose arbitrary points in the input space as examples to be labeled. While this in theory gives the learner the most power to explore, it is often not practical. One obstacle is the observation that most learning systems train on only a reduced representation of the instances they are presented with: text classifiers on bags of words (rather than fully-structured text) and speech recognizers on formants (rather than raw audio). While a learning system may be able to identify what pattern of formants would be most informative to label, there is no reliable way to generate audio that a human could recognize (and label) from the desired formants alone. Pool-Based Active Learning

Pool-based active learning (McCallum & Nigam, ) is popular in domains such as text classification and speech recognition where unlabeled data are plentiful and cheap, but labels are expensive and slow to acquire. In pool-based active learning, the learner may not propose arbitrary points to label, but instead has access to a set of unlabeled examples, and is allowed to select which of them to request labels for. A special case of pool-based learning is transductive active learning, where the test distribution is exactly the set of unlabeled examples. The goal then is to sequentially select and label a small number of examples that will best allow predicting the labels of those points that remain unlabeled. A theme that is common to both constructive and pool-based active learning is the principle of sequential experimentation. Information gained from early queries allows the learner to focus its search on portions of the domain that are most likely to give it additional information on subsequent queries. Stream-Based Active Learning

Stream-based active learning resembles pool-based learning in many ways, except that the learner only has

A

access to the unlabeled instances as a stream; when an instance arrives, the learner must decide whether to ask for its label or let it go. Other Forms of Active Learning

By virtue of the broad definition of active learning, there is no real limit on the possible settings for framing it. Angluin’s early work on learning regular sets (Angluin, ) employed a “counterexample” oracle: the learner would propose a solution, and the oracle would either declare it correct, or divulge a counterexample – an instance on which the proposed and true solutions disagreed. Jin and Si () describe a Bayesian method for selecting informative items to recommend when learning a collaborative filtering model, and Steck and Jaakkola () describe a method best described as unsupervised active learning to build Bayesian networks in large domains. While most active learning work assumes that the cost of obtaining a label is independent of the instance to be labeled, there are many scenarios where this is not the case. A mobile robot taking surface measurements must first travel to the point it wishes to sample, making distant points more expensive than nearby ones. In some cases, the cost of a query (e.g., the difficulty of traveling to a remote point to sample it) may not even be known until it is made, requiring the learner to learn a model of that as well. In these situations, the sequential nature of active learning is greatly accentuated, and a learner faces the additional challenges of planning under uncertainty (see “Greedy vs. Batch Active Learning,” below).

Common Active Learning Strategies . Version space partitioning. The earliest practical active learning work (Ruff & Dietterich, ; Mitchell, ) explicitly relied on 7version space partitioning. These approaches tried to select examples on which there was maximal disagreement between hypotheses in the current version space. When such examples were labeled, they would invalidate as large a portion of the version space as possible. A limitation of explicit version space approaches is that, in underconstrained domains, a learner may waste its effort differentiating portions of the version space that have little

A

A

Active Learning

effect on the classifier’s predictions, and thus on its error. . Query by Committee (Seung, Opper, & Sompolinsky ). In query by committee, the experimenter trains an ensemble of models, either by selecting randomized starting points (e.g., in the case of a neural network) or by bootstrapping the training set. Candidate examples are scored based on disagreement among the ensemble models – examples with high disagreement indicate areas in the sample space that are underdetermined by the training data, and therefore potentially valuable to label. Models in the ensemble may be looked at as samples from the version space; picking examples where these models disagree is a way of splitting the version space. . Uncertainty sampling (Lewis & Gail, ). Uncertainty sampling is a heuristic form of statistical active learning. Rather than sampling different points in the version space by training multiple learners, the learner itself maintains an explicit model of uncertainty over its input space. It then selects for labeling those examples on which it is least confident. In classification and regression problems, uncertainty contributes directly to expected loss (as the variance component of the “error = bias + variance” decomposition), so that gathering examples where the learner has greatest uncertainty is often an effective loss-minimization heuristic. This approach has also been found effective for non-probabilistic models, by simply selecting examples that lie near the current decision boundary. For some learners, such as support vector machines, this heuristic can be shown to be an approximate partitioning of the learner’s version space (Tong & Koller, ). . Loss minimization (Cohn, Ghahramani, & Jordan, ). Uncertainty sampling can stumble when parts of the learner’s domain are inherently noisy. It may be that, regardless of the number of samples labeled in some neighborhood, it will remain impossible to accurately predict these. In these cases, it would be desirable to not only model the learner’s uncertainty over arbitrary parts of its domain, but also to model what effect labeling any future example is expected

to have on that uncertainty. For some learning algorithms it is feasible to explicitly compute such estimates (e.g., for locally-weighted regression and mixture models, these estimates may be computed in closed form). It is, therefore, practical to select examples that directly minimize the expected loss to the learner, as discussed below under “Statistical Active Learning.”

Statistical Active Learning Uncertainty sampling and direct loss minimization are two examples of statistical active learning. Both rely on the learner’s ability to statistically model its own uncertainty. When learning with a statistical model, such as a linear regressor or a mixture of Gaussians (Dasgupta, ), the objective is usually to find model parameters that minimize some form of expected loss. When active learning is applied to such models, it is natural to also select training data so as to minimize that same objective. As statistical models usually give us estimates on the probability of (as yet) unknown values, it is often straightforward to turn this machinery upon itself to assist in the active learning process (Cohn et al., ). The process is usually as follows: . Begin by requesting labels for a small random subsample of the examples x , x , K, xn x and fit our model to the labeled data. . For any x in our domain, a statistical model lets us estimate both the conditional expectation yˆ(x) and σyˆ(x) , the variance of that expectation. We estimate our current loss by drawing a new random sample of unlabeled data, and computing the averaged σyˆ(x) . . We now consider a candidate point x˜ , and ask what reduction in loss we would obtain if we had labeled it y˜. If we knew its label with certainty, we could simply add the point to the training set, retrain, and compute the new expected loss. While we do not know the true y˜, we could, in theory, compute the new expected loss for every possible y˜ and average those losses, weighting them by our model’s estimate of p(˜y∣˜x). In practice, this is normally unfeasible; however, for some statistical models, such as locally-weighted regression and mixtures of Gaussians, we can compute the distribution of p(˜y∣˜x) and its effect on σyˆ(x) in closed form (Cohn et al., ).

Active Learning

. Given the ability to estimate the expected effect of obtaining label y˜ for candidate x˜ , we repeat this computation for a sample of Mcandidates, and then request a label for the candidate with the largest expected decrease in loss. We add the newly-labeled example to our training set, retrain, and begin looking at candidate points to add on the next iteration.

Given n labeled pairs, and a prediction to make for input x, LOESS computes the following covariance statistics around x: ∑ ki (xi − µ x ) ∑i ki xi , σx = i , n n ∑i ki (xi − µ x ) (yi − µ y ) σxy = n ∑i ki (yi − µ y ) ∑ ki yi µy = i , σy = , n n σxy σy∣x = σy − σx

A Detailed Example: Statistical Active Learning with LOESS LOESS (Cleveland, Devlin, & Gross, ) is a simple form of locally-weighted regression using a kernel function. When asked to predict the unknown output y corresponding to a given input x, LOESS computes a 7linear regression over known (x, y) pairs, in which it gives pair (xi , yi ) weight according to the proximity of xi to x. We will write this weighting as a kernel function, K(xi , x), or simplify it to ki when there is no chance of confusion. In the active learning setting, we will assume that we have a large supply of unlabeled examples drawn from the test distribution, along with labels for a small number of them. We wish to label a small number more so as to minimize the mean squared error (MSE) of our model. MSE can be decomposed into two terms: squared 7bias and variance. If we make the (inaccurate but simplifying) assumption that LOESS is approximately unbiased for the problem at hand, minimizing MSE reduces to minimizing the variance of our estimates.

µx =

The Need for Reference Distributions Step () above illustrates a complication that is unique to active learning approaches. Traditional “passive” learning usually relies on the assumption that the distribution over which the learner will be tested is the same as the one from which the training data were drawn. When the learner is allowed to select its own training data, it still needs some form of access to the distribution of data on which it will be tested. A pool-based or stream-based learner can use the pool or stream as a proxy for that distribution, but if the learner is allowed (or required) to construct its own examples, it risks wasting all its effort on resolving portions of the solution space that are of no interest to the problem at hand.

A

We can combine these to express the conditional expectation of y (our estimate) and its variance as: yˆ = µ y + σyˆ =

σy∣x

n

σxy (x − µ x ), σx (∑ ki + i

(x − µ x ) (xi − µ x ) k ). ∑ i σx σx i

Our proxy for model error is the variance of our prediction, integrated over the test distribution ⟨σyˆ ⟩. As we have assumed a pool-based setting in which we have a large number of unlabeled examples from that distribution, we can simply compute the above variance over a sample from the pool, and use the resulting average as our estimate. To perform statistical active learning, we want to compute how our estimated variance will change if we add an (as yet unknown) label y˜ for an arbitrary x˜ . We will write this new expected variance as ⟨σ˜yˆ ⟩. While we do not know what value y˜ will take, our model gives us an estimated mean yˆ(˜x) and variance σyˆ(˜x) for the value, as above. We can add this “distributed” y value to LOESS just as though it were a discrete one, and compute the resulting expectation ⟨σ˜yˆ ⟩ in closed form. Defining k˜ as K(˜x, x), we write:

⟨σ˜yˆ ⟩ =

⟨σ˜y∣x ⟩

˜ (n + k)

(∑ ki + k˜ + i

(x − µ˜ x ) σ˜x

(xi − µ˜ x ) ˜ (˜x − µ˜ x ) × (∑ ki +k )) , σ˜x σ˜x i

A

A

Active Learning Theory

where the component expectations are computed as follows: ⟨σ˜y∣x ⟩ = ⟨σ˜y ⟩ −

⟨σ˜xy ⟩

, σ˜x ˜ + (ˆy(˜x) − µ y ) ) nk(σ nσy y∣˜x ˜ ⟨ σy ⟩ = + , ˜ ˜ n+k (n + k) ˜x nµ x + k˜ µ˜ x = , n + k˜ ˜ x − µ x )(ˆy(˜x) − µ y ) nσxy nk(˜ ⟨σ˜xy ⟩ = + , ˜ n + k˜ (n + k) ˜ x − µ x ) nσx nk(˜ + , ˜ n + k˜ (n + k) n k˜ σy∣˜ x − µ x ) x (˜ ⟨σ˜xy ⟩ = ⟨σ˜xy ⟩ + . ˜ (n + k) σ˜x =

Greedy Versus Batch Active Learning It is also worth pointing out that virtually all active learning work relies on greedy strategies – the learner estimates what single example best achieves its objective, requests that one, retrains, and repeats. In theory, it is possible to plan some number of queries ahead, asking what point is best to label now, given that N- more labeling opportunities remain. While such strategies have been explored in Operations Research for very small problem domains, their computational requirements make this approach unfeasible for problems of the size typically encountered in machine learning. There are cases where retraining the learner after every new label would be prohibitively expensive, or where access to labels is limited by the number of iterations as well as by the total number of labels (e.g., for a finite number of clinical trials). In this case, the learner may select a set of examples to be labeled on each iteration. This batch approach, however, is only useful if the learner is able to identify a set of examples whose expected contributions are non-redundant, which substantially complicates the process.

Cross References

Box, G. E. P., & Draper, N. (). Empirical model-building and response surfaces. New York: Wiley. Cleveland, W., Devlin, S., & Gross, E. (). Regression by local fitting. Journal of Econometrics, , –. Cohn, D., Atlas, L., & Ladner, R. (). Training connectionist networks with queries and selective sampling. In D. Touretzky (Ed.)., Advances in neural information processing systems. Morgan Kaufmann. Cohn, D., Ghahramani, Z., & Jordan, M. I. (). Active learning with statistical models. Journal of Artificial Intelligence Research, , –. http://citeseer.ist.psu.edu/ .html Dasgupta, S. (). Learning mixtures of Gaussians. Foundations of Computer Science, –. Fedorov, V. (). Theory of optimal experiments. New York: Academic Press. Kearns, M., Li, M., Pitt, L., & Valiant, L. (). On the learnability of Boolean formulae, Proceedings of the th annual ACM conference on theory of computing (pp. –). New York: ACM Press. Lewis, D. D., & Gail, W. A. (). A sequential algorithm for training text classifiers. Proceedings of the th annual international ACM SIGIR conference (pp. –). Dublin. McCallum, A., & Nigam, K. (). Employing EM and pool-based active learning for text classification. In Machine learning: Proceedings of the fifteenth international conference (ICML’) (pp. –). North, D. W. (). A tutorial introduction to decision theory. IEEE Transactions Systems Science and Cybernetics, (). Pitt, L., & Valiant, L. G. (). Computational limitations on learning from examples. Journal of the ACM (JACM), (), –. Robbins, H. (). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, , –. Ruff, R., & Dietterich, T. (). What good are experiments? Proceedings of the sixth international workshop on machine learning. Ithaca, NY. Seung, H. S., Opper, M., & Sompolinsky, H. (). Query by committee. In Proceedings of the fifth workshop on computational learning theory (pp. –). San Mateo, CA: Morgan Kaufmann. Steck, H., & Jaakkola, T. (). Unsupervised active learning in large domains. In Proceeding of the conference on uncertainty in AI. http://citeseer.ist.psu.edu/ steckunsupervised.html

Active Learning Theory

7Active Learning Theory

Sanjoy Dasgupta University of California, San Diego, La Jolla, CA, USA

Recommended Reading

Definition

Angluin, D. (). Learning regular sets from queries and counterexamples. Information and Computation, (), –. Angluin, D. (). Queries and concept learning. Machine Learning, , –.

The term active learning applies to a wide range of situations in which a learner is able to exert some control over its source of data. For instance, when fitting a

Active Learning Theory

regression function, the learner may itself supply a set of data points at which to measure response values, in the hope of reducing the variance of its estimate. Such problems have been studied for many decades under the rubric of experimental design (Chernoff, ; Fedorov, ). More recently, there has been substantial interest within the machine learning community in the specific task of actively learning binary classifiers. This task presents several fundamental statistical and algorithmic challenges, and an understanding of its mathematical underpinnings is only gradually emerging. This brief survey will describe some of the progress that has been made so far.

Learning from Labeled and Unlabeled Data In the machine learning literature, the task of learning a classifier has traditionally been studied in the framework of supervised learning. This paradigm assumes that there is a training set consisting of data points x (from some set X ) and their labels y (from some set Y), and the goal is to learn a function f : X → Y that will accurately predict the labels of data points arising in the future. Over the past years, tremendous progress has been made in resolving many of the basic questions surrounding this model, such as “how many training points are needed to learn an accurate classifier?” Although this framework is now fairly well understood, it is a poor fit for many modern learning tasks because of its assumption that all training points automatically come labeled. In practice, it is frequently the case that the raw, abundant, easily obtained form of data is unlabeled, whereas labels must be explicitly procured and are expensive. In such situations, the reality is that the learner starts with a large pool of unlabeled points and must then strategically decide which ones it wants labeled: how best to spend its limited budget. Example: Speech recognition. When building a speech recognizer, the unlabeled training data consists of raw speech samples, which are very easy to collect: just walk around with a microphone. For all practical purposes, an unlimited quantity of such samples can be obtained. On the other hand, the “label” for each speech sample is a segmentation into its constituent phonemes, and producing even one such label requires substantial human time and attention. Over the past decades, research labs and the government have expended an

A

enormous amount of money, time, and effort on creating labeled datasets of English speech. This investment has paid off, but our ambitions are inevitably moving past what these datasets can provide: we would now like, for instance, to create recognizers for other languages, or for English in specific contexts. Is there some way to avoid more painstaking years of data labeling, to somehow leverage the easy availability of raw speech so as to significantly reduce the number of labels needed? This is the hope of active learning.

Some early results on active learning were in the membership query model, where the data is assumed to be separable (that is, some hypothesis h perfectly classifies all points) and the learner is allowed to query the label of any point in the input space X (rather than being constrained to a prespecified unlabeled set), with the goal of eventually returning the perfect hypothesis h∗ . There is a significant body of beautiful theoretical work in this model (Angluin, ), but early experiments ran into some telling difficulties. One study (Baum & Lang, ) found that when training a neural network for handwritten digit recognition, the queries synthesized by the learner were such bizarre and unnatural images that they were impossible for a human to classify. In such contexts, the membership query model is of limited practical value; nonetheless, many of the insights obtained from this model carry over to other settings (Hanneke, a). We will fix as our standard model one in which the learner is given a source of unlabeled data, rather than being able to generate these points himself. Each point has an associated label, but the label is initially hidden, and there is a cost for revealing it. The hope is that an accurate classifier can be found by querying just a few labels, much fewer than would be required by regular supervised learning. How can the learner decide which labels to probe? One option is to select the query points at random, but it is not hard to show that this yields the same label complexity as supervised learning. A better idea is to choose the query points adaptively: for instance, start by querying some random data points to get a rough sense of where the decision boundary lies, and then gradually refine the estimate of the boundary by specifically querying points in its immediate vicinity. In other

A

A

Active Learning Theory

words, ask for the labels of data points whose particular positioning makes them especially informative. Such strategies certainly sound good, but can they be fleshed out into practical algorithms? And if so, do these algorithms work well in the sense of producing good classifiers with fewer labels than would be required by supervised learning? On account of the enormous practical importance of active learning, there are a wide range of algorithms and techniques already available, most of which resemble the aggressive, adaptive sampling strategy just outlined, and many of which show promise in experimental studies. However, a big problem with this kind of sampling is that very quickly the set of labeled points no longer reflects the underlying data distribution. This makes it hard to show that the classifiers learned have good statistical properties (for instance, that they converge to an optimal classifier in the limit of infinitely many labels). This survey will only discuss methods that have proofs of statistical well-foundedness, and whose label complexity can be explicitly analyzed.

Motivating Examples We will start by looking at a few examples that illustrate the enormous potential of active learning and that also make it clear why analyses of this new model require concepts and intuitions that are fundamentally different from those that have already been developed for supervised learning. Example: Thresholds on the Line

Suppose the data lie on the real line, and the available classifiers are simple thresholding functions, H = {hw : w ∈ R}: ⎧ ⎪+ if x ≥ w ⎪ hw (x) = ⎨ ⎪− if x < w ⎪ ⎩

(using VC theory) tells us that if the data are separable – that is, if they can be perfectly classified by some hypothesis in H – then we need approximately /є random labeled examples from P, and it is enough to return any classifier consistent with them. Now suppose we instead draw /є unlabeled samples from P. If we lay these points down on the line, their hidden labels are a sequence of −s followed by a sequence of +s, and the goal is to discover the point w at which the transition occurs. This can be accomplished with a simple binary search which asks for just log /є labels: first ask for the label of the median point; if it is +, move to the th percentile point, otherwise move to the th percentile point; and so on. Thus, for this hypothesis class, active learning gives an exponential improvement in the number of labels needed, from /є to just log /є. For instance, if supervised learning requires a million labels, active learning requires just log ,, ≈ , literally! It is a tantalizing possibility that even for more complicated hypothesis classes H, a sort of generalized binary search is possible. A natural next step is to consider linear separators in two dimensions. Example: Linear Separators in R

Let H be the hypothesis class of linear separators in R , and suppose the data is distributed according to some density supported on the perimeter of the unit circle. It turns out that the positive results of the onedimensional case do not generalize: there are some target hypotheses in H for which Ω(/є) labels are needed to find a classifier with error rate less than є, no matter what active learning scheme is used. To see this, consider the following possible target hypotheses (Fig. ): ● ●

To make things precise, let us denote the (unknown) underlying distribution on the data (X, Y) ∈ R × {+, −} by P, and let us suppose that we want a hypothesis h ∈ H whose error with respect to P, namely errP (h) = P(h(X) ≠ Y), is at most some є. How many labels do we need? In supervised learning, such issues are well understood. The standard machinery of sample complexity

h : all points are positive. hi ( ≤ i ≤ /є): all points are positive except for a small slice Bi of probability mass є.

The slices Bi are explicitly chosen to be disjoint, with the result that Ω(/є) labels are needed to distinguish between these hypotheses. For instance, suppose nature chooses a target hypothesis at random from among the hi , ≤ i ≤ /є. Then, to identify this target with probability at least /, it is necessary to query points in at least (about) half the Bi s.

Active Learning Theory

A

The Sample Complexity of Active Learning

Active Learning Theory. Figure . P is supported on the circumference of a circle. Each Bi is an arc of probability mass є

Thus for these particular target hypotheses, active learning offers little improvement in sample complexity over regular supervised learning. What about other target hypotheses in H, for instance those in which the positive and negative regions are more evenly balanced? It is quite easy (Dasgupta, ) to devise an active learning scheme which asks for O(min{/i(h), /є}) + O(log /є) labels, where i(h) = min{positive mass of h, negative mass of h}. Thus even within this simple hypothesis class, the label complexity can run anywhere from O(log /є) to Ω(/є), depending on the specific target hypothesis!

Example: An Overabundance of Unlabeled Data

In our two previous examples, the amount of unlabeled data needed was O(/є), exactly the usual sample complexity of supervised learning. But it is sometimes helpful to have significantly more unlabeled data than this. In Dasgupta (), a distribution P is described for which if the amount of unlabeled data is small (below any prespecified threshold), then the number of labels needed to learn the target linear separator is Ω(/є); whereas if the amount of unlabeled data is much larger, then only O(log /є) labels are needed. This is a situation where most of the data distribution is fairly uninformative while a miniscule fraction is highly informative. A lot of unlabeled data is needed in order to get even a few of the informative points.

We will think of the unlabeled points x , . . . , xn as being drawn i.i.d. from an underlying distribution PX on X (namely, the marginal of the distribution P on X × Y), either all at once (a pool) or one at a time (a stream). The learner is only allowed to query the labels of points in the pool/stream; that is, it is restricted to “naturally occurring” data points rather than synthetic ones (Fig. ). It returns a hypothesis h ∈ H whose quality is measured by its error rate, errP (h). In regular supervised learning, it is well known that if the VC dimension of H is d, then the number of labels that will with high probability ensure errP (h) ≤ є is roughly O(d/є) if the data is separable and O(d/є ) otherwise (Haussler, ); various logarithmic terms are omitted here. For active learning, it is clear from the examples above that the VC dimension alone does not adequately characterize label complexity. Is there a different combinatorial parameter that does? Generic Results for Separable Data

For separable data, it is possible to give upper and lower bounds on label complexity in terms of a special parameter known as the splitting index (Dasgupta, ). This is merely an existence result: the algorithm needed to realize the upper bound is intractable because it involves explicitly maintaining an є-cover (a coarse approximation) of the hypothesis class, and the size of this cover is in general exponential in the VC dimension. Nevertheless, it does give us an idea of the kinds of label complexity we can hope to achieve. Example. Suppose the hypothesis class consists of intervals on the real line: X = R and H = {ha,b : a, b ∈ R}, where ha,b (x) = (a ≤ x ≤ b). Using the splitting index, the label complexity of active learning is seen to ̃ be Θ(min{/P X ([a, b]), /є} + log /є) when the target ̃ notation hypothesis is ha,b (Dasgupta, ). Here the Θ is used to suppress logarithmic terms. Example. Suppose X = Rd and H consists of linear separators through the origin. If PX is the uniform distribution on the unit sphere, the number of labels needed ̃ log /є), to learn a hypothesis of error ≤ є is just Θ(d ̃ exponentially smaller than the O(d/є) label complexity of supervised learning. If PX is not the uniform distribution but is everywhere within a multiplicative

A

A

Active Learning Theory

Pool-based active learning

Stream-based active learning

Get a set of unlabeled points U ⊂ X Repeat until satisfied: Pick some x ∈ U to label Return a hypothesis h ∈ H

Repeat for t = , , , . . .: Choose a hypothesis ht ∈ H Receive an unlabeled point x ∈ X Decide whether to query its label

Active Learning Theory. Figure . Models of pool- and stream-based active learning. The data are draws from an underlying distribution PX , and hypotheses h are evaluated by errP (h). If we want to get this error below є, how many labels are needed, as a function of є?

factor λ > of it, then the label complexity becomes ̃ O((d log /є) log λ), provided the amount of unlabeled data is increased by a factor of λ (Dasgupta, ). These results are very encouraging, but the question of an efficient active learning algorithm remains open. We now consider two approaches.

results have shown how to remove this assumption (Balcan, Beygelzimer, & Langford, ; Dasgupta et al., ) and to accommodate classification loss functions other than − loss (Beygelzimer et al., ). Variants of the disagreement coefficient continue to characterize label complexity in the agnostic setting (Beygelzimer et al., ; Dasgupta et al., ).

Mildly Selective Sampling

The label complexity results mentioned above are based on querying maximally informative points. A less aggressive strategy is to be mildly selective, to query all points except those that are quite clearly uninformative. This is the idea behind one of the earliest generic active learning schemes (Cohn, Atlas, & Ladner, ). Data points x , x , . . . arrive in a stream, and for each one the learner makes a spot decision about whether or not to request a label. When xt arrives, the learner behaves as follows. Determine whether both possible labelings, (xt , +) and (xt , −), are consistent with the labeled examples seen so far. ● If so, ask for the label yt . Otherwise set yt to be the unique consistent label.

A Bayesian Model

The query by committee algorithm (Seung, Opper, & Sompolinsky, ) is based on a Bayesian view of active learning. The learner starts with a prior distribution on the hypothesis space, and is then exposed to a stream of unlabeled data. Upon receiving xt , the learner performs the following steps. Draw two hypotheses h, h′ at random from the posterior over H. ● If h(xt ) ≠ h′ (xt ) then ask for the label of xt and update the posterior accordingly.

●

●

Fortunately, the check required for the first step can be performed efficiently by making two calls to a supervised learner. Thus this is a very simple and elegant active learning scheme, although as one might expect, it is suboptimal in its label complexity (Balcan et al., ). Interestingly, there is a parameter called the disagreement coefficient that characterizes the label complexity of this scheme and also of some other mildly selective learners (Friedman, ; Hanneke, b). In practice, the biggest limitation of the algorithm above is that it assumes the data are separable. Recent

This algorithm queries points that substantially shrink the posterior, while at the same time taking account of the data distribution. Various theoretical guarantees have been shown for it (Freund, Seung, Shamir, & Tishby, ); in particular, in the case of linear separators with a uniform data distribution, it achieves a label complexity of O(d log /є), the best possible. Sampling from the posterior over the hypothesis class is, in general, computationally prohibitive. However, for linear separators with a uniform prior, it can be implemented efficiently using random walks on convex bodies (Gilad-Bachrach, Navot, & Tishby, ).

Adaboost

Other Work

In this survey, I have touched mostly on active learning results of the greatest generality, those that apply to arbitrary hypothesis classes. There is also a significant body of more specialized results. Efficient active learning algorithms for specific hypothesis classes. This includes an online learning algorithm for linear separators that only queries some of the points and yet achieves similar regret bounds to algorithms that query all the points (Cesa-Bianchi, Gentile, & Zaniboni, ). The label complexity of this method is yet to be characterized. ● Algorithms and label bounds for linear separators under the uniform data distribution. This particular setting has been amenable to mathematical analysis. For separable data, it turns out that a variant of the perceptron algorithm achieves the optimal O(d log /є) label complexity (Dasgupta, Kalai, & Monteleoni,).Asimplealgorithmisalsoavailable for the agnostic setting (Balcan et al., ).

●

Conclusion The theoretical frontier of active learning is mostly an unexplored wilderness. Except for a few specific cases, we do not have a clear sense of how much active learning can reduce label complexity: whether by just a constant factor, or polynomially, or exponentially. The fundamental statistical and algorithmic challenges involved, together with the huge practical importance of the field, make active learning a particularly rewarding terrain for investigation.

A

Beygelzimer, A., Dasgupta, S., & Langford, J. (). Importance weighted active learning. In International Conference on Machine Learning (pp. –). New York: ACM Press. Cesa-Bianchi, N., Gentile, C., & Zaniboni, L. (). Worst-case analysis of selective sampling for linear-threshold algorithms. Advances in Neural Information Processing Systems. Chernoff, H. (). Sequential analysis and optimal design. In CBMS-NSF Regional Conference Series in Applied Mathematics . SIAM. Cohn, D., Atlas, L., & Ladner, R. (). Improving generalization with active learning. Machine Learning, (),–. Dasgupta, S. (). Coarse sample complexity bounds for active learning. Advances in Neural Information Processing Systems. Dasgupta, S., Kalai, A., & Monteleoni, C. (). Analysis of perceptron-based active learning. In th Annual Conference on Learning Theory. pp. –. Dasgupta, S., Hsu, D. J., & Monteleoni, C. (). A general agnostic active learning algorithm. Advances in Neural Information Processing Systems. Fedorov, V. V. (). Theory of optimal experiments. (W. J. Studden & E. M. Klimko, Trans.). New York: Academic Press. Freund, Y., Seung, S., Shamir, E., & Tishby, N. (). Selective sampling using the query by committee algorithm. Machine Learning Journal, ,–. Friedman, E. (). Active learning for smooth problems. In Conference on Learning Theory. pp. –. Gilad-Bachrach, R., Navot, A., & Tishby, N. (). Query by committeee made real. Advances in Neural Information Processing Systems. Hanneke, S. (a). Teaching dimension and the complexity of active learning. In Conference on Learning Theory. pp. –. Hanneke, S. (b). A bound on the label complexity of agnostic active learning. In International Conference on Machine Learning. pp. –. Haussler, D. (). Decision-theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, (),–. Seung, H. S., Opper, M., & Sompolinsky, H. (). Query by committee. In Conference on Computational Learning Theory, pp. –.

Cross References 7Active Learning

Adaboost Recommended Reading Angluin, D. (). Queries revisited. In Proceedings of the th international conference on algorithmic learning theory (pp. –). Balcan, M.-F., Beygelzimer, A., & Langford, J. (). Agnostic active learning. In International Conference on Machine Learning (pp. –). New York: ACM Press. Balcan, M.-F., Broder, A., & Zhang, T. (). Margin based active learning. In Conference on Learning Theory. pp. –. Baum, E. B., & Lang, K. (). Query learning can work poorly when a human oracle is used. In International Joint Conference on Neural Networks.

Adaboost is an 7ensemble learning technique, and the most well-known of the 7Boosting family of algorithms. The algorithm trains models sequentially, with a new model trained at each round. At the end of each round, mis-classified examples are identified and have their emphasis increased in a new training set which is then fed back into the start of the next round, and a new model is trained. The idea is that subsequent models

A

A

Adaptive Control Processes

should be able to compensate for errors made by earlier models. See 7ensemble learning for full details.

Adaptive Control Processes 7Bayesian Reinforcement Learning

Adaptive Real-Time Dynamic Programming Andrew G. Barto University of Massachusetts, Amherst, MA, USA

Synonyms ARTDP

Definition Adaptive Real-Time Dynamic Programming (ARTDP) is an algorithm that allows an agent to improve its behavior while interacting over time with an incompletely known dynamic environment. It can also be viewed as a heuristic search algorithm for finding shortest paths in incompletely known stochastic domains. ARTDP is based on 7Dynamic Programming (DP), but unlike conventional DP, which consists of off-line algorithms, ARTDP is an on-line algorithm because it uses agent behavior to guide its computation. ARTDP is adaptive because it does not need a complete and accurate model of the environment but learns a model from data collected during agent-environment interaction. When a good model is available, 7RealTime Dynamic Programming (RTDP) is applicable, which is ARTDP without the model-learning component.

Motivation and Background RTDP combines strengths of heuristic search and DP. Like heuristic search – and unlike conventional DP – it does not have to evaluate the entire state space in order

to produce an optimal solution. Like DP – and unlike most heuristic search algorithms – it is applicable to nondeterministic problems. Additionally, RTDP’s performance as an 7anytime algorithm is better than conventional DP and heuristic search algorithms. ARTDP extends these strengths to problems for which a good model is not initially available. In artificial intelligence, control engineering, and operations research, many problems require finding a policy (or control rule) that determines how an agent (or controller) should generate actions in response to the states of its environment (the controlled system). When a “cost” or a “reward” is associated with each step of the agent’s behavior, policies can be compared according to how much cost or reward they are expected to accumulate over time. The usual formulation for problems like this in the discrete-time case is the 7Markov Decision Process (MDP). The objective is to find a policy that minimizes (maximizes) a measure of the total cost (reward) over time, assuming that the agent–environment interaction can begin in any of the possible states. In other cases, there is a designated set of “start states” that is much smaller than the entire state set (e.g., the initial board configuration in a board game). In these cases, any given policy only has to be defined for the set of states that can be reached from the starting states when the agent is using that policy. The rest of the states will never arise when that policy is being followed, so the policy does not need to specify what the agent should do in those states. ARTDP and RTDP exploit situations where the set of states reachable from the start states is a small subset of the entire state space. They can dramatically reduce the amount of computation needed to determine an optimal policy for the relevant states as compared with the amount of computation that a conventional DP algorithm would require to determine an optimal policy for all the states. These algorithms do this by focussing computation around simulated behavioral experiences (if there is a model available capable of simulating these experiences), or around real behavioral experiences (if no model is available). RTDP and ARTDP were introduced by Barto, Bradtke, and Singh (). The starting point was the novel observation by Bradtke that Korf ’s Learning Real-Time A* heuristic search algorithm (Korf, )

Adaptive Real-Time Dynamic Programming

is closely related to DP. RTDP generalizes Learning Real-Time A* to stochastic problems. ARTDP is also closely related to Sutton’s Dyna system (Sutton, ) and Jalali and Ferguson’s () Transient DP. Theoretical analysis relies on the theory of Asnychronous DP as described by Bertsekas and Tsitsiklis (). ARTDP and RTDP are 7model-based reinforcement learning algorithms, so called because they take advantage of an environment model, unlike 7model-free reinforcement learning algorithms such as 7Q-Learning and 7Sarsa.

A

applied to all states (and some other conditions are satisfied), the algorithm will converge. RTDP is an instance of asynchronous DP in which the states chosen for backups are determined by the agent’s behavior. The backup operation above is model-based because it uses known rewards and transition probabilities, and the values of all the states appear on the right-hand-side of the equation. In contrast, a sample backup uses the value of just one sample successor state. RTDP and ARTDP are like RL algorithms in that they rely on real or simulated behavioral experience, but unlike many (but not all) RL algorithms, they use full backups like DP.

Structure of Learning System Backup Operations

Off-Line Versus On-Line

A basic step of many DP and RL algorithms is a backup operation. This is an operation that updates a current estimate of the cost of an MDP’s state. (We use the cost formulation instead of reward to be consistent with the original presentation of the algorithms. In the case of rewards, this would be called the value of a state and we would maximize instead of minimize.) Suppose X is the set of MDP states. For each state x ∈ X, f (x), the cost of state x, gives a measure (which varies with different MDP formulations) of the total cost the agent is expected to incur over the future if it starts in x. If fk (x) and fk+ (x), respectively, denote the estimate of f (x) before and after a backup, a typical backup operation applied to x looks like this:

A conventional DP algorithm typically executes off-line. When applied to finding an optimal policy for an MDP, this means that the DP algorithm executes to completion before its result (an optimal policy) is used to control the agent’s behavior. The sweeps of DP sequentially “visit” the states of the MDP, performing a backup operation on each state. But it is important not to confuse these visits with the behaving agent’s visits to states: the agent is not yet behaving while the off-line DP computation is being done. Hence, the agent’s behavior has no influence on the DP computation. The same is true for off-line asynchronous DP. RTDP is an on-line, or “real-time,” algorithm. It is an asynchronous DP computation that executes concurrently with the agent’s behavior so that the agent’s behavior can influence the DP computation. Further, the concurrently executing DP computation can influence the agent’s behavior. The agent’s visits to states directs the “visits” to states made by the concurrent asynchronous DP computation. At the same time, the action performed by the agent is the action specified by the policy corresponding to the latest results of the DP computation: it is the “greedy” action with respect to the current estimate of the cost function.

fk+ (x) = mina∈A [cx (a) + ∑ pxy (a)fk (y)], y∈X

where A is the set of possible agent actions, cx (a) is the immediate cost the agent incurs for performing action a in state x, and pxy (a) is the probability that the environment makes a transition from state x to state y as a result of the agent’s action a. This backup operation is associated with the DP algorithm known as 7value iteration. It is also the backup operation used by RTDP and ARTDP. Conventional DP algorithms consist of successive “sweeps” of the state set. Each sweep consists of applying a backup operation to each state. Sweeps continue until the algorithm converges to a solution. Asynchronous DP, which underlies RTDP and ARTDP, does not use systematic sweeps. States can be chosen in any way whatsoever, and as long as backups continue to be

Specify actions Asynchronous Dynamic Programming Computation

Behaving Agent Specify states to backup

In the simplest version of RTDP, when a state is visited by the agent, the DP computation performs the

A

A

Adaptive Real-Time Dynamic Programming

model-based backup operation given above on that same state. In general, for each step of the agent’s behavior, RTDP can apply the backup operation to each of an arbitrary set of states, provided that the agent’s current state is included. For example, at each step of behavior, a limited-horizon look-ahead search can be conducted from the agent’s current state, with the backup operation applied to each of the states generated in the search. Essentially, RTDP is an asynchronous DP computation with the computational effort focused along simulated or actual behavioral trajectories. Learning A Model

ARTDP is the same as RTDP except that () an environment model is updated using any on-line model-learning, or system identification, method, () the current environment model is used in performing the RTDP backup operations, and () the agent has to perform exploratory actions occasionally instead of always greedy actions as in RTDP. This last step is essential to ensure that the environment model eventually converges to the correct model. If the state and action sets are finite, the simplest way to learn a model is to keep counts of the number of times each transition occurs for each action and convert these frequencies to probabilities, thus forming the maximum-likelihood model. Summary of Theoretical Results

When RTDP and ARTDP are applied to stochastic optimal path problems, one can prove that under certain conditions they converge to optimal policies without the need to apply backup operations to all the states. Indeed, is some problems, only a small fraction of the states need to be visited. A stochastic optimal path problem is an MDP with a nonempty set of start states and a nonempty set of goal states. Each transition until a goal state is reached has a nonnegative immediate cost, and once the agent reaches a goal state, it stays there and thereafter incurs zero cost. Each episode of agent experience begins with a start state. An optimal policy is one that minimizes the cost of every state, i.e., minimizes f (x) for all states x. Under some relatively mild conditions, every optimal policy is guaranteed to eventually reach a goal state. A state x is relevant if a start state s and an optimal policy exist such that x can be reached from s

when the agent uses that policy. If we could somehow know which states are relevant, we could restrict DP to just these states and obtain an optimal policy. But this is not possible because knowing which states are relevant requires knowledge of optimal policies, which is what one is seeking. However, under certain conditions, without requiring repeated visits to all the irrelevant states, RTDP produces a policy that is optimal for all the relevant states. The conditions are that () the initial cost of every goal state is zero, () there exists at least one policy that guarantees that a goal state will be reached with probability one from any start state, () all immediate costs for transitions from non-goal states are strictly positive, and () none of the initial costs are larger than the actual costs. This result is proved in Barto et al. () by combining aspects of Korf ’s () proof for LRTA* with results for asynchronous DP.

Special Cases and Extensions

A number of special cases and extensions of RTDP have been developed that improve performance over the basic version. Some examples are as follows. Bonnet and Geffner’s () Labeled RTDP labels states that have already been “solved,” allowing faster convergence than RTDP. Feng, Hansen, and Zilberstein () proposed Symbolic RTDP, which selects a set of states to update at each step using symbolic model-checking techniques. The RTDP convergence theorem still applies because this is a special case of RTDP. Smith and Simmons () developed Focused RTDP that maintains a priority value for each state to better direct search and produce faster convergence. Hansen and Zilberstein’s () LAO* uses some of the same ideas as RTDP to produce a heuristic search algorithm that can find solutions with loops to non-deterministic heuristic search problems. Many other variants are possible. Extending ARTDP instead of RTDP in all of these ways would produce analogous algorithms that could be used when a good model is not available.

Cross References 7Anytime Algorithm 7Approximate Dynamic Programming 7Reinforcement Learning 7System Identification

Adaptive Resonance Theory

Recommended Reading Barto, A., Bradtke, S., & Singh, S. (). Learning to act using realtime dynamic programming. Artificial Intelligence, (–), – . Bertsekas, D., & Tsitsiklis, J. (). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice-Hall. Bonet, B., & Geffner, H. (a). Labeled RTDP: Improving the convergence of real-time dynamic programming. In Proceedings of the th international conference on automated planning and scheduling (ICAPS-). Trento, Italy. Bonet, B., & Geffner, H. (b). Faster heuristic search algorithms for planning with uncertainty and full feedback. In Proceedings of the international joint conference on artificial intelligence (IJCAI-). Acapulco, Mexico. Feng, Z., Hansen, E., & Zilberstein, S. (). Symbolic generalization for on-line planning. In Proceedings of the th conference on uncertainty in artificial intelligence. Acapulco, Mexico. Hansen. E., & Zilberstein, S. (). LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, , –. Jalali, A., & Ferguson, M. (). Computationally efficient control algorithms for Markov chains. In Proceedings of the th conference on decision and control (pp.–), Tampa, FL. Korf, R. (). Real-time heuristic search. Artificial Intelligence, (–), –. Smith, T., & Simmons, R. (). Focused real-time dynamic programming for MDPs: Squeezing more out of a heuristic. In Proceedings of the national conference on artificial intelligence (AAAI). Boston, MA: AAAI Press. Sutton, R. (). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the th international conference on machine learning (pp.–). San Mateo, CA: Morgan Kaufmann.

A

complex changing environment is needed. ART clarifies the brain processes from which conscious experiences emerge. It predicts a functional link between processes of consciousness, learning, expectation, attention, resonance, and synchrony (CLEARS), including the prediction that “all conscious states are resonant states.” This connection clarifies how brain dynamics enable a behaving individual to autonomously adapt in real time to a rapidly changing world. ART predicts how top-down attention works and regulates fast stable learning of recognition categories. In particular, ART articulates a critical role for “resonant” states in driving fast stable learning; and thus the name adaptive resonance. These resonant states are bound together, using top-down attentive feedback in the form of learned expectations, into coherent representations of the world. ART hereby clarifies one important sense in which the brain carries out predictive computation. ART has explained and successfully predicted a wide range of behavioral and neurobiological data, including data about human cognition and the dynamics of spiking laminar cortical networks. ART algorithms have been used in large-scale applications such as medical database prediction, remote sensing, airplane design, and the control of autonomous adaptive robots.

Motivation and Background

Adaptive Resonance Theory Gail A. Carpenter, Stephen Grossberg Boston University, Boston, MA, USA

Synonyms ART

Definition Adaptive resonance theory, or ART, is both a cognitive and neural theory of how the brain quickly learns to categorize, recognize, and predict objects and events in a changing world, and a set of algorithms that computationally embody ART principles and that are used in large-scale engineering and technological applications wherein fast, stable, and incremental learning about

Many current learning algorithms do not emulate the way in which humans and other animals learn. The power of human and animal learning provides high motivation to discover computational principles whereby machines can learn with similar capabilities. Humans and animals experience the world on the fly, and carry out incremental learning of sequences of episodes in real time. Often such learning is unsupervised, with the world itself as the teacher. Learning can also proceed with an unpredictable mixture of unsupervised and supervised learning trials. Such learning goes on successfully in a world that is nonstationary; that is, the rules of which can change unpredictably through time. Moreover, humans and animals can learn quickly and stably through time. A single important experience can be remembered for a long time. ART proposes a solution of this stability–plasticity dilemma (Grossberg, ) by

A

A

Adaptive Resonance Theory

showing how brains learn quickly without forcing catastrophic forgetting of already learned, and still successful, memories. Thus, ART autonomously carries out fast, yet stable, incremental learning under both unsupervised and supervised learning conditions in response to a complex nonstationary world. In contrast, many current learning algorithms use batch learning in which all the information about the world to be learned is available at a single time. Other algorithms are not defined unless all learning trials are supervised. Yet other algorithms become unstable in a nonstationary world, or become unstable if learning is fast; that is, if an event can be fully learned on a single learning trial. ART overcomes these problems. Some machine learning algorithms are feed-forward clustering algorithms that undergo catastrophic forgetting in a nonstationary world. The ART solution of the stability–plasticity dilemma depends upon feedback, or top-down, expectations that are matched against bottom-up data and thereby focus attention upon critical feature patterns. A good enough match leads to resonance and fast learning. A big enough mismatch leads to hypothesis testing or memory search that discovers and learns a more predictive category. Thus, ART is a self-organizing expert system that avoids the brittleness of traditional expert systems. The world is filled with uncertainty, so probability concepts seem relevant to understanding how brains learn about uncertain data. This fact has led some machine learning practitioners to assume that brains obey Bayesian laws. However, the Bayes rule is so general that it can accommodate any system in nature. Additional computational principles and mechanisms must augment Bayes to distinguish a brain from, say, a hydrogen atom or storm. Moreover, probabilistic models often use nonlocal computations. ART shows how the brain embodies a novel kind of real-time probability theory, hypothesis testing, prediction, and decisionmaking, the local computations of which adapt to a nonstationary world. These ART principles and mechanisms go beyond Bayesian analysis, and are embodied parsimoniously in the laminar circuits of cerebral cortex. Indeed, the cortex embodies a new kind of laminar computing that reconciles the best properties of feedforward and feedback processing, digital and analog processing, and data-driven bottom-up processing

combined with hypothesis-driven top-down processing (Grossberg, ).

Structure of Learning System How CLEARS Mechanisms Interact

Humans are intentional beings who learn expectations about the world and make predictions about what is about to happen. Humans are also attentional beings who focus processing resources upon a restricted amount of incoming information at any time. Why are we both intentional and attentional beings, and are these two types of processes related? The stability– plasticity dilemma and its solution using resonant states provide a unifying framework for understanding these issues. To clarify the role of sensory or cognitive expectations, and of how a resonant state is activated, suppose you were asked to “find the yellow ball as quickly as possible, and you will win a $, prize.” Activating an expectation of a “yellow ball” enables its more rapid detection, and with a more energetic neural response. Sensory and cognitive top-down expectations hereby lead to excitatory matching with consistent bottom-up data. Mismatch between top-down expectations and bottom-up data can suppress the mismatched part of the bottom-up data, to focus attention upon the matched, or expected, part of the bottom-up data. Excitatory matching and attentional focusing on bottom-up data using top-down expectations generates resonant brain states: When there is a good enough match between bottom-up and top-down signal patterns between two or more levels of processing, their positive feedback signals amplify and prolong their mutual activation, leading to a resonant state. Amplification and prolongation of activity triggers learning in the more slowly varying adaptive weights that control the signal flow along pathways from cell to cell. Resonance hereby provides a global context-sensitive indicator that the system is processing data worthy of learning, hence the name adaptive resonance theory. In summary, ART predicts a link between the mechanisms which enable us to learn quickly and stably about a changing world, and the mechanisms that enable us to learn expectations about such a world, test hypotheses about it, and focus attention upon information that we find interesting. ART clarifies this link by asserting that to solve the stability–plasticity

Adaptive Resonance Theory

dilemma, only resonant states can drive rapid new learning. It is just a step from here to propose that those experiences which can attract our attention and guide our future lives by being learned are also among the ones that are conscious. Support for this additional assertion derives from the many modeling studies whose simulations of behavioral and brain data using resonant states map onto properties of conscious experiences in those experiments. The type of learning within the sensory and cognitive domain that ART mechanizes is match learning: Match learning occurs only if a good enough match occurs between bottom-up information and a learned top-down expectation that is read out by an active recognition category, or code. When such an approximate match occurs, previously learned knowledge can be refined. Match learning raises the concern about what happens if a match is not good enough? How does such a model escape perseveration on already learned representations? If novel information cannot form a good enough match with the expectations that are read-out by previously learned recognition categories, then a memory search or hypothesis testing is triggered, which leads to selection and learning of a new recognition category, rather than catastrophic forgetting of an old one. Figure illustrates how this happens in an ART model; it is discussed in great detail below. In contrast, learning within spatial and motor processes is proposed to be mismatch learning that continuously updates sensorymotor maps or the gains of sensory-motor commands. As a result, we can stably learn what is happening in a changing world, thereby solving the stability–plasticity dilemma, while adaptively updating our representations of where objects are and how to act upon them using bodies whose parameters change continuously through time. Brain systems that use inhibitory matching and mismatch learning cannot generate resonances; hence, their representations are not conscious.

Complementary Computing in the Brain: Resonance and Reset

It has been mathematically proved that match learning within an ART model leads to stable memories in response to arbitrary list of events to be learned (e.g.,

A

Carpenter & Grossberg, ). However, match learning also has a serious potential weakness: If you can only learn when there is a good match between bottom-up data and learned top-down expectations, then how do you ever learn anything that you do not already know? ART proposes that this problem is solved by the brain by using an interaction between complementary processes of resonance and reset, which are predicted to control properties of attention and memory search, respectively. These complementary processes help our brains to balance between the complementary demands of processing the familiar and the unfamiliar, the expected and the unexpected. Organization of the brain into complementary processes is predicted to be a general principle of brain design that is not just found in ART (Grossberg, ). A complementary process can individually compute some properties well, but cannot, by itself, process other complementary properties. In thinking intuitively about complementary properties, one can imagine puzzle pieces fitting together. Both pieces are needed to finish the puzzle. Complementary brain processes are more dynamic than any such analogy: Pairs of complementary processes interact to form emergent properties which overcome their complementary deficiencies to compute complete information with which to represent or control some aspect of intelligent behavior. The resonance process in the complementary pair of resonance and reset is predicted to take place in the What cortical stream, notably in the inferotemporal and prefrontal cortex. Here top-down expectations are matched against bottom-up inputs. When a topdown expectation achieves a good enough match with bottom-up data, this match process focuses attention upon those feature clusters in the bottom-up input that are expected. If the expectation is close enough to the input pattern, then a state of resonance develops as the attentional focus takes hold. Figure illustrates these ART ideas in a simple two-level example. Here, a bottom-up input pattern, or vector, I activates a pattern X of activity across the feature detectors of the first level F . For example, a visual scene may be represented by the features comprising its boundary and surface representations. This feature pattern represents the relative importance of different features in the inputs pattern I. In Fig. a, the

A

A

Adaptive Resonance Theory

F2

Y

U + T

+ T

+

S

S

F1

+

V +

F1

X*

–

–

ρ

–

X

+

a

b

+

T

Y*

F2

F2 +

S +

–

– +

T

S F1

X*

c

ρ +

+

Reset

+

F2

Y

F1 ρ

–

X +

+

ρ +

d

Adaptive Resonance Theory. Figure . Search for a recognition code within an ART learning circuit: (a) The input pattern I is instated across the feature detectors at level F as a short term memory (STM) activity pattern X. Input I also nonspecifically activates the orienting system with a gain that is called vigilance (ρ); that is, all the input pathways converge with gain ρ onto the orienting system and try to activate it. STM pattern X is represented by the hatched pattern across F . Pattern X both inhibits the orienting system and generates the output pattern S. Pattern S is multiplied by learned adaptive weights, also called long term memory (LTM) traces. These LTM-gated signals are added at F cells, or nodes, to form the input pattern T, which activates the STM pattern Y across the recognition categories coded at level F . (b) Pattern Y generates the top-down output pattern U which is multiplied by top-down LTM traces and added at F nodes to form a prototype pattern V that encodes the learned expectation of the active F nodes. Such a prototype represents the set of commonly shared features in all the input patterns capable of activating Y. If V mismatches I at F , then a new STM activity pattern X∗ is selected at F . X∗ is represented by the hatched pattern. It consists of the features of I that are confirmed by V. Mismatched features are inhibited. The inactivated nodes corresponding to unconfirmed features of X are unhatched. The reduction in total STM activity which occurs when X is transformed into X∗ causes a decrease in the total inhibitionfrom F to the orienting system. (c) If inhibition decreases sufficiently, the orienting system releases a nonspecific arousal wave to F ; that is, a wave of activation that equally activates all F nodes. This wave instantiates the intuition that “novel events are arousing.” This arousal wave resets the STM pattern Y at F by inhibiting Y. (d) After Y is inhibited, its top-down prototype signal is eliminated, and X can be reinstated at F . The prior reset event maintains inhibition of Y during the search cycle. As a result, X can activate a different STM pattern Y at F . If the top-down prototype due to this new Y pattern also mismatches I at F , then the search for an appropriate F code continues until a more appropriate F representation is selected. Such a search cycle represents a type of nonstationary hypothesis testing. When search ends, an attentive resonance develops and learning of the attended data is initiated (adapted with permission from Carpenter and Grossberg ()). The distributed ART architecture supports fast stable learning with arbitrarily distributed F codes (Carpenter, )

Adaptive Resonance Theory

A

pattern peaks represent more activated feature detector cells, and the troughs, less-activated feature detectors. This feature pattern sends signals S through an adaptive filter to the second level F at which a compressed representation Y (also called a recognition category, or a symbol) is activated in response to the distributed input T. Input T is computed by multiplying the signal vector S by a matrix of adaptive weights that can be altered through learning. The representation Y is compressed by competitive interactions across F that allow only a small subset of its most strongly activated cells to remain active in response to T. The pattern Y in the figure indicates that a small number of category cells may be activated to different degrees. These category cells, in turn, send top-down signals U to F . The vector U is converted into the top-down expectation V by being multiplied by another matrix of adaptive weights. When V is received by F , a matching process takes place between the input vector I and V which selects that subset X* of F features that were “expected” by the active F category Y. The set of these selected features is the emerging “attentional focus.”

of artificial intelligence have claimed that neural models can process distributed features, but not symbolic representations. This is not, of course, true in the brain. Nor is it true in ART. Resonance between these two types of information converts the pattern of attended features into a coherent context-sensitive state that is linked to its category through feedback. This coherent state, which binds together distributed features and symbolic categories, can enter consciousness while it binds together spatially distributed features into either a stable equilibrium or a synchronous oscillation. The original ART article (Grossberg, ) predicted the existence of such synchronous oscillations, which were there described in terms of their mathematical properties as “order-preserving limit cycles.” See Carpenter, Grossberg, Markuzon, Reynolds & Rosen () and Grossberg & Versace () for reviews of confirmed ART predictions, including predictions about synchronous oscillations.

Binding Distributed Feature Patterns and Symbols During Conscious Resonances

In ART, the resonant state, rather than bottom-up activation, is predicted to drive learning. This state persists long enough, and at a high enough activity level, to activate the slower learning processes in the adaptive weights that guide the flow of signals between bottomup and top-down pathways between levels F and F in Fig. . This viewpoint helps to explain how adaptive weights that were changed through previous learning can regulate the brain’s present information processing, without learning about the signals that they are currently processing unless they can initiate a resonant state. Through resonance as a mediating event, one can understand from a deeper mechanistic view why humans are intentional beings who are continually predicting what may next occur, and why we tend to learn about the events to which we pay attention. More recent versions of ART, notably the synchronous matching ART (SMART) model (Grossberg & Versace, ) show how a match may lead to fast gamma oscillations that facilitate spike-timing dependent plasticity (STDP), whereas mismatch can lead to slower beta oscillations that lower the probability that mismatched events can be learned by a STDP learning law.

If the top-down expectation is close enough to the bottom-up input pattern, then the pattern X∗ of attended features reactivates the category Y which, in turn, reactivates X∗ . The network hereby locks into a resonant state through a positive feedback loop that dynamically links, or binds, the attended features across X∗ with their category, or symbol, Y. Resonance itself embodies another type of complementary processing. Indeed, there seem to be complementary processes both within and between cortical processing streams (Grossberg, ). This particular complementary relation occurs between distributed feature patterns and the compressed categories, or symbols, that selectively code them: Individual features at F have no meaning on their own, just like the pixels in a picture are meaningless one-by-one. The category, or symbol, in F is sensitive to the global patterning of these features, and can selectively fire in response to this pattern. But it cannot represent the “contents” of the experience, including their conscious qualia, due to the very fact that a category is a compressed or “symbolic” representation. Practitioners

Resonance Links Intentional and Attentional Information Processing to Learning

A

A

Adaptive Resonance Theory

Complementary Attentional and Orienting Systems Control Resonance Versus Reset

A sufficiently bad mismatch between an active topdown expectation and a bottom-up input, say because the input represents an unfamiliar type of experience, can drive a memory search. Such a mismatch within the attentional system is proposed to activate a complementary orienting system, which is sensitive to unexpected and unfamiliar events. ART suggests that this orienting system includes the nonspecific thalamus and the hippocampal system. See Grossberg & Versace () for a summary of data supporting this prediction. Output signals from the orienting system rapidly reset the recognition category that has been reading out the poorly matching top-down expectation (Figs. b and c). The cause of the mismatch is hereby removed, thereby freeing the system to activate a different recognition category (Fig. d). The reset event hereby triggers memory search, or hypothesis testing, which automatically leads to the selection of a recognition category that can better match the input. If no such recognition category exists, say because the bottom-up input represents a truly novel experience, then the search process automatically activates an as yet uncommitted population of cells, with which to learn about the novel information. In order for a topdown expectation to match a newly discovered recognition category, its top-down adaptive weights initially have large values, which are pruned by the learning of a particular expectation. This learning process works well under both unsupervised and supervised conditions (Carpenter et al., ). Unsupervised learning means that the system can learn how to categorize novel input patterns without any external feedback. Supervised learning uses predictive errors to let the system know whether it has categorized the information correctly. Supervision can force a search for new categories that may be culturally determined, and are not based on feature similarity alone. For example, separating the letters E and F that are of similar features into separate recognition categories is culturally determined. Such error-based feedback enables variants of E and F to learn their own category and top-down expectation, or prototype. The complementary, but interacting, processes of attentive-learning and orienting-search together realize a type of error correction through hypothesis testing that can build an

ever-growing, self-refining internal model of a changing world. Controlling the Content of Conscious Experiences: Exemplars and Prototypes

What combinations of features or other information are bound together into conscious object or event representations? One view is that exemplars or individual experiences are learned because humans can have very specific memories. For example, we can all recognize the particular faces of our friends. On the other hand, storing every remembered experience as exemplars can lead to a combinatorial explosion of memory, as well as to unmanageable problems of memory retrieval. A possible way out is suggested by the fact that humans can learn prototypes which represent general properties of the environment (Posner & Keele, ). For example, we can recognize that everyone has a face. But then how do we learn specific episodic memories? ART provides an answer that overcomes the problems faced by earlier models. ART prototypes are not merely averages of the exemplars that are classified by a category, as is typically assumed in classical prototype models. Rather, they are the actively selected critical feature patterns upon which the top-down expectations of the category focus attention. In addition, the generality of the information that is codes by these critical feature patterns is controlled by a gain control process, called vigilance control, which can be influenced by environmental feedback or internal volition (Carpenter & Grossberg, ). Low vigilance permits the learning of general categories with abstract prototypes. High vigilance forces a memory search to occur for a new category when even small mismatches exist between an exemplar and the category that it activates. As a result, in the limit of high vigilance, the category prototype may encode an individual exemplar. Vigilance is computed within the orienting system of an ART model (Fig. b–d). It is here that bottom-up excitation from all the active features in an input pattern I is compared with inhibition from all the active features in a distributed feature representation across F . If the ratio of the total activity across the active features in F (i.e., the “matched” features) to the total activity of all the features in I is less than a vigilance parameter ρ (Fig. b), then a reset wave is activated (Fig. c), which

Adaptive Resonance Theory

can drive the search for another category to classify the exemplar. In other words, the vigilance parameter controls how bad a match can be tolerated before search for a new category is initiated. If the vigilance parameter is low, then many exemplars can influence the learning of a shared prototype, by chipping away at the features that are not shared with all the exemplars. If the vigilance parameter is high, then even a small difference between a new exemplar and a known prototype (e.g., F vs. E) can drive the search for a new category with which to represent F. One way to control vigilance is by a process of match tracking. Here a predictive error (e.g., D is predicted in response to F), the vigilance parameter increases until it is just higher than the ratio of active features in F to total features in I. In other words, vigilance “tracks” the degree of match between input exemplar and matched prototype. This is the minimal level of vigilance that can trigger a reset wave and thus a memory search for a new category. Match tracking realizes a minimax learning rule that conjointly maximizes category generality while it minimizes predictive error. In other words, match tracking uses the least memory resources that can prevent errors from being made. Because vigilance can vary across learning trials, recognition categories capable of encoding widely differing degrees of generalization or abstraction can be learned by a single ART system. Low vigilance leads to broad generalization and abstract prototypes. High vigilance leads to narrow generalization and to prototypes that represent fewer input exemplars, even a single exemplar. Thus a single ART system may be used, say, to learn abstract prototypes with which to recognize abstract categories of faces and dogs, as well as “exemplar prototypes” with which to recognize individual views of faces and dogs. ART models hereby try to learn the most general category that is consistent with the data. This tendency can, for example, lead to the type of overgeneralization that is seen in young children until further learning leads to category refinement.

Memory Consolidation and the Emergence of Rules: Direct Access to Globally Best Match

As sequences of inputs are practiced over learning trials, the search process eventually converges upon stable categories. It has been mathematically proved

A

(Carpenter & Grossberg, ) that familiar inputs directly access the category whose prototype provides the best match globally, while unfamiliar inputs engage the orienting subsystem to trigger memory searches for better categories until they become familiar. This process continues until the memory capacity, which can be chosen arbitrarily large, is fully utilized. The process whereby search is automatically disengaged is a form of memory consolidation that emerges from network interactions. Emergent consolidation does not preclude structural consolidation at individual cells, since the amplified and prolonged activities that subserve a resonance may be a trigger for learning-dependent cellular processes, such as protein synthesis and transmitter production. It has also been shown that the adaptive weights which are learned by some ART models can, at any stage of learning, be translated into fuzzy IF-THEN rules (Carpenter et al., ). Thus the ART model is a self-organizing rule-discovering production system as well as a neural network. These examples show that the claims of some cognitive scientists and AI practitioners that neural network models cannot learn rule-based behaviors are as incorrect as the claims that neural models cannot learn symbols. How the Laminar Circuits of Cerebral Cortex Embody ART Mechanisms

More recent versions of ART have shown how predicted ART mechanisms may be embodied within known laminar microcircuits of the cerebral cortex. These include the family of LAMINART models (Fig. ; see Raizada & Grossberg, ) and the synchronous matching ART, or SMART, model (Fig. ; see Grossberg & Versace, ). SMART, in particular, predicts how a top-down match may lead to fast gamma oscillations that facilitate spike-timing dependent plasticity (STDP), whereas a mismatch can lead to slower beta oscillations that prevent learning by a STDP learning law. At least three neurophysiological labs have recently reported data consistent with the SMART prediction. Review of ART and ARTMAP Algorithms From Winner-Take-All to Distributed Coding As noted

above, ART networks serve both as models of human cognitive information processing (Carpenter, ;

A

A

Adaptive Resonance Theory

4

6

6 LGN

a

LGN

d

2/3

V2 layer 6 1 V1

4

V2

4 5 6

b

6 2/3 V1

2/3

6

4

c

6

4

LGN

e

Adaptive Resonance Theory. Figure . LAMINART circuit clarifies how known cortical connections within and across cortical layers join the layer → and layer / circuits to form a laminar circuit model for the interblobs and pale stripe regions of cortical areas V and V. Inhibitory interneurons are shown filled-in black. (a) The LGN provides bottom-up activation to layer via two routes. First, it makes a strong connection directly into layer . Second, LGN axons send collaterals into layer , and thereby also activate layer via the → on-center off-surround path. The combined effect of the bottom-up LGN pathways is to stimulate layer via an on-center off-surround, which provides divisive contrast normalization (Grossberg, ) of layer cell responses. (b)Folded feedback carries attentional signals from higher cortex into layer of V, via the modulatory → path. Corticocortical feedback axons tend preferentially to originate in layer of the higher area and to terminate in layer of the lower cortex, where they can excite the apical dendrites of layer pyramidal cells whose axons send collaterals into layer . The triangle in the figure represents such a layer pyramidal cell. Several other routes through which feedback can pass into V layer exist. Having arrived in layer , the feedback is then “folded” back up into the feedforward stream by passing through the → on-center off-surround path (Bullier, Hup’e, James, & Girard, ). (c)Connecting the → on-centeroff-surround to the layer / grouping circuit: like-oriented layer simple cells with opposite contrast polarities compete (not shown) before generating half-wave rectified outputs that converge onto layer / complex cells in the column above them. Just like attentional signals from higher cortex, as shown in (b), groupings that form within layer / also send activation into the folded feedback path, to enhance their own positions in layer beneath them via the → on-center, and to suppress input to other groupings via the → off-surround. There exist direct layer / → connections in macaque V, as well as indirect routes via layer . (d) Top-down corticogeniculate feedback from V layer to LGN also has an oncenter off-surround anatomy, similar to the → path. The on-center feedback selectively enhances LGN cells that are consistent with the activation that they cause (Sillito, Jones, Gerstein, & West, ), and the off-surround contributes to length-sensitive (endstopped) responses that facilitate grouping perpendicular to line ends. (e) The entire V/V circuit: V repeats the laminar pattern of V circuitry, but at a larger spatial scale. In particular, the horizontal layer / connections have a longer range in V, allowing above-threshold perceptual groupings between more widely spaced

Adaptive Resonance Theory

A

Grossberg, , ) and as neural systems for technology transfer (Caudell, Smith, Escobedo, & Anderson, ; Parsons & Carpenter, ). Design principles derived from scientific analyses and design constraints imposed by targeted applications have jointly guided the development of many variants of the basic networks, including fuzzy ARTMAP (Carpenter et al., ), ART-EMAP, ARTMAP-IC, and Gaussian ARTMAP. Early ARTMAP systems, including fuzzy ARTMAP, employ winner-take-all (WTA) coding, whereby each input activates a single category node during both training and testing. When a node is first activated during training, it is mapped to its designated output class. Starting with ART-EMAP, subsequent systems have used distributed coding during testing, which typically improves predictive accuracy, while avoiding the computational problems inherent in the use of distributed code representations during training. In order to address these problems, distributed ARTMAP (Carpenter, ; Carpenter, Milenova, & Noeske, ) introduced a new network configuration, in addition to new learning laws. Comparative analysis of the performance of ARTMAP systems on a variety of benchmark problems has led to the identification of a default ARTMAP network, which features simplicity of design and robust performance in many application domains. Default ARTMAP employs winner-take-all coding during training and distributed coding during testing within a distributed ARTMAP network architecture. With winner-take-all coding during testing, default ARTMAP reduces to a version of fuzzy ARTMAP.

computational design known as opponent processing. Balancing an entity against its opponent, as in agonist– antagonist muscle pairs, allows a system to act upon relative quantities, even as absolute magnitudes may vary unpredictably. In ART systems, complement coding is analogous to retinal ON-cells and OFF-cells. When the learning system is presented with a set of input features a ≡ (a ...ai ...aM ), complement coding doubles the number of input components, presenting to the network both the original feature vector and its complement. Complement coding allows an ART system to encode within its critical feature patterns of memory features that are consistently absent on an equal basis with features that are consistently present. Features that are sometimes absent and sometimes present when a given category is learning are regarded as uninformative with respect to that category. Since its introduction, complement coding has been a standard element of ART and ARTMAP networks, where it plays multiple computational roles, including input normalization. However, this device is not particular to ART, and could, in principle, be used to preprocess the inputs to any type of system. To implement complement coding, component activities ai of a feature vector a are scaled; thus, ≤ ai ≤ . For each feature i, the ON activity ai determines the complementary OFF activity ( − ai ). Both ai and ( − ai ) are represented in the M-dimensional system input vector A = (a ∣ ac ) (Fig. ). Subsequent network computations then operate in this Mdimensional input space. In particular, learned weight vectors wJ are M-dimensional.

Complement Coding: Learning both Absent and Present Features ART and ARTMAP employ a preprocess-

ARTMAP Search and Match Tracking in Fuzzy ARTMAP

ing step called complement coding (Fig. ), which models the nervous system’s ubiquitous use of the

As illustrated by Fig. , the ART matching process triggers either learning or a parallel memory search. If search ends at an established code, the memory

inducing stimuli to form. V layer / projects up to V layers and , just as LGN projects to layers an of V. Higher cortical areas send feedback into V which ultimately reaches layer , just as V feedback acts on layer of V. Feedback paths from higher cortical areas straight into V (not shown) can complement and enhance feedback from V into V. Top-down attention can also modulate layer / pyramidal cells directly by activating both the pyramidal cells and inhibitory interneurons in that layer. The inhibition tends to balance the excitation, leading to a modulatory effect. These top-down attentional pathways tend to synapse in layer , as shown in Fig. b. Their synapses on apical dendrites in layer are not shown, for simplicity. (Reprinted with permission from Raizada & Grossberg ())

A

A

Adaptive Resonance Theory

Adaptive Resonance Theory. Figure . SMART model overview. A first-order and higher-order cortical area are linked by corticocortical and corticothalamocortical connections. The thalamus is subdivided into specific first-order, secondorder, nonspecific, and thalamic reticular nucleus (TRN). The thalamic matrix (one cell population shown as an open ring) provides priming to layer , where layer pyramidal cell apical dendrites terminate. The specific thalamus relays sensory information (first-order thalamus) or lower-order cortical information (second-order thalamus) to the respective cortical areas via plastic connections. The nonspecific thalamic nucleus receives convergent BU input and inhibition from the TRN, and projects to layer of the laminar cortical circuit, where it regulates reset and search in the cortical circuit (see text). Corticocortical feedback connections link layer II of the higher cortical area to layer of the lower cortical area, whereas thalamocortical feedback originates in layer II and terminates in the specific thalamus after synapsing on the TRN. Layer II corticothalamic feedback matches the BU input in the specific thalamus. V receives two parallel BU thalamocortical pathways. The LGN→V layer pathway and the modulatory LGN→V layer I → pathway provide divisive contrast normalization of layer cell responses. The intracortical loop V layer →/→→I → pathway (folded feedback) enhances the activity of winning layer / cells at their own positions via the I → on-center, and suppresses input to other layer / cells via the I → off-surround. V also activates the BU V→V corticocortical pathways (V layer /→V layers I and ) and the BU corticothalamocortical pathways (V layer →PULV→V layers I and ), where the layer I → pathway provides divisive contrast normalization to V layer cells analogously to V. Corticocortical feedback from V layer II →V layer →I → also uses the same modulatory I → pathway. TRN cells of the two thalamic sectors are linked via gap junctions, which provide synchronization of the two thalamocortical sectors when processing BU stimuli (reprinted with permission from Grossberg & Versace ())

representation may either remain the same or incorporate new information from matched portions of the current input. While this dynamic applies to arbitrarily distributed activation patterns, the F search and code

for fuzzy ARTMAP (Fig. ) describes a winner-take all system. Before ARTMAP makes a class prediction, the bottom-up input A is matched against the top-down

Adaptive Resonance Theory

A

complement coded input

A

A = (A1...AM ⏐ AM+1...A2M) = (a ⏐ ac ) OFF channel

ON channel

(a1...ai ...am ) = a

ac = ((1 – ai )...(1 – ai )...(1 – aM )) a

feature vector

Adaptive Resonance Theory. Figure . Complement coding transforms an M-dimensional feature vector a into a Mdimensional system input vector A. A complement-coded system input represents both the degree to which a feature i is present (ai ) and the degree to which that feature is absent ( − ai )

J = J1 y

J = J1 y F2

F2

X F1

A

–

–

A

r F0

F1

–

r

F0

a

a fuzzy ART J = J1 y

y r=1 reset

F2 X –

A

F1

F2 X

–

–

A

r F0

F1

–

r F0

a

a

Adaptive Resonance Theory. Figure . A fuzzy ART search cycle, with a distributed ART network configuration (Carpenter, ). The ART search cycle (Carpenter and Grossberg, ) is the same, but allows only binary inputs and did not originally feature complement coding. The match field F represents the matched activation pattern x = A ∧ wJ , where ∧ denotes the component-wise minimum, or fuzzy intersection, between the bottom-up input A and the top-down expectation wJ . If the matched pattern fails to meet the matching criterion, then the active code is reset at F , and the system searches for another code y that better represents the input. The match/mismatch decision takes place in the ART orienting system. Each active feature in the input pattern A excites the orienting system with gain equal to the vigilance parameter ρ. Hence, with complement M

coding, the total excitatory input is ρ ∣A∣ = ρ ∑ Ai =ρM. Active cells in the matched pattern x inhibit the orii=

M

enting system, leading to a total inhibitory input equal to − ∣x∣ = − ∑ xi . If ρ ∣A∣ − ∣x∣ ≤ , then the orienti=

ing system remains quiet, allowing resonance and learning to occur. If ρ ∣A∣ − ∣x∣ > , then the reset signal r = , initiating search for a better matching code

A

Adaptive Resonance Theory

learned expectation, or critical feature pattern, that is read out by the active node (Fig. b). The matching criterion is set by a vigilance parameter ρ. As noted above, low vigilance permits the learning of abstract, prototype-like patterns, while high vigilance requires the learning of specific, exemplar-like patterns. When ¯ a new input arrives, vigilance equals a baseline level ρ. Baseline vigilance is set equal to zero by default, in order to maximize generalization. Vigilance rises only after the system has made a predictive error. The internal control process that determines how far it must rise in order to correct the error is called match tracking. As vigilance rises, the network is required to pay more attention to how well top-down expectations match the current bottom-up input. Match tracking (Fig. ) forces an ARTMAP system not only to reset its mistakes, but to learn from them. With match tracking and fast learning, each ARTMAP network passes the next input test, which requires that,

match tracking dr = –(r – r– )+ΓRr c dt

J y

F2

wJ F1

x = A ∧ wJ

predictive error R=1

match r A – x ≤0 r c= 1 –x

ART Geometry Fuzzy ART long-term memories are

rc

F0 a

R

+r A

A

ac

A

if a training input were re-presented immediately after a learning trial, it would directly activate the correct output class, with no predictive errors or search. Match tracking thus simultaneously implements the design goals of maximizing generalization and minimizing predictive error, without requiring the choice of a fixed matching criterion. ARTMAP memories thereby include both broad and specific pattern classes, with the latter typically formed as exceptions to the more general “rules” defined by the former. ARTMAP learning typically produces a wide variety of such mixtures, whose exact composition depends upon the order of training exemplar presentation. Unless they have already activated all their coding nodes, ARTMAP systems contain a reserve of nodes that have never been activated, with weights at their initial values. These uncommitted nodes compete with the previously active committed nodes, and an uncommitted node is chosen over poorly matched committed nodes. An ARTMAP design constraint specifies that an active uncommitted node should not reset itself. Weights initially begin with wiJ = . Thus, when the active node J is uncommitted, x = A ∧ wJ = A at the match field. Then, ρ ∣A∣ − ∣x∣ = ρ ∣A∣ − ∣A∣ = (ρ − ) ∣A∣. Thus ρ ∣A∣ − ∣x∣ ≤ and an uncommitted node does not trigger a reset, provided ρ ≤ .

r

a

Adaptive Resonance Theory. Figure . ARTMAP match tracking. When an active node J meets the matching criterion (ρ ∣A∣ − ∣x∣ ≤ ), the reset signal r = and the node makes an prediction. If the predicted output is incorrect, the feedback signal R = . While R = rc = , ∣x∣ r increases rapidly. As soon as ρ > ∣A∣ , r switches to , which both halts the increase of r and resets the active F node. From one chosen node to the next, r decays to ∣x∣ slightly below ∣A∣ (MT–). On the time scale of learning r returns to ρ¯

visualized as hyper-rectangles, called category boxes. The weight vector wJ is interpreted geometrically as a box RJ whose ON-channel corner uJ and OFF-channel corner vJ are, in the format of the complement-coded input vector, defined by (uJ ∣ vJC ) ≡ wJ (Fig. ). For fuzzy ART with the choice-by-difference F → F signal function TJ , an input a activates the node J of the closest category box RJ , according to the L (city-block) metric. In case of a tie, as when a lies in more than one box, the node with the smallest RJ is chosen, where ∣RJ ∣ is M

defined as the sum of the edge lengths ∑ ∣viJ − uiJ ∣. The i=

chosen node J will reset if ∣RJ ⊕ a∣ > M ( − ρ), where RJ ⊕ a is the smallest box enclosing both RJ and a. Otherwise, RJ expands toward RJ ⊕ a during learning. With fast learning, Rnew = Rold ⊕ a. J J

Adaptive Resonance Theory

1

ART 3 search mechanism

vJ F2

a2

RJ

r=1 x = A ^ wj

0 a1

ρ|A| - |x| > 0

a

a 0

reset

Y

1

Adaptive Resonance Theory. Figure . Fuzzy ART geometry. The weight of a category node J is represented in complement-coding form as wJ = (uJ ∣ vJC ), and the M-dimensional vectors uJ and vJ define the corners of the category box RJ . When M = , the size of RJ equals its width plus its height. During learning, RJ expands toward RJ ⊕a, defined as the smallest box enclosing both RJ and a. Node J will reset before learning if ∣RJ ⊕ a∣ > M ( − ρ)

Biasing Against Previously Active Category Nodes and Previously Attended Features During Attentive Memory Search Activity x at the ART field F continuously com-

putes the match between the field’s bottom-up and topdown input patterns. A reset signal r shuts off the active F node J when x fails to meet the matching criterion determined by the value of the vigilance parameter ρ. Reset alone does not, however, trigger a search for a different F node: unless the prior activation has left an enduring trace within the F -to-F subsystem, the network will simply reactivate the same node as before. As modeled in ART , biasing the bottom-up input to the coding field F to favor the previously inactive nodes implements search by allowing the network to activate a new node in response to a reset signal. The ART search mechanism defines a medium-term memory (MTM) in the F -to-F adaptive filter which biases the system against re-choosing a node that had just produced a reset. A presynaptic interpretation of this bias is transmitter depletion, or habituation (Fig. ). Medium-term memory in all ART models allows the network to shift attention among learned categories at the coding field F during search. The new biased ART network (Carpenter & Gaddam, ) introduces a second medium-term memory that shifts attention among input features, as well as categories, during search. Self-Organizing Rule Discovery This foundation of com-

putational principles and mechanisms has enabled the

F1

A

F0 a

ac

A

J

RJ uJ

A

- |x| + ρ|A| |A|

ρ

a

Adaptive Resonance Theory. Figure . ART search implements a medium-term memory within the F -to-F pathways, which biases the system against choosing a category node that had just produced a reset

development of an ART information fusion system that is capable of incrementally learning a cognitive hierarchy of rules in response to probabilistic, incomplete, and even contradictory data that are collected by multiple observers (Carpenter, Martens, & Ogas, ).

Cross References 7Bayes Rule 7Bayesian Methods

Recommended Reading Bullier, J., Hupé, J. M., James, A., & Girard, P. (). Functional interactions between areas V and V in the monkey. Journal of Physiology Paris, (–), –. Carpenter, G. A. (). Distributed learning, recognition, and prediction by ART and ARTMAP neural networks. Neural Networks, , –. Carpenter, G. A. & Gaddam, S. C. (). Biased ART: A neural architecture that shifts attention towards previously disregarded features following an incorrect prediction. Neural Networks, . Carpenter, G. A., & Grossberg, S. (). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, , –. Carpenter, G. A. & Grossberg, S. (). Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions. Trends in Neurosciences, , –. Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H. & Rosen, D. B. (). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Networks, , –. Carpenter, G. A., Martens, S., & Ogas, O. J. (). Self-organizing information fusion and hierarchical knowledge discovery: A

A

Adaptive System

new framework using ARTMAP neural networks. Neural Networks, , –. Carpenter, G. A., Milenova, B. L., & Noeske, B. W. (). Distributed ARTMAP: A neural network for fast distributed supervised learning. Neural Networks, , –. Caudell, T. P., Smith, S. D. G., Escobedo, R., & Anderson, M. (). NIRS: Large scale ART neural architectures for engineering design retrieval. Neural Networks, , –. Grossberg, S. (). Adaptive pattern classification and universal recoding, II: Feedback, expectation, olfaction, and illusions. Biological Cybernetics, , –. Grossberg, S. (). How does a brain build a cognitive code? Psychological Review, , –. Grossberg, S. (). The link between brain, learning, attention, and consciousness. Consciousness and Cognition, , –. Grossberg, S. (). The complementary brain: Unifying brain dynamics and modularity. Trends in Cognitive Sciences, , –. Grossberg, S. (). How does the cerebral cortex work? Development, learning, attention, and D vision by laminar circuits of visual cortex. Behavioral and Cognitive Neuroscience Reviews, , –. Grossberg, S. (). Consciousness CLEARS the mind. Neural Networks, , –. Grossberg, S. & Versace, M. (). Spikes, synchrony, and attentive learning by laminar thalamocortical circuits. Brain Research, , –. Parsons, O., & Carpenter, G. A. (). ARTMAP neural networks for information fusion and data mining: Map production and target recognition methodologies. Neural Networks, (), –. Posner, M. I., & Keele, S. W. (). On the genesis of abstract ideas. Journal of Experimental Psychology, , –. Raizada, R., & Grossberg, S. (). Towards a theory of the laminar architecture of cerebral cortex: Computational clues from the visual system. Cerebral Cortex, , –. Sillito, A. M., Jones, H. E., Gerstein, G. L., & West, D. C. (). Feature-linked synchronization of thalamic relay cell firing induced by feedback from the visual cortex. Nature, , –.

definitions of agents. Most of them would agree on the following set of agent properties: Persistence: Code is not executed on demand but runs continuously and decides autonomously when it should perform some activity. ● Social ability: Agents are able to interact with other agents. ● Reactivity: Agents perceive the environment and are able to react. ● Proactivity: Agents exhibit goal-directed behavior and can take the initiative. ●

Agent-Based Computational Models 7Artificial Societies

Agent-Based Modeling and Simulation 7Artificial Societies

Agent-Based Simulation Models 7Artificial Societies

AIS 7Artificial Immune Systems

Adaptive System 7Complexity in Adaptive Systems

Agent In computer science, the term “agent” usually denotes a software abstraction of a real entity which is capable of acting with a certain degree of autonomy. For example, in artificial societies, agents are software abstractions of real people, interacting in an artifical, simulated environment. Various authors have proposed different

Algorithm Evaluation Geoffrey I. Webb Monash University, Victoria, Australia

Definition Algorithm evaluation is the process of assessing a property or properties of an algorithm.

Motivation and Background It is often valuable to assess the efficacy of an algorithm. In many cases, such assessment is relative, that is,

Ant Colony Optimization

evaluating which of several alternative algorithms is best suited to a specific application.

Processes and Techniques Many learning algorithms have been proposed. In order to understand the relative merits of these alternatives, it is necessary to evaluate them. The primary approaches to evaluation can be characterized as either theoretical or experimental. Theoretical evaluation uses formal methods to infer properties of the algorithm, such as its computational complexity (Papadimitriou, ), and also employs the tools of 7computational learning theory to assess learning theoretic properties. Experimental evaluation applies the algorithm to learning tasks to study its performance in practice. There are many different types of property that may be relevant to assess depending upon the intended application. These include algorithmic properties, such as time and space complexity. These algorithmic properties are often assessed separately with respect to performance when learning a 7model, that is, at 7training time, and performance when applying a learned model, that is, at 7test time. Other types of property that are often studied are the properties of the models that are learned (see 7model evaluation). Strictly speaking, such properties should be assessed with respect to a specific application or class of applications. However, much machine learning research includes experimental studies in which algorithms are compared using a set of data sets with little or no consideration given to what class of applications those data sets might represent. It is dangerous to draw general conclusions about relative performance on any application from relative performance on this sample of some unknown class of applications. Such experimental evaluation has become known disparagingly as a bake-off . An approach to experimental evaluation that may be less subject to the limitations of bake-offs is the use of experimental evaluation to assess a learning algorithm’s 7bias and variance profile. Bias and variance measure properties of an algorithm’s propensities in learning models rather than directly being properties of the models that are learned. Hence, they may provide more general insights into the relative characteristics of alternative algorithms than do assessments of the performance of learned models on a finite number of

A

applications. One example of such use of bias–variance analysis is found in Webb (). Techniques for experimental algorithm evaluation include 7bootstrap sampling, 7cross-validation, and 7holdout evaluation.

Cross References 7Computational Learning Theory 7Model Evaluation

Recommended Reading Hastie, T., Tibshirani, R., & Friedman, J. H. (). The elements of statistical learning. New York: Springer. Mitchell, T. M. (). Machine learning. New York: McGraw-Hill. Papadimitriou, C. H. (). Computational complexity. Reading, MA: Addison-Wesley. Webb, G. I. (). MultiBoosting: A technique for combining boosting and wagging. Machine Learning, (), –. Witten, I. H., & Frank, E. (). Data mining: Practical machine learning tools and techniques (nd ed.). San Francisco: Morgan Kaufmann.

Analogical Reasoning 7Instance-Based Learning

Analysis of Text 7Text Mining

Analytical Learning 7Deductive Learning 7Explanation-Based Learning

Ant Colony Optimization Marco Dorigo, Mauro Birattari Université Libre de Bruxelles, Brussels, Belgium

Synonyms ACO

Definition Ant colony optimization (ACO) is a population-based metaheuristic for the solution of difficult combinatorial

A

A

Ant Colony Optimization

optimization problems. In ACO, each individual of the population is an artificial agent that builds incrementally and stochastically a solution to the considered problem. Agents build solutions by moving on a graphbased representation of the problem. At each step their moves define which solution components are added to the solution under construction. A probabilistic model is associated with the graph and is used to bias the agents’ choices. The probabilistic model is updated online by the agents so as to increase the probability that future agents will build good solutions.

●

The Ant Colony Optimization Probabilistic Model

We assume that the combinatorial optimization problem (S, f ) is mapped on a problem that can be characterized by the following list of items: ● ●

Motivation and Background Ant colony optimization is so called because of its original inspiration: the foraging behavior of some ant species. In particular, in Beckers, Deneubourg, and Goss () it was demonstrated experimentally that ants are able to find the shortest path between their nest and a food source by collectively exploiting the pheromone they deposit on the ground while walking. Similar to real ants, ACO’s artificial agents, also called artificial ants, deposit artificial pheromone on the graph of the problem they are solving. The amount of pheromone each artificial ant deposits is proportional to the quality of the solution the artificial ant has built. These artificial pheromones are used to implement a probabilistic model that is exploited by the artificial ants to make decisions during their solution construction activity.

Structure of the Optimization System Let us consider a minimization problem (S, f ), where S is the set of feasible solutions, and f is the objective function, which assigns to each solution s ∈ S a cost value f (s). The goal is to find an optimal solution s∗ , that is, a feasible solution of minimum cost. The set of all optimal solutions is denoted by S ∗ . Ant colony optimization attempts to solve this minimization problem by repeating the following two steps: ●

Candidate solutions are constructed using a parameterized probabilistic model, that is, a parameterized probability distribution over the solution space.

The candidate solutions are used to modify the model in a way that is intended to bias future sampling toward low cost solutions.

● ● ●

A finite set C = {c , c , . . . , cNC } of components, where NC is the number of components. A finite set X of states of the problem, where a state is a sequence x = ⟨ci , cj , . . . , ck , . . . ⟩ over the elements of C. The length of a sequence x, that is, the number of components in the sequence, is expressed by ∣x∣. The maximum length of a sequence is bounded by a positive constant n < +∞. A set of (candidate) solutions S, which is a subset of X (i.e., S ⊆ X ). A set of feasible states X˜ , with X˜ ⊆ X , defined via a set of constraints Ω. A nonempty set S ∗ of optimal solutions, with S ∗ ⊆ X˜ and S ∗ ⊆ S.

Given the above formulation (Note that, because this formulation is always possible, ACO can in principle be applied to any combinatorial optimization problem.) artificial ants build candidate solutions by performing randomized walks on the completely connected, weighted graph G = (C, L, T ), where the vertices are the components C, the set L fully connects the components C, and T is a vector of so-called pheromone trails τ. Pheromone trails can be associated with components, connections, or both. Here we assume that the pheromone trails are associated with connections, so that τ(i, j) is the pheromone associated with the connection between components i and j. It is straightforward to extend the algorithm to the other cases. The graph G is called the construction graph. To construct candidate solutions, each artificial ant is first put on a randomly chosen vertex of the graph. It then performs a randomized walk by moving at each step from vertex to vertex on the graph in such a way that the next vertex is chosen stochastically according to the strength of the pheromone currently on the arcs.

Ant Colony Optimization

While moving from one node to another of the graph G, constraints Ω may be used to prevent ants from building infeasible solutions. Formally, the solution construction behavior of a generic ant can be described as follows: ant_solution_construction For each ant: – Select a start node c according to some problem dependent criterion. – Set k = and xk = ⟨c ⟩. ● While xk = ⟨c , c , . . . , ck ⟩ ∈ X˜ , xk ∉ S, and the set Jxk of components that can be appended to xk is not empty, select the next node (component) ck+ randomly according to:

A

The Ant Colony Optimization Pheromone Update

Many different schemes for pheromone update have been proposed within the ACO framework. For an extensive overview, see Dorigo and Stützle (). Most pheromone updates can be described using the following generic scheme:

●

PT (ck+ = c∣xk ) ⎧ F(ck ,c) (τ(ck , c)) ⎪ ⎪ if (ck , c)∈Jxk , ⎪ ⎪ ⎪ ∑(ck ,y)∈Jxk F(ck ,y) (τ(ck , y)) ⎪ ⎪ ⎪ ⎪ =⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ otherwise, ⎪ ⎩ () where a connection (ck , y) belongs to Jxk if and only if the sequence xk+ = ⟨c , c , . . . , ck , y⟩ satisfies the constraints Ω (that is, xk+ ∈ X˜ ) and F(i, j) (z) is some monotonic function – a common choice being z α η(i, j) β , where α, β > , and η(i, j)’s are heuristic values measuring the desirability of adding component j after i. If at some stage xk ∉ S and Jxk = ∅, that is, the construction process has reached a dead-end, the current state xk is discarded. However, this situation may be prevented by allowing artificial ants to build infeasible solutions as well. In such a case, an infeasibility penalty term is usually added to the cost function. Nevertheless, in most of the settings in which ACO has been applied, the dead-end situation does not occur. For certain problems, one may find it useful to use a more general scheme, where F depends on the pheromone values of several “related” connections rather than just a single one. Moreover, instead of the random-proportional rule above, different selection schemes, such as the pseudo-random-proportional rule (Dorigo & Gambardella, ), may be used.

Generic_ACO_Update ∀s ∈ Sˆ t , ∀(i, j) ∈ s : τ(i, j) ← τ(i, j)+Qf (s∣S , . . . , St ), ● ∀(i, j) : τ(i, j) ← ( − ρ) ⋅ τ(i, j),

●

where Si is the sample in the ith iteration, ρ, ≤ ρ < , is the evaporation rate, and Qf (s∣S , . . . , St ) is some “quality function,” which is typically required to be nonincreasing with respect to f and is defined over the “reference set” Sˆ t . Different ACO algorithms may use different quality functions and reference sets. For example, in the very first ACO algorithm – Ant System (Dorigo, Maniezzo, & Colorni, , ) – the quality function is simply /f (s) and the reference set Sˆ t = St . In a subsequently proposed scheme, called iteration best update (Dorigo & Gambardella, ), the reference set is a singleton containing the best solution within St (if there are several iteration-best solutions, one of them is chosen randomly). For the global-best update (Dorigo et al., ; Stützle & Hoos, ), the reference set contains the best among all the iteration-best solutions (and if there are more than one global-best solution, the earliest one is chosen). In Dorigo et al. () an elitist strategy was introduced, in which the update is a combination of the previous two. In case a good lower bound on the optimal solution cost is available, one may use the following quality function (Maniezzo, ): f¯ − f (s) f (s) − LB ) = τ ¯ , Qf (s∣S , . . . , St ) = τ ( − ¯ f − LB f − LB () where f¯ is the average of the costs of the last k solutions and LB is the lower bound on the optimal solution cost. With this quality function, the solutions are evaluated by comparing their cost to the average cost of the other recent solutions, rather than by using the absolute cost values. In addition, the quality function is automatically scaled based on the proximity of the average cost to the lower bound.

A

A

Anytime Algorithm

A pheromone update that slightly differs from the generic update described above was used in ant colony system (ACS) (Dorigo & Gambardella, ). There the pheromone is evaporated by the ants online during the solution construction, hence only the pheromone involved in the construction evaporates. Another modification of the generic update was introduced in MAX–MIN Ant System (Stützle & Hoos, , ), which uses maximum and minimum pheromone trail limits. With this modification, the probability of generating any particular solution is kept above some positive threshold. This helps to prevent search stagnation and premature convergence to suboptimal solutions.

Cross References 7Swarm Intelligence

is 7Adaptive Real-Time Dynamic Programming (ARTDP).

AODE 7Averaged One-Dependence Estimators

Apprenticeship Learning 7Behavioral Cloning

Approximate Dynamic Programming 7Value Function Approximation

Recommended Reading Beckers, R., Deneubourg, J. L., & Goss, S. (). Trails and U-turns in the selection of the shortest path by the ant Lasius Niger. Journal of Theoretical Biology, , –. Dorigo, M., & Gambardella, L. M. (). Ant colony system: A cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, (), –. Dorigo, M., Maniezzo, V., & Colorni, A. (). Positive feedback as a search strategy. Technical Report -, Dipartimento di Elettronica, Politecnico di Milano, Milan, Italy. Dorigo M., Maniezzo V., & Colorni A. (). Ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics – Part B, (), –. Dorigo, M., & Stützle, T. (). Ant colony optimization. Cambridge, MA: MIT Press. Maniezzo, V. (). Exact and approximate nondeterministic tree-search procedures for the quadratic assignment problem. INFORMS Journal on Computing, (), –. Stützle, T., & Hoos, H. H. (). The MAX–MIN ant system and local search for the traveling salesman problem. In Proceedings of the Congress on Evolutionary Computation – CEC’ (pp. –). Piscataway, NJ: IEEE Press. Stützle, T., & Hoos, H. H. (). MAX–MIN ant system. Future Generation Computer Systems, (), –, .

Anytime Algorithm An anytime algorithm is an algorithm whose output increases in quality gradually with increased running time. This is in contrast to algorithms that produce no output at all until they produce full-quality output after a sufficiently long execution time. An example of an algorithm with good anytime performance

Apriori Algorithm Hannu Toivonen University of Helsinki, Helsinki, Finland

Definition Apriori algorithm (Agrawal, Mannila, Srikant, Toivonen, & Verkamo, ) is a 7data mining method which outputs all 7frequent itemsets and 7association rules from given data. Input: set I of items, multiset D of subsets of I, frequency threshold min_ fr, and confidence threshold min_conf. Output: all frequent itemsets and all valid association rules in D. Method: : level := ; frequent_sets := ∅; : candidate_sets := {{i} ∣ i ∈ I}; : while candidate_sets ≠ ∅ .: scan data D to compute frequencies of all sets in candidate_sets; .: frequent_sets := frequent_sets ∪ {C ∈ candidate_sets ∣ frequency(C) ≥ min_ fr}; . level := level + ; .: candidate_sets := {A ⊂ I ∣ ∣A∣ = level and B ∈ frequent_sets for all B ⊂ A, ∣B∣ = level − };

Artificial Immune Systems

: output frequent_sets; : for each F ∈ frequent_sets .: for each E ⊂ F, E ≠ ∅, E ≠ F ..: if frequency(F)/frequency(E) ≥ min_conf then output association rule E → (F / E) The algorithm finds frequent itemsets (lines -) by a breadth-first, general-to-specific search. It generates and tests candidate itemsets in batches, to reduce the overhead of database access. The search starts with the most general itemset patterns, the singletons, as candidate patterns (line ). The algorithm then iteratively computes the frequencies of candidates (line .) and saves those that are frequent (line .). The crux of the algorithm is in the candidate generation (line .): on the next level, those itemsets are pruned that have an infrequent subset. Obviously, such itemsets cannot be frequent. This allows Apriori to find all frequent itemset without spending too much time on infrequent itemsets. See 7frequent pattern and 7constraint-based mining for more details and extensions. Finally, the algorithm tests all frequent association rules and outputs those that are also confident (lines -..).

Cross References 7Association Rule 7Basket Analysis 7Constraint-Based Mining 7Frequent Itemset 7Frequent Pattern

A

under an ROC curve. It evaluates the performance of a scoring classifier on a test set, but ignores the magnitude of the scores and only takes their rank order into account. AUC is expressed on a scale of to , where means that all negatives are ranked before all positives, and means that all positives are ranked before all negatives. See 7ROC Analysis.

AQ 7Rule Learning

ARL 7Average-Reward Reinforcement Learning

ART 7Adaptive Resonance Theory

ARTDP 7Adaptive Real-Time Dynamic Programming

Artificial Immune Systems Jon Timmis University of York, Heslington, North Yorkshire, UK

Recommended Reading Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A. I. (). Fast discovery of association rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. – s). Menlo Park: AAAI Press.

Synonyms AIS; Immune computing; Immune-inspired computing; Immunocomputing; Immunological computation

Definition

Area Under Curve Synonyms AUC

Definition The area under curve (AUC) statistic is an empirical measure of classification performance based on the area

Artificial immune systems (AIS) have emerged as a computational intelligence approach that shows great promise. Inspired by the complexity of the immune system, computer scientists and engineers have created systems that in some way mimic or capture certain computationally appealing properties of the immune system, with the aim of building more robust and adaptable solutions. AIS have been defined by de Castro and Timmis () as:

A

A

Artificial Immune Systems

▸ adaptive systems, inspired by theoretical immunology and observed immune functions, principle and models, which are applied to problem solving

AIS are not limited to machine learning systems, there are a wide variety of other areas in which AIS are developed such as optimization, scheduling, fault tolerance, and robotics (Hart & Timmis, ). Within the context of machine learning, both supervised and unsupervised approaches have been developed. Immune-inspired learning approaches typically develop a memory set of detectors that are capable of classifying unseen data items (in the case of supervised learning) or a memory set of detectors that represent clusters within the data (in the case of unsupervised learning). Both static and dynamic learning systems have been developed.

Motivation and Background The immune system is a complex system that undertakes a myriad of tasks. The abilities of the immune system have helped to inspire computer scientists to build systems that mimic, in some way, various properties of the immune system. This field of research, AIS, has seen the application of immune-inspired algorithms to a wide variety of areas. The origin of AIS has its roots in the early theoretical immunology work of Farmer, Perelson, and Varela (Farmer, Packard, & Perelson, ; Varela, Coutinho, Dupire, & Vaz, ). These works investigated a number of theoretical 7immune network models proposed to describe the maintenance of immune memory in the absence of antigen. While controversial from an immunological perspective, these models began to give rise to an interest from the computing community. The most influential people at crossing the divide between computing and immunology in the early days were Bersini and Forrest. It is fair to say that some of the early work by Bersini () was very well rooted in immunology, and this is also true of the early work by Forrest (). It was these works that formed the basis of a solid foundation for the area of AIS. In the case of Bersini, he concentrated on the immune network theory, examining how the immune system maintained its memory and how one might build models and algorithms mimicking that property. With regard to Forrest, her work was focused on computer security

(in particular, network intrusion detection) and formed the basis of a great deal of further research by the community on the application of immune-inspired techniques to computer security. At about the same time as Forrest was undertaking her work, other researchers began to investigate the nature of learning in the immune system and how that might by used to create machine learning algorithms (Cook & Hunt, ). They had the idea that it might be possible to exploit the mechanisms of the immune system (in particular, the immune network) in learning systems, so they set about doing a proof of concept (Cook & Hunt, ). Initial results were very encouraging, and they built on their success by applying the immune ideas to the classification of DNA sequences as either promoter or nonpromoter classes: this work was generalized in Timmis and Neal (). Similar work was carried out by de Castro and Von Zuben (), who developed algorithms for use in function optimization and data clustering. Work in dynamic unsupervised machine learning algorithms was also undertaken, meeting with success in works such as Neal (). In the supervised learning domain, very little happened until the work by Watkins () (later expanded in Watkins, ) developed an immune-based classifier known as AIRS, and in the dynamic supervised domain, with the work in Secker, Freitas, and Timmis () being one of a number of successes.

Structure of the Learning System In an attempt to create a common basis for AIS, the work in de Castro and Timmis () proposed the idea of a framework for engineering AIS. They argued that the case for such a framework as the existence of similar frameworks in other biologically inspired approaches, such as 7artificial neural networks (ANNs) and evolutionary algorithms (EAs), has helped considerably with the understanding and construction of such systems. For example, de Castro and Timmis () consider a set of artificial neurons, which can be arranged together to form an ANN. In order to acquire knowledge, these neural networks undergo an adaptive process, known as learning or training, which alters (some of) the parameters within the network. Therefore, they argued that in a simplified form, a framework to design an ANN is

Artificial Immune Systems

composed of a set of artificial neurons, a pattern of interconnection for these neurons, and a learning algorithm. Similarly, they argued that in evolutionary algorithms, there is a set of artificial chromosomes representing a population of individuals that iteratively suffer a process of reproduction, genetic variation, and selection. As a result of this process, a population of evolved artificial individuals arises. A framework, in this case, would correspond to the genetic representation of the individuals of the population, plus the procedures for reproduction, genetic variation, and selection. Therefore, they proposed that a framework to design a biologically inspired algorithm requires, at least, the following basic elements: A representation for the components of the system ● A set of mechanisms to evaluate the interaction of individuals with the environment and each other. The environment is usually stimulated by a set of input stimuli, one or more fitness function(s), or other means ● Procedures of adaptation that govern the dynamics of the system, i.e., how its behavior varies over time ●

This framework can be thought of as a layered approach such as the specific framework for engineering AIS of de Castro and Timmis () shown in Fig. . This framework follows the three basic elements for designing a biologically inspired algorithm just described, where the set of mechanisms for evaluation are the affinity measures and the procedures

A

of adaptation are the immune algorithms. In order to build a system such as an AIS, one typically requires an application domain or target function. From this basis, the way in which the components of the system will be represented is considered. For example, the representation of network traffic may well be different from the representation of a real-time embedded system. In AIS, the way in which something is represented is known as shape space. There are many kinds of shape space, such as Hamming, real valued, and so on, each of which carries it own bias and should be selected with care (Freitas & Timmis, ). Once the representation has been chosen, one or more affinity measures are used to quantify the interactions of the elements of the system. There are many possible affinity measures (which are partially dependent upon the representation adopted), such as Hamming and Euclidean distance metrics. Again, each of these has its own bias, and the affinity function must be selected with great care, as it can affect the overall performance (and ultimately the result) of the system (Freitas & Timmis, ).

Supervised Immune-Inspired Learning

The artificial immune recognition system (AIRS) algorithm was introduced as one of the first immuneinspired supervised learning algorithms and has subsequently gone through a period of study and refinement (Watkins, ). To use classifications from de Castro and Timmis (), for the procedures of adaptation, AIRS is a, 7clonal selection type of immune-inspired algorithm. The representation and affinity layers of the system are standard in

Artificial Immune Systems. Figure . AIS layered framework adapted from de Castro and Timmis ()

A

A

Artificial Immune Systems

that any number of representations such as binary, real values, etc., can be used with the appropriate affinity function. AIRS has its origin in two other immune-inspired algorithms: CLONALG (CLONAL Selection alGorithm) and Artificial Immune NEtwork (AINE) (de Castro and Timmis, ). AIRS resembles CLONALG in the sense that both the algorithms are concerned with developing a set of memory cells that give a representation of the learned environment. AIRS is concerned with the development of a set of memory cells that can encapsulate the training data. This is done in a two-stage process of first evolving a candidate memory cell and then determining if this candidate cell should be added to the overall pool of memory cells. The learning process can be outlined as follows:

. For each pattern to be recognized, do (a) Compare a training instance with all memory cells of the same class and find the memory cell with the best affinity for the training instance. This is referred to as a memory cell mcmatch . (b) Clone and mutate mcmatch in proportion to its affinity to create a pool of abstract B-cells. (c) Calculate the affinity of each B-cell with the training instance. (d) Allocate resources to each B-cell based on its affinity. (e) Remove the weakest B-cells until the number of resources returns to a preset limit. (f) If the average affinity of the surviving B-cells is above a certain level, continue to step (g). Else, clone and mutate these surviving B-cells based on their affinity and return to step (c). (g) Choose the best B-cell as a candidate memory cell (mccand ). (h) If the affinity of mccand for the training instance is better than the affinity of mcmatch , then add mccand to the memory cell pool. If, in addition to this, the affinity between mccand and mcmatch is within a certain threshold, then remove mcmatch from the memory cell pool. . Repeat from step (a) until all training instances have been presented.

Once this training routine is complete, AIRS classifies the instances using k-nearest neighbor with the developed set of memory cells. Unsupervised Immune-Inspired Learning

The artificial immune network (aiNET) algorithm was introduced as one of the first immune-inspired unsupervised learning algorithms and has subsequently gone through a period of study and refinement (de Castro & Von Zuben, ). To use classifications from de Castro and Timmis (), for the procedures of adaptation, aiNET is an immune network type of immune-inspired algorithm. The representation and affinity layers of the system are standard (the same as in AIRS). aiNET has its origin in another immuneinspired algorithms: CLONALG (the same forerunner to AIRS), and resembles CLONALG in the sense that both algorithms (again) are concerned with developing a set of memory cells that give a representation of the learnt environment. However, within aiNET there is no error feedback into the learning process. The learning process can be outlined as follows: . Randomly initialize a population P . For each pattern to be recognized, do (a) Calculate the affinity of each B-cell (b) in the network for an instance of the pattern being learnt (b) Select a number of elements from P into a clonal pool C (c) Mutate each element of C proportional to affinity to the pattern being learnt (the higher the affinity, the less mutation applied) (d) Select the highest affinity members of C to remain in the set C and remove the remaining elements (e) Calculate the affinity between all members of C and remove elements in C that have an affinity below a certain threshold (user defined) (f) Combine the elements of C with the set P (g) Introduce a random number of randomly created elements into P to maintain diversity . Repeat from (a) until stopping criteria is met Once this training routine is complete, the minimumspanning tree algorithm is applied to the network to extract the clusters from within the network.

Artificial Societies

Recommended Reading Bersini, H. (). Immune network and adaptive control. In Proceedings of the st European conference on artificial life (ECAL) (pp. –). Cambridge, MA: MIT Press. Cooke, D., & Hunt, J. (). Recognising promoter sequences using an artificial immune system. In Proceedings of intelligent systems in molecular biology (pp. –). California: AAAI Press. de Castro, L. N., & Timmis, J. (). Artificial immune systems: A new computational intelligence approach. New York: Springer. de Castro, L. N., & Von Zuben, F. J. (). aiNet: An artificial immune network for data analysis (pp. –). Hershey, PA: Idea Group Publishing. Farmer, J. D., Packard, N. H., & Perelson, A. S. (). The immune system, adaptation, and machine learning. Physica D, , –. Forrest, S., Perelson, A. S., Allen, L., Cherukuri, R. (). Self–nonself discrimination in a computer. In Proceedings of the IEEE symposium on research security and privacy (pp. –). Freitas, A., & Timmis, J. (). Revisiting the foundations of artificial immune systems: A problem oriented perspective, LNCS (Vol. ) (pp. –). New York: Springer. Hart, E., & Timmis, J. (). Application Areas of AIS: The Past, Present and the Future. Journal of Applied Soft Computing, (). pp. –. Neal, M. (). An artificial immune system for continuous analysis of time-varying data. In J. Timmis & P. Bentley (Eds.), Proceedings of the st international conference on artificial immune system (ICARIS) (pp. –). Canterbury, UK: University of Kent Printing Unit. Secker, A., Freitas, A., & Timmis, J. (). AISEC: An artificial immune system for email classification. In Proceedings of congress on evolutionary computation (CEC) (pp. –). Timmis, J., & Bentley (Eds.). (). Proceedings of the st international conference on artificial immune system (ICARIS). Canterbury, UK: University of Kent Printing Unit. Timmis, J., & Neal, M. (). A resource limited artificial immune system for data analysis. Knowledge Based Systems, (–), –. Varela, F., Coutinho, A., Dupire, B., & Vaz, N. (). Cognitive networks: Immune, neural and otherwise. Journal of Theoretical Immunology, , –. Watkins, A. (). AIRS: A resource limited artificial immune classifier. Master’s thesis, Mississippi State University. Watkins, A. (). Exploiting immunological metaphors in the development of serial, parallel and distributed learning algorithms. PhD thesis, University of Kent.

A

life include the origin of life, growth and development, evolutionary and ecological dynamics, adaptive autonomous robots, emergence and self-organization, social organization, and cultural evolution.

Artificial Neural Networks (ANNs) is a computational model based on biological neural networks. It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase.

Cross References 7Adaptive Resonance Theory 7Backpropagation 7Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity 7Boltzmann Machines 7Cascade Correlation 7Competitive Learning 7Deep Belief Networks 7Evolving Neural Networks 7Hypothesis Language 7Neural Network Topology 7Neuroevolution 7Radial Basis Function Networks 7Reservoir Computing 7Self-Organizing Maps 7Simple Recurrent Networks 7Weights

Artificial Societies Artificial Life Artificial Life is an interdisciplinary research area trying to reveal and understand the principles and organization of living systems. Its main goal is to artificially synthesize life-like behavior from scratch in computers or other artificial media. Important topics in artificial

Jürgen Branke University of Warwick, Coventry, UK

Synonyms Agent-based computational models; Agent-based modeling and simulation; Agent-based simulation models

A

A

Artificial Societies

Definition An artificial society is an agent-based, computerimplemented simulation model of a society or group of people, usually restricted to their interaction in a particular situation. Artificial societies are used in economics and social sciences to explain, understand, and analyze socioeconomic phenomena. They provide scientists with a fully controllable virtual laboratory to test hypotheses and observe complex system behavior emerging as result of the 7agents’ interaction. They allow formalizing and testing social theories by using computer code, and make it possible to use experimental methods with social phenomena, or at least with their computer representations, on a large scale. Because the designer is free to choose any desired 7agent behavior as long as it can be implemented, research based on artificial societies is not restricted by assumptions typical in classical economics, such as homogeneity and full rationality of agents. Overall, artificial societies have added an all new dimension to research in economics and social sciences and have resulted in a new research field called “agent-based computational economics.” Artificial societies should be distinguished from virtual worlds and 7artificial life. The term virtual world is usually used for virtual environments to interact with, as, e.g., in computer games. In artificial life, the goal is more to learn about biological principles, understand how life could emerge, and create life within a computer.

Motivation and Background Classical economics can be roughly divided into analytical and empirical approaches. The former uses deduction to derive theorems from assumptions. Thereby, analytical models usually include a number of simplifying assumptions in order to keep the model tractable, the most typical being full rationality and homogeneity of agents. Also, analytical economics is often limited to equilibrium calculations. Classical empirical economics collects data from the real world, and derives patterns and regularities inductively. In recent years, the tremendous increase in available computational power gave rise to a new branch of economics and sociology which uses simulation of artificial societies as a tool to generate new insights.

Artificial societies are agent-based, computerimplemented simulation models of real societies or a group of people in a specific situation. They are built from the bottom up, by specifying the behavior of the agents in different situations. The simulation then reveals the emerging global behavior of the system, and thus provides a link between micro-level behavior of the agents and macro-level characteristics of the system. Using simulation, researchers can now carry out social experiments under fully controlled and reproducible laboratory conditions, trying out different configurations and observing the consequences. Like deduction, simulation models are based on a set of clearly specified assumptions as written down in a computer program. This is then used to generate data, from which regularities and patterns are derived inductively. As such, research based on artificial societies stands somewhere between the classical analytical and empirical social sciences. One of the main advantages of artificial societies is that they allow to consider very complex scenarios where agents are heterogeneous, boundedly rational, or have the ability to learn. Also, they allow to observe evolution over time, instead of just the equilibrium. Artificial societies can be used for many purposes, e.g.: . Verification: Test a hypothesis or theory by examining its validity in relevant, clearly defined scenarios. . Explanation: Construct an artificial society which shows the same behavior as the real society. Then analyze the model to explain the emergent behavior. . Prediction: Run a model of an existing society into the future. Also, feed the model with different input parameters and use the result as a prediction on how the society would react. . Optimization: Test different strategies in the simulation environment, trying to find a best possible strategy. . Existence proof: Demonstrate that a specific simulation model is able to generate a certain global behavior. . Discovery: Play around with parameter settings, discovering new interdependencies and gaining new insights. . Training and education: Use simulation as demonstrator.

Artificial Societies

Structure of the Learning System Using artificial societies requires the usual steps in model building and experimental science, including . . . .

Developing a conceptual model Building the simulation model Verification (making sure the model is correct) Validation (making sure the model is suitable to answer the posed questions) . Simulation and analysis using an appropriate experimental design.

Artificial society is an interdisciplinary research area involving, among others, computer science, psychology, economics, sociology, and biology. Important Aspects

The modeling, simulation, and analysis process described in the previous section is rather complex and only remotely connected to machine learning. Thus, instead of a detailed description of all steps, the following focuses on aspects particularly interesting from a machine learning point of view. Modeling Learning

One of the main advantages of artificial societies is that they can account for boundedly rational and learning agents. For that, one has to specify (in form of a program) exactly how agents decide and learn. In principle, all the learning algorithms developed in machine learning could be used, and many have been used successfully, including 7reinforcement learning, 7artificial neural networks, and 7evolutionary algorithms. However, note that the choice of a learning algorithm is not determined by its learning speed and efficiency (as usual in machine learning), but by how well it reflects human learning in the considered scenario, at least if the goal is to construct an artificial society which allows conclusions to be transferred to the real world. As a consequence, many learning models used in artificial societies are motivated by psychology. The idea of the most suitable model depends on the simulation context, e.g., on whether the simulated learning process is conscious or nonconscious, or on the time and effort an individual may be expected to spend on a particular decision.

A

Besides individual learning (i.e., learning from own past experience), artificial societies usually feature social learning (where one agent learns by observing others), and cultural learning (e.g., the evolution of norms). While the latter simply emerges from the interaction of the agents, the former has to be modeled explicitly. Several different models for learning in artificial societies are discussed in Brenner (). One popular learning paradigm which can be used as a model for individual as well as social learning are 7evolutionary algorithms (EAs). Several studies suggest that EAs are indeed an appropriate model for learning in artificial societies, either based on comparisons of simulations with human subject experiments or based on comparisons with other learning mechanisms such as reinforcement learning (Duffy, ). As EAs are successful search strategies, they seem particularly suitable if the space of possible actions or strategies is very large. If used to model individual learning, each agent uses a separate EA to search for a better personal solution. In this case, the EA population represents the different alternative actions or strategies that an agent considers. The genetic operators crossover and mutation are clearly related to two major ingredients of human innovation: combination and variation. Crossover can be seen as deriving a new concept by combining two known concepts, and mutation corresponds to a small variation of an existing concept. So, the agent, in some sense, creatively tries out new possibilities. Selection, which favors the best solutions found so far, models the learning part. A solution’s quality is usually assessed by evaluating it in a simulation assuming all other agents keep their behavior. For modeling social learning, EAs can be used in two different ways. In both cases, the population represents the actions or strategies of the different agents in the population. From this it follows that the population size corresponds to the number of agents in the simulation. Fitness values are calculated by running the simulation and observing how the different agents perform. Crossover is now seen as a model for information exchange, or imitation, among agents. Mutation, as in the individual learning case, is seen as a small variation of an existing concept. The first social learning model simply uses a standard EA, i.e., selection chooses agents to “reproduce,”

A

A

Artificial Societies

and the resulting new agent strategy replaces an old strategy in the population. While allowing to use standard EA libraries, this approach does not provide a direct link between agents in the simulation and individuals in the EA population. In the second social learning model, each agent directly corresponds to an individual in the EA. In every iteration, each agent creates and tests a new strategy as follows. First, it selects a “donor” individual, with preference to successful individuals. Then it performs a crossover of its own strategy and the donor’s strategy, and mutates the result. This can be regarded as an agent observing other agents, and partially adopting the strategies of successful other agents. Then, the resulting new strategy is tested in a “thought experiment,” by testing whether the agent would be better off with the new strategy compared with its current strategy, assuming all other agents keep their behavior. If the new strategy performs better, it replaces the current strategy in the next iteration. Otherwise, the new strategy is discarded and the agent again uses its old strategy in the next iteration. The testing of new strategies against their parents has been termed election operator in Arifovic (), and makes sure that some very bad and obviously implausible agent strategies never enter the artificial society. Examples

One of the first forerunners of artificial societies was Schelling’s segregation model, . In this study, Schelling placed some artificial agents of two different colors on a simple grid. Each agent follows a simple rule: if less than a given percentage of agents in the neighborhood had the same color, the agent moves to a random free spot. Otherwise, it stays. As the simulation shows, in this model, segregation of agent colors could be observed even if every individual agent was satisfied to live in a neighborhood with only % of its neighbors being of the same color. Thus, with this simple model, Schelling demonstrated that segregation of races in suburbs can occur even if each individual would be happy to live in a diverse neighborhood. Note that the simulations were actually not implemented on a computer but carried out by moving coins on a grid by hand. Other milestones in artificial societies are certainly the work by Epstein and Axtell on their “sugarscape” model (Epstein & Axtell, ), and the Santa

Fe artificial stock market (Arthur, Holland, LeBaron, Palmer, & Taylor, ). In the former, agents populate a simple grid world, with sugar growing as the only resource. The agents need the sugar for survival, and can move around to collect it. Axtell and Epstein have shown that even with agents following some very simple rules, the emerging behavior of the overall system can be quite complex and similar in many aspects to observations in the real world, e.g., showing a similar wealth distribution or population trajectories. The latter is a simple model of a stock market with only a single stock and a risk-free fixed-interest alternative. This model has subsequently been refined and studied by many researchers. One remarkable result of the first model was to demonstrate that technical trading can actually be a viable strategy, something widely accepted in practice, but which classical analytical economics struggled to explain. One of the most sophisticated artificial societies is perhaps the model of the Anasazi tribe, who left their dwellings in the Long House Valley in northeastern Arizona for so far unknown reasons around BC (Axtell et al., ). By building an artificial society of this tribe and the natural surroundings (climate etc.), it was possible to replicate macro behavior which is known to have occurred and provide a possible explanation for the sudden move. The NewTies project (Gilbert et al., ) has a different and quite ambitious focus: it constructs artificial societies with the hope of an emerging artificial language and culture, which then might be studied to help explain how language and culture formed in human societies. Software Systems

Agent-based simulations can be facilitated by using specialized software libraries such as Ascape, Netlogo, Repast, StarLogo, Mason, and Swarm. A comparison of different libraries can be found in Railsback, Lytinen, and Jackson ().

Applications Artificial societies have many practical applications, from rather simple simulation models to very complex economic decision problems, examples include

Association Rule

traffic simulation, market design, evaluation of vaccination programs, evacuation plans, or supermarket layout optimization. See, e.g., Bonabeau () for a discussion of several such applications.

Future Directions, Challenges The science on artificial societies is still at its infancy, but the field is burgeoning and has already produced some remarkable results. Major challenges lie in the model building, calibration, and validation of the artificial society simulation model. Despite several agentbased modeling toolkits available, there is a lot to be gained by making them more flexible, intuitive, and user-friendly, allowing to construct complex models simply by selecting and combining provided building blocks of agent behavior. 7Behavioral Cloning may be a suitable machine learning approach to generate representative agent models.

Cross References 7Artificial Life 7Behavioral Cloning 7Co-Evolutionary Learning 7Multi-Agent Learning

Recommended Reading Agent-based computational economics, website maintained by Tesfatsion () Axelrod: The Complexity of Cooperation: Agent-Based Models of Competition and Collaboration (Axelrod, ) Bonabeau: Agent-based modeling (Bonabeau, ) Brenner: Agent learning representation: Advice on modeling economic learning (Brenner, ) Epstein: Generative social science (Epstein, ) Journal of Artificial Societies and Social Simulation () Tesfatsion and Judd (eds.): Handbook of computational economics (Tesfatsion & Judd, ) Arifovic, J. (). Genetic algorithm learning and the cobwebmodel. Journal of Economic Dynamics and Control, , –. Arthur, B., Holland, J., LeBaron, B., Palmer, R., & Taylor, P. (). Asset pricing under endogenous expectations in an artificial stock market. In B. Arthur et al., (Eds.), The economy as an evolvin complex system II (pp. –). Boston: Addison-Wesley. Axelrod, R. (). The complexity of cooperation: Agent-based models of competition and collaboration. Princeton, NJ: Princeton University Press. Axtell, R. L., Epstein, J. M., Dean, J. S., Gumerman, G. J., Swedlund, A. C., Harburger, J., et al. (). Population growth and collapse in a multiagent model of the kayenta anasazi in long

A

house valley. Proceedings of the National Academy of Sciences, , –. Bonabeau, E. (). Agent-based modeling: Methods and techniques for simulating human systems. Proceedings of the National Academy of Sciences, , –. Brenner, T. (). Agent learning representation: Advice on modelling economic learning. In L. Tesfatsion & K. L. Judd, (Eds.), Handbook of computational economics (Vol. , pp.–). Amsterdam: North-Holland. Duffy, J. (). Agent-based models and human subject experiments. In L. Tesfatsion & K. L. Judd, (Eds.), Handbook of computational economics (Vol. , pp.–). Amsterdam: North-Holland. Epstein, J. M. (). Generative social science: Studies in agentbased computational modeling. Princeton, NJ: Princeton University Press. Epstein, J. M., & Axtell, R. (). Growing artificial societies. Washington, DC: Brookings Institution Press. Gilbert, N., den Besten, M., Bontovics, A., Craenen, B. G. W., Divina, F., Eiben, A. E., et al. (). Emerging artificial societies through learning. Journal of Artificial Societies and Social Simulation, (). http://jasss.soc.surrey.ac.uk///. html. Railsback, S. F., Lytinen, S. L., & Jackson, S. K. (). Agent-based simulation platforms: Review and development recommendations. Simulation, (), –. Schelling, T. C. (). Dynamic models of segregation. Journal of Mathematical Sociology, , –. Tesfatsion, L. (). Website on agent-based computational economics. http://www.econ.iastate.edu/tesfatsi/ace. htm. Tesfatsion, L., & Judd, K. L. (Eds.) (). Handbook of computational economics – Vol : Agent-based computational economics. Amsterdam: Elsevier. The journal of artificial societies and social simulation. http:// jasss.soc.surrey.ac.uk/JASSS.html.

Assertion In 7Minimum Message Length, the code or language shared between sender and receiver that is used to describe the model.

Association Rule Hannu Toivonen University of Helsinki, Helsinki, Finland

Definition Association rules (Agrawal, Imieli´nski, & Swami, ) can be extracted from data sets where each example

A

A

Associative Bandit Problem

consists of a set of items. An association rule has the form X → Y, where X and Y are 7itemsets, and the interpretation is that if set X occurs in an example, then set Y is also likely to occur in the example. Each association rule is usually associated with two statistics measured from the given data set. The frequency or support of a rule X → Y, denoted fr(X→Y), is the number (or alternatively the relative frequency) of examples in which X ∪ Y occurs. Its confidence, in turn, is the observed conditional probability P(Y ∣ X) = fr(X ∪ Y)/fr(X). The 7Apriori algorithm (Agrawal, Mannila, Srikant, Toivonen & Verkamo, ) finds all association rules, between any sets X and Y, which exceed user-specified support and confidence thresholds. In association rule mining, unlike in most other learning tasks, the result thus is a set of rules concerning different subsets of the feature space. Association rules were originally motivated by supermarket 7basket analysis, but as a domain independent technique they have found applications in numerous fields. Association rule mining is part of the larger field of 7frequent itemset or 7frequent pattern mining.

Cross References 7Apriori Algorithm 7Basket Analysis 7Frequent Itemset 7Frequent Pattern

Recommended Reading Agrawal, R., Imieli n´ ski, T., & Swami, A. (). Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD international conference on management of data, Washington, DC (pp. –). New York: ACM. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A. I. (). Fast discovery of association rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. –). Menlo Park: AAAI Press.

Associative Bandit Problem 7Associative Reinforcement Learning

Associative Reinforcement Learning Alexander L. Strehl R¨utgers University, USA

Synonyms Associative bandit problem; Bandit problem with side information; Bandit problem with side observations; One-step reinforcement learning

Definition The associative reinforcement-learning problem is a specific instance of the 7reinforcement learning problem whose solution requires generalization and exploration but not temporal credit assignment. In associative reinforcement learning, an action (also called an arm) must be chosen from a fixed set of actions during successive timesteps and from this choice a real-valued reward or payoff results. On each timestep, an input vector is provided that along with the action determines, often probabilistically, the reward. The goal is to maximize the expected long-term reward over a finite or infinite horizon. It is typically assumed that the action choices do not affect the sequence of input vectors. However, even if this assumption is not asserted, learning algorithms are not required to infer or model the relationship between input vectors from one timestep to the next. Requiring a learning algorithm to discover and reason about this underlying process results in the full reinforcement learning problem.

Motivation and Background The problem of associative reinforcement learning may be viewed as connecting the problems of 7supervised learning or 7classification, which is more specific, and reinforcement learning, which is more general. Its study is motivated by real-world applications such as choosing which internet advertisements to display based on information about the user or choosing which stock to buy based on current information related to the market. Both problems are distinguished from supervised learning by the absence of labeled training examples to learn from. For instance, in the advertisement problem, the learner is never told which ads would have resulted in the greatest expected reward (in this problem, reward is

Associative Reinforcement Learning

determined by whether an ad is clicked on or not). In the stock problem, the best choice is never revealed since the choice itself affects the future price of the stocks and therefore the payoff.

The Learning Setting The learning problem consists of the following core objects: An input space X , which is a set of objects (often a subset of the n-dimension Euclidean space Rn ). ● A set of actions or arms A, which is often a finite set of size k. ● A distribution D over X . In some cases, D is allowed to be time-dependent and may be denoted Dt on timestep t for t = , , . . ..

●

A learning sequence proceeds as follows. During each timestep t = , , . . ., an input vector xt ∈ X is is drawn according to the distribution D and is provided to the algorithm. The algorithm selects an aarm at at ∈ A. This choice may be stochastic and depend on all previous inputs and rewards observed by the algorithm as well as all previous action choices made by the algorithm for timesteps t = , , . . .. Then, the learner receives a payoff rt generated according to some unknown stochastic process that depends only on the xt and at . The informal goal is to maximize the expected long-term payoff. Let π : X → A be any policy that maps input vectors to actions. Let T

V π (T) := E [∑ ri ∣ ai = π(xi ) for i = , , . . . , T] () i=

denotes the expected total reward over T steps obtained by choosing arms according to policy π. The expectation is taken over any randomness in the generation of input vectors xi and rewards ri . The expected regret of a learning algorithm with respect to policy π is defined as V π (T) − E[∑Ti= ri ] the expected difference between the return from following policy π and the actual obtained return. Power of Side Information

Wang, Kulkarni, and Poor () studied the associative reinforcement learning problem from a statistical viewpoint. They considered the setting with two action

A

and analyzed the expected inferior sampling time, which is the number of times that the lesser action, in terms of expected reward, is selected. The function mapping input vectors to conditional reward distributions belongs to a known parameterized class of functions, with the true parameters being unknown. They show that, under some mild conditions, an algorithm can achieve finite expected inferior sampling time. This demonstrates the power provided by the input vectors (also called side observations or side information), because such a result is not possible in the standard multi-armed bandit problem, which corresponds to the associative reinforcement-learning problem without input vectors xi . Intuitively, this type of result is possible when the side information can be used to infer the payoff function of the optimal action. Linear Payoff Functions

In its most general setting, the associative reinforcement learning problem is intractable. One way to rectify this problem is to assume that the payoff function is described by a linear system. For instance, Abe and Long () and Auer () consider a model where during each timestep t, there is a vector zt,i associated with each arm i. The expected payoff of pulling arm i on this timestep is given by θ T zt,i where θ is an unknown parameter vector and θ T denotes the transpose of f . This framework maps to the framework described above by taking xt = (zt, , zt, , . . . , zt,k ). They assume a time-dependent distribution D and focus on obtaining bounds on the regret against the optimal policy. Assuming that all rewards lie in the interval [, ], the worst possible regret of any learning algorithm is linear. When considering only the number of timesteps T, Auer () shows that √ a regret (with respect to the optimal policy) of O( T ln(T)) can be obtained. PAC Associative Reinforcement Learning

The previously mentioned works analyze the growth rate of the regret of a learning algorithm with respect to the optimal policy. Another way to approach the problem is to allow the learner some number of timesteps of exploration. After the exploration trials, the algorithm is required to output a policy. More specifically, given inputs < є < and < δ < , the algorithm is

A

A

Attribute

required to output an є-optimal policy with probability at least − δ. This type of analysis is based on the work by Valiant (), and learning algorithms satisfying the above condition are termed probably approximately correct (PAC). Motivated by the work of Kaelbling (), Fiechter () developed a PAC algorithm when the true payoff function can be described by a decision list over the action and input vector. Building on both works, Strehl, Mesterharm, Littman, and Hirsh () showed that a class of associative reinforcement learning problems can be solved efficiently, in a PAC sense, when given a learning algorithm for efficiently solving classification problems.

Recommended Reading Section . of the survey by Kaelbling, Littman, and Moore () presents a nice overview of several techniques for the associative reinforcement-learning problem, such as CRBP (Ackley and Littman, ), ARC (Sutton, ), and REINFORCE (Williams, ). Abe, N., & Long, P. M. (). Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the th international conference on machine learning (pp. –). Ackley, D. H., & Littman, M. L. (). Generalization and scaling in reinforcement learning. In Advances in neural information processing systems (pp. –). San Mateo, CA: Morgan Kaufmann. Auer, P. (). Using confidence bounds for exploitation– exploration trade-offs. Journal of Machine Learning Research, , –. Fiechter, C.-N. (). PAC associative reinforcement learning. Unpublished manuscript. Kaelbling, L. P. (). Associative reinforcement learning: Functions in k-DNF. Machine Learning, , –. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, , –. Strehl, A. L., Mesterharm, C., Littman, M. L., & Hirsh, H. (). Experience-efficient learning in associative bandit problems. In ICML-: Proceedings of the rd international conference on machine learning, Pittsburgh, Pennsylvania (pp. –). Sutton, R. S. (). Temporal credit assignment in reinforcement learning. Doctoral dissertation, University of Massachusetts, Amherst, MA. Valiant, L. G. (). A theory of the learnable. Communications of the ACM, , –. Wang, C.-C., Kulkarni, S. R., & Poor, H. V. (). Bandit problems with side observations. IEEE Transactions on Automatic Control, , –. Williams, R. J. (). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, , –.

Attribute Chris Drummond National Research Council of Canada, Ottawa, ON, Canada

Synonyms Characteristic; Feature; Property; Trait

Definition Attributes are properties of things, ways that we, as humans, might describe them. If we were talking about the appearance of our friends, we might describe one of them as “sex female,” “hair brown,” “height ft in.” Linguistically, this is rather terse, but this very terseness has the advantage of limiting ambiguity. The attributes are sex, hair color, and height. For each friend, we could give the appropriate values to go along with each attribute, some examples are shown in Table . Attribute-value pairs are a standard way of describing things within the machine learning community. Traditionally, values have come in one of three types: binary, sex has two values; nominal, hair color has many values; real, height has an ordered set of values. Ideally, the attribute-value pairs are sufficient to describe some things accurately and to tell them apart from others. What might be described is very varied, so the attributes themselves will vary widely.

Motivation and Background For machine learning to be successful, we need a language to describe everyday things that is sufficiently powerful to capture the similarities and differences between them and yet is computationally easy to manage. The idea that a sufficient number of attribute-value Attribute. Table Some friends Sex

Hair color

Height

Male

Black

ft in.

Female

Brown

ft in.

Female

Blond

ft in.

Male

Brown

ft in.

Attribute

pairs would meet this requirement is an intuitive one. It has also been studied extensively in philosophy and psychology, as a way that humans represent things mentally. In the early days of artificial intelligence research, the frame (Minsky, ) became a common way of representing knowledge. We have, in many ways, inherited this representation, attribute-value pairs sharing much in common with the labeled slots for values used in frames. In addition, the data for many practical problems comes in this form. Popular methods of storing and manipulating data such as relational databases, and less formal structures such as spread sheets, have columns as attributes and cells as values. So, attributevalue pairs are a ubiquitous way of representing data.

Future Directions The notion of an attribute-value pair is so well entrenched in machine learning that it is difficult to perceive what might replace it. As, in many practical applications, the data comes in this form, this representation will undoubtedly be around for some time. One change that is occurring is the growing complexity of attribute-values. Traditionally, we have used the simple value types, binary, nominal, and real, discussed earlier. But to effectively describe many things, we need to extend this simple language and use more complex values. For example, in 7data mining applied to multimedia, more new complex representations abound. Sound and video streams, images, and various properties of them, are just a few examples (Cord et al., ; Simoff & Djeraba, ). Perhaps, the most significant change is away from attributes, albeit with complex values, to structural forms where the relationship between things is included. As Quinlan () states “Data may concern objects or observations with arbitrarily complex structure that cannot be captured by the values of a predetermined set of attributes.” There is a large and growing community of researchers in 7relational learning. This is evidenced by the number, and growing frequency, of recent workshops at the International Conference for Machine Learning (Cord et al., ; De Raedt & Kramer, ; Dietterich, Getoor, & Murphy, ; Fern, Getoor, & Milch, ).

A

Limitations In philosophy there is the idea of essence, the properties an object must have to be what it is. In machine learning, particularly in practical applications, we get what we are given and have little control in the choice of attributes and their range of values. If domain experts have chosen the attributes, we might hope that they are properties that can be readily ascertained and are relevant to the task at the hand. For example, when describing one of our friends, we would not say Fred is the one with the spleen. It is not only difficult to observe, it is also poor at discriminating between people. Data are collected for many reasons. In medical applications, all sorts of attribute-values would be collected on patients. Most are unlikely to be important to the current task. An important part of learning is 7feature extraction, determining which attributes are necessary for learning. Whether or not attribute-value pairs are an essential representation for the type of learning required in the development, and functioning, of intelligent agents, remains to be seen. Attribute-values readily capture symbolic information, typically at the level of words that humans naturally use. But if our agents need to move around in their environment, recognizing what they encounter, we might need a different nonlinguistic representation. Certainly, other representations based on a much finer granularity of features, and more holistic in nature, have been central to areas such as 7neural networks for some time. In research into 7dynamic systems, attractors in a sensor space might be more realistic that attribute-values (See chapter on 7Classification).

Recommended Reading Cord, M., Dahyot, R., Cunningham, P., & Sziranyi, T. (Eds.). (). Workshop on machine learning techniques for processing multimedia content. In Proceedings of the twenty-second international conference on machine learning. De Raedt, L., & Kramer, S. (Eds.). (). In Proceedings of the seventeenth international conference on machine learning. Workshop on attribute-value and relational learning: Crossing the boundaries, Stanford University, Palo Alto, CA. Dietterich, T., Getoor, L., & Murphy, K. (Eds.). (). In Proceedings of the twenty-first international conference on machine learning. Workshop on statistical relational learning and its connections to other fields. Fern, A., Getoor, L., & Milch, B. (Eds.). (). In Proceedings of the twenty-fourth international conference on machine learning. Workshop on open problems in statistical relational learning.

A

A

Attribute Selection

Minsky, M. (). A framework for representing knowledge. Technical report, Massachusetts Institute of Technology, Cambridge, MA. Quinlan, J. R. (). Learning first-order definitions of functions. Journal of Artificial Intelligence Research, , –. Simoff, S. J., & Djeraba, C. (Eds.). (). In Proceedings of the sixth international conference on knowledge discovery and data mining. Workshop on multimedia data mining.

Attribute Selection 7Feature Selection

Attribute-Value Learning Attribute-value learning refers to any learning task in which the each 7Instance is described by the values of some finite set of attributes (see 7Attribute). Each of these instances is often represented as a vector of attribute values, each position in the vector corresponding to a unique attribute.

AUC 7Area Under Curve

Autonomous Helicopter Flight Using Reinforcement Learning Adam Coates , Pieter Abbeel , Andrew Y. Ng Stanford University, Stanford, CA, USA University of California, Berkeley, CA, USA Stanford University, Stanford, CA, USA

Definition Helicopter flight is a highly challenging control problem. While it is possible to obtain controllers for simple maneuvers (like hovering) by traditional manual design procedures, this approach is tedious and typically requires many hours of adjustments and flight testing, even for an experienced control engineer. For complex maneuvers, such as aerobatic routines, this approach

is likely infeasible. In contrast, 7reinforcement learning (RL) algorithms enable faster and more automated design of controllers. Model-based RL algorithms have been used successfully for autonomous helicopter flight for hovering, forward flight and, using apprenticeship learning methods for expert-level aerobatics. In modelbased RL, first one builds a model of the helicopter dynamics and specifies the task using a reward function. Then, given the model and the reward function, the RL algorithm finds a controller that maximizes the expected sum of rewards accumulated over time.

Motivation and Background Autonomous helicopter flight represents a challenging control problem and is widely regarded as being significantly harder than control of fixed-wing aircraft. (See, e.g., Leishman, (); Seddon, ()). At the same time, helicopters provide unique capabilities such as inplace hover, vertical takeoff and landing, and low-speed maneuvering. These capabilities make helicopter control an important research problem for many practical applications. Building autonomous flight controllers for helicopters, however, is far from trivial. When done by hand, it can require many hours of tuning by experts with extensive prior knowledge about helicopter dynamics. Meanwhile, the automated development of helicopter controllers has been a major success story for RL methods. Controllers built using RL algorithms have established state-of-the-art performance for both basic flight maneuvers, such as hovering and forward flight (Bagnell & Schneider, ; Ng, Kim, Jordan, & Sastry, ), as well as being among the only successful methods for advanced aerobatic stunts. Autonomous helicopter aerobatics has been successfully tackled using the innovation of “apprenticeship learning,” where the algorithm learns by watching a human demonstrator (Abbeel & Ng, ). These methods have enabled autonomous helicopters to fly aerobatics as well as an expert human pilot, and often even better (Coates, Abbeel, & Ng, ). Developing autonomous flight controllers for helicopters is challenging for a number of reasons: . Helicopters have unstable, high-dimensional, asymmetric, noisy, nonlinear, non-minimum phase dynamics. As a consequence, all successful helicopter flight

Autonomous Helicopter Flight Using Reinforcement Learning

controllers (to date) have many parameters. Controllers with – gains are not atypical. Hand engineering the right setting for each of the parameters is difficult and time consuming, especially since their effects on performance are often highly coupled through the helicopter’s complicated dynamics. Moreover, the unstable dynamics, especially in the low-speed flight regime, complicates flight testing. . Helicopters are underactuated: their position and orientation is representable using six parameters, but they have only four control inputs. Thus helicopter control requires significant planning and making trade-offs between errors in orientation and errors in desired position. . Helicopters have highly complex dynamics: Even though we describe the helicopter as having a twelve dimensional state (position, velocity, orientation, and angular velocity), the true dynamics are significantly more complicated. To determine the precise effects of the inputs, one would have to consider the airflow in a large volume around the helicopter, as well as the parasitic coupling between the different inputs, the engine performance, and the non-rigidity of the rotor blades. Highly accurate simulators are thus difficult to create, and controllers developed in simulation must be sufficiently robust that they generalize to the real helicopter in spite of the simulator’s imperfections. . Sensing capabilities are often poor: For small remotely controlled (RC) helicopters, sensing is limited because the on-board sensors must deal with a large amount of vibration caused by the helicopter blades rotating at about Hz, as well as

A

higher frequency noise from the engine. Although noise at these frequencies (which are well above the roughly Hz at which the helicopter dynamics can be modeled reasonably) might be easily removed by low pass filtering, this introduces latency and damping effects that are detrimental to control performance. As a consequence, helicopter flight controllers have to be robust to noise and/or latency in the state estimates to work well in practice.

Typical Hardware Setup A typical autonomous helicopter has several basic sensors on board. An Inertial Measurement Unit (IMU) measures angular rates and linear accelerations for each of the helicopter’s three axes. A -axis magnetometer senses the direction of the Earth’s magnetic field, similar to a magnetic compass (Fig. ). Attitude-only sensing, as provided by the inertial and magnetic sensors, is insufficient for precise, stable hovering, and slow-speed maneuvers. These maneuvers require that the helicopter maintain relatively tight control over its position error, and hence highquality position sensing is needed. GPS is often used to determine helicopter position (with carrier-phase GPS units achieving sub-decimeter accuracy), but visionbased solutions have also been employed (Abbeel, Coates, Quigley, & Ng, ; Coates et al., ; Saripalli, Montgomery, & Sukhatme, ). Vibration adds errors to the sensor measurements and may damage the sensors themselves, hence significant effort may be required to mount the sensors on the airframe (Dunbabin, Brosnan, Roberts, & Corke, ). Provided there is no aliasing, sensor errors added by

Autonomous Helicopter Flight Using Reinforcement Learning. Figure . (a) Stanford University’s instrumented XCell Tempest autonomous helicopter. (b) A Bergen Industrial Twin autonomous helicopter with sensors and on-board computer

A

A

Autonomous Helicopter Flight Using Reinforcement Learning

vibration can be removed by using a digital filter on the measurements (though, again, one must be careful to avoid adding too much latency). Sensor data from the aircraft sensors is used to estimate the state of the helicopter for use by the control algorithm. This is usually done with an extended Kalman filter (EKF). A unimodal distribution (as computed by the EKF) suffices to represent the uncertainty in the state estimates and it is common practice to use the mode of the distribution as the state estimate for feedback control. In general the accuracy obtained with this method is sufficiently high that one can treat the state as fully observed. Most autonomous helicopters have an on-board computer that runs the EKF and the control algorithm (Gavrilets, Martinos, Mettler, & Feron, a; La Civita, Papageorgiou, Messner, & Kanade, ; Ng et al., ). However, it is also possible to use groundbased computers by sending sensor data by wireless to the ground, and then transmitting control signals back to the helicopter through the pilot’s RC transmitter (Abbeel et al., ; Coates et al., ).

Helicopter State and Controls The helicopter state s is defined by its position (px , py , pz ), orientation (which could be expressed using a unit quaternion q), velocity (vx , vy , vz ) and angular velocity (ω x , ω y , ω z ). The helicopter is controlled via a -dimensional action space: . u and u : The lateral (left-right) and longitudinal (front-back) cyclic pitch controls (together referred to as the “cyclic” controls) cause the helicopter to roll left or right, and pitch forward or backward, respectively. . u : The tail rotor pitch control affects tail rotor thrust, and can be used to yaw (turn) the helicopter about its vertical axis. In analogy to airplane control, the tail rotor control is commonly referred to as “rudder.” . u : The collective pitch control (often referred to simply as “collective”), increases and decreases the pitch of the main rotor blades, thus increasing or decreasing the vertical thrust produced as the blades sweep through the air.

By using the cyclic and rudder controls, the pilot can rotate the helicopter into any orientation. This allows the pilot to direct the thrust of the main rotor in any particular direction, and thus fly in any direction, by rotating the helicopter appropriately.

Helicopter Flight as an RL Problem Formulation

A RL problem can be described by a tuple (S, A, T, H, s(), R), which is referred to as a 7Markov decision process (MDP). Here S is the set of states; A is the set of actions or inputs; T is the dynamics model, which is a t t set of probability distributions {Psu } (Psu (s′ ∣s, u) is the ′ probability of being in state s at time t + , given the state and action at time t are s and u); H is the horizon or number of time steps of interest; s() ∈ S is the initial state; R : S × A → R is the reward function. A policy π = (µ , µ , . . . , µ H ) is a tuple of mappings from states S to actions A, one mapping for each time t = , . . . , H. The expected sum of rewards when acting according to a policy π is given by: ∗ U(π) = E[∑H t = R(s(t), u(t))∣π]. The optimal policy π for an MDP (S, A, T, H, s(), R) is the policy that maximizes the expected sum of rewards. In particular, the optimal policy is given by: π ∗ = arg max π U(π). The common approach to finding a good policy for autonomous helicopter flight proceeds in two steps: First one collects data from manual helicopter flights to build a model (One could also build a helicopter model by directly measuring physical parameters such as mass, rotor span, etc. However, even when this approach is pursued, one often resorts to collecting flight data to complete the model.). Then one solves the MDP comprised of the model and some chosen reward function. Although the controller obtained, in principle, is only optimal for the learned simulator model, it has been shown in various settings that optimal controllers perform well even when the model has some inaccuracies (see, e.g., Anderson & Moore, ()).

Modeling

One way to create a helicopter model is to use direct knowledge of aerodynamics to derive an explicit mathematical model. This model will depends on a number of parameters that are particular to the helicopter

Autonomous Helicopter Flight Using Reinforcement Learning

being flown. Many of the parameters may be measured directly (e.g., mass, rotational inertia), while others must be estimated from flight experiments. This approach has been used successfully on several systems (see, e.g., (Gavrilets, Martinos, Mettler, & Feron, b; Gavrilets, Mettler, & Feron, ; La Civita, )). However, substantial expert aerodynamics knowledge is required for this modeling approach. Moreover, these models tend to cover only a limited fraction of the flight envelope. Alternatively, one can learn a model of the dynamics directly from flight data, with only limited a priori knowledge of the helicopter’s dynamics. Data is usually collected from a series of manually controlled flights. These flights involve the human sweeping the control sticks back and forth at varying frequencies to cover as much of the flight envelope as possible, while recording the helicopter’s state and the pilot inputs at each instant. Given a corpus of flight data, various different learning algorithms can be used to learn the underlying model of the helicopter dynamics. If one is only interested in a single flight regime, one could learn a linear model that maps from the current state and action to the next state. Such a model can be easily estimated using 7linear regression (While the methods presented here emphasize time-domain estimation, frequency domain estimation is also possible for the special case of estimating linear models (Tischler & Cauffman, ).). Linear models are restricted to small flight regimes (e.g., hover or inverted hover) and do not immediately generalize to fullenvelope flight. To cover a broader flight regime, non parametric algorithms such as locally-weighted linear regression have been used (Bagnell & Schneider, ; Ng et al., ). Non parametric models that map from current state and action to next state can, in principle, cover the entire flight regime. Unfortunately, one must collect large amounts of data to obtain an accurate model and the models are often quite slow to evaluate. An alternative way to increase the expressiveness of the model, without resorting to non parametric methods, is to consider a time-varying model where the dynamics are explicitly allowed to depend on time. One can then proceed to compute simpler (say, linear) parametric models for each choice of the time parameter.

A

This method is effective when learning a model specific to a trajectory whose dynamics are repeatable but vary as the aircraft travels along the trajectory. Since this method can also require a great deal of data (similar to nonparametric methods) in practice, it is helpful to begin with a non-time-varying parametric model fit from a large amount of data, and then augment it with a time-varying component that has fewer parameters (Abbeel, Quigley, & Ng, ; Coates et al., ). One can also take advantage of symmetry in the helicopter dynamics to reduce the amount of data needed to fit a parametric model. In Abbeel, Ganapathi, and Ng () observe that – in a coordinate frame attached to the helicopter – the helicopter dynamics are essentially the same for any orientation (or position) once the effect of gravity is removed. They learn a model that predicts (angular and linear) accelerations – except for the effects of gravity – in the helicopter frame as a function of the inputs and the (angular and linear) velocity in the helicopter frame. This leads to a lower-dimensional learning problem, which requires significantly less data. To simulate the helicopter dynamics over time, the predicted accelerations augmented with the effects of gravity are integrated over time to obtain velocity, angular rates, position, and orientation. Abbeel et al. () used this approach to learn a helicopter model that was later used for autonomous aerobatic helicopter flight maneuvers covering a large part of the flight envelope. Significantly less data is required to learn a model using the gravity-free parameterization compared to a parameterization that directly predicts the next state as a function of current state and actions (as was used in Bagnell and Schneider (), Ng et al. ()). Abbeel et al. evaluate their model by checking its simulation accuracy over longer time scales than just a one-step acceleration prediction. Such an evaluation criterion maps more directly to the reinforcement learning objective of maximizing the expected sum of rewards accumulated over time (see also Abbeel & Ng, (b)). The models considered above are deterministic. This normally would allow us to drop the expectation when evaluating a policy according to E[∑H t = R(s(t), u(t))∣π]. However, it is common to add stochasticity to account for unmodeled effects. Abbeel et al. () and Ng et al. () include additive process noise in

A

A

Autonomous Helicopter Flight Using Reinforcement Learning

their models. Bagnell and Schneider () go further, learning a distribution over models. Their policy must then perform well, on expectation, for a (deterministic) model selected randomly from the distribution. Control Problem Solution Methods

Given a model of the helicopter, we now seek a policy π that maximizes the expected sum of rewards U(π) = E[∑H t = R(s(t), u(t))∣π] achieved when acting according to the policy π. Policy Search General policy search algorithms can be

employed to search for optimal policies for the MDP based on the learned model. Given a policy π, we can directly try to optimize the objective U(π). Unfortunately, U(π) is an expectation over a complicated distribution making it impractical to evaluate the expectation exactly in general. One solution is to approximate the expectation U(π) by Monte Carlo sampling: under certain boundedness assumptions the empirical average of the sum of rewards accumulated over time will give a good ˆ estimate U(π) of the expectation U(π). Naively Applying Monte Carlo sampling to accurately compute, e.g., the local gradient from the difference in function value at nearby points, requires very large amounts of samples due to the stochasticity in the function evaluation. To get around this hurdle, the PEGASUS algorithm (Ng & Jordan, ) can be used to convert the stochastic optimization problem into a deterministic one. When evaluating by averaging over n simulations, PEGASUS initially fixes n random seeds. For each policy evaluation, the same n random seeds are used so that the simulator is now deterministic. In particular, multiple evaluations of the same policy will result in the same computed reward. A search algorithm can then be applied to the deterministic problem to find an optimum. The PEGASUS algorithm coupled with a simple local policy search was used by Ng et al. () to develop a policy for their autonomous helicopter that successfully sustains inverted hover. Bagnell and Schneider () proceed similarly, but use the “amoeba” search algorithm (Nelder & Mead, ) for policy search. Because of the searching involved, the policy class must generally have low dimension. Nonetheless, it is

often possible to find good policies within these policy classes. The policy class of Ng et al. (), for instance, is a decoupled, linear PD controller with a sparse dependence on the state variables (For instance, the linear controller for the pitch axis is parametrized as u = c (px −p∗x )+c (vx −v∗x )+c θ, which has just three parameters while the entire state is nine dimensional. Here, p⋅ , v⋅ , and p∗⋅ , v⋅∗ , respectively, are the actual and desired position and velocity. θ denotes the pitch angle.). The sparsity reduces the policy class to just nine parameters. In Bagnell and Schneider (), two-layer neural network structures are used with a similar sparse dependence on the state variables. Two neural networks with five parameters each are learned for the cyclic controls. Differential Dynamic Programming Abbeel et al. ()

use differential dynamic programming (DDP) for the task of aerobatic trajectory following. DDP (Jacobson & Mayne, ) works by iteratively approximating the MDP as linear quadratic regulator (LQR) problems. The LQR control problem is a special class of MDPs, for which the optimal policy can be computed efficiently. In LQR the set of states is given by S = Rn , the set of actions/inputs is given by A = Rp , and the dynamics model is given by: s(t + ) = A(t)s(t) + B(t)u(t) + w(t), where for all t = , . . . , H we have that A(t) ∈ Rn×n , B(t) ∈ Rn×p and w(t) is a mean zero random variable (with finite variance). The reward for being in state s(t) and taking action u(t) is given by: −s(t)⊺ Q(t)s(t) − u(t)⊺ R(t)u(t). Here Q(t), R(t) are positive semi-definite matrices which parameterize the reward function. It is wellknown that the optimal policy for the LQR control problem is a linear feedback controller which can be efficiently computed using dynamic programming (see, e.g., Anderson & Moore, (), for details on linear quadratic methods.) DDP approximately solves general continuous statespace MDPs by iterating the following two steps until convergence: . Compute a linear approximation to the nonlinear dynamics and a quadratic approximation to

Autonomous Helicopter Flight Using Reinforcement Learning

the reward function around the trajectory obtained when executing the current policy in simulation. . Compute the optimal policy for the LQR problem obtained in Step and set the current policy equal to the optimal policy for the LQR problem. During the first iteration, the linearizations are performed around the target trajectory for the maneuver, since an initial policy is not available. This method is used to perform autonomous flips, rolls, and “funnels” (high-speed sideways flight in a circle) in Abbeel et al. () and autonomous autorotation (Autorotation is an emergency maneuver that allows a skilled pilot to glide a helicopter to a safe landing in the event of an engine failure or tail-rotor failure.) in Abbeel, Coates, Hunter, and Ng (), Fig. . While DDP computes a solution to the non-linear optimization problem, it relies on the accuracy of the non-linear model to correctly predict the trajectory that will be flown by the helicopter. This prediction is used in Step above to linearize the dynamics. In practice, the helicopter will often not follow the predicted trajectory closely (due to stochasticity and modeling errors), and thus the linearization will become a highly inaccurate approximation of the non-linear model. A common solution to this, applied by Coates et al. (), is to compute the DDP solution online, linearizing around a trajectory that begins at the current helicopter state. This ensures that the model is always linearized around a trajectory near the helicopter’s actual flight path. Apprenticeship Learning and Inverse RL In computing a

policy for an MDP, simply finding a solution (using any method) that performs well in simulation may not be enough. One may need to adjust both the model and

A

reward function based on the results of flight testing. Modeling error may result in controllers that fly perfectly in simulation but perform poorly or fail entirely in reality. Because helicopter dynamics are difficult to model exactly, this problem is fairly common. Meanwhile, a poor reward function can result in a controller that is not robust to modeling errors or unpredicted perturbations (e.g., it may use large control inputs that are unsafe in practice). If a human “expert” is available to demonstrate the maneuver, this demonstration flight can be leveraged to obtain a better model and reward function. The reward function encodes both the trajectory that the helicopter should follow, as well as the trade-offs between different types of errors. If the desired trajectory is infeasible (either in the non-linear simulation or in reality), this results in a significantly more difficult control problem. Also, if the trade-offs are not specified correctly, the helicopter may be unable to compensate for significant deviations from the desired trajectory. For instance, a typical reward function for hovering implicitly specifies a trade-off between position error and orientation error (it is possible to reduce one error, but usually at the cost of increasing the other). If this trade-off is incorrectly chosen, the controller may be pushed off course by wind (if it tries too hard to keep the helicopter level) or, conversely, may tilt the helicopter to an unsafe attitude while trying to correct for a large position error. We can use demonstrations from an expert pilot to recover both a good choice for the desired trajectory as well as good choices of reward weights for errors relative to this trajectory. In apprenticeship learning, we are given a set of N recorded state and control sequences,

Autonomous Helicopter Flight Using Reinforcement Learning. Figure . Snapshots of an autonomous helicopter performing in-place flips and rolls

A

A

Autonomous Helicopter Flight Using Reinforcement Learning

{sk (t), uk (t)}H t = for k = , . . . , N, from demonstration flights by an expert pilot. Coates et al. () note that these demonstrations may be sub-optimal but are often sub-optimal in different ways. They suggest that a large number of expert demonstrations may implicitly encode the optimal trajectory and propose a generative model that explains the expert demonstrations as stochastic instantiations of an “ideal” trajectory. This is the desired trajectory that the expert has in mind but is unable to demonstrate exactly. Using an ExpectationMaximization (Dempster, Laird, & Rubin, ) algorithm, they infer the desired trajectory and use this as the target trajectory in their reward function. A good choice of reward weights (for errors relative to the desired trajectory) can be recovered using inverse reinforcement learning (Abbeel & Ng, ; Ng & Russell, ). Suppose the reward function is written as a linear combination of features as follows: R(s, u) = c ϕ (s, u) + c ϕ (s, u) + ⋯. For a single recorded demonstration, {s(t), u(t)}H t= , the pilot’s accumulated reward corresponding to each feature can be computed as ci ϕ∗i = ci ∑H t= ϕ i (s(t), u(t)). If the pilot out-performs the autonomous flight controller with respect to a particular feature ϕ i , this indicates that the pilot’s own “reward function” places a higher value on that feature, and hence its weight ci should be increased. Using this procedure, a good choice of reward function that makes trade-offs similar to that of a human pilot can be recovered. This method has been used to guide the choice of reward for many maneuvers during flight testing (Abbeel et al., , ; Coates et al., ). In addition to learning a better reward function from pilot demonstration, one can also use the pilot demonstration to improve the model directly and attempt to reduce modeling error. Coates et al. (), for instance, use errors observed in expert demonstrations to jointly infer an improved dynamics model along with the desired trajectory. Abbeel et al. (), however, have proposed the following alternating procedure that is broadly applicable (see also Abbeel and Ng (a) for details): . Collect data from a human pilot flying the desired maneuvers with the helicopter. Learn a model from the data.

. Find a controller that works in simulation based on the current model. . Test the controller on the helicopter. If it works, we are done. Otherwise, use the data from the test flight to learn a new (improved) model and go back to Step . This procedure has similarities with model-based RL and with the common approach in control to first perform system identification and then find a controller using the resulting model. However, the key insight from Abbeel and Ng (a) is that this procedure is guaranteed to converge to expert performance in a polynomial number of iterations. The authors report needing at most three iterations in practice. Importantly, unlike the E family of algorithms (Kearns & Singh, ), this procedure does not require explicit exploration policies. One only needs to test controllers that try to fly as well as possible (according to the current choice of dynamics model) (Indeed, the E family of algorithms (Kearns & Singh, ) and its extensions (Brafman & Tennenholtz, ; Kakade, Kearns, & Langford, ; Kearns & Koller, ) proceed by generating “exploration” policies, which try to visit inaccurately modeled parts of the state space. Unfortunately, such exploration policies do not even try to fly the helicopter well, and thus would almost invariably lead to crashes.). The apprenticeship learning algorithms described above have been used to fly the most advanced autonomous maneuvers to date. The apprenticeship learning algorithm of Coates et al. (), for example, has been used to attain expert level performance on challenging aerobatic maneuvers as well as entire airshows composed of many maneuvers in rapid sequence. These maneuvers include in-place flips and rolls, tictocs (“Tic-toc” is a maneuver where the helicopter pitches forward and backward with its nose pointed toward the sky (resembling an inverted clock pendulum).), and chaos (“Chaos” is a maneuver where the helicopter flips in-place but does so while continuously pirouetting at a high rate. Visually, the helicopter body appears to tumble chaotically while nevertheless remaining in roughly the same position.) (see Fig. ). These maneuvers are considered among the most challenging possible and can only be performed

Autonomous Helicopter Flight Using Reinforcement Learning

A

A

Autonomous Helicopter Flight Using Reinforcement Learning. Figure . Snapshot sequence of an autonomous helicopter flying a “chaos” maneuver using apprenticeship learning methods. Beginning from top-left and proceeding left-to-right, top-to-bottom, the helicopter performs a flip while pirouetting counter-clockwise about its vertical axis. (This maneuver has been demonstrated continuously for as long as cycles like the one shown here)

Autonomous Helicopter Flight Using Reinforcement Learning. Figure . Super-imposed sequence of images of autonomous autorotation landings (from Abbeel et al. ())

by advanced human pilots. In fact, Coates et al. () show that their learned controller performance can even exceed the performance of the expert pilot providing the demonstrations, putting many of the maneuvers on par with professional pilots (Fig. ). A similar approach has been used in Abbeel et al. () to perform the first successful autonomous autorotations. Their aircraft has performed more than autonomous landings successfully without engine power. Not only do apprenticeship methods achieve stateof-the-art performance, but they are among the fastest learning methods available, as they obviate the need for arduous hand tuning by engineers. Coates et al. (), for instance, report that entire airshows can be

created from scratch with just h of work. This is in stark contrast to previous approaches that may have required hours or even days of tuning for relatively simple maneuvers.

Conclusion Helicopter control is a challenging control problem and has recently seen major successes with the application of learning algorithms. This Chapter has shown how each step of the control design process can be automated using machine learning algorithms for system identification and reinforcment learning algorithms for control. It has also shown how apprenticeship learning algorithms can be employed to achieve

A

Autonomous Helicopter Flight Using Reinforcement Learning

expert-level performance on challenging aerobatic maneuvers when an expert pilot can provide demonstrations. Autonomous helicopters with control systems developed using these methods are now capable of flying advanced aerobatic maneuvers (including flips, rolls, tic-tocs, chaos, and auto-rotation) at the level of expert human pilots.

Cross References 7Apprenticeship Learning 7Reinforcement Learning 7Reward Shaping

Recommended Reading Abbeel, P., Coates, A., Hunter, T., & Ng, A. Y. (). Autonomous autorotation of an rc helicopter. In ISER . Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y. (). An application of reinforcement learning to aerobatic helicopter flight. In NIPS (pp. –). Vancouver. Abbeel, P., Ganapathi, V., & Ng, A. Y. (). Learning vehicular dynamics with application to modeling helicopters. In NIPS . Vancouver. Abbeel, P., & Ng, A. Y. (). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the international conference on machine learning. New York: ACM. Abbeel, P., & Ng, A. Y. (a). Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the international conference on machine learning. New York: ACM Abbeel, P., & Ng, A. Y. (b). Learning first order Markov models for control. In NIPS . Abbeel, P., Quigley, M., & Ng, A. Y. (). Using inaccurate models in reinforcement learning. In ICML ’: Proceedings of the rd international conference on machine learning (pp. –). New York: ACM. Anderson, B., & Moore, J. (). Optimal control: linear quadratic methods. Princeton, NJ: Prentice-Hall. Bagnell, J., & Schneider, J. (). Autonomous helicopter control using reinforcement learning policy search methods. In International conference on robotics and automation. Canada: IEEE. Brafman, R. I., & Tennenholtz, M. (). R-max, a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, , –. Coates, A., Abbeel, P., & Ng, A. Y. (). Learning for control from multiple demonstrations. In ICML ’: Proceedings of the th international conference on machine learning. Dempster, A. P., Laird, N. M., & Rubin, D. B. (). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, , –. Dunbabin, M., Brosnan, S., Roberts, J., & Corke, P. (). Vibration isolation for autonomous helicopter flight. In Proceedings of the IEEE international conference on robotics and automation (Vol. , pp. –).

Gavrilets, V., Martinos, I., Mettler, B., & Feron, E. (a). Control logic for automated aerobatic flight of miniature helicopter. In AIAA guidance, navigation and control conference. Cambridge, MA: Massachusetts Institute of Technology. Gavrilets, V., Martinos, I., Mettler, B., & Feron, E. (b). Flight test and simulation results for an autonomous aerobatic helicopter. In AIAA/IEEE digital avionics systems conference. Gavrilets, V., Mettler, B., & Feron, E. (). Nonlinear model for a small-size acrobatic helicopter. In AIAA guidance, navigation and control conference (pp. –). Jacobson, D. H., & Mayne, D. Q. (). Differential dynamic programming. New York: Elsevier. Kakade, S., Kearns, M., & Langford, J. (). Exploration in metric state spaces. In Proceedings of the international conference on machine learning. Kearns, M., & Koller, D. (). Efficient reinforcement learning in factored MDPs. In Proceedings of the th international joint conference on artificial intelligence. San Francisco: Morgan Kaufmann. Kearns, M., & Singh, S. (). Near-optimal reinforcement learning in polynomial time. Machine Learning Journal, (–), – . La Civita, M. (). Integrated modeling and robust control for full-envelope flight of robotic helicopters. PhD thesis, Carnegie Mellon University, Pittsburgh, PA. La Civita, M., Papageorgiou, G., Messner, W. C., & Kanade, T. (). Design and flight testing of a high-bandwidth H∞ loop shaping controller for a robotic helicopter. Journal of Guidance, Control, and Dynamics, (), –. Leishman, J. (). Principles of helicopter aerodynamics. Cambridge: Cambridge University Press. Nelder, J. A., & Mead, R. (). A simplex method for function minimization. The Computer Journal, , –. Ng, A. Y., & Jordan, M. (). Pegasus: A policy search method for large MDPs and POMDPs. In Proceedings of the uncertainty in artificial intelligence th conference. San Francisco: Morgan Kaufmann. Ng, A. Y., & Russell, S. (). Algorithms for inverse reinforcement learning. In Procedings of the th international conference on machine learning (pp. –). San Francisco: Morgan Kaufmann. Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., et al., (). Autonomous inverted helicopter flight via reinforcement learning. In International symposium on experimental robotics. Berlin: Springer. Ng, A. Y., Kim, H. J., Jordan, M., & Sastry, S. (). Autonomous helicopter flight via reinforcement learning. In NIPS . Saripalli, S., Montgomery, J. F., & Sukhatme, G. S. (). Visually-guided landing of an unmanned aerial vehicle. IEEE Transactions on Robotics and Autonomous Systems, (), –. Seddon, J. (). Basic helicopter aerodynamics. In AIAA education series. El Segundo, CA: America Institute of Aeronautics and Astronautics. Tischler, M. B., & Cauffman, M. G. (). Frequency response method for rotorcraft system identification: Flight application to BO- couple rotor/fuselage dynamics. Journal of the American Helicopter Society, .

Averaged One-Dependence Estimators

Average-Cost Neuro-Dynamic Programming 7Average-Reward Reinforcement Learning

called the SuperParent and this type of one-dependence classifier is called a SuperParent one-dependence estimator (SPODE). Only those SPODEs with SuperParent xi where the value of xi occurs at least m times are used for predicting a class label y for the test instance x = ⟨x , . . . , xn ⟩. For any attribute value xi ,

Average-Cost Optimization 7Average-Reward Reinforcement Learning

Averaged One-Dependence Estimators Fei Zheng, Geoffrey I. Webb Monash University

Synonyms AODE

Definition Averaged one-dependence estimators is a 7seminaive Bayesian Learning method. It performs classification by aggregating the predictions of multiple one-dependence classifiers in which all attributes depend on the same single parent attribute as well as the class.

Classification with AODE An effective approach to accommodating violations of naive Bayes’ attribute independence assumption is to allow an attribute to depend on other non-class attributes. To maintain efficiency it can be desirable to utilize one-dependence classifiers, such as 7Tree Augmented Naive Bayes (TAN), in which each attribute depends upon the class and at most one other attribute. However, most approaches to learning with onedependence classifiers perform model selection, a process that usually imposes substantial computational overheads and substantially increases variance relative to naive Bayes. AODE avoids model selection by averaging the predictions of multiple one-dependence classifiers. In each one-dependence classifier, an attribute is selected as the parent of all the other attributes. This attribute is

A

P(y, x) = P(y, xi )P(x ∣ y, xi ). This equality holds for every xi . Therefore, P(y, x) =

∑≤i≤n∧F(xi )≥m P(y, xi )P(x ∣ y, xi ) , ∣{ ≤ i ≤ n ∧ F(xi ) ≥ m}∣

()

where F(xi ) is the frequency of attribute value xi in the training sample. Utilizing () and the assumption that attributes are independent given the class and the SuperParent xi , AODE predicts the class for x by selecting argmax y

∑

≤i≤n∧F(x i )≥m

ˆ xi ) ∏ P(x ˆ j ∣ y, xi ). () P(y, ≤j≤n,j≠i

It averages over estimates of the terms in (), rather than the true values, which has the effect of reducing the variance of these estimates. Figure shows a Markov network representation of an example AODE. As AODE makes a weaker attribute conditional independence assumption than naive Bayes while still avoiding model selection, it has substantially lower 7bias with a very small increase in 7variance. A number of studies (Webb, Boughton, & Wang, ; Zheng & Webb, ) have demonstrated that it often has considerably lower zero-one loss than naive Bayes with moderate time complexity. For comparisons with other semi-naive techniques, see 7semi-naive Bayesian learning. One study (Webb, Boughton, & Wang, ) found AODE to provide classification accuracy competitive to a state-of-the-art discriminative algorithm, boosted decision trees. When a new instance is available, like naive Bayes, AODE only needs to update the probability estimates. Therefore, it is also suited to incremental learning.

A

A

Average-Payoff Reinforcement Learning

y x

x

x

y

...

x

x

x

x

y

...

x

x

x

x

...

x

...

Averaged One-Dependence Estimators. Figure . A Markov network representation of the SPODEs that comprise an example AODE

Cross References

Motivation and Background

7Bayesian Network 7Naive Bayes 7Semi-Naive Bayesian Learning 7Tree-Augmented Naive Bayes

7Reinforcement learning (RL) is the study of programs that improve their performance at some task by receiving rewards and punishments from the environment (Sutton & Barto, ). RL has been quite successful in automatic learning of good procedures for complex tasks such as playing Backgammon and scheduling elevators (Crites & Barto, ; Tesauro, ). In episodic domains in which there is a natural termination condition such as the end of the game in Backgammon, the obvious performance measure to optimize is the expected total reward per episode. But some domains such as elevator scheduling are recurrent, i.e., do not have a natural termination condition. In such cases, total expected reward can be infinite, and we need a different optimization criterion. In the discounted optimization framework, in each time step, the value of the reward is multiplied by a discount factor γ < , so that the total discounted reward is always finite. However, in many domains, there is no natural interpretation for the discount factor γ. A natural performance measure to optimize in such domains is the average reward received per time step. Although one could use a discount factor which is close to to approximate average-reward optimization, an approach that directly optimizes the average reward avoids this additional parameter and often leads to faster convergence in practice. There is significant theory behind average-reward optimization based on 7Markov decision processes (MDPs) (Puterman, ). An MDP is described by a -tuple ⟨S, A, P, r⟩, where S is a discrete set of states and A is a discrete set of actions. P is a conditional probability distribution over the next states, given the current state and action, and r gives the immediate reward for a given state and action. A policy π is a mapping from states to actions. Each policy π induces a Markov process over some set of states. In ergodic MDPs, every policy π forms a single closed set of states, and the average reward per time step of π in the limit of infinite

Recommended Reading Webb, G. I., Boughton, J., & Wang, Z. (). Not so naive Bayes: aggregating one-dependence estimators. Machine Learning, (), –. Zheng, F., & Webb, G. I. (). A comparative study of seminaive Bayes methods in classification learning. In Proceedings of the Fourth Australasian Data Mining Conference. (pp. –).

Average-Payoff Reinforcement Learning 7Average-Reward Reinforcement Learning

Average-Reward Reinforcement Learning Prasad Tadepalli Oregon State University, Corvallis, OR, USA

Synonyms ARL; Average-cost neuro-dynamic programming; Average-cost optimization; Average-payoff reinforcement learning

Definition Average-reward reinforcement learning (ARL) refers to learning policies that optimize the average reward per time step by continually taking actions and observing the outcomes including the next state and the immediate reward.

Average-Reward Reinforcement Learning

horizon is independent of the starting state. We call it the “gain” of the policy π, denoted by ρ(π), and consider the problem of finding a “gain-optimal policy,” π ∗ , that maximizes ρ(π). Even though the gain ρ(π) of a policy π is independent of the starting state s, the total expected reward in time t is not. It can be denoted by ρ(π)t + h(s), where h(s) is a state-dependent bias term. It is the bias values of states that determine which states and actions are preferred, and need to be learned for optimal performance. The following theorem gives the Bellman equation for the bias values of states. Theorem For ergodic MDPs, there exist a scalar ρ and a real-valued bias function h over S that satisfy the recurrence relation

∀s ∈ S,

h(s) = max {r(s, a) + ∑ P(s′ ∣s, a)h(s′ )} − ρ. a∈A

s′ ∈S

() Further, the gain-optimal policy µ attains the above maximum for each state s, and ρ is its gain. ∗

Note that any one solution to () yields an infinite number of solutions by adding the same constant to all h-values. However, all these sets of h-values will result in the same set of optimal policies µ ∗ , since the optimal action in a state is determined only by the relative differences between the values of h.

0

h(0)=0

3

bad-move 0 0 good-move

3

h(3)=2

0

1

h(1)=0

A

For example, in Fig. , the agent has to select between the actions good-move and bad-move in state . If it stays in state , it gets an average reward of . If it stays in state , it gets an average reward of −. For this domain, ρ = for the optimal policy of choosing good-move in state . If we arbitrarily set h() to , then h() = , h() = , and h() = satisfy the recurrence relations in (). For example, the difference between h() and h() is , which equals the difference between the immediate reward for the optimal action in state and the optimal average reward . Given the probability model P and the immediate rewards r, the above equations can be solved by White’s relative value iteration method by setting the h-value of an arbitrarily chosen reference state to and using synchronous successive approximation (Bertsekas, ). There is also a policy iteration approach to determine the optimal policy starting with some arbitrary policy, solving for its values using the value iteration, and updating the policy using one step look-ahead search. The above iteration is repeated until the policy converges (Puterman, ).

Model-Based Learning If the probabilities and the immediate rewards are not known, the system needs to learn them before applying the above methods. A model-based approach called H-learning interleaves model learning with Bellman backups of the value function (Tadepalli & Ok, ). This is an average-reward version of 7adaptive real-time dynamic programming (Barto, Bradtke, & Singh, ). The models are learned by collecting samples of state-action-next-state triples ⟨s, a, s′ ⟩ and computing P(s′ ∣s, a) using the maximum likelihood estimation. It then employs the “certainty equivalence principle” by using the current estimates as the true value while updating the h-value of the current state s according to the following update equation derived from the Bellman equation.

0

h(s) ← max {r(s, a) + ∑ P(s′ ∣s, a)h(s′ )} − ρ. a∈A

2

s′ ∈S

()

h(2)=1

Average-Reward Reinforcement Learning. Figure . A simple Markov decision process (MDP) that illustrates the Bellman equation

One complication in ARL is the estimation of the average reward ρ in the update equations during learning. One could use the current estimate of the long-term average reward, but it is distorted

A

A

Average-Reward Reinforcement Learning

by the exploratory actions that the agent needs to take to learn about the unexplored parts of the state space. Without the exploratory actions, ARL methods converge to a suboptimal policy. To take this into account, we have from (), in any state s and a nonexploratory action a that maximizes the right-hand side, ρ = r(s, a) − h(s) + ∑s′ ∈S P(s′ ∣S, a)h(s′ ). Hence, ρ is estimated by cumulatively averaging r − h(s) + h(s′ ), whenever a greedy action a is executed in state s resulting in state s′ and immediate reward r. ρ is updated using the following equation where α is the learning rate. ρ ← ρ + α(r − h(s) + h(s′ )). () One issue with model-based learning is that the models require too much space and time to learn as tables. In many cases, actions can be represented much more compactly. For example, Tadepalli and Ok () uses dynamic Bayesian networks to represent and learn action models, resulting in significant savings in space and time for learning the models.

Model-Free Learning One of the disadvantages of the model-based methods is the need to explicitly represent and learn action models. This is completely avoided in model-free methods such as 7Q-learning by learning value functions over state–action pairs. Schwartz’s R-learning is an adaptation of Q-learning, which is a discounted reinforcement learning method, to optimize average reward (Schwartz, ). The state–action value R(s, a) can be defined as the expected long-term advantage of executing action a in state s and from then on following the optimal averagereward policy. It can be defined using the bias values h and the optimal average reward ρ as follows. R(s, a) = r(s, a) + ∑ P(s′ ∣s, a)h(s′ ) − ρ. s′ ∈S

()

The main difference with Q-values is that instead of discounting the expected total reward from the next state, we subtract the average reward ρ in each time step, which is the constant penalty for using up a time step. The h value of any state s can now be defined using the following equation. h(s′ ) = max R(s′ , u). u

()

Initially all the R-values are set to . When action a is executed in state s, the value of R(s, a) is updated using the update equation R(s, a) ← ( − β)R(s, a) + β(r + h(s′ ) − ρ),

()

where β is the learning rate, r is the immediate reward received, s′ is the next state, and ρ is the estimate of the average reward of the current greedy policy. In any state s, the greedy action a maximizes the value R(s, a); so R-learning does not need to explicitly learn the immediate reward function r(s, a) or the action models P(s′ ∣s, a), since it does not use them either for the action selection or for updating the R-values. Both model-free and model-based ARL methods have been evaluated in several experimental domains (Mahadevan, ; Tadepalli & Ok, ). When there is a compact representation for models and can be learned quickly, the model-based method seems to perform better. It also has the advantage of fewer number of tunable parameters. However, model-free methods are more convenient to implement especially if the models are hard to learn or represent.

Scaling Average-Reward Reinforcement Learning Just as for discounted reinforcement learning, scaling issues are paramount for ARL. Since the number of states is exponential to the number of relevant state variables, a table-based approach does not scale well. The problem is compounded in multi-agent domains where the number of joint actions is exponential in the number of agents. Several function approximation approaches, such as linear functions, multi-layer perceptrons (Marbach, Mihatsch, & Tsitsiklis, ), local 7linear regression (Tadepalli & Ok, ), and tile coding (Proper & Tadepalli, ) were tried with varying degrees of success. 7Hierarchical reinforcement learning based on the MAXQ framework was also explored in the averagereward setting and was shown to lead to significantly faster convergence. In MAXQ framework, we have a directed acyclic graph, where each node represents a task and stores the value function for that task. Usually, the value function for subtasks depends on fewer state variables than the overall value function and hence can

Average-Reward Reinforcement Learning

be more compactly represented. The relevant variables for each subtask are fixed by the designer of the hierarchy, which makes it much easier to learn the value functions. One potential problem with the hierarchical approach is the loss due to the hierarchical constraint on the policy. Despite this limitation, both model-based (Seri & Tadepalli, ) and model-free approaches (Ghavamzadeh & Mahadevan, ) were shown to yield optimal policies in some domains that satisfy the assumptions of these methods.

Applications A temporal difference method for average reward based on TD() was used to solve a call admission control and routing problem (Marbach et al., ). On a modestly sized network of nodes, it was shown that the average-reward TD() outperforms the discounted version because it required more careful tuning of its parameters. Similar results were obtained in other domains such as automatic guided vehicle routing (Ghavamzadeh & Mahadevan, ) and transfer line optimization (Wang & Mahadevan, ).

Convergence Analysis Unlike their discounted counterparts, both R-Learning and H-Learning lack convergence guarantees. This is because due to the lack of discounting, the updates can no longer be thought of as contraction mappings, and hence the standard theory of stochastic approximation does not apply. Simultaneous update of the average reward ρ and the value functions makes the analysis of these algorithms much more complicated. However, some ARL algorithms have been proved convergent in the limit using analysis based on ordinary differential equations (ODE) (Abounadi, Bertsekas, & Borkar, ). The main idea is to turn to ordinary differential equations that are closely tracked by the update equations and use two time-scale analysis to show convergence. In addition to the standard assumptions of stochastic approximation theory, the two timescale analysis requires that ρ is updated at a much slower time scale than the value function. The previous convergence results are based on the limit of infinite exploration. One of the many challenges in reinforcement learning is that of efficient exploration

A

of the MDP to learn the dynamics and the rewards. There are model-based algorithms that guarantee learning an approximately optimal average-reward policy in time polynomial in the numbers of states and actions of the MDP and its mixing time. These algorithms work by alternating between learning the action models of the MDP by taking actions in the environment, and solving the learned MDP using offline value iteration. In the “Explicit Explore and Exploit” or E algorithm, the agent explicitly decides between exploiting the known part of the MDP and optimally trying to reach the unknown part of the MDP (exploration) (Kearns & Singh, ). During exploration, it uses the idea of “balanced wandering,” where the least executed action in the current state is preferred until all actions are executed a certain number of times. In contrast, the R-Max algorithm implicitly chooses between exploration and exploitation by using the principle of “optimism under uncertainty” (Brafman & Tennenholtz, ). The idea here is to initialize the model parameters optimistically so that all unexplored actions in all states are assumed to reach a fictitious state that yields maximum possible reward from then on regardless of which action is taken. The optimistic initialization of the model parameters automatically encourages the agent to execute unexplored actions, until the true models and values of more states and actions are gradually revealed to the agent. It has been shown that with a probability at least − δ, both E and R-MAX learn approximately correct models whose optimal policies have an average reward є-close to the true optimal in time polynomial in the numbers of states and actions, the mixing time of the MDP, є , and δ . Unfortunately the convergence results do not apply when there is function approximation involved. In the presence of linear function approximation, the averagereward version of temporal difference learning, which learns a state-based value function for a fixed policy, is shown to converge in the limit (Tsitsiklis & Van Roy, ). The transient behavior of this algorithm is similar to that of the corresponding discounted TD-learning with an appropriately scaled constant basis function (Van Roy & Tsitsiklis, ). As in the discounted case, development of provably convergent optimal policy learning algorithms with function approximation is a challenging open problem.

A

A

Average-Reward Reinforcement Learning

Cross References 7Efficient Exploration in Reinforcement Learning 7Hierarchical Reinforcement Learning 7Model-Based Reinforcement Learning

Recommended Reading Abounadi, J., Bertsekas, D. P., & Borkar, V. (). Stochastic approximation for non-expansive maps: Application to Qlearning algorithms. SIAM Journal of Control and Optimization, (), –. Barto, A. G., Bradtke, S. J., & Singh, S. P. (). Learning to act using real-time dynamic programming. Artificial Intelligence, (), –. Bertsekas, D. P. (). Dynamic programming and optimal control. Belmont, MA: Athena Scientific. Brafman, R. I., & Tennenholtz, M. (). R-MAX – a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, , –. Crites, R. H., & Barto, A. G. (). Elevator group control using multiple reinforcement agents. Machine Learning, (/), –. Ghavamzadeh, M., & Mahadevan, S. (). Hierarchical average reward reinforcement learning. Journal of Machine Learning Research, (), –. Kearns, M., & Singh S. (). Near-optimal reinforcement learning in polynomial time. Machine Learning, (/), –. Mahadevan, S. (). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, (//), –. Marbach, P., Mihatsch, O., & Tsitsiklis, J. N. (). Call admission control and routing in integrated service networks using

neuro-dynamic programming. IEEE Journal on Selected Areas in Communications, (), –. Proper, S., & Tadepalli, P. (). Scaling model-based averagereward reinforcement learning for product delivery. In European conference on machine learning (pp. –). Springer. Puterman, M. L. (). Markov decision processes: Discrete dynamic stochastic programming. New York: Wiley. Schwartz, A. (). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the tenth international conference on machine learning (pp. –). San Mateo, CA: Morgan Kaufmann. Seri, S., & Tadepalli, P. (). Model-based hierarchical averagereward reinforcement learning. In Proceedings of international machine learning conference (pp. –). Sydney, Australia: Morgan Kaufmann. Sutton, R., & Barto, A. (). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tadepalli, P., & Ok, D. (). Model-based average-reward reinforcement learning. Artificial Intelligence, , –. Tesauro, G. (). Practical issues in temporal difference learning. Machine Learning, (–), –. Tsitsiklis, J., & Van Roy, B. (). Average cost temporal-difference learning. Automatica, (), –. Van Roy, B., & Tsitsiklis, J. (). On average versus discounted temporal-difference learning. Machine Learning, (/), –. Wang, G., & Mahadevan, S. (). Hierarchical optimization of policy-coupled semi-Markov decision processes. In Proceedings of the th international conference on machine learning (pp. –). Bled, Slovenia.

B Backprop 7Backpropagation

Backpropagation Paul Munro University of Pittsburgh, Pittsburgh, PA, USA

Synonyms Backprop; BP; Generalized delta rule

Definition Backpropagation of error (henceforth BP) is a method for training feed-forward neural networks see 7Artificial Neural Networks. A specific implementation of BP is an iterative procedure that adjusts network weight parameters according to the gradient of an error measure. The procedure is implemented by computing an error value for each output unit, and by backpropagating the error values through the network.

Characteristics

denote the set of units that receive input from unit k. In an acyclic graph, at least one unit has a FanIn that is the null set. These are the input units; the activity of an input unit is not computed; rather it is set to a value external to the network (i.e., from the training data). Similarly, at least one unit has a null FanOut set. Such units typically represent the output of the network; i.e., this set of values is the result of the network computation. Intermediate units (often called hidden units) receive input from other units and project outputs to other computational units. For the BP procedure, the activity of each unit is computed in two steps: Linear step: the activities of the FanIn are each multiplied by an independent “weight” parameter, to which a “bias” parameter is added; each computational unit has a single bias parameter, independent of the other units. Let this sum be denoted xk for unit k. Nonlinear step: The activity ak of unit k is a differentiable nonlinear function of xk . A favorite function is the logistic a = /( + exp(−x)), because it maps the range [−∞, +∞] to [, ] and its derivative has properties conducive to the implementation of BP.

Feed-Forward Networks

A feed-forward neural network is a mathematical function that is composed of constituent “semi-linear” functions constrained by a feed-forward network architecture, wherein the constituent functions correspond to nodes (often called units or artificial neurons) in a graph, as in Fig. . A feedfoward network architecture has a connectivity structure that is an acyclic graph; that is, there are no closed loops. In most cases, the unit functions have a finite range such as [, ]. Thus, the network maps RN to [, ]M , where N is the number of input values and M is the number of output units. Let FanIn(k) refer to the set of units that provide input to unit k, and let FanOut(k)

ak = fk (xk );

where xk = bk +

∑

wkj sj

j∈FanIn(k)

Gradient Descent

Derivation of BP is a direct application of the gradient descent approach to optimization and is dependent on a definition of network error, a function of the actual network response to a stimulus, r(s) and the target T(s). The two most common error functions are the summed squared error (SSE) and the cross entropy error (CE) (CE error as defined here is based on the presumption that the output values are in the range [, ]. Likewise

Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----, © Springer Science+Business Media LLC

B

Backpropagation

FanOut (k)

Output units

Hidden units

Unit k Input units

Standard 3 layer classification net

FanIn (k)

General feedforward net structure

Backpropagation. Figure . Two networks are shown. Input units are shown as simple squares at the bottom of each figure. Other units are computational (designated by a horizontal line). Left: A standard -layer network. Four input units project to five hidden units, which in turn project to a single output unit. Not all connections are shown. Such a network is commonly used for classification tasks. Right: An example of a feed-forward network with four inputs, three hidden units, and two outputs

for the target values; this is often used for classification tasks, wherein target values are set to the endpoints of the range, and ). ESSE ≡ ∑ (Ti (s) − ri (s))

i∈Outut s∈Train

ECE ≡ ∑ [Ti (s) ln (ri (s)) − ( − Ti (s)) ln ( − ri (s))] i∈Outut s∈Train

Each weight parameter, wij (the weight of the connection from j to i), is updated by an amount proportional to the negative gradient of the error measure with respect to that parameter: ∆wij = −η

∂E , ∂wij

where the step size, η, modulates the intrinsic tradeoff between smooth convergence of the weights and the speed of convergence; in the regime where η is small, the system is well-behaved and converges smoothly, but slowly, and for larger η, the system may learn some subsets of the training set faster at the expense of smooth convergence on all patterns in the set. Thus, η is also called the learning rate.

Implementation

Several aspects of the feed-forward network must be defined prior to running a BP program, such as the configuration of the hidden units, the initial values of the weights, the functions they will compute, and the numerical representation of the input and target data. There are also parameters of the learning algorithm that must be chosen, such as the value of η and the form of the error function. The weight and bias parameters are set to their initial values (these are usually random within specified limits). BP is implemented as an iterative process as follows: . A stimulus-target pair is drawn from the training set. . The activity values for the units in the network are computed for all the units in the network in a forward fashion from input to output (Fig. a). . The network output values are compared to the target and a delta (δ) value is computed for each output unit based on the difference between the target and the actual output response value.

Backpropagation

B

Errors from FanOut (k)

B ak ak = fk (xk)

di

ek = Σwik di

xk = bk + Σwkj aj

i ÎFanOut(k )

j ÎFanIn(k)

Dbi = hdi

dk = fk¢(ak ) × ek aj

Dwij = hdi aj

Inputs to unit k Activity propagates forward

Error propagates backward

Weights are updated

Backpropagation. Figure . With each iteration of the backprop algorithm, (a) An activity value is computed for every unit in the network from the input to the output. (b) The network output is compared with the target. The error ek for output unit k is defined as (Tk − rk ). A value δk is computed for each output unit by multiplying ek by the derivative of the activity function. For hidden units, the error is propagated backward using the weights. (c) The weight parameters wij are updated in proportion to the product of δi and aj

. The deltas are propagated backward through the network using the same weights that were used to compute the activity values (Fig. b). . Each weight is updated by an amount proportional to the product of the downstream delta value and the upstream activity (Fig. c). The procedure can be run either in an online mode or batch mode. In the online mode, the network parameters are updated for each stimulus-target pair. In the batch mode, the weight changes are computed and accumulated over several iterations without updating the weights until a large number (B) of stimulus-target pairs have been processed (often, the entire training set), at which the weights are updated by the accumulated amounts. online :

∆wij (t) = ηδ i (t)aj (t)

Classification Tasks with BP

The simplest and most common classification function returns a binary value, indicating membership in a particular class. The most common network architecture for a task of this kind is the three-layer network of Fig. (left), with training values of and . For classification tasks, the cross entropy error function generally gives significantly faster convergence. After training, the network is in test mode or production mode, and the responses are in the continuous range [, ]; the response must thus be interpreted. The value of the response could be interpreted as a probability or fuzzy Boolean value. Often, however, a single threshold is applied to give a binary answer. A double threshold is sometimes used, with the midrange defined as “uncertain.”

∆bi (t) = ηδ i (t) Curve Fitting with BP

t+B

batch :

∆wij (t + B) = ∑ ηδ i (s)aj (s) s=t− t−B

∆bi (t + T) = ∑ ηδ i (s) s=t+

A feed-forward network can be trained to approximate any function, given the sufficient hidden units. The range of the output unit(s) must be capable of generating activity values in the required range. In order to accommodate an arbitrary range uniformly, a linear

B

Backpropagation

function is advisable for the output units, and the SSE function is the basis for gradient descent. The Autoencoder Architecture

The autoencoder is a network design in which the target pattern is identical to the input pattern. The hidden units are configured such that there is a “bottleneck layer” of units that is smaller than the input layer, through which information flows; i.e., there are no connections bypassing the bottleneck. Thus, any information necessary to reconstruct the input pattern at the output layer must be represented at the bottleneck. This approach has been successfully applied as an approach to nonlinear dimensionality reduction (e.g., Demers & Cottrell, ). It bears notable similarities and differences to linear techniques, such as 7principal components analysis (PCA). Prediction with BP

The plain “vanilla” BP propagates input to output with no explicit representation of time. Several approaches to processing of temporal patterns have been put forward. Most prominent among these are: Time delay neural network. In this approach, the input stimulus is simply a sample of a time varying signal. The input patterns are typically generated by a sliding window of samples over time or over a sequence. 7Simple recurrent network (Elman, ). A sequence of stimulus patterns is presented as input for the network, which has a single hidden layer design. With each iteration, the input is augmented by a secondary set of input units whose activity is a copy of the hidden layer activity from the previous iteration. Thus, the network is able to maintain a representation of the recent history of network stimuli. Backpropagation through time (Rumelhart, Hinton, & Williams, ). A recurrent network (i.e., a cyclic network) is “unfolded in time” by forming a large multilayer network, in which each layer is a copy of the entire network shifted in time. Thus, the number of layers limits the temporal window available to the network. Recurrent backpropagation (Pineda, ). An acyclic network is run with activity propagation and error propagation, until variables converge. Then the weights are updated.

Cognitive Modeling with BP

Interest in BP as a training technique for classifiers has waned somewhat since the introduction of 7Support vector machines (SVMs) in the mid s. However, the influence of BP as an approach to modeling cognitive processes, including perception, concept learning, spatial cognition, and language learning, remains strong. Analysis of hidden unit representations (e.g., using clustering techniques) has given insight into plausible intermediate processes that may underlie cognitive phenomena. Also, many cognitive models trained with BP have exhibited time courses consistent with stages of human learning. Biological Inspiration and Plausibility

The “connectionist” approach to modeling cognition is based on “neural network” models, which have been touted as “biologically inspired” since their inception. The similarities and differences between connectionist architectures and living brains have been exhaustively debated. Like the brain, the models consist of elements that are extremely limited, computationally. Computational power is derived by several units in network architecture. However, there are compelling differences as well. For example, the temporal dynamics in biological neurons is far more complex than the simple functions used in connectionist networks. It remains unclear what level of neurobiological detail is relevant to understand the cognitive functions. Shortcomings of BP

The BP method is notorious for convergence problems. An inherent problem of gradient descent approaches to optimization is the issue of locally optimal values. Seeking a minimum value be heading downhill is like water running downhill. Not all water reaches the lowest point (sea level). Water that flows into a mountain lake has landed in a local minimum, a region that is bounded by higher ground. Even when BP converges to a global minimum (or a local minimum that is “good enough”), it is sometimes very slow. The convergence properties of BP depend on the learning rate and random factors, such as the initial weight and bias values. Another difficulty with BP is the selection of a network structure. The number of hidden units and the

Basic Lemma

interconnectivity among them has a strong influence on both the generalization performance and the convergence time. Since the nature of this influence is poorly understood, the design of the network is left to guesswork. The standard approach is to use a single hidden layer (as in Fig. , left), which has the advantage of relatively fast convergence.

History

The idea of training a multilayered network using error propagation was originated by Frank Rosenblatt (, ). However, he was unable to apply gradient descent because he was using linear threshold functions that were not differentiable; therefore, the technique of gradient descent was unavailable. He developed a technique known as the perceptron learning rule that is only applicable to two layer networks (no hidden units). Without hidden units, the computational power of the network is severely reduced. Work in the field virtually stopped with the publication of Perceptrons (Minsky & Papert, ). The backpropagation procedure was first published by Werbos (), but did not receive significant recognition until it was put forward by Rumelhart et al. ().

Bagging is an 7ensemble learning technique. The name “Bagging” is an acronym derived from Bootstrap AGGregatING. Each member of the ensemble is constructed from a different training dataset. Each dataset is a 7bootstrap sample from the original. The models are combined by a uniform average or vote. Bagging works best with 7unstable learners, that is those that produce differing generalization patterns with small changes to the training data. Bagging therefore tends not to work well with linear models. See 7ensemble learning for more details.

Bake-Off Definition Bake-off is a disparaging term for experimental evaluation of multiple learning algorithms by a process of applying each algorithm to a limited set of benchmark problems.

Cross References

Cross References 7Artificial Neural Networks

Demers, D., & Cottrell, G. (). Non-linear dimensionality reduction. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems (Vol. ). San Mateo, CA: Morgan Kaufmann. Elman, J. (). Finding structure in time. Cognitive Science, , –. Minsky, M. L., & Papert, S. A. (). Perceptrons. Cambridge, MA: MIT Press. Pineda, F. J. (). Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neural Computation, , –. Rosenblatt, F. (). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, , –. Rosenblatt, F. (). Principles of statistical neurodynamics. Washington, DC: Spartan. Werbos, P. (). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge.

Bagging

7Algorithm Evaluation

Recommended Reading

B

Bandit Problem with Side Information 7Associative Reinforcement Learning

Bandit Problem with Side Observations 7Associative Reinforcement Learning

Basic Lemma 7Symmetrization Lemma

B

B

Basket Analysis

Basket Analysis Hannu Toivonen University of Helsinki, Helsinki, Finland

Synonyms Market basket analysis

Definition The goal of basket analysis is to utilize large volumes of electronic receipts, stored at the checkout terminals of supermarkets, for better understanding of customer behavior. While many forms of learning and mining can be applied to market baskets, the term usually refers to some variant of 7association rule mining. In the basic setting, each market basket constitutes an example essentially defined by the set of purchased products. Association rules then identify sets of items that tend to be bought together. A classical, anecdotal discovery from supermarket data is that “if a basket contains diapers then it often also contains beer.” This example illustrates several potential benefits of market basket analysis by association rules: simplicity and understandability of the results, actionability of the results, and a form of nonsupervised approach where the consequent of the rule has not been fixed by the user. Association rules are often found with the 7Apriori algorithm, and are based on 7frequent itemsets.

Cross References 7Apriori Algorithm 7Association Rule 7Frequent Itemset 7Frequent Pattern

Baum–Welch Algorithm The Baum–Welch algorithm is used for computing maximum likelihood estimates and posterior mode estimates for the parameters (transition and emission probabilities) of a HMM, when given only output sequences (emissions) as training data. The Baum–Welch algorithm is a particular instantiation of the expectation-maximization algorithm, suited for HMMs.

Bayes Adaptive Markov Decision Processes 7Bayesian Reinforcement Learning

Bayes Net 7Bayesian Network

Bayes Rule Geoffrey I. Webb Monash University

Definition Bayes rule provides a decomposition of a conditional probability that is frequently used in a family of learning techniques collectively called Bayesian Learning. Bayes rule is the equality

Batch Learning Synonyms Offline Learning

P(z ∣ w) =

P(z)P(w ∣ z) P(w)

()

P(w) is called the prior probability, P(w ∣ z) is called the posterior probability, and P(z ∣ w) is called the likelihood.

Definition A batch learning algorithm accepts a single input that is a set or sequence of observations. The algorithm produces its 7model, and does no further learning. Batch learning stands in contrast to 7online learning.

Discussion Bayes rule is used for two purposes. The first is Bayesian update. In this context, z represents some new information that has become available since an estimate P(w)

Bayesian Methods

was formed of some hypothesis w. The application of Bayes’ rule enables a new estimate of the probability of w (the posterior probability) to be calculated from estimates of the prior probability, the likelihood and P(z). The second common application of Bayes’ rule is for estimating posterior probabilities in probabilistic learning, where it is the core of 7Bayesian networks, 7naïve Bayes, and 7semi-naïve Bayesian techniques. While Bayes’ rule may initially appear mysterious, it is readily derived from the basic principle of conditional probability that P(w ∣ z) = P(w, z)P(z)

()

B

logical sense). Probabilities are updated based on new evidence or outcomes y using Bayes rule, which takes the form p(x∣C, y) =

p(x∣C)p(y∣x, C) , p(y∣C)

where χ is the discrete domain of x. More generally, any measurable set can be used for the domain χ. An integral or mixed sum and integral can replace the sum. For a utility function u(x) of some event x, for instance the benefit of a particular outcome, the expected value of u() is Ex∣C [u(x)] = ∑ p(x∣C)u(x). x∈X

As P(w, z) =

P(w)P(w, z) P(w)

()

and P(w, z) = P(z ∣ w), P(w)

()

Bayes’ rule (Eq. ) follows by simple substitution of Eq. () into Eq. () and then of the result into Eq. ().

Cross References 7Bayesian Methods 7Bayesian Network 7Naïve Bayes 7Semi-Naïve Bayesian Learning

Bayesian Methods Wray Buntine NICTA, Canberra, Australia

Definition The two most important concepts used in Bayesian modeling are probability and utility. Probabilities are used to model our belief about the state of the world and utilities are used to model the value to us of different outcomes, thus to model costs and benefits. Probabilities are represented in the form of p(x∣C), where C is the current known context and x is some event(s) of interest from a space χ. The left and right arguments of the probability function are in general propositions (in the

One then estimates the expected utility Ex∣C,y [u(x)] based on different evidence, actions or outcomes y. An action is taken to maximize this expected utility, appealing to the principle of maximum expected utility (MEU). A common application of this principle is recursive: one should take the action now that will maximize utility in the future, assuming all future actions are also taken to maximize utility.

Motivation and Background In modeling a problem, primarily, one considers an interrelated space of events or states, actions, and outcomes. Events describe the state of the world, outcomes are also sometimes considered events but they are special in that one directly obtains from them costs or benefits. Actions allow one to influence the world. Some actions may instigate tests and thus also help measure the state of the world to reduce uncertainty. Some problems may be dynamic in that a sequence of actions and outcomes are considered and the resulting changes in states modeled. The Bayesian approach is a modeling methodology that provides a principled approach of how to reason and act in the context of uncertainty and a dynamic environment. In the approach, probabilities are used to model all forms of belief or proportions about events and states, and then utilities are used to model the costs and benefits of any actions taken. An explicit assumption is that these probabilities and utilities can be adequately elicited and precisely modeled for the problem. An implicit assumption is that the computation required – recursive evaluation of

B

B

Bayesian Methods

possibly nested integrals and sums (over domain variables) – can be done quickly enough so that the computation itself does not become a significant factor in the costs considered. The Bayesian approach is named after Rev. Thomas Bayes, whose work was contributed to the Royal Society in after his death, although it was independently more generally presented as a theory by Laplace in . The field was subsequently developed into a field of statistics, inference and decision theory by a stream of authors in the s including Jeffreys (Bernardo and Smith, ). The field of statistics was dominated by the frequentist school during the s, and for a time Bayesian methods were considered controversial. Like the different schools of theory in machine learning, these statistical approaches now coexist. The Bayesian approach can be justified by axiomatic prescriptions of how a rational agent should reason and act, and by appeal to principles of consistency. In the context of learning, probabilities are used to infer models of the problem of interest, and then utilities are used to recommend predictions or analysis based on the models.

positive. Utilities should be additive in worth, and are often practically interpreted in monetary units. Strictly speaking, the value of money is nonlinear (for most people, billion dollars is not significantly better than billion dollars), so it is not a correct utility measure. However, it is adequate when the range of financial transactions expected is reasonable. Expected utility, which is the expected value of the utility function, is the fundamental quantity assessed with Bayesian methods. Some scenarios are the following:

Theory

In Bayesian machine learning, we usually take utilities as a given, and the majority of the work revolves around evaluating and estimating probabilities and maximizing of expected utility. In some ranking tasks and generalized agent learning, the utilities themselves may be poorly understood. Belief and proportions: Some probabilities correspond to proportions that exist in the real world, such as the proportion of school children in the general population of a given state. These real proportions can be measured by counting or sampling, and they are governed by Kolmogorov’s Axioms for probability, including the probability of certainty is and the probability of a disjunction of mutually exclusive events is the sum of the probabilities of the individual events. This kind of probability is used in the Frequentist School that only considers long term average proportions obtained from a series of independent and identical experiments. These proportions can be model parameters one wishes to reason about. Probabilities can also represent beliefs. For instance, in , one could have had a belief about the event that

Basic Theory

First, consider definitions, the different kinds of probability, the process of reasoning (about probabilities), and making decisions. Basic definitions: Probabilities are represented in the form of p(x∣C), where C is the current known context and x is some event(s) of interest. It is sufficient to place in C only terms relevant to x and ignore terms assumed by default. Moreover, both x and C must have welldefined events. For instance, x = “John is tall” is not considered a well-defined event since the word “tall” is not precise. One would instead replace it with something like x = “John is greater than foot tall” or x = “Julie said John is tall.” An important functional used with probabilities is the expected value. For a function f (x) of some event x from a space χ, the expected value of f () is Ex∈ χ [ f (x)]. Utility is used to measure value or relative satisfaction, and is usually represented as a function on outcomes. Costs are negative utility and benefits are

Prediction: For prediction problems, the outcome is the “true” value, and the utility is sometimes the mean square error or the absolute error. In data mining, the choices are much richer, see 7Model Evaluation. Diagnosis: The outcome is the “true” diagnosis, and utility is made up of the differing costs of treatment, mistreatment, and delay or nontreatment, as well as any benefit from correct diagnosis. Game playing: The utility comes from the eventual outcome of the game, each player has their own utility and the state of the game constantly changes as plays are made.

Bayesian Methods

George Bush would win the Presidential Election in the USA. This event is unique and has only one outcome, so the frequentist notion cannot be justified, i.e., there is no long-term sequence of different presidential elections with George Bush. Beliefs are usually considered to be subjective, in that they are specific to each agent, reflecting their sum of unique experiences, and the unique context in which the event in question occurs. To better understand the role beliefs play in Bayesian methods, also see 7Prior Probabilities. Reasoning: A stylized version of probabilistic reasoning considers an event of interest one is reasoning about, x, and evidence, y, one may obtain. Typical scenarios are Learning: x = (Θ, M) are parameters Θ of a model from family M, and y = D is a set of data D = {d , . . . , dN }. So one considers p(Θ, M∣D, C) versus p(Θ, M∣C). Diagnosis: x a disease or condition, and y is a set of observable symptoms or diagnostic tests. One might choose a test y that maximizes the expected utility. Hypothesis testing: x is a hypothesis H and y is some sequence of evidence E , E , . . . , En , so we consider p(H∣E , E , . . . , En ) and hope it is sufficiently high. Different probabilities are then considered: p(x∣C): The prior probability for event x, called the baserate in some contexts. p(y∣C): The prior probability for evidence y. Once the evidence has been seen, this is also used as a proxy for the quality of the model. p(x∣y, C): The posterior probability for event x given evidence y. p(y∣x, C): The likelihood for the event x based on evidence y. In the case of diagnostic reasoning, the prior p(x∣C) is usually the base rate for the disease or condition, and can be got from the population base rate. In the case of learning, however, the prior p(Θ, M∣C) represents a prior distribution on parameters about which we may well be largely ignorant, or at least may not be able to readily elicit from experts. For instance, the proportion θ D might be the probability of a new drug slowing the onset of AIDS

B

related diseases. At the moment of initial testing, θ D is unknown so one places a probability distribution over θ D , which represents one’s belief about the proportion. These priors are second-order probabilities, beliefs about proportions, and they are the most challenging quantity modeled with the Bayesian approach. They can be a function on thousands of parameters, and can be critical in the success of applications. They are also challenging from the philosophical perspective. Decision theory: The term Bayesian inference is usually reserved for the process of manipulating priors and posteriors, computing probabilities, and computing expected values. Bayesian decision theory describes the process of formulating utilities and then evaluating the (sometimes) recursive maximum expected utility formula, such as in game playing, or interactive advertising. In Bayesian theory one takes the action that maximizes expected utility (MEU) in the current context, sometimes referred to as the expected utility hypothesis. Decision theory places this in a dynamic context and says each action should be taken to maximize expected future utility. This is defined recursively, so taken to the limit this implies the optimal future actions need to be determined before the optimal current action can be got via MEU. Justifications

This section covers basic mathematical justifications of the theory. The best general reference for this is Bernardo and Smith (). Additional discussion of prior probabilities appears in 7Prior Probabilities. Note that Bayesian theory, with its acceptance as a branch of mainstream statistics, is widely accepted for the following reasons: Application: It has extensive support through practical success, often times by clever combination of prior knowledge and statistical and computational finesse. Explanation: It provides a convenient common language in which a variety of other theoretical approaches can be represented. For instance PAC, MDL methods, penalized likelihood methods, and the maximum margin approach all find good interpretations within the Bayesian framework.

B

B

Bayesian Methods

Composition: It allows different reasoning tasks to be composed in a coherent way. With a probabilistic framework, the components can interoperate in a coherent manner, so that information may flow bidirectionally between components via probabilities. Composition of processing steps in intelligent systems is a key application for Bayesian methods. For instance, natural language and vision recognition tasks can sometimes be broken down into a processing chain (for instance, doing a named entity recognition step before a dependency parsing step), but these components rarely work conclusively and unambiguously. By attaching probabilities to the output of components, and allowing probabilistic inputs, the uncertainty inherent in individual steps can be propagated and managed. Theoretical justifications also exist to support each of the different components, probabilities, and utilities. These justifications are based on the concept of normative axioms, axioms that do not describe reasoning but rather prescribe basic principles it should follow. The axioms try to capture principles such as coherence and consistency in a quantitative manner. These various justifications have their reported shortcomings and a rich literature exists arguing about the details and postulating new variants. These axiomatic justifications are supportive of the Bayesian approach, but they are not irrefutable. Justifying probabilities: In the Bayesian approach, beliefs and proportions are given the same mathematical treatment. One set of arguably controversial justifications for this revolve around betting (Bernardo and Smith, , Sect. ..). Someone’s subjective beliefs about specific events, such as significant economic and political events (or horse races), are claimed to be measurable by offering them a series of options or bets. Moreover, if their beliefs do not behave like proportions, then a clever bookmaker can use a so-called Dutch book to consistently profit from them. An alternative scheme for justifying probability by Cox is based on normative axioms that beliefs should follow. For instance, one controversial axiom by Cox is that belief about a single event should be represented by a single real number. These axioms are presented by

Jaynes as rules for a robot (Jaynes, ), and as rules for intelligent systems by Horvitz et al. (). Justifying decision theory: Another scheme again using normative axioms, by von Neumann and Morgenstern, is used to justify the use of utilities. This scheme assumes probabilities are the basis of inference about uncertainty. A different set of normative axiomatic schemes have been developed that justify the use of probabilities and utilities together under MEU, the best known is by Savage but others exist (Bernardo and Smith, ).

Bayesian Computation

The first part of this article has been devoted to a brief overview of the Bayesian approach. Computation for Bayesian inference is an extensive field itself. Here we review the basic aspects as a pointer to the literature. This is an active area of research in machine learning, statistics, and a many applied artificial intelligence communities such as natural language processing, image analysis, and others. In general, in Bayesian reasoning one wants to estimate posterior average parameter values, or their average variance, or some other averaged quantity, then general formulas are given by (in the case of continuous parameters) Θ = EΘ∣D,M,C [Θ] = ∫ Θ p (Θ∣D, M, C)dΘ Θ

var(Θ) = EΘ∣D,M,C [(Θ − Θ) ] Marginal likelihood: A useful quantity to assist in evaluating results, and a worthy score in its own right is the marginal likelihood, in the continuous parameter case found from the likelihood p(D∣Θ, M, C) by taking an average p(D∣M, C) = ∫ p(Θ∣M, C)p(D∣Θ, M, C)dΘ. Θ

This is also called the normalizing constant due to its occurrence in the posterior formula p(Θ∣D, M, C) =

p(Θ∣M, C)p(D∣Θ, M, C) !. p(D∣M, C)

It is generally difficult to estimate because of the multidimensional integrals and sums.

Bayesian Methods

Exponential family distributions: Standard probability distributions covered in mathematical statistics, such as the 7Gaussian Distribution, the Poisson, Dirichlet, Gamma, and Wishart, have very convenient mathematical properties that make Bayesian estimation easier. With these distributions, one computes statistics, called sufficient statistics, such as a mean and sum of squares (for the Gaussian), and then parameter estimation follows with a function inverse on a concave function. This is the basis of 7linear regression, 7principal components analysis, and some 7decision tree learning methods, for instance. All good texts on mathematical statistics cover these in detail. Note the marginal likelihood is often computable in closed form for exponential family distributions. Graphical models: 7Graphical Models are a general family of of probabilistic models formed by composing graphs over variables. They work particularly well with exponential family distributions, and allow a rich variety of popular machine learning and data mining methods to be represented and manipulated. Graphical models allow complex models to be composed from simpler components and provide a family of algorithm schemes for developing inference and learning methods that operate on them. They have become the de facto standard for presenting (suitable decomposed) models and algorithms in the machine learning community. Maximum a posterior estimation: known as MAP, is usually the simplest form of parameter estimation that could be called Bayesian. It also corresponds to a penalized or regularized maximum likelihood method. Given the posterior for a stylized learning problem of the previous section, one finds the parameters Θ that maximizes the posterior p(Θ, M∣D, C), which can be conveniently done without computing the marginal likelihood above, so ̂ Θ M P = argmax log p(Θ, D∣M, C), Θ

where the log probability can be broken down as a prior and a likelihood term log p(Θ, D∣M, C) = log p(Θ∣M, C) + log p(D∣Θ, M, C). The Laplace approximation: When the posterior is well behaved, and there is a large amount of data, the posterior is focused around a vanishing small region in

B

√ parameter space of diameter O(/ (N)). If this occurs away from the boundary of the parameter space, then one can make a second-order Taylor expansion of the log. posterior at the MAP point and the result is a Gaussian approximation to the posterior. T ̂ ̂ log p(D, Θ∣M, C) ≈ log p(D, Θ M P ∣M, C)+ (Θ M P −Θ) d log p(D, Θ∣M, C) ∣ dΘdΘ T ̂ Θ= Θ MP ̂ (Θ M,P − Θ) .

From this, one can approximate integrals such as the marginal likelihood p(D∣M, C). This is known as the Laplace approximation, the name of the corresponding general method used for the asymptotic expansion of integrals. In general, this is a poor approximation, but it serves to aid our understanding of parameter estimation (MacKay, Chaps. and ), and is the approximate basis for some model selection criteria. Latent variable models: Latent variables are data that are hidden and thus never observed in the evidence. However, their existence is postulated as a significant component of the model. For instance, in 7Clustering (an unsupervised method) and finite mixture models generally, one assumes each data point has a hidden class label, thus the Bayesian model of clustering is a simple kind of latent variable model. 7Markov chain Monte Carlo methods: The most general form of reasoning and estimation available are the Markov chain Monte Carlo (MCMC) methods. The MCMC methods couple two processes: first, they use Monte Carlo or simulation methods to estimate the integral, and second they use a Markov Chain to sample, so sampling is sequentially (Markovian) based, and samples are not independent. Simulation methods generally use the functional form of p(Θ, D∣M, C) so we do not need to compute the marginal likelihood. Hence, given a set of I samples {Θ , . . . , Θ I } the expected value is approximated with a weighted average Θ≈

I ∑ wi Θ i . I i=

The simplest case is where the samples are made independently according to the posterior itself and then the

B

B

Bayesian Methods

weights wi = , This is called the ordinary Monte Carlo (OMC) method, but it is not often usable in practice because efficient multidimensional posterior samplers rarely exist. Alternatively, one can sample according to a Markov Chain, Θ i+ ∼ q(Θ i+ ∣Θ i ), so each Θ i+ is conditionally dependent on Θ i . So while samples are not independent, as long as the long run distribution of the Markov chain is the same as the posterior, the same approximation formula holds. There are a rich variety of MCMC methods, and this forms one of the key areas of current research. Gibbs sampling: The simplest kind of MCMC method samples each dimension (or sub-vector) in turn. Suppose the parameter vector has K real components, Θ = (θ , . . . , θ K ). Sampling a complete Θ in one go is not generally possible given just a functional form of the posterior p(Θ∣D, M, C) but given no computable form for the normalizing constant. Gibbs sampling works in the one-dimensional case where normalizing bounds can be obtained and sampling tricks used. The conditional posterior of θ k is given by p(θ k ∣(θ , . . . , θ k− , θ k+ , . . . , θ K ), D, M, C), and this is usually easier to sample from. The Gibbs (and MCMC) sample Θ i+ can be drawn given the previous sample Θ i by progressively resampling each dimension in turn and so slowly updating the full vector: . Sample θ i+, according to p(θ ∣θ i, , . . . , θ i,K , D, M, C). ... k. Sample θ i+,k according to p(θ ∣θ i+, , . . . , θ i+,k− , θ i,k+ , . . . , θ i,K , D, M, C). ... K. Sample θ i+,k according to p(θ K ∣θ i+, , . . . , θ i+,K− , D, M, C). In samping terms, this method is no more successful than coordinate-wise ascent is as a primitive greedy search method: it is supported by theoretical results but can be very slow to converge. Variational approximations: When the function you seek to optimize or average over presents difficulty, perhaps it is highly multimodal, then one option is to change the function itself, and replace it with a

more readily approximated function. Variational methods provide a general principle for doing this safely. The general principle uses variational calculus, which is the calculus over functions, not just variables. Variational methods are a very general approach that can be used to develop a broad range of algorithms (Wainwright and Jordan, ). Nonparametric models: The above discussion implicitly assumed the model has a fixed finite parameter vector Θ. If one is attempting to model a regression function, or a language grammar, or image model of unknown a priori structural complexity, then one cannot know the dimension ahead of time. Moreover, as in the case of functions, the dimension cannot always be finite. The 7Bayesian Nonparametric Models address this situation, and are perhaps the most important family of techniques for general machine learning.

Cross References 7Bayes Rule 7Bayesian Nonparametric Models 7Markov Chain Monte Carlo 7Prior Probability

Recommended Reading A good introduction to the problems of uncertainty and philosophical issues behind the Bayesian treatment of probability is in Lindley (). From the statistical machine learning perspective, a good introductory text is by MacKay () who carefully covers information theory, probability, and inference but not so much statistical machine learning. Another alternative introduction to probabilities is the posthumously completed and published work of Jaynes (). Discussions from the frequentist versus Bayesian battlefront can be found in works such as (Rosenkrantz and Jaynes, ), and in the approximate artificial intelligence versus probabilistic battlefront in discussion articles such as Cheeseman’s () and the many responses and rebuttals. It should be noted that it is the continued success in applications that have really led these methods into the mainstream, not the entertaining polemics. Good mathematical statistics text books, such as Casella and Berger () cover the breadth of statistical methods and therefore handle basic Bayesian theory. A more comprehensive treatment is given in Bayesian texts such as Gelman et al. (). Most advanced statistical machine learning text books cover Bayesian methods, but to fully understand the subtleties of prior beliefs and Bayesian methodology one needs to view more advanced Bayesian literature. A detailed theoretical reference for Bayesian methods is Bernardo and Smith (). Bernardo, J., & Smith, A. (). Bayesian theory. Chichester: Wiley. Casella, G., & Berger, R. (). Statistical inference (nd ed.). Pacific Grove: Duxbury.

Bayesian Nonparametric Models

Cheeseman, P. (). An inquiry into computer understanding. Computational Intelligence, (), –. Gelman, A., Carlin, J., Stern, H., & Rubin, D. (). Bayesian data analysis (nd ed.). Boca Raton: Chapman & Hall/CRC Press. Horvitz, E., Heckerman, D., & Langlotz, C. (). A framework for comparing alternative formalisms for plausible reasoning. Fifth National Conference on Artificial Intelligence, Philadelphia, pp. –. Jaynes, E. (). Probability theory: the logic of science. New York: Cambridge University Press. Lindley, D. (). Understanding uncertainty. Hoboken: Wiley. MacKay, D. (). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press. Rosenkrantz, R. (Ed.). (). E.T. Jaynes: papers on probability, statistics and statistical physics. Dordrecht: D. Reidel. Wainwright, M. J., & Jordan, M. I. (). Graphical models, exponential families, and variational inference. Hanover: Now Publishers.

B

derive effects from causes, and intercausal reasoning, to discover the mutual causes of a common effect.

Cross References 7Graphical Models

Bayesian Nonparametric Models Peter Orbanz , Yee Whye Teh Cambridge University, Cambridge, UK University College London, London, UK

Synonyms

Bayesian Model Averaging 7Learning Graphical Models

Bayesian Network Synonyms Bayes net

Definition A Bayesian network is a form of directed 7graphical model for representing multivariate probability distributions. The nodes of the network represent a set of random variables, and the directed arcs represent causal relationships between variables. The Markov property is usually required: every direct dependency between a possible cause and a possible effect has to be shown with an arc. Bayesian networks with the Markov property are called I-maps (independence maps). If all arcs in the network correspond to a direct dependence on the system being modeled, then the network is said to be a D-map (dependence-map). Each node is associated with a conditional probability distribution, that quantifies the effects the parents of the node, if any, have on it. Bayesian support various forms of reasoning: diagnosis, to derive causes from symptoms, prediction, to

Bayesian methods; Dirichlet process; Gaussian processes; Prior probabilities

Definition A Bayesian nonparametric model is a Bayesian model on an infinite-dimensional parameter space. The parameter space is typically chosen as the set of all possible solutions for a given learning problem. For example, in a regression problem, the parameter space can be the set of continuous functions, and in a density estimation problem, the space can consist of all densities. A Bayesian nonparametric model uses only a finite subset of the available parameter dimensions to explain a finite sample of observations, with the set of dimensions chosen depending on the sample such that the effective complexity of the model (as measured by the number of dimensions used) adapts to the data. Classical adaptive problems, such as nonparametric estimation and model selection, can thus be formulated as Bayesian inference problems. Popular examples of Bayesian nonparametric models include Gaussian process regression, in which the correlation structure is refined with growing sample size, and Dirichlet process mixture models for clustering, which adapt the number of clusters to the complexity of the data. Bayesian nonparametric models have recently been applied to a variety of machine learning problems, including regression, classification, clustering, latent variable modeling, sequential modeling, image segmentation, source separation, and grammar induction.

B

B

Bayesian Nonparametric Models

Motivation and Background Most of machine learning is concerned with learning an appropriate set of parameters within a model class from 7training data. The meta-level problems of determining appropriate model classes are referred to as model selection or model adaptation. These constitute important concerns for machine learning practitioners, not only for avoidance of over-fitting and under-fitting, but also for discovery of the causes and structures underlying data. Examples of model selection and adaptation include selecting the number of clusters in a clustering problem, the number of hidden states in a hidden Markov model, the number of latent variables in a latent variable model, or the complexity of features used in nonlinear regression. Nonparametric models constitute an approach to model selection and adaptation where the sizes of models are allowed to grow with data size. This is as opposed to parametric models, which use a fixed number of parameters. For example, a parametric approach to density estimation would be to fit a Gaussian or a mixture of a fixed number of Gaussians by maximum likelihood. A nonparametric approach would be a Parzen window estimator, which centers a Gaussian at each observation (and hence uses one mean parameter per observation). Another example is the support vector machine with a Gaussian kernel. The representer theorem shows that the decision function is a linear combination of Gaussian radial basis functions centered at every input vector, and thus has a complexity that grows with more observations. Nonparametric methods have long been popular in classical (non-Bayesian) statistics (Wasserman, ). They often perform impressively in applications and, though theoretical results for such models are typically harder to prove than for parametric models, appealing theoretical properties have been established for a wide range of models. Bayesian nonparametric methods provide a Bayesian framework for model selection and adaptation using nonparametric models. A Bayesian formulation of nonparametric problems is nontrivial, since a Bayesian model defines prior and posterior distributions on a single fixed parameter space, but the dimension of the parameter space in a nonparametric approach should change with sample size. The Bayesian nonparametric solution to this problem is to use an infinite-dimensional parameter space, and to invoke only a finite subset of

the available parameters on any given finite data set. This subset generally grows with the data set. In the context of Bayesian nonparametric models, “infinitedimensional” can therefore be interpreted as “of finite but unbounded dimension.” More precisely, a Bayesian nonparametric model is a model that () constitutes a Bayesian model on an infinite-dimensional parameter space and () can be evaluated on a finite sample in a manner that uses only a finite subset of the available parameters to explain the sample. We make the above description more concrete in the next section when we describe a number of standard machine learning problems and the corresponding Bayesian nonparametric solutions. As we will see, the parameter space in () typically consists of functions or of measures, while () is usually achieved by marginalizing out surplus dimensions over the prior. Random functions and measures and, more generally, probability distributions on infinite-dimensional random objects are called stochastic processes; examples that we will encounter include Gaussian processes, Dirichlet processes, and beta processes. Bayesian nonparametric models are often named after the stochastic processes they contain. The examples are then followed by theoretical considerations, including formal constructions and representations of the stochastic processes used in Bayesian nonparametric models, exchangeability, and issues of consistency and convergence rate. We conclude this chapter with future directions and a list of literature available for reading.

Examples Clustering with mixture models. Bayesian nonparametric generalizations of finite mixture models provide an approach for estimating both the number of components in a mixture model and the parameters of the individual mixture components simultaneously from data. Finite mixture models define a density function over data items x of the form p(x) = ∑Kk= π k p(x∣θ k ), where π k is the mixing proportion and θ k are parameters associated with component k. The density can be written in a non-standard manner as an integral: p(x) = K ∫ p(x∣θ)G(θ)dθ, where G = ∑k= π k δ θ k is a discrete mixing distribution encapsulating all the parameters of the mixture model and δ θ is a dirac distribution (atom) centered at θ. Bayesian nonparametric mixtures use

Bayesian Nonparametric Models

mixing distributions consisting of a countably infinite number of atoms instead: ∞

G = ∑ πk δ θ k .

()

k=

This gives rise to mixture models with an infinite number of components. When applied to a finite training set, only a finite (but varying) number of components will be used to model the data, since each data item is associated with exactly one component but each component can be associated with multiple data items. Inference in the model then automatically recovers both the number of components to use and the parameters of the components. Being Bayesian, we need a prior over the mixing distribution G, and the most common prior to use is a Dirichlet process (DP). The resulting mixture model is called a DP mixture. Formally, a Dirichlet process DP(α, H) parametrized by a concentration paramter α > and a base distribution H is a prior over distributions (probability measures) G such that, for any finite partition A , . . . , Am of the parameter space, the induced random vector (G(A ), . . . , G(Am )) is Dirichlet distributed with parameters (αH(A ), . . . , αH(Am )) (see entitled Section “Theory” for a discussion of subtleties involved in this definition). It can be shown that draws from a DP will be discrete distributions as given in (). The DP also induces a distribution over partitions of integers called the Chinese restaurant process (CRP), which directly describes the prior over how data items are clustered under the DP mixture. For more details on the DP and the CRP, see 7Dirichlet Process. Nonlinear regression. The aim of regression is to infer a continuous function from a training set consisting of input–output pairs {(ti , xi )}ni= . Parametric approaches parametrize the function using a finite number of parameters and attempt to infer these parameters from data. The prototypical Bayesian nonparametric approach to this problem is to define a prior distribution over continuous functions directly by means of a Gaussian process (GP). As explained in the Chapter 7Gaussian Process, a GP is a distribution on an infinite collection of random variables Xt , such that the joint distribution of each finite subset Xt , . . . , Xtm is a multivariate Gaussian. A value xt taken by the variable Xt can be regarded as the value of a continuous function f at t, that is, f (t) = xt . Given the training set,

B

the Gaussian process posterior is again a distribution on functions, conditional on these functions taking values f (t ) = x , . . . , f (tn ) = xn . Latent feature models. These models represent a set of objects in terms of a set of latent features, each of which represents an independent degree of variation exhibited by the data. Such a representation of data is sometimes referred to as a distributed representation. In analogy to nonparametric mixture models with an unknown number of clusters, a Bayesian nonparametric approach to latent feature modeling allows for an unknown number of latent features. The stochastic processes involved here are known as the Indian buffet process (IBP) and the beta process (BP). Draws from BPs are random discrete measures, where each of an infinite number of atoms has a mass in (, ) but the masses of atoms need not sum to . Each atom corresponds to a feature, with the mass corresponding to the probability that the feature is present for an object. We can visualize the occurrences of features among objects using a binary matrix, where the (i, k) entry is if object i has feature k and otherwise. The distribution over binary matrices induced by the BP is called the IBP. 7Hidden Markov models (HMMs). HMMs are popular models for sequential or temporal data, where each time step is associated with a state, with state transitions dependent on the previous state. An infinite HMM is a Bayesian nonparametric approach to HMMs, where the number of states is unbounded and allowed to grow with the sequence length. It is defined using one DP prior for the transition probabilities going out from each state. To ensure that the set of states reachable from each outgoing state is the same, the base distributions of the DPs are shared and given a DP prior recursively. The construction is called a hierarchical Dirichlet process (HDP); see below. 7Density estimation. A nonparametric Bayesian approach to density estimation requires a prior on densities or distributions. However, the DP is not useful in this context, since it generates discrete distributions. A useful density estimator should smooth the empirical density (such as a Parzen window estimator), which requires a prior that can generate smooth distributions. Priors applicable in density estimation problems include DP mixture models and Pólya trees. If p(x∣θ) is a smooth density function, the density ∞ ∑k= π k p(x∣θ k ) induced by a DP mixture model is a

B

B

Bayesian Nonparametric Models

smooth random density, such that DP mixtures can be used as prior in density estimation problems. Pólya trees are priors on probability distributions that can generate both discrete and piecewise continuous distributions, depending on the choice of parameters. Pólya trees are defined by a recursive infinitely deep binary subdivision of the domain of the generated random measure. Each subdivision is associated with a beta random variable which describes the relative amount of mass on each side of the subdivision. The DP is a special case of a Pólya tree corresponding to a particular parametrization. For other parametrizations the resulting random distribution can be smooth, so it is suitable for density estimation. Power-law Phenomena. Many naturally occurring phenomena exhibit power-law behavior. Examples include natural languages, images, and social and genetic networks. An interesting generalization of the DP, called the Pitman-Yor process, PYP(α, d, H), has recently been successfully used to model power-law data. The PitmanYor process augments the DP by a third parameter d ∈ [, ). When d = the PYP is a DP(α, H), while when α = it is a so called normalized stable process. Sequential modeling. HMMs model sequential data using latent variables representing the underlying state of the system, and assuming that each state only depends on the previous state (the so called Markov property). In some applications, for example language modeling and text compression, we are interested in directly modeling sequences without using latent variables, and without making any Markov assumptions, i.e., modeling each observation conditional on all previous observations in the sequence. Since the set of potential sequences of previous observations is unbounded, this calls for nonparametric models. A hierarchical Pitman-Yor process can be used to construct a Bayesian nonparametric solution whereby the conditional probabilities are coupled hierarchically. Dependent and hierarchical models. Most of the Bayesian nonparametric models described so far are applied in settings where observations are homogeneous or exchangeable. In many real world settings observations are not homogeneous, and in fact are often structured in interesting ways. For example, the data generating process might change over time thus observations at different times are not exchangeable, or observations might come in distinct groups with those

in the same group being more similar than across groups. Significant recent efforts in Bayesian nonparametrics research have been placed in developing extensions that can handle these non-homogeneous settings. Dependent Dirichlet processes are stochastic processes, typically over a spatial or temporal domain, which define a Dirichlet process (or a related random measure) at each point with neighboring DPs being more dependent. These are used for spatial modeling, nonparametric regression, as well as for modeling temporal changes. Alternatively, hierarchical Bayesian nonparametric models like the hierarchical DP aim to couple multiple Bayesian nonparametric models within a hierarchical Bayesian framework. The idea is to allow sharing of statistical strength across multiple groups of observations. Among other applications, these have been used in the infinite HMM, topic modeling, language modeling, word segmentation, image segmentation, and grammar induction. For an overview of various dependent Bayesian nonparametric models and their applications in biostatistics please refer to Dunson (). Teh and Jordan () is an overview of hierarchical Bayesian nonparametric models as well as a variety of applications in machine learning.

Theory As we saw in the preceding examples, Bayesian nonparametric models often make use of priors over functions and measures. Because these spaces typically have uncountable number of dimensions, extra care has to be taken to define the priors properly and to study the asymptotic properties of estimation in the resulting models. In this section we give an overview of the basic concepts involved in the theory of Bayesian nonparametric models. We start with a discussion of the importance of exchangeability in Bayesian parametric and nonparametric statistics. This is followed by representations of the priors and issues of convergence. Exchangeability

The underlying assumption of all Bayesian methods is that the parameter specifying the observation model is a random variable. This assumption is subject to

Bayesian Nonparametric Models

much criticism, and at the heart of the Bayesian versus non-Bayesian debate that has long divided the statistics community. However, there is a very general type of observation for which the existence of such a random variable can be derived mathematically: For so-called exchangeable observations, the Bayesian assumption that a randomly distributed parameter exists is not a modeling assumption, but a mathematical consequence of the data’s properties. Formally, a sequence of variables X , X , . . . , Xn over the same probability space (X , Ω) is exchangeable if their joint distribution is invariant to permuting the variables. That is, if P is the joint distribution and σ any permutation of {, . . . , n}, then

B

In de Finetti’s Theorem it is important to stress that θ can be infinite dimensional (it is typically a random measure), thus the hierarchical Bayesian model () is typically a nonparametric one. For an example, the Blackwell–MacQueen urn scheme (related to the CRP) is exchangeable and thus implicitly defines a random measure, namely the DP (see 7Dirichlet Process for more details). In this sense, we will see below that de Finetti’s theorem is an alternative route to Kolmogorov’s extension theorem, which implicitly defines the stochastic processes underlying Bayesian nonparametric models.

Model Representations

P(X =x , X =x . . . Xn =xn ) = P(X =xσ() , X =xσ() . . . Xn =xσ(n) ).

()

An infinite sequence X , X , . . . is infinitely exchangeable if X , . . . , Xn is exchangeable for every n ≥ . In this chapter, we mean infinite exchangeability whenever we write exchangeability. Exchangeability reflects the assumption that the variables do not depend on their indices although they may be dependent among themselves. This is typically a reasonable assumption in machine learning and statistical applications, even if the variables are not themselves independently and identically distributed (iid). Exchangeability is a much weaker assumption than iid since iid variables are automatically exchangeable. If θ parametrizes the underlying distribution, and one assumes a prior distribution over θ, then the resulting marginal distribution over X , X , . . . with θ marginalized out will still be exchangeable. A fundamental result credited to de Finetti () states that the converse is also true. That is, if X , X , . . . is (infinitely) exchangeable, then there is a random θ such that: n

P(X , . . . , Xn ) = ∫ P(θ) ∏ P(Xi ∣θ)dθ

()

i=

for every n ≥ . In other words, the seemingly innocuous assumption of exchangeability automatically implies the existence of a hierarchical Bayesian model with θ being the random latent parameter. This the crux of the fundamental importance of exchangeability to Bayesian statistics.

In finite dimensions, a probability model is usually defined by a density function or probability mass function. In infinite dimensional spaces, this approach is not generally feasible, for reasons explained below. To define or work with a Bayesian nonparametric model, we have to choose alternative mathematical representations. Weak distributions. A weak distribution is a representation for the distribution of a stochastic process, that is, for a probability distribution on an infinite-dimensional sample space. If we assume that the dimensions of the space are indexed by t ∈ T, the stochastic process can be regarded as the joint distribution P of an infinite set of random variables {Xt }t∈T . For any finite subset S ⊂ T of dimensions, the joint distribution PS of the corresponding subset {Xt }t∈S of random variables is a finite-dimensional marginal of P. The weak distribution of a stochastic process is the set of all its finite-dimensional marginals, that is, the set {PS : S ⊂ T, ∣S∣ < ∞}. For example, the customary definition of the Gaussian process as an infinite collection of random variables, each finite subset of which has a joint Gaussian distribution, is an example of a weak distribution representation. In contrast to the explicit representations to be described below, this representation is generally not generative, because it represents the distribution rather than a random draw, but is more widely applicable. Apparently, just defining a weak distribution in this manner need not imply that it is a valid representation of a stochastic process. A given collection of finite-dimensional distributions represents a stochastic

B

B

Bayesian Nonparametric Models

process only () if a process with these distributions as its marginals actually exists, and () if it is uniquely defined by the marginals. The mathematical result which guarantees that weak distribution representations are valid is the Kolmogorov extension theorem (also known as the Daniell–Kolmogorov theorem or the Kolmogorov consistency theorem). Suppose that a collection {PS : S ⊂ T, ∣S∣ < ∞} of distributions is given. If all distributions in the collection are marginals of each other, that is, if PS is a marginal of PS whenever S ⊂ S , the set of distributions is called a projective family. The Kolmogorov extension theorem states that, if the set T is countable, and if the distributions PS form a projective family, then there exists a uniquely defined stochastic process with the collection {PS } as its marginal distributions. In other words, any projective family for a countable set T of dimensions is the weak distribution of a stochastic process. Conversely, any stochastic process can be represented in this manner, by computing its set of finite-dimensional marginals. The weak distribution representation assumes that all individual random variable Xt of the stochastic process take values in the same sample space Ω. The stochastic process P defined by the weak distribution is then a probability distribution on the sample space Ω T , which can be interpreted as the set of all functions f : T → Ω. For example, to construct a GP we might choose T = Q and Ω = R to obtain real-valued functions on the countable space of rational numbers. Since Q is dense in R, the function f can then be extended to all of R by continuity. To define the DP as a distribution over probability measures on R, we note that a probability measure is a set function that maps “random events,” i.e., elements of the Borel σ-algebra B(R) of R, into probabilities in [, ]. We could therefore choose a weak distribution consisting of Dirichlet distributions, and set T = B(R) and Ω = [, ]. However, this approach raises a new problem because the set B(R) is not countable. As in the GP, we can first define the DP on a countable “base” for B(R) then extend to all random events by continuity of measures. More precise descriptions are unfortunately beyond the scope of this chapter. Explicit representations. Explicit representations directly describe a random draw from a stochastic process, rather than its distribution. A prominent example of

an explicit representation is the so-called stick-breaking representation of the Dirichlet process. The discrete random measure G in () is completely determined by the two infinite sequences {π k }k∈N and {θ k }k∈N . The stickbreaking representation of the DP generates these two sequences by drawing θ k ∼ H iid and vk ∼ Beta(, α) for k = , , . . . . The coefficients π k are then computed as π k = vk ∏k− j= ( − vk ). The measure G so obtained can be shown to be distributed according to a DP(α, G ). Similar representations can be derived for the Pitman–Yor process and the beta process as well. Explicit representations, if they exist for a given model, are typically of great practical importance for the derivation of algorithms. Implicit Representations. A third representation of infinite dimensional models is based on de Finetti’s Theorem. Any exchangeable sequence X , . . . , Xn uniquely defines a stochastic process θ, called the de Finetti measure, making the Xi ’s iid. If the Xi ’s are sufficient to define the rest of the model and their conditional distributions are easily specified, then it is sufficient to work directly with the Xi ’s and have the underlying stochastic process implicitly defined. Examples include the Chinese restaurant process (an exchangeable distribution over partitions) with the DP as the de Finetti measure, and the Indian buffet process (an exchangeable distribution over binary matrices) with the BP being the corresponding de Finetti measure. These implicit representations are useful in practice as they can lead to simple and efficient inference algorithms. Finite representations. A fourth representation of Bayesian nonparametric models is as the infinite limit of finite (parametric) Bayesian models. For example, DP mixtures can be derived as the infinite limit of finite mixture models with particular Dirichlet priors on mixing proportions, GPs can be derived as the infinite limit of particular Bayesian regression models with Gaussian priors, while BPs can be derived as from the limit of an infinite number of independent beta variables. These representations are sometimes more intuitive for practitioners familiar with parametric models. However, not all Bayesian nonparametric models can be expressed in this fashion, and they do not necessarily make clear the mathematical subtleties involved. Consistency and Convergence Rates

A recent series of works in mathematical statistics examines the convergence properties of Bayesian

Bayesian Nonparametric Models

nonparametric models, and in particular the questions of consistency and convergence rates. In this context, a Bayesian model is called consistent if, given that an infinite amount of data is available, the model posterior will concentrate in a neighborhood of the true solution (e.g., true function or density). A rate of convergence specifies, for a finite sample, how rapidly the posterior concentrates depending on the sample size. In their pioneering article Diaconis and Freedman () showed, to the great surprise of much of the Bayesian community, that models such as the Dirichlet process can be inconsistent, and may converge to arbitrary solutions even for an infinite amount of data. More recent results, notably by van der Vaart and Ghosal, apply modern methods of mathematical statistics to study the convergence properties of Bayesian nonparametric models (see e.g., Gho, () and references therein). Consistency has been shown for a number of models, including Gaussian processes and Dirichlet process mixtures. However, a particularly interesting aspect of this line of work are results on convergence rates, which specify the rate of concentration of the posterior depending on sample size, on the complexity of the model, and on how much probability mass the prior places around the true solution. To make such results quantitative requires a measure for the complexity of a Bayesian nonparametric model. This is done by means of complexity measures developed in empirical process theory and statistical learning theory, such as metric entropies, covering numbers and bracketing, some of which are well-known in theoretical machine learning.

Inference There are two aspects to inference from Bayesian nonparametric models: the analytic tractability of posteriors for the stochastic processes embedded in Bayesian nonparametric models, and practical inference algorithms for the overall models. Bayesian nonparametric models typically include stochastic processes such as the Gaussian process and the Dirichlet process. These processes have an infinite number of dimensions, hence naïve algorithmic approaches to computing posteriors are generally infeasible. Fortunately, these processes typically have analytically tractable posteriors, so all but

B

finitely many of the dimensions can be analytically integrated out efficiently. The remaining dimensions, along with the parametric parts of the models, can then be handled by the usual inference techniques employed in parametric Bayesian modeling, including Markov chain Monte Carlo, sequential Monte Carlo, variational inference, and message-passing algorithms like expectation propagation. The precise choice of approximations to use will depend on the specific models under consideration, with speed/accuracy trade-offs between different techniques generally following those for parametric models. In the following, we will give two examples to illustrate the above points, and discuss a few theoretical issues associated with the analytic tractability of stochastic processes. Examples

In Gaussian process regression, we model the relationship between an input x and an output y using a function f , so that y ∼ f (x) + є, where є is iid Gaussian noise. Given a GP prior over f and a finite training data set {(xi , yi )}ni= we wish to compute the posterior over f . Here we can use the weak representation of f and note that { f (xi )}ni= is simply a finite-dimensional Gaussian with mean and covariance given by the mean and covariance functions of the GP. Inference for { f (xi )}ni= is then straightforward. The approach can be thought of equivalently as marginalizing out the whole function except its values on the training inputs. Note that although we only have the posterior over { f (xi )}ni= , this is sufficient to reconstruct the function evaluated at any other point x (say the test input), since f (x ) is Gaussian and independent of the training data {(xi , yi )}ni= given { f (xi )}ni= . In GP regression the posterior over { f (xi )}ni= can be computed exactly. In GP classification or other regression settings with nonlinear likelihood functions, the typical approach is to use sparse methods based on variational approximations or expectation propagation; see Chapter 7Gaussian Process for details. Our second example involves Dirichlet process mixture models. Recall that the DP induces a clustering structure on the data items. If our training set consists of n data items, since each item can only belong to one cluster, there are at most n clusters represented in the training set. Even though the DP mixture itself has an infinite number of potential clusters, all but finitely

B

B

Bayesian Nonparametric Models

many of these are not associated with data, thus the associated variables need not be explicitly represented at all. This can be understood either as marginalizing out these variables, or as an implicit representation which can be made explicit whenever required by sampling from the prior. This idea is applicable for DP mixtures using both the Chinese restaurant process and the stickbreaking representations. In the CRP representation, each data item xi is associated with a cluster index zi , and each cluster k with a parameter θ ∗k (these parameters can be marginalized out if H is conjugate to F), and these are the only latent variables that need be represented in memory. In the stick-breaking representation, clusters are ordered by decreasing prior expected size, with cluster k associated with a parameter θ ∗k and a size π k . Each data item is again associated with a cluster index zi , and only the clusters up to K = max(z , . . . , zn ) need to be represented. All clusters with index > K need not be represented since their posterior conditioning on {(xi , zi )}ni= is just the prior. On Bayes Equations and Conjugacy

It is worth noting that the posterior of a Bayesian model is, in abstract terms, defined as the conditional distribution of the parameter given the data and the hyperparameters, and this definition does not require the existence of a Bayes equation. If a Bayes equation exists for the model, the posterior can equivalently be defined as the left-hand side of the Bayes equation. However, for some stochastic processes, notably the DP on an uncountable space such as R, it is not possible to define a Bayes equation even though the posterior is still a well-defined mathematical object. Technically speaking, existence of a Bayes equation requires the family of all possible posteriors to be dominated by the prior, but this is not the case for the DP. That posteriors of these stochastic processes can be evaluated at all is solely due to the fact that they admit an analytic representation. The particular form of tractability exhibited by many stochastic processes in the literature is that of a conjugate posterior, that is, the posterior belongs to the same model family as the prior, and the posterior parameters can be computed as a function of the prior hyperparameters and the observed data. For example, the posterior of a DP(α, G ) under

observations θ , . . . , θ n is again a Dirichlet process, (αG + ∑ δ θ i )). Similarly the posterior DP(α + n, α+n of a GP under observations of f (x ), . . . , f (xn ) is still a GP. It is this conjugacy that allows practical inference in the examples above. A Bayesian nonparametric model is conjugate if and only if the elements of its weak distribution, i.e., its finite-dimensional marginals, have a conjugate structure as well (Orbanz, ). In particular, this characterizes a class of conjugate Bayesian nonparametric models whose weak distributions consist of exponential family models. Note however, that lack of conjugacy does not imply intractable posteriors. An example is given by the Pitman–Yor process in which the posterior is given by a sum of a finite number of atoms and a Pitman-Yor process independent from the atoms.

Future Directions Since MCMC (see 7Markov Chain Monte Carlo) sampling algorithms for Dirichlet process mixtures became available in the s and made latent variable models with nonparametric Bayesian components applicable to practical problems, the development of Bayesian nonparametrics has experienced explosive growth (Escobar & West, ; Neal, ). Arguably, though, the results available so far have only scratched the surface. The repertoire of available models is still mostly limited to using the Gaussian process, the Dirichlet process, the beta process, and generalizations derived from those. In principle, Bayesian nonparametric models may be defined on any infinitedimensional mathematical object of possible interest to machine learning and statistics. Possible examples are kernels, infinite graphs, special classes of functions (e.g., piece-wise continuous or Sobolev functions), and permutations. Aside from the obvious modeling questions, two major future directions are to make Bayesian nonparametric methods available to a larger audience of researchers and practitioners through the development of software packages, and to understand and quantify the theoretical properties of available methods. General-Purpose Software Package

There is currently significant growth in the application of Bayesian nonparametric models across a

Bayesian Nonparametric Models

variety of application domains both in machine learning and in statistics. However significant hurdles still exist, especially the expense and expertise needed to develop computer programs for inference in these complex models. One future direction is thus the development of software packages that can compile efficient inference algorithms automatically given model specifications, thus allowing a much wider range of modeler to make use of these models. Current developments include the R DPpackage (http://cran.rproject.org/web/packages/DPpackage), the hierarchical Bayesian compiler (http://www.cs.utah.edu/hal/HBC), adaptor grammars (http://www.cog.brown.edu/mj/ Software.htm), the MIT-Church project (http:// projects.csail.mit.edu/church/wiki/Church), as well as efforts to add Bayesian nonparametric models to the repertoire of current Bayesian modeling environments like OpenBugs (http://mathstat.helsinki.fi/openbugs) and infer.NET (http://research.microsoft.com/en-us/ um/cambridge/projects/infernet).

Statistical Properties of Models

Recent work in mathematical statistics provides some insight into the quantitative behavior of Bayesian nonparametric models (cf theory section). The elegant, methodical approach underlying these results, which quantifies model complexity by means of empirical process theory and then derives convergence rates as a function of the complexity, should be applicable to a wide range of models. So far, however, only results for Gaussian processes and Dirichlet process mixtures have been proven, and it will be of great interest to establish properties for other priors. Some models developed in machine learning, such as the infinite HMM, may pose new challenges to theoretical methodology, since their study will probably have to draw on both the theory of algorithms and mathematical statistics. Once a wider range of results is available, they may in turn serve to guide the development of new models, if it is possible to establish how different methods of model construction affect the statistical properties of the constructed model. In addition to the references embedded in the text above, we recommend the books Hjort, Holmes, Müller, and Walker (), Ghosh and Ramamoorthi (),

B

and the review articles Walker, Damien, Laud, and Smith (), Müller and Quintana () on Bayesian nonparametrics. Further references can be found in the chapter by they Teh and Jordan () of the book Hjort et al. ().

Cross References 7Bayesian Methods 7Dirichlet Processes 7Gaussian Processes 7Mixture Modelling 7Prior Probabilities

Recommended Reading Diaconis, P., & Freedman, D. () On the consistency of Bayes estimates (with discussion). Annals of Statistics, (), –. Dunson, D. B. (). Nonparametric Bayes applications to biostatistics. In N. Hjort, C. Holmes, P. Müller, & S. Walker (Eds.), Bayesian nonparametrics. Cambridge: Cambridge University Press. Escobar, M. D., & West, M. (). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, , –. de Finetti, B. (). Funzione caratteristica di un fenomeno aleatorio. Atti della R. Academia Nazionale dei Lincei, Serie . Memorie, Classe di Scienze Fisiche, Mathematice e Naturale, , –. Ghosh, J. K., & Ramamoorthi, R. V. (). Bayesian nonparametrics. New York: Springer. Hjort, N., Holmes, C., Müller, P., & Walker, S. (Eds.) (). Bayesian nonparametrics. In Cambridge series in statistical and probabilistic mathematics (No. ). Cambridge: Cambridge University Press. Müller, P., & Quintana, F. A. (). Nonparametric Bayesian data analysis. Statistical Science, (), –. Neal, R. M. (). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, , –. Orbanz, P. (). Construction of nonparametric Bayesian models from parametric Bayes equations. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems, , –. Teh, Y. W., & Jordan, M. I. (). Hierarchical Bayesian nonparametric models with applications. In N. Hjort, C. Holmes, P. Müller, & S. Walker (Eds.), Bayesian nonparametrics. Cambridge: Cambridge University Press. Walker, S. G., Damien, P., Laud, P. W., & Smith, A. F. M. (). Bayesian nonparametric inference for random distributions and related functions. Journal of the Royal Statistical Society, (), –. Wasserman, L. (). All of nonparametric statistics. New York: Springer.

B

B

Bayesian Reinforcement Learning

Bayesian Reinforcement Learning Pascal Poupart University of Waterloo, Waterloo, Ontario, Canada

Synonyms Adaptive control processes; Bayes adaptive Markov decision processes; Dual control; Optimal learning

Definition Bayesian reinforcement learning refers to 7reinforcement learning modeled as a Bayesian learning problem (see 7Bayesian Methods). More specifically, following Bayesian learning theory, reinforcement learning is performed by computing a posterior distribution on the unknowns (e.g., any combination of the transition probabilities, reward probabilities, value function, value gradient, or policy) based on the evidence received (e.g., history of past state–action pairs).

Structure of Learning Approach A Markov decision process (MDP) (Puterman, ) can be formalized by a tuple ⟨S, A, T⟩ where S is the set of states s, A is the set of actions a, T(s, a, s′ ) = Pr(s′ ∣s, a) is the transition distribution indicating the probability of reaching s′ when executing a in s. Let sr denote the reward feature of a state and Pr (s′r ∣s, a) be the probability of earning r when executing a in s. A policy π : S → A consists of a mapping from states to actions. For a given discount factor ≤ γ ≤ and horizon h, the value V π of a policy π is the expected discounted total reward earned while executing this policy: V π (s) = h ∑t=o γ t Es∣π [str ]. The value function V π can be written in a recursive form as the expected sum of the immediate reward s′r with the discounted future rewards: V π (s) = ∑s′ Pr(s′ ∣s, π(s)) [s′r + γV π (s′ )]. The goal is to find an optimal policy π ∗ , that is, a policy with the highest value V ∗ in all states (i.e., V ∗ (s) ≥ V π (s) ∀π, s). Many algorithms exploit the fact that the optimal value function V ∗ satisfies Bellman’s equation: V ∗ (s) = max ∑ Pr(s′ ∣s, a) [s′r + γV ∗ (s)] a

()

s′

Motivation and Background Bayesian reinforcement learning can be traced back to the s and s in the work of Bellman (), Fel’Dbaum (), and several of Howard’s students (Martin, ). Shortly after 7Markov decision processes were formalized, the above researchers (and several others) in Operations Research considered the problem of controlling a Markov process with uncertain transition and reward probabilities, which is equivalent to reinforcement learning. They considered Bayesian techniques since Bayesian learning is performed by probabilistic inference, which naturally combines with decision theory. In general, Bayesian reinforcement learning distinguishes itself from other reinforcement learning approaches by the use of probability distributions (instead of point estimates) to fully capture the uncertainty. This enables the learner to make more informed decisions, with the potential of learning faster with less data. In particular, the exploration/exploitation tradeoff can be naturally optimized. The use of a prior distribution also facilitates the encoding of domain knowledge, which is exploited in a natural and principled way by the learning process.

Reinforcement learning (Sutton & Barto, ) is concerned with the problem of finding an optimal policy when the transition (and reward) probabilities T are unknown (or uncertain). Bayesian learning is a learning approach in which unknowns are modeled as random variables X over which distributions encode the uncertainty. The process of learning consists of updating the prior distribution Pr(X) based on some evidence e to obtain a posterior distribution Pr(X∣e) according to Bayes theorem: Pr(X∣e) = k Pr(X) Pr(e∣X). (Here k = / Pr(e) is a normalization constant.) Hence, Bayesian reinforcement learning consists of using Bayesian learning for reinforcement learning. The unknowns are the transition (and reward) probabilities T, the optimal value function V ∗ , and the optimal policy π ∗ . Techniques that maintain a distribution on T are known as model-based Bayesian reinforcement learning since they explicitly learn the underlying model T. In contrast, techniques that maintain a distribution on V ∗ or π ∗ are known as model-free Bayesian reinforcement learning since they directly learn the optimal value function or policy without learning a model.

Bayesian Reinforcement Learning

Model-Based Bayesian Learning

In model-based Bayesian reinforcement learning, the learner starts with a prior distribution over the parameters of T, which we denote by θ. For instance, let θ sas′ = Pr(s′ ∣s, a, θ) be the unknown probability of reaching s′ when executing a in s. In general, we denote by θ the set of all θ sas′ . Then, the prior b(θ) represents the initial belief of the learner regarding the underlying model. The learner updates its belief after every s, a, s′ triple observed by computing a posterior bsas′ (θ) = b(θ∣s, a, s′ ) according to Bayes theorem: bsas′ (θ) = kb(θ) Pr(s′ ∣s, a, θ) = kb(θ)θ sas′ .

()

In order to facilitate belief updates, it is convenient to pick the prior from a family of distributions that is closed under Bayes updates. This ensures that beliefs are always parameterized in the same way. Such families are called conjugate priors. In the case of a discrete model (i.e., Pr(s′ ∣s, a, θ) is a discrete distribution), Dirichlets are conjugate priors and form a family of distributions corresponding to monomials over the simplex of discrete distributions (DeGroot, ). They are parameterized as follows: Dir(θ; n) = k ∏i θ ini − . Here θ is an unknown discrete distribution such that ∑i θ i = and n is a vector of strictly positive real numbers ni (known as the hyperparameters) such that ni − can be interpreted as the number of times that the θ i -probability event has been observed. Since the unknown transition model θ is made up of one unknown distribution θ as per s, a pair, let the prior be b(θ) = ∏s,a Dir (θ as ; nsa ) such that nsa is a ′ vector of hyperparameters ns,s a . The posterior obtained after transition ˆs, aˆ , ˆs′ is ′

′

s,s s s bs,s a (θ) = kθ a ∏ Dir (θ a ; na ) s,a

= ∏ Dir (θ as ; nsa + δˆs,ˆa,ˆs′ (s, a, s′ ))

()

s,a

where δˆs,ˆa,ˆs′ (s, a, s′ ) is a Kronecker delta that returns when s = ˆs, a = aˆ , s′ = ˆs′ and otherwise. In practice, belief monitoring is as simple as incrementing the hyperparameter corresponding to the observed transition.

the underlying model. This information is very useful to decide whether future actions should focus on exploring or exploiting. Hence, in Bayesian reinforcement learning, policies π are mappings from state-belief pairs ⟨s, b⟩ to actions. Equivalently, the problem of Bayesian reinforcement learning can be thought as one of planning with a belief MDP (or a partially observable MDP). More precisely, every Bayesian reinforcement learning problem has an equivalent belief MDP formulation ⟨Sbel , Abel , Tbel ⟩ where Sbel = S × B (B is the space of beliefs b), Abel = A, and Tbel (sbel , abel , b′bel ) = Pr (b′bel ∣bbel , abel ) = Pr(s′ , b′ ∣s, b, a) = Pr(b′ ∣s, b, a, s′ ) Pr(s′ ∣s, b, a). The decomposition of the transition dynamics is particularly interesting since ′ Pr(b′ ∣s, b, a, s′ ) equals when b′ = bs,s a (as defined in Eq. ) and otherwise. Furthermore, Pr(s′ ∣s, b, a) = ′ ∫θ b(θ)Pr(s ∣s, θ, a)dθ, which can be computed in closed form when b is a Dirichlet. As a result, the transition dynamics of the belief MDP are fully known. This is a remarkable fact since it means that Bayesian reinforcement learning problems, which by definition have unknown/uncertain transition dynamics, can be recast as belief MDPs with known transition dynamics. While this doesn’t make the problem any easier since the belief MDP has a hybrid state space (discrete s with continuous b), it allows us to treat policy optimization as a problem of planning and in particular to adapt algorithms originally designed for belief MDPs (also known as partially observable MDPs). Optimal Value Function Parameterization

Many planning techniques compute the optimal value function V ∗ , from which an optimal policy π ∗ can easily be extracted. Despite the hybrid nature of the state space, the optimal value function (for a finite horizon) has a simple parameterization corresponding to the upper envelope of a set of polynomials (Poupart, Vlassis, Hoey, & Regan, ). Recall that the optimal value function satisfies Bellman’s equation, which can be adapted as follows for a belief MDP: V ∗ (s, b) = max ∑ Pr(s′ , b′ ∣s, b, a) [s′r + γV ∗ (s′ , b′ )] . a

Belief MDP Equivalence

At any point in time, the belief b provides an explicit representation of the uncertainty of the learner about

B

s′

() Using the fact that b must be (otherwise Pr(s′ , b′ ∣s, b, a) = ) allows us to rewrite Bellman’s equation as follows: ′

′ bs,s a

B

B

Bayesian Reinforcement Learning

′

V ∗ (s, b) = max ∑ Pr(s′ ∣s, b, a) [s′r + γV ∗ (s′ , bs,s a )] . a

s′

() Let Γ be a set of polynomials in θ such that the optimal value function V n with n steps to go is V n (s, b) = ∫θ b(θ)polys,b (θ)dθ where polys,b = argmaxpoly∈Γ n ∫θ b(θ)poly(θ)dθ. It suffices to replace n

s

′

n Pr(s′ ∣s, b, a), bs,s a and V by their definitions in Bellman’s equation

V n+ (s, b) = max ∑ ∫ b(θ) Pr(s′ ∣s, θ, a) a

s′

θ

[rs′ + γ polys′ ,bs,s′ (θ)] dθ a

= max ∫ b(θ) ∑ θ as,s a

[rs′

θ

()

′

s′

+ γ polys′ ,bs,s′ (θ)] dθ a

()

to obtain a similar set of polynomials Γsn+ = ′ {∑s′ θ as,s [rs′ + γ poly′s (θ)] ∣a ∈ A, polys′ ∈ Γsn′ } that represents V n+ . The fact that the optimal value function has a closed form with a simple parameterization is quite useful for planning algorithms based on value iteration. Instead of using an arbitrary function approximator to fit the value function, one can take advantage of the fact that the value function can be represented by a set of polynomials to choose a good representation. For instance, the Beetle algorithm (Poupart et al., ) performs point-based value iteration and approximates the value function with a bounded set of polynomials that each consists of a linear combination of monomial basis functions.

discounted rewards) must naturally optimize the exploration/exploitation tradeoff. In order for a policy to be optimal, it must use all the information available. The information available to the learner consists of the history of past states and actions. One can show that state–belief pairs ⟨s, b⟩ are sufficient statistics of the history. Hence, by searching for the mapping from state–belief pairs to actions that maximizes total discounted rewards, Bayesian reinforcement learning implicitly seeks an optimal tradeoff between exploration and exploitation. In contrast, traditional reinforcement learning approaches search in the space of mappings from states to actions. As a result, such techniques typically focus on asymptotic convergence (i.e., convergence to a policy that is optimal in the limit), but do not effectively balance exploration and exploitation since they do not use histories or beliefs to quantify the uncertainty about the underlying model. Related Work

Michael Duff ’s PhD thesis (Duff, ) provides an excellent survey of Bayesian reinforcement learning up until . The above text pertains mostly to modelbased Bayesian reinforcement learning applied to discrete, fully observable, single agent domains. Bayesian learning has also been explored in model-free reinforcement learning (Dearden, Friedman, & Russell, ; Engel, Mannor, & Meir, ; Ghavamzadeh & Engel, ) continuous-valued state spaces (Ross, Chaib-Draa, & Pineau, ), partially observable domains (Poupart & Vlassis, ; Ross, ChaibDraa, & Pineau, ), and multi-agent systems (Chalkiadakis & Boutilier, , ; Gmytrasiewicz & Doshi, ).

Exploration/Exploitation Tradeoff

Since the underlying model is unknown in reinforcement learning, it is not clear whether actions should be chosen to explore (gain more information about the model) or exploit (maximize immediate rewards based on information gathered so far). Bayesian reinforcement learning provides a principled solution to the exploration/exploitation tradeoff. Despite the appearance of multiple objectives induced by exploration and exploitation, there is a single objective in reinforcement learning: maximize total discounted rewards. Hence, an optimal policy (which maximizes total

Cross References 7Active Learning 7Markov Decision Processes 7Reinforcement Learning

Recommended Reading Bellman, R. (). Adaptive control processes: A guided tour. Princeton, NJ: Princeton University Press.

Behavioral Cloning

Chalkiadakis, G., & Boutilier, C. (). Coordination in multiagent reinforcement learning: A Bayesian approach. In International joint conference on autonomous agents and multiagent systems (AAMAS), Melbourne, Australia (pp. –). Chalkiadakis, G., & Boutilier, C. (). Bayesian reinforcement learning for coalition formation under uncertainty. In International joint conference on autonomous agents and multiagent systems (AAMAS), New York (pp. –). Dearden, R., Friedman, N., & Russell, S. J. (). Bayesian Q-learning. In National conference on artificial intelligence (AAAI), Madison, Wisconsin (pp. –). DeGroot, M. H. (). Optimal statistical decisions. New York: McGraw-Hill. Duff, M. (). Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massachusetts, Amherst. Engel, Y., Mannor, S., & Meir, R. (). Reinforcement learning with Gaussian processes. In International conference on machine learning (ICML), Bonn, Germany. Fel’Dbaum, A. (). Optimal control systems. New York: Academic. Ghavamzadeh, M., & Engel, Y. (). Bayesian policy gradient algorithms. In Advances in neural information processing systems (NIPS), (pp. –). Gmytrasiewicz, P., & Doshi, P. (). A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research (JAIR), , –. Martin (). Bayesian decision problems and Markov chains. New York: Wiley. Poupart, P., & Vlassis, N. (). Model-based Bayesian reinforcement learning in partially observable domains. In International symposium on artificial intelligence and mathematics (ISAIM). Poupart, P., Vlassis, N., Hoey, J., & Regan, K. (). An analytic solution to discrete Bayesian reinforcement learning. In International conference on machine learning (ICML), Pittsburgh, Pennsylvania (pp. –). Puterman, M. L. (). Markov decision processes. New York: Wiley. Ross, S., Chaib-Draa, B., & Pineau, J. (). Bayes-adaptive POMDPs. In Advances in neural information processing systems (NIPS). Ross, S., Chaib-Draa, B., & Pineau, J. (). Bayesian reinforcement learning in continuous POMDPs with application to robot navigation. In IEEE International conference on robotics and automation (ICRA), (pp. –). Sutton, R. S., & Barto, A. G. (). Reinforcement Learning. Cambridge, MA: MIT Press.

B

a list of nodes that represent a frontier in the search space. Whereas the breadth-first adds all neighbors to the list, the beam search orders the neighboring nodes according to some heuristic and only keeps the n best, where n is the beam size. This can significantly reduce the processing and storage requirements for the search. In machine learning, the beam search has been used in algorithms, such as AQ (Dietterich & Michalski, ).

Cross References 7Learning as Search

Recommended Reading Dietterich, T. G., & Michalski, R. S. (). Learning and generalization of characteristic descriptions: Evaluation criteria and comparative review of selected methods. In Fifth international joint conference on artificial intelligence (pp. –). Cambridge, MA: William Kaufmann.

Behavioral Cloning Caude Sammut The University of New South Wales, Sydney, Australia

Synonyms Apprenticeship learning; Behavioral cloning; Learning by demonstration; Learning by imitation; Learning control rules

Definition

Beam Search Claude Sammut University of New South Wales, Sydney, Australia A beam search is a heuristic search technique that combines elements of breadth-first and best-first searches. Like a breadth-first search, the beam search maintains

Behavioral cloning is a method by which human subcognitive skills can be captured and reproduced in a computer program. As the human subject performs the skill, his or her actions are recorded along with the situation that gave rise to the action. A log of these records is used as input to a learning program. The learning program outputs a set of rules that reproduce the skilled behavior. This method can be used to construct automatic control systems for complex tasks for which classical control theory is inadequate. It can also be used for training.

B

B

Behavioral Cloning

Motivation and Background Behavioral cloning (Michie, Bain, & Hayes-Michie, ) is a form of learning by imitation whose main motivation is to build a model of the behavior of a human when performing a complex skill. Preferably, the model should be in a readable form. It is related to other forms of learning by imitation, such as 7inverse reinforcement learning (Abbeel & Ng, ; Amit & Matari´c, ; Hayes & Demiris, ; Kuniyoshi, Inaba, & Inoue, ; Pomerleau, ) and methods that use data from human performances to model the system being controlled (Atkeson & Schaal, ; Bagnell & Schneider, ). Experts might be defined as people who know what they are doing not what they are talking about. That is, once a person becomes highly skilled in some task, the skill becomes sub-cognitive and is no longer available to introspection. So when the person is asked to explain why certain decisions were made, the explanation is a post hoc justification rather than a true explanation. Michie et al. () used an induction program to learn rules for balancing a pole (in simulation) and earlier work by Donaldson (), Widrow and Smith (), and Chambers and Michie () demonstrated the feasibility of learning by imitation, also for polebalancing.

Structure of the Learning System Behavioral cloning assumes that there is a plant of some kind that is under the control of a human operator. The plant may be a physical system or a simulation. In either case, the plant must be instrumented so that it is possible to capture the state of the system, including all the control settings. Thus, whenever the operator performs an action, that is, changes a control setting, we can associate that action with a particular state. Let us use a simple example of a system that has only one control action. A pole balancer has four state variables: the angle of the pole, θ, and its angular velocity, θ˙ and the position, x, and velocity x˙ , of the cart on the track. The only action available to the controller is to apply a fixed positive of negative force, F, to accelerate the cart left or right. We can create an experimental setup where a human can control a pole and cart system (either real or in simulation) by applying a left push or a right push at

Human trainer

As the trainer executes the task all actions are recorded Log file

Plant

Learning program

Controller An learning program uses the logged data to build a controller

Behavioral Cloning. Figure . Structure system

of

learning

the appropriate time. Whenever a control action is performed, we record the action as well as values of the four state variables at the time of the action. Each of these records can be viewed as an example of a mapping from state to action. Michie et al. () demonstrated that it is possible to construct a controller by learning from these examples. The learning task is to predict the appropriate action, given the state. They used a 7decision tree learning program to produce a classifier that, given the values of the four state variables, would output an action. A decision tree is easily convertible into an executable code as a nested if statement. The quality of the controller can be tested by inserting the decision tree into the simulator, replacing the human operator. If the goal of learning is simply to produce an operational controller then any program capable of building a classifier could be used. The reason that Michie et al. () chose a symbolic learner was their desire to produce a controller whose decision making was transparent as well as operational. That is, it should be possible to extract an explanation of the behavior that is meaningful to an expert in the task. Learning Direct (Situation–Action) Controllers

A controller such as the one described above is referred to as a direct controller because it maps situations to actions. Other examples of learning a direct controller

Behavioral Cloning

are building an autopilot from behavioral traces of human pilots flying aircraft in a flight simulator (Sammut, Hurst, Kedzier, & Michie, ) and building a control system for a container crane (Urbanˇciˇc & Bratko, ). These systems extended the earlier work by operating in domains in which there is more than one control variable and the task is sufficiently complex that it must be decomposed into several subtasks. An operator of a container crane can control the speed of the cart and the length of the rope. A pilot of a fixed-wing aircraft can control the ailerons, elevators, rudder, throttle, and flaps. To build an autopilot, the learner must build a system that can set each of the control variables. Sammut et al. (), viewed this as a multitask learning problem. Each training example is a feature vector that includes the position, orientation, and velocities of the aircraft as well as the values of each of the control settings: ailerons, elevator, throttle, and flaps. The rudder is ignored. A separate decision tree is built for each control variable. For example, the aileron setting is treated as the dependent variable and all the other variables, including the other controls, are treated as the attributes of the training example. A decision tree is built for ailerons, then the process is repeated for the elevators, etc. The result is a decision tree for each control variable. The autopilot code executes each decision tree in each cycle of the control loop. This method treats the setting of each control as a separate task. It may be surprising that this method works since it is often necessary to adjust more than one control simultaneously to achieve the desired result. For example, to turn, it is normal to use the ailerons to roll the aircraft while adjusting the elevators to pull it around. This kind of multivariable control does result from multiple decision trees. When, say, the aileron decision tree initiates a roll, the elevator’s decision tree detects the roll and causes the aircraft to pitch up and execute a turn. Limitations Direct controllers work quite well for sys-

tems that have a relatively small state space. However, for complex systems, behavioral cloning of direct situation–action rules tends to produce very brittle controllers. That is, they cannot tolerate large disturbances. For example, when air turbulence is introduced into the flight simulator, the performance of the clone degrades very rapidly. This is because the examples provided by

B

logging the performance of a human only cover a very small part of the state space of a complex system such as an aircraft in flight. Thus, the“expertise” of the controller is very limited. If the system strays outside the controller’s region of expertise, it has no method for recovering and failure is usually catastrophic. More robust control is possible but only with a significant change in approach. The more successful methods decompose the learning task into two stages: learning goals and learning the actions to achieve those goals.

Learning Indirect (Goal-Directed) Controllers The problem of learning in a large search space can partially be addressed by decomposing the learning into subtasks. A controller built in this way is said to be an indirect controller. A control is “indirect” if it does not compute the next action directly from the system’s current state but uses, in addition, some intermediate information. An example of such intermediate information is a subgoal to be attained before achieving the final goal. Subgoals often feature in an operator’s control strategies and can be automatically detected from a trace of the operator’s behavior (Šuc & Bratko, ). The problem of subgoal identification can be treated as the inverse of the usual problem of controller design, that is, given the actions in an operator’s trace, find the goal that these actions achieve. The limitation of this approach is that it only works well for cases in which there are just a few subgoals, not when the operator’s trajectory contains many subgoals. In these cases, a better approach is to generalize the operator’s trajectory. The generalized trajectory can be viewed as defining a continuously changing subgoal (Bratko & Šuc, ; Šuc & Bratko, a) (see also the use of flow tubes in dynamic plan execution (Hofmann & Williams, )). Subgoals and generalized trajectories are not sufficient to define a controller. A model of the systems dynamics is also required. Therefore, in addition to inducing subgoals or a generalized trajectory, this approach also requires learning approximate system dynamics, that is a model of the controlled system. Bratko and Šuc () and Šuc and Bratko (b) use a combination of the Goldhorn (Križman & Džeroski,

B

B

Behavioral Cloning

) discovery program and locally weighted regression to build the model of the system’s dynamics. The next action is then computed “indirectly” by () computing the desired next state (e.g., next subgoal) and () determining an action that brings the system to the desired next state. Bratko and Šuc also investigated building qualitative control strategies from operator traces (Bratko & Šuc, ). An analog to this approach is 7inverse reinforcement learning (Abbeel & Ng, ; Atkeson & Schaal, ; Ng & Russell, ) where the reward function is learned. Here, the learning the reward function corresponds to learning the human operator’s goals. Isaac and Sammut () uses an approach that is similar in spirit to Šuc and Bratko but incorporates classical control theory. Learned skills are represented by a two-level hierarchical decomposition with an anticipatory goal level and a reactive control level. The goal level models how the operator chooses goal settings for the control strategy and the control level models the operator’s reaction to any error between the goal setting and actual state of the system. For example, in flying, the pilot can achieve goal values for the desired heading, altitude, and airspeed by choosing appropriate values of turn rate, climb rate, and acceleration. The controls can be set to correct errors between the current state and the desired state of these goal-directing quantities. Goal models map system states to a goal setting. Control actions are based on the error between the output of each of the goal models and the current system state. The control level is modeled as a set of proportional integral derivative (PID) controllers, one for each control variable. A PID controller determines a control value as a linear function proportional to the error on a goal variable, the integral of the error, and the derivative of the error. Goal setting and control models are learned separately. The process begins be deciding which variables are to be used for the goal settings. For example, trainee pilots will learn to execute a “constant-rate turn,” that is, their goal is to maintain a given turn rate. A separate goal rule is constructed for each goal variable using a 7model tree learner (Potts & Sammut, ). A goal rule gives the setting for a goal variable and therefore, we can find the difference (error) between the

current state value and the goal setting. The integral and derivative of the error can also be calculated. For example, if the set turn rate is ○ min, then the error on the turn rate is calculated as the actual turn rate minus . The integral is then the running sum of the error multiplied by the time interval between time samples, starting from the first time sample of the behavioral trace, and the derivative is calculated as the difference between the error and previous error all divided by the time interval. For each control available to the operator, a model tree learner is used to predict the appropriate control setting. 7Linear regression is used in the leaf nodes of the model tree to produce linear equations whose coefficients are the P, I, and D of values of the PID controller. Thus the learner produces a collection of PID controllers that are selected according to the conditions in the internal nodes of the tree. In control theory, this is known as piecewise linear control. Another indirect method is to learn a model of the dynamics of the system and use this to learn, in simulation, a controller for the system (Bagnell & Schneider, ; Ng, Jin Kim, Jordan, & Sastry, ). This approach does not seek to directly model the behavior of a human operator. A behavioral trace may be used to generate data for modeling the system but then a reinforcement learning algorithm is used to generate a policy for controlling the simulated system. The learned policy can then be transferred to the physical system. 7Locally weighted regression is typically used for system modeling, although 7model trees can also be used.

Cross References 7Apprenticeship Learning 7Inverse Reinforcement Learning 7Learning by Imitation 7Locally Weighted Regression 7Model Trees 7Reinforcement Learning 7System Identification

Recommended Reading Abbeel, P., & Ng, A. Y. (). Apprenticeship learning via inverse reinforcement learning. In International conference on machine learning, Banff, Alberta, Canada. New York: ACM.

Bias

Amit, R., & Matari´c, M. (). Learning movement sequences from demonstration. In Proceedings of the second international conference on development and learning, Cambridge, MA, USA (pp. –). Washington, D.C.: IEEE. Atkeson, C. G., & Schaal, S. (). Robot learning from demonstration. In D. H. Fisher (Ed.), Proceedings of the fourteenth international conference on machine learning, Nashville, TN, USA (pp. –). San Francisco: Morgan Kaufmann. Bagnell, J. A., & Schneider, J. G. (). Autonomous helicopter control using reinforcement learning policy search methods. In International conference on robotics and automation, South Korea. IEEE Press, New York. Bratko, I., & Šuc, D. (). Using machine learning to understand operator’s skill. In Proceedings of the th international conference on industrial and engineering applications of artificial intelligence and expert systems (pp. –). London: Springer. AAAI Press, Menlo Park, CA. Bratko, I., & Šuc, D. (). Learning qualitative models. AI Magazine, (), –. Chambers, R. A., & Michie, D. (). Man-machine co-operation on a learning task. In R. Parslow, R. Prowse, & R. Elliott-Green (Eds.), Computer graphics: techniques and applications. London: Plenum. Donaldson, P. E. K. (). Error decorrelation: A technique for matching a class of functions. In Proceedings of the third international conference on medical electronics (pp. –). Hayes, G., & Demiris, J. (). A robot controller using learning by imitation. In Proceedings of the international symposium on intelligent robotic systems, Grenoble, France (pp. –). Grenoble: LIFTA-IMAG. Hofmann, A. G., & Williams, B. C. (). Exploiting spatial and temporal flexiblity for plan execution of hybrid, underactuated systems. In Proceedings of the st national conference on artficial intelligence, July , Boston, MA (pp. –). Isaac, A., & Sammut, C. (). Goal-directed learning to fly. In T. Fawcett & N. Mishra (Eds.), Proceedings of the twentieth international conference on machine learning, Washington, D.C. (pp. –). Menlo Park: AAAI. Križman, V., & Džeroski, S. (). Discovering dynamics from measured data. Electrotechnical Review, (–), –. Kuniyoshi, Y., Inaba, M., & Inoue, H. (). Learning by watching: Extracting reusable task knowledge from visual observation of human performance. IEEE Transactions on Robotics and Automation, , –. Michie, D., Bain, M., & Hayes-Michie, J. E. (). Cognitive models from subcognitive skills. In M. Grimble, S. McGhee, & P. Mowforth (Eds.), Knowledge-based systems in industrial control. Stevenage: Peter Peregrinus. Ng, A. Y., Jin Kim, H., Jordan, M. I., & Sastry, S. (). Autonomous helicopter flight via reinforcement learning. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems . Cambridge: MIT Press. Ng, A. Y., & Russell, S. (). Algorithms for inverse reinforcement learning. In Proceedings of th international conference on machine learning, Stanford, CA, USA (pp. –). San Francisco: Morgan Kaufmann.

B

Pomerleau, D. A. (). ALVINN: An autonomous land vehicle in a neural network. In D. S. Touretzky (Ed.), Advances in neural information processing systems. San Mateo: Morgan Kaufmann. Potts, D., & Sammut, C. (November ). Incremental learning of linear model trees. Machine Learning, (–), –. Sammut, C., Hurst, S., Kedzier, D., & Michie, D. (). Learning to fly. In D. Sleeman & P. Edwards (Eds.), Proceedings of the ninth international conference on machine learning, Aberdeen (pp. –). San Francisco: Morgan Kaufmann. Šuc, D., & Bratko, I. (). Skill reconstruction as induction of LQ controllers with subgoals. In IJCAI-: Proceedings of the fiftheenth international joint conference on artificial intelligence, Nagoya, Japan (Vol. , pp. –). San Francisco: Morgan Kaufmann. Šuc, D., & Bratko, I. (a). Modelling of control skill by qualitative constraints. In Thirteenth international workshop on qualitative reasoning, – June , Lock Awe, Scotland (pp. –). Aberystwyth: University of Aberystwyth. Šuc, D., & Bratko, I. (b). Symbolic and qualitative reconstruction of control skill. Electronic Transactions on Artificial Intelligence, (B), –. Urbanˇciˇc, T., & Bratko, I. (). Reconstructing human skill with machine learning. In A. Cohn (Ed.), Proceedings of the th European conference on artificial intelligence. Wiley. Amsterdam: New York. Widrow, B., & Smith, F. W. (). Pattern recognising control systems. In J. T. Tou & R. H. Wilcox (Eds.), Computer and information sciences. London: Clever Hume.

Belief State Markov Decision Processes 7Partially Observable Markov Decision Processes

Bellman Equation The Bellman Equation is a recursive formula that forms the basis for 7dynamic programming. It computes the expected total reward of taking an action from a state in a 7Markov decision process by breaking it into the immediate reward and the total future expected reward. (See 7dynamic programming.)

Bias Bias has two meanings, 7inductive bias, and statistical bias (see 7bias variance decomposition).

B

B

Bias Specification Language

Bias Specification Language Hendrik Blockeel Katholieke Universiteit Leuven, Belgium The Netherlands

Definition A bias specification language is a language in which a user can specify a 7Language Bias. The language bias of a learner is the set of hypotheses (or hypothesis descriptions) that this learner may return. In contrast to the 7hypothesis language, the bias specification language allows us to describe not single hypotheses but sets (languages) of hypotheses.

Examples In learning approaches based on 7graphical models or 7artificial neural networks, whenever the user provides the graph structure of the model, he or she is specifying a bias. The “language” used to specify this bias, in this case, consists of graphs. Figure shows examples of such graphs. Not every kind of bias can necessarily be expressed by some bias specification language; for instance, the bias defined by the 7Bayesian network structure in Fig. cannot be expressed using a

A

B

C p(A,B,C) = p(A)p(B)p(C|A,B)

A

B

C p(A,B,C) = f1(A,C)f2(B,C)

Bias Specification Language. Figure . Graphs defining a bias for learning joint distributions. The Bayesian network structure to the left constrains the form of the joint distribution in a particular way (shown as the equation below the graph). Notably, it guarantees that only distributions can be learned in which the variables A and B are (unconditionally) independent. The Markov network structure to the right constrains the form of the joint distribution in a different way: it states that it must be possible to write the distribution as a product of a function of A and C and a function of B and C. These two biases are different. In fact, no Markov network structure over the variables A, B, and C exists that expresses the bias specified by the Bayesian network structure

7Markov network. Bayesian networks and Markov networks have a different expressiveness, when viewed as bias specification languages. Also certain parameters of decision tree learners or rule set learners effectively restrict the hypothesis language (for instance, an upper bound on the rule length or the size of the decision tree). A combination of parameter values can hardly be called a language, and even the “language” of graphs is a relatively simple kind of language. More elaborate types of bias specification languages are typically found in the field of 7inductive logic programming (ILP).

Bias Specification Languages in Inductive Logic Programming In ILP, the hypotheses returned by the learning algorithm are typically written as first-order logic clauses. As the set of all possible clauses is too large to handle, a subset of these clauses is typically defined; this subset is called the language bias. Several formalisms (“bias specification languages”) have been proposed for specifying such subsets. We here focus on a few representative ones. DLAB

In the DLAB bias specification language (Dehaspe & De Raedt, ), the language bias is defined in a declarative way, by defining a syntax that clauses must fulfill. In its simplest form, a DLAB specification simply gives a set of possible head and body literals out of which the system can build a clause. Example The actual syntax of the DLAB specification language is relatively complicated (see Dehaspe & De Raedt, ), but in essence, one can write down a specification such as: { parent({X,Y,Z},{X,Y,Z}), grandparent({X,Y,Z}, {X,Y,Z}) } :{ parent({X,Y,Z},{X,Y,Z}), parent({X,Y,Z},{X,Y,Z}), grandparent({X,Y,Z},{X,Y,Z}), grandparent({X,Y,Z}, {X,Y,Z}) } which states that the hypothesis language consists of all clauses that have at most one parent and at most one

Bias Specification Language

grandparent literal in the head, and at most two of these literals in the body; the arguments of these literals may be variables X,Y,Z. Thus, the following clauses are in the hypothesis language: grandparent(X, Y) :- parent(X, Z), parent(Z,Y) :- parent(X,Y), parent(Y,X) :- parent(X,X) These express the usual definition of grandparent as well as the fact that there can be no cycles in the parent relation. Note that for each argument of each literal, all the variables and constants that may occur have to be enumerated explicitly. This can make a DLAB specification quite complex. While DLAB contains advanced constructs to alleviate this problem, it remains the case that often very elaborate bias specifications are needed in practical situations.

B

but not the following clause: grandparent(X,Y) :- parent(Z,Y) because Z occurs as an input parameter for parent without occurring elsewhere as an output parameter (i.e., it is being used without having been given a value first). FLIPPER’s Bias Specification Language

The FLIPPER system (Cohen, ) employs a powerful, but somewhat more procedural, bias specification formalism. The user does not specify a set of valid hypotheses directly, but rather, specifies a 7Refinement Operator. The language bias is the set of all clauses that can be obtained from one or more starting clauses through repeated application of this refinement operator. The operator itself is defined by specifying under which conditions certain literals can be added to a clause. Rules defining the operator have one of the following forms:

Type- and Mode-Based Biases

A more flexible bias specification language is used by Progol (Muggleton, ) and many other ILP systems. It is based on the notions of types and modes. In Progol, arguments of a predicate can be typed, and a variable can never occur in two locations with different types. Similarly, arguments of a predicate have an input (+) or output (−) mode; each variable that occurs as an input argument of some literal must occur elsewhere as an output argument, or must occur as input argument in the head literal of a clause. Example In Progol, the specifications type(parent(human,human)). type(grandparent(human,human)). modeh(grandparent(+,+)). % modeh: specifies a head literal modeb(grandparent(+,-)). % modeb: specifies a body literal modeb(parent(+,-)).

A ← B where Pre asserting Post L where Pre asserting Post The first form defines a set of “starting clauses,” and the second form defines when a literal L can be added to a clause. Each rule can only be applied when its preconditions Pre are fulfilled, and upon application will assert a set of literals Post. As an example (Cohen, ), the rules illegal(A, B, C, D, E, F) ← where true asserting {linked(A), linked(B), . . ., linked(F)} R(X, Y) where rel(R), linked(X), linked(Y) asserting ∅ state that any clause of the form illegal(A, B, C, D, E, F) ←

allow the system to construct a clause such as grandparent(X,Y) :- parent(X,Z), parent(Z,Y)

can be used as a starting point for the refinement operator, and the variables in this clause are all linked. Further, any literal of the form R(X, Y) with R a relation

B

B

Bias Variance Decomposition

symbol (as defined by the Rel predicate) and X and Y linked variables can be added. Other Approaches

Grammars or term rewriting systems have been proposed several times as a means of defining the hypothesis language. A relatively recent approach along these lines was given by Lloyd, who uses a rewriting system to define the tests that can occur in the nodes of a decision tree built by the Alkemy system (Lloyd, ). Boström & Idestam-Almquist () present an approach where the language bias is implicitly defined through the 7Background Knowledge given to the learner. Knobbe et al. () propose the use of UML as a “common” bias specification language, specifications in which could be translated automatically to languages specific to a particular learner.

Dehaspe, L., & De Raedt, L. (). DLAB: A declarative language bias formalism. In Proceedings of the international symposium on methodologies for intelligent systems. Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer. Knobbe, A. J., Siebes, A., Blockeel, H., & van der Wallen, D. (). Multi-relational data mining, using UML for ILP. In Proceedings of PKDD- – The fourth European conference on principles and practice of knowledge discovery in databases. Lecture notes in artificial intelligence (Vol. , pp. –), Lyon, France. Berlin: Springer. Lloyd, J. W. (). Logic for learning. Berlin: Springer. Muggleton, S. (). Inverse entailment and Progol. New Generation Computing, Special Issue on Inductive Logic Programming, (–), –. Nédellec, C., Adé, H., Bergadano, F., & Tausend, B. (). Declarative bias in ILP. In L. De Raedt (Ed.), Advances in inductive logic programming. Frontiers in artificial intelligence and applications (Vol. , pp. –). Amsterdam: IOS Press.

Bias Variance Decomposition

Further Reading An overview of bias specification formalisms in ILP is given by Nédellec et al. (). The bias specification languages discussed above are discussed in more detail in Dehaspe and De Raedt (), Muggleton (), and Cohen (). De Raedt () discusses language bias and the concept of bias shift (learners weakening their bias, i.e., extending the set of hypotheses that can be represented, when a given language bias turns out to be too restrictive). A more recent approach to learning declarative bias is presented by Bridewell and Todorovski ().

Cross References 7Hypothesis Language 7Inductive Logic Programminllg

Recommended Reading Boström, H., & Idestam-Almquist, P. (). Induction of logic programs by example-guided unfolding. Journal of Logic Programming, (–), –. Bridewell, W., & Todorovski, L. (). Learning declarative bias. In Proceedings of the th international conference on inductive logic programming. Lecture notes in computer science (Vol. , pp. –). Berlin: Springer. Cohen, W. (). Learning to classify English text with ILP methods. In L. De Raedt (Ed.), Advances in inductive logic programming (pp. –). Amsterdam: IOS Press. De Raedt, L. (). Interactive theory revision: An inductive logic programming approach. New York: Academic Press.

Definition The bias-variance decomposition is a useful theoretical tool to understand the performance characteristics of a learning algorithm. The following discussion is restricted to the use of squared loss as the performance measure, although similar analyses have been undertaken for other loss functions. The case receiving most attention is the zero-one loss (i.e., classification problems), in which case the decomposition is nonunique and a topic of active research. See Domingos () for details. The decomposition allows us to see that the mean squared error of a model (generated by a particular learning algorithm) is in fact made up of two components. The bias component tells us how accurate the model is, on average across different possible training sets. The variance component tells us how sensitive the learning algorithm is to small changes in the training set (Fig. ). Mathematically, this can be quantified as a decomposition of the mean squared error function. For a testing example {x, d}, the decomposition is: ED {( f (x) − d) } = (ED { f (x)} − d) + ED {( f (x) − ED { f (x)}) }, MSE = bias + variance,

Bias-Variance Trade-offs: Novel Applications

B

B High bias High variance

Low bias High variance

High bias Low variance

Low bias Low variance

Bias Variance Decomposition. Figure . The bias-variance decomposition is like trying to hit the bullseye on a dartboard. Each dart is thrown after training our “dart-throwing” model in a slightly different manner. If the darts vary wildly, the learner is high variance. If they are far from the bullseye, the learner is high bias. The ideal is clearly to have both low bias and low variance; however this is often difficult, giving an alternative terminology as the bias-variance “dilemma” (Dartboard analogy, Moore & McCabe ())

where the expectations are with respect to all possible training sets. In practice, this can be estimated by crossvalidation over a single finite training set, enabling a deeper understanding of the algorithm characteristics. For example, efforts to reduce variance often cause increases in bias, and vice versa. A large bias and low variance is an indicator that a learning algorithm is prone to 7overfitting the model.

Cross References 7Bias-Variance Trade-offs: Novel Applications

Recommended Reading Domingos, P. (). A unified bias-variance decomposition for zero-one and squared loss. In Proceedings of national conference on artificial intelligence. Austin, TX: AAAI Press. Geman, S. (). Neural networks and the bias/variance dilemma. Neural Computation, () Moore, D. S., & McCabe, G. P. (). Introduction to the practice of statistics. Michelle Julet

Bias-Variance Trade-offs: Novel Applications Dev Rajnarayan, David Wolpert NASA Ames Research Center, Moffett Field, CA, USA

Definition Consider a given random variable F and a random variˆ We wish to use a sample of able that we can modify, F. Fˆ as an estimate of a sample of F. The mean squared error (MSE) between such a pair of samples is a sum

of four terms. The first term reflects the statistical coupling between F and Fˆ and is conventionally ignored in bias-variance analysis. The second term reflects the inherent noise in F and is independent of the estimator ˆ Accordingly, we cannot affect this term. In contrast, F. ˆ The third term, the third and fourth terms depend on F. called the bias, is independent of the precise samples of ˆ and reflects the difference between the both F and F, ˆ The fourth term, called the variance, is means of F and F. independent of the precise sample of F, and reflects the inherent noise in the estimator as one samples it. These last two terms can be modified by changing the choice of the estimator. In particular, on small sample sets, we can often decrease our mean squared error by, for instance, introducing a small bias that causes a large reduction the variance. While most commonly used in machine learning, this article shows that such bias-variance trade-offs are applicable in a much broader context and in a variety of situations. We also show, using experiments, how existing bias-variance trade-offs can be applied in novel circumstances to improve the performance of a class of optimization algorithms.

Motivation and Background In its simplest form, the bias-variance decomposition is based on the following idea. Say we have a random variable F taking on values F distributed according to a density function p(F). We want to estimate the value of a sample from p(F). To form our estimate, we sample a different random variable Fˆ taking on values Fˆ disˆ Assuming a quadratic loss tributed according to p(F). function, the quality of our estimate is measured by its MSE:

B

Bias-Variance Trade-offs: Novel Applications

ˆ F) (Fˆ − F) dFˆ dF. ˆ ≡ ∫ p(F, MSE(F) In many situations, F and Fˆ are dependent variables. For example, in supervised machine learning, F is a “target” conditional distribution, stochastically mapping elements of an input space X into a space Y of output variables. The associated distribution p(F) is the “prior” of F. A random sample D of F, called “the training set,” is generated, and D is used in a “learning algorithm” to ˆ which is our estimate of F. Clearly, this F and produce F, Fˆ are statistically dependent, via D. Indeed, intuitively speaking, the goal in designing a learning algorithm is ˆ it produces are positively correlated with F’s. that the F’s In practice this coupling is simply ignored in analyses of bias plus variance, without any justification (one such justification could be that the coupling has little effect on the value of the MSE). We shall follow that practice here. Accordingly, our equation for MSE reduces to

ˆ ˆ = ∫ p(F)p(F) (Fˆ − F) dFˆ dF. MSE(F)

()

If we were to account for the coupling of Fˆ and Fˆ an additive correction term would need to be added to the right-hand side. For instance, see Wolpert (). Using simple algebra, the right hand side of () can be written as the sum of three terms. The first is the variance of F. Since this is beyond our control in designing ˆ we ignore it for the rest of this artithe estimator F, cle. The second term involves a mean that describes the deterministic component of the error. This term ˆ depends on both the distribution of F and that of F, and quantifies how close the means of those distributions are. The third term is a variance that describes stochastic variations from one sample to the next. This term is independent of the random variable being estimated. Formally, up to an overall additive constant, we can write ˆ = ∫ p(F)( ˆ Fˆ − F Fˆ + F ) dFˆ MSE(F) ˆ Fˆ dFˆ − F ∫ p(F) ˆ Fˆ dFˆ + F = ∫ p(F) ³¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹· ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹µ ˆ + [E(F)] ˆ −F E(F) ˆ + F = V(F) ˆ + [F − E(F)] ˆ = V(F) ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ = variance + bias .

In light of (), one way to try to reduce expected quadratic error is to modify an estimator to trade-off bias and variance. Some of the most famous applications of such bias-variance trade-offs occur in parametric machine learning, where many techniques have been developed to exploit the trade-off. Nonetheless, the trade-off also arises in many other fields, including integral estimation and optimization. In the rest of this paper we present a few novel applications of bias-variance trade-off, and describe some interesting features in each case. A recurring theme is the following: whenever a bias-variance trade-off arises in a particular field, we can use many techniques from parametric machine learning that have been developed for exploiting this trade-off. See Wolpert and Rajnarayan () for further details of many of these applications.

Applications In this section, we describe some applications of the bias-variance tradeoff. First, we describe Monte Carlo (MC) techniques for the estimation of integrals, and provide a brief analysis of bias-variance trade-offs in this context. Next, we introduce the field of Monte Carlo optimization (MCO), and illustrate that there are more subtleties involved than in simple MC. Then, we describe the field of parametric machine learning, which, as will show, is formally identical to MCO. Finally, we describe the application of parametric learning (PL) techniques to improve the performance of MCO algorithms. We do this in the context of an MCO problem that addresses black-box optimization. Monte Carlo Estimation of Integrals Using Importance Sampling

Monte Carlo methods are often the method of choice for estimating difficult high-dimensional integrals. Consider a function f ∶ X → R, which we want to integrate over some region X ⊆ X, yielding the value F, as given by F = ∫ dx f (x). X

()

We can view this as a random variable F, with density function given by a Dirac delta function centered on F. Therefore, the variance of F is , and () is exact.

Bias-Variance Trade-offs: Novel Applications

A popular MC method to estimate this integral is importance sampling (see Robert & Casella, ). This exploits the law of large numbers as follows: i.i.d. samples x(i) , i = , . . . , m are generated from a so-called importance distribution h(x) that we control, and the associated values of the integrand, f (x(i) ) are computed. Denote these “data” by (i)

(i)

D = {(x , f (x ), i = , . . . , m}.

()

B

Monte Carlo Optimization

Instead of a fixed integral to evaluate, consider a para metrized integral F(θ) = ∫ dx fθ (x). X

Further, suppose we are interested in finding the value of the parameter θ ∈ Θ that minimizes F(θ): θ ⋆ = arg min F(θ).

Now,

θ∈Θ

F = ∫ dx h(x) X

f (x) h(x)

m

= lim

m→∞

(i)

f (x ) ∑ m i= h(x(i) )

with probability .

Denote by Fˆ the random variable with value given by the sample average for D: m f (x(i) ) Fˆ = ∑ . m i= h(x(i) ) We use Fˆ as our statistical estimator for F, as we broadly described in the introductory section. Assumˆ F) = (F − F) ˆ , the ing a quadratic loss function, L(F, bias-variance decomposition described in () applies exactly. It can be shown that the estimator Fˆ is unbiased, ˆ = F, where the mean is over samples of h. that is, E(F) Consequently, the MSE of this estimator is just its variance. The choice of sampling distribution h that minimizes this variance is given by (see Robert & Casella, ) h⋆ (x) =

∣f (x)∣ . ∫X ∣f (x′ )∣ dx′

By itself, this result is not very helpful, since the equation for the optimal importance distribution contains a similar integral to the one we are trying to estimate. For non-negative integrands f (x), the VEGAS algorithm (Lepage, ) describes an adaptive method to find successively better importance distributions, by iteratively estimating F, and then using that estimate to generate the next importance distribution h. In the case of these unbiased estimators, there is no tradeoff between bias and variance, and minimizing MSE is achieved by minimizing variance.

In the case where the functional form of fθ is not explicitly known, one approach to solve this problem is a technique called MCO (see Ermoliev & Norkin, ), involving repeated MC estimation of the integral in question with adaptive modification of the parameter θ. We proceed by analogy to the case with MC. First, we introduce the θ-indexed random variable F(θ), all of whose components have delta-function distributions about the associated values F(θ). Next, we introduce a θ-indexed vector random variable Fˆ with values ˆ Fˆ ≡ {F(θ) ∀ θ ∈ Θ}.

()

ˆ can be sampled and Each real-valued component F(θ) viewed as an estimate of F(θ). For example, let D be a data set as described in (). Then for every θ, any sample of D provides an associated estimate m fθ (x(i) ) ˆ F(θ) = ∑ . m i= h(x(i) ) That average serves as an estimate of F(θ). Formally, Fˆ is a function of the random variable D, and is given by such averaging over the elements of D. So, a samˆ A priori, we make no ple of D provides a sample of F. ˆ and so, in general, its components restrictions on F, may be statistically coupled with one another. Note that this coupling arises even though we are, for simplicity, treating each function F(θ) as having a delta-function distribution, rather than as having a non-zero variance that would reflect our lack of knowledge of the f (θ) functions.

B

B

Bias-Variance Trade-offs: Novel Applications

ˆ one way However Fˆ is defined, given a sample of F, ⋆ to estimate θ is ˆ θˆ⋆ = arg min F(θ).

The natural MCO algorithm provides some insight into these results. For that algorithm, ˆ (arg min F(θ)) ˆ E(L) = ∫ dFˆ p(F)F

θ∈Θ

θ

We call this approach “natural” MCO. As an example, say that D is a set of m samples of h, and let

ˆ F (arg min F(θ)) .

m fθ (x(i) ) ˆ , F(θ) ≜ ∑ m i= h(x(i) )

θ

ˆ as above. Under this choice for F, m fθ (x(i) ) θˆ⋆ = arg min ∑ . θ∈Θ m i= h(x(i) )

()

We call this approach “naive” MCO. Consider any algorithm that estimates θ ⋆ as a ˆ The estimate of θ ⋆ prosingle-valued function of F. duced by that algorithm is itself a random variable, ˆ Call this since it is a function of the random variable F. ⋆ random variable θˆ , taking on values θˆ⋆ . Any MCO ⋆ algorithm is defined by θˆ ; that random variable encapsulates the output estimate made by the algorithm. To analyze the error of such an algorithm, consider the associated random variable given by the true ⋆ parametrized integral F(θˆ ). The difference between a ⋆ sample of F(θˆ ) and the true minimal value of the integral, F(θ ⋆ ) = minθ F(θ), is the error introduced by ⋆ our estimating that optimal θ as a sample of θˆ . Since our aim in MCO is to minimize F(θ), we adopt the ⋆ ⋆ loss function L(θˆ , θ ⋆ ) ≜ F(θˆ ) − F(θ ⋆ ). This is in contrast to our discussion on MC integration, which involved quadratic loss. The current loss function just ⋆ equals F(θˆ ) up to an additive constant F(θ ⋆ ) that is fixed by the MCO problem at hand and is beyond our control. Up to that additive constant, the associated expected loss is E(L) = ∫ d θˆ⋆ p(θˆ⋆ )F(θˆ⋆ ).

()

Now change coordinates in this integral from the val⋆ ues of the scalar random variable θˆ to the values of the ˆ The expected loss underlying vector random variable F. now becomes ˆ ˆ E(L) = ∫ dFˆ p(F)F( θˆ⋆ (F)).

ˆ ), F(θ ˆ ), . . .) ˆ ) dF(θ ˆ ) . . . p(F(θ = ∫ dF(θ ()

For any fixed θ, there is an error between samples of ˆ F(θ) and the true value F(θ). Bias-variance considerations apply to this error, exactly as in the discussion of MC above. We are not, however, concerned with Fˆ for a single component θ, but rather for a set Θ of θ’s. The simplest such case is where the components ˆ ˆ of F(Θ) are independent. Even so, arg minθ F(θ) is distributed according to the laws for extrema of multiple independent random variables, and this distribution depends on higher-order moments of each ˆ random variable F(θ). This means that E[L] also depends on such higher-order moments. Only the first two moments, however, arise in the bias and variance for any single θ. Thus, even in the simplest possible case, the bias-variance considerations for the individual θ do not provide a complete analysis. In most cases, the components of Fˆ are not independent. Therefore, in order to analyze E[L], in addition to higher moments of the distribution for each θ, we must now also consider higher-order moments coupling the ˆ estimates F(θ) for different θ. Due to these effects, it may be quite acceptable ˆ for all the components F(θ) to have both a large bias and a large variance, as long as they still order the θ’s correctly with respect to the true F(θ). In such a situation, large covariances could ensure that ˆ ˆ ′ ), θ ′ ≠θ if some F(θ) were incorrectly large, then F(θ would also be incorrectly large. This coupling between the components of Fˆ would preserve the ordering of θ’s under F. So, even with large bias and variance for each θ, the estimator as a whole would still work well. Nevertheless, it is sufficient to design estimators ˆ F(θ) with sufficiently small bias plus variance for each single θ. More precisely, suppose that those terms are very small on the scale of differences F(θ) − F(θ ′ ) for any θ and θ ′ . Then by Chebychev’s inequality,

Bias-Variance Trade-offs: Novel Applications

we know that the density functions of the random ˆ ′ ) have almost no overlap. ˆ and F(θ variables F(θ) ˆ Accordingly, the probability that a sample of F(θ) − ˆ ′ ) has the opposite sign of F(θ) − F(θ ′ ) is F(θ almost zero. Evidently, E[L] is generally determined by a complicated relationship involving bias, variance, covariance, and higher moments. Natural MCO in general, and naive MCO in particular, ignore all of these effects, and consequently, often perform quite poorly in practice. In the next section we discuss some ways of addressing this problem. Parametric Machine Learning

There are many versions of the basic MCO problem described in the previous section. Some of the best-explored arise in parametric density estimation and parametric supervised learning, which together comprise the field of parametric machine learning (PL). In particular, parametric supervised learning attempts to solve arg min ∫ dx p(x) ∫ dy p(y ∣ x)fθ (x). θ∈Θ

Here, the values x represent inputs, and the values y represent corresponding outputs, generated according to some stochastic process defined by a set of conditional distributions {p(y ∣ x), x ∈ X }. Typically, one tries to solve this problem by casting it as an MCO problem. For instance, say we adopt a quadratic loss between a predictor zθ (x) and the true value of y. Using MCO notation, we can express the associated supervised learning problem as finding arg minθ F(θ), where lθ (x) = ∫ dy p(y ∣ x) (zθ (x) − y) ,

These are used to estimate arg minθ F(θ), exactly as in MCO. In particular, one could estimate the minimizer ˆ of F(θ) by finding the minimum of F(θ), just as in natural MCO. As mentioned above, this MCO algorithm can perform very poorly in practice. In PL, this poor performance is called “overfitting the data.” There are several formal approaches that have been explored in PL to try to address this “overfitting the data.” Interestingly, none are based on direct considerˆ and the ramiation of the random variable F(θˆ⋆ (F)) fications of its distribution for expected loss (cf. ()). In particular, no work has applied the mathematics of extrema of multiple random variables to analyze the bias-variance-covariance trade-offs encapsulated in (). The PL approach that perhaps comes closest to such ⋆ direct consideration of the distribution of F(θˆ ) is uniform convergence theory, which is a central part of computational learning theory (see Angluin, ). Uniform convergence theory starts by crudely encapsulating the quadratic loss formula for expected loss under natural MCO (). It does this by considering the worst-case bound, over possible p(x) and p(y ∣ x), of the probability that F(θ ⋆ ) exceeds minθ F(θ) by more than κ. It then examines how that bound varies with κ. In particular, it relates such variation to characteristics of the set of functions {fθ : θ ∈ Θ}, e.g., the “VC dimension” of that set (see Vapnik, , ). Another, historically earlier approach, is to apply bias-plus-variance considerations to the entire PL algo⋆ ˆ separately. This rithm θˆ , rather than to each F(θ) approach is applicable for algorithms that do not use natural MCO, and even for non-parametric supervised learning. As formulated for parameteric supervised learning, this approach combines the formulas in () to write F(θ) = ∫ dx dy p(x)p(y ∣ x)(zθ (x) − y) .

fθ (x) = p(x) lθ (x), F(θ) = ∫ dx fθ (x).

B

()

This is then substituted into (), giving

Next, the argmin is estimated by minimizing a sample-based estimate of the F(θ)’s. More precisely, we are given a “training set” of samples of p(y ∣ x) p(x), {(x(i) , yi )i = , . . . , m}. This training set provides a set of associated estimates of F(θ):

E[L] = ∫ dθˆ⋆ dx dy p(x) p(y ∣ x) p(θˆ⋆ )(zθˆ⋆ (x) − y)

m ˆ F(θ) = ∑ lθ (x(i) ). m i=

The term in square brackets is an x-parameterized expected quadratic loss, which can be decomposed into

= ∫ dx p(x) [∫ dθˆ⋆ dy p(x)p(y ∣ x)p(θˆ⋆ ) (zθˆ⋆ (x) − y) ] .

()

B

B

Bias-Variance Trade-offs: Novel Applications

a bias, variance, etc., in the usual way. This formulation eliminates any direct concern for issues like the distribution of extrema of multiple random variables, ˆ ′ ) for different values ˆ and F(θ covariances between F(θ) of θ, and so on. There are numerous other approaches for addressing the problems of natural MCO that have been explored in PL. Particularly important among these are Bayesian approaches, e.g., Buntine and Weigend (), Berger (), and Mackay (). Based on these approaches, as well as on intuition, many powerful techniques for addressing data-overfitting have been explored in PL, including regularization, crossvalidation, stacking, bagging, etc. Essentially all of these techniques can be applied to any MCO problem, not just PL problems. Since many of these techniques can be justified using (), they provide a way to exploit the bias-variance trade-off in other domains besides PL. PLMCO

In this section, we illustrate how PL techniques that exploit the bias-variance decomposition of () can be used to improve an MCO algorithm used in a domain outside of PL. This MCO algorithm is a version of adaptive importance sampling, somewhat similar to the CE method (Rubinstein & Kroese, ), and is related to function smoothing on continuous spaces. The PL techniques described are applicable to any other MCO problem, and this particular one is chosen just as an example. MCO Problem Description The problem is to find the

θ-parameterized distribution qθ that minimizes the associated expected value of a function G∶ Rn → R, i.e., find arg min Eq θ [G]. θ

We are interested in versions of this problem where we do not know the functional form of G, but can obtain its value G(x) at any x ∈ X . Similarly we cannot assume that G is smooth, nor can we evaluate its derivatives directly. This scenario arises in many fields, including blackbox optimization (see Wolpert, Strauss, & Rajnarayan, ), and risk minimization (see Ermoliev & Norkin, ).

We begin by expressing this minimization problem as an MCO problem. We know that Eq θ [G] = ∫ dx qθ (x)G(x) X

Using MCO terminology, fθ (x)=qθ (x)G(x) and F(θ)= Eq θ [G]. To apply MCO, we must define a vectorvalued random variable Fˆ with components indexed by θ, and then use a sample of Fˆ to estimate arg minθ Eq θ [G]. In particular, to apply naive MCO to estimate arg minθ Eq θ (G), we first i.i.d. sample a density function h(x). By evaluating the associated values of G(x) we get a data set D ≡ (DX , DG ) = ({x(i) : i = , . . . , m}, {G(x(i) ) : i = , . . . , m}). The associated estimates of F(θ) for each θ are m qθ (x(i) )G(x(i) ) ˆ . F(θ) ≜ ∑ m i= h(x(i) )

()

The associated naive MCO estimate of arg minθ Eq θ [G] is ˆ θˆ⋆ ≡ arg min F(θ). θ

Suppose Θ includes all possible density functions over x’s. Then the qθ minimizing our estimate is a delta function about the x(i) ∈ DX with the lowest associated value of G(x(i) )/h(x(i) ). This is clearly a poor estimate in general; it suffers from “data-overfitting.” Proceeding as in PL, one way to address this dataoverfitting is to use regularization. In particular, we can use the entropic regularizer, given by the negative of the Shannon entropy S(qθ ). So we now want to find the minimizer of Eq θ [G(x)] − TS(qθ ), where T is the regularization parameter. Equivalently, we can minimize βEq θ [G(x)] − S(qθ ), where β = /T. This changes the definition of Fˆ from the function given in () to m β qθ (x(i) )G(x(i) ) ˆ − S(qθ ). F(θ) ≜ ∑ m i= h(x(i) ) Solution Methodology Unfortunately, it can be difficult

to find the θ globally minimizing this new Fˆ for an arbitrary D. An alternative is to find a close approximation

Bias-Variance Trade-offs: Novel Applications

to that optimal θ. One way to do this is as follows. First, we find minimizer of m β p(x(i) )G(x(i) ) − S(p) ∑ m i= h(x(i) )

()

over the set of all possible distributions p(x) with domain X . We then find the qθ that has minimal Kullback–Leibler (KL) divergence from this p, evaluated over DX . That serves as our approximation to ˆ and therefore as our estimate of the θ arg minθ F(θ), that minimizes Eq θ (G). The minimizer p of () can be found in closed form; over DX it is the Boltzmann distribution p β (x(i) ) ∝ exp(−β G(x(i) )). The KL divergence in DX from this Boltzmann distribution to qθ is F(θ) = KL(p β ∥qθ ) = ∫ dx p β (x) log ( X

p β (x) ). qθ (x)

The minimizer of this KL divergence is given by

the cost of convexity of the KL distance minimization problem. However, a plethora of techniques from supervised learning, in particular the expectation maximization (EM) algorithm, can be applied with minor modifications. Suppose qθ is a mixture of M Gaussians, that is, θ = (µ, Σ, ϕ) where ϕ is the mixing p.m.f, we can view the problem as one where a hidden variable z decides which mixture component each sample is drawn from. We then have the optimization problem minimize − ∑ D

E-step: For each i, set Qi (z(i) ) = p(z(i) ∣x(i) ), (i)

= q µ,Σ,ϕ (z(i) = j∣x(i) ), j = , . . . , M.

m

exp(−βG(x(i) )) log(qθ (x(i) )). (i) ) θ h(x i= () This approach is an approximation to a regularized version of the naive MCO estimate of the θ that minimizes Eq θ (G). The application of the technique of regularization in this context has the same motivation as it does in PL: to reduce bias plus variance. Log-Concave Densities If q θ is log-concave in its parameters θ, then the minimization problem in () is a convex optimization problem, and the optimal parameters can be found closed-form. Denote the likelihood ratios by s(i) = exp(−βG(x(i) ))/h(x(i) ). Differentiating () with respect to the parameters µ and Σ− and setting them to zero yields (i) (i) ∑D s x ∑D s(i) ∑ s(i) (x(i) − µ ⋆ )(x(i) − µ ⋆ )T Σ⋆ = D ∑D s(i)

µ⋆ =

Mixture Models The single Gaussian is a fairly restric-

tive class of models. Mixture models (see 7Mixture Modeling) can significantly improve flexibility, but at

p(x(i) ) log (qθ (x(i) , z(i) )) . h(x(i) )

Following the standard EM procedure, we get the algorithm described in (). Since this is a nonconvex problem, one typically runs the algorithm multiple times with random initializations of the parameters.

that is, wj θ † = arg min − ∑

B

(i)

M-step: Set

µj =

∑D wj s(i) x(i) (i)

(i) ∑D wj s

,

(i)

Σj =

(i) (i) (i) T ∑D wj s (x − µ j )(x − µ j ) (i)

,

∑D wj s(i) (i)

ϕj =

∑D wj s(i) ∑D s(i)

.

Test Problems To compare the performance of this

algorithm with and without the use of PL techniques, we use a couple of very simple academic problems in two and four dimensions – the Rosenbrock function in two dimensions, given by GR (x) = (x − x ) + ( − x ) , and the Woods function in four dimensions, given by given by GWoods (x) = (x − x ) + ( − x ) + (x − x ) + ( − x ) + .[( − x ) + ( − x ) ] + .( − x )( − x ).

B

B

Bias-Variance Trade-offs: Novel Applications

For the Rosenbrock, the optimum value of is achieved at x = (, ), and for the Woods problem, the optimum value of is achieved at x = (, , , ). Application of PL Techniques As mentioned above,

there are many PL techniques beyond regularization that are designed to optimize the trade-off between bias and variance. So having cast the solution of arg minq θ E(G) as an MCO problem, we can apply those other PL techniques instead of (or in addition to) entropic regularization. This should improve the performance of our MCO algorithm, for the exact same reason that using those techniques to trade off bias and variance improves performance in PL. We briefly mention some of those alternative techniques here. The overall MCO algorithm is broadly described in Algorithm . For the Woods problem, samples of x are drawn from the updated qθ at each iteration, and for the Rosenbrock, samples. For comparing various methods and plotting purposes, , samples of G(x) are drawn to evaluate Eq θ [G(x)]. Note: in an actual optimization, we will not be drawing these test samples! All the performance results in Fig. are based on runs of the PC algorithm, randomly initialized each time. The sample mean performance across these runs is plotted along with % confidence intervals for this sample mean (shaded regions). 7Cross-Validation for Regularization: We note that we are using regularization to reduce variance, but that regularization introduces bias. As is done in PL, we use standard k-fold cross-validation to tradeoff this bias and

Algorithm Overview of pq minimization using Gaussian mixtures : Draw uniform random samples on X : Initialize regularization parameter β : Compute G(x) values for those samples : repeat : Find a mixture distribution qθ to minimize sampled pq KL distance : Sample from qθ : Compute G(x) for those samples : Update β : until Termination : Sample final q θ to get solution(s).

variance. We do this by partitioning the data into k disjoint sets. The held-out data for the ith fold is just the ith partition, and the held-in data is the union of all other partitions. First, we “train” the regularized algorithm on the held-in data Dt to get an optimal set of parameters θ ⋆ , then “test” this θ ⋆ by considering unregularized performance on the held-out data Dv . In our context, “training” refers to finding optimal parameters by KL distance minimization using the held-in data, and “testing” refers to estimating Eq θ [G(x)] on the heldout data using the following formula (Robert & Casella, ).

∑ ̂ g (θ) =

Dv

qθ (x(i) )G(x(i) ) h(x(i) ) qθ (x(i) ) ∑ (i) Dv h(x )

.

We do this for several values of the regularization parameter β in the interval k β < β < k β, and choose the one that yield the best held-out performance, averaged over all folds. For our experiments, k = ., k = , and we use five equally-spaced values in this interval. Having found the best regularization parameter in this range, we then use all the data to minimize KL distance using this optimal value of β. Note that all crossvalidation is done without any additional evaluations of G(x). Cross-validation for β in PC is similar to optimizing the annealing schedule in simulated annealing. This “auto-annealing” is seen in Fig. a, which shows the variation of β with iterations of the Rosenbrock problem. It can be seen that β value sometimes decreases from one iteration to the next. This can never happen in any kind of “geometric annealing schedule,” β ← k β β, k β > , of the sort that is often used in most algorithms in the literature. In fact, we ran trials of this algorithm on the Rosenbrock and then computed a best-fit geometric variation for β, that is, a nonlinear least squares fit to variation of β, and a linear least squares fit to the variation of log(β). These are shown in Fig. c and d. As can be seen, neither is a very good fit. We then ran trials of the algorithm with the fixed update rule obtained by best-fit to log(β), and found that the adaptive setting of β using cross-validation performed an order of magnitude better, as shown in Fig. e.

Bias-Variance Trade-offs: Novel Applications Cross-validation for β: log(β) History.

6

3

4

2

2

0

–2

–1

–4

5

0

10

15 Iteration

a x10

25

–2

30

Least-squares Fit to β

9

3

βo = 1.809e+00

2

kβ = 1.548

0

5

10

15 Iteration

20

25

30

b 10

x10

Least-squares Fit to log(β)

9

βo = 1.240e-03 β

β

4

20

B

1

0

5

kβ = 1.832

0 0

10

20

10

20

1 0 0

20

Iteration

30

40

50

10

0

10

10

0

10

20

c

30 Iteration

40

10

50

Iteration

30

40

50

Cross-validation for Model-selection:2-D Rosenbrock.

4

Single gaussian Mixture model

3.5 3 2.5

3

2

log[E(G)]

log[E(G)]

0

d

3.5

2.5

1.5 1

2

0.5

1.5

0

1

–0.5 0

10

20

e

Iteration

30

40

–1

50

0

Bagging: Noisy Rosenbrock.

3

2

2 log[E(G)]

3

1

0

–1

–1

10

Iteration

15

20

–2

25

h

Iteration

15

20

25

Single gaussian Cross-validation Stacking

1

0

5

10

Model Selection Methods: Noisy Rosenbrock.

4

No bagging Bagging

0

5

f

4

log[E(G)]

50

0

Best-fit β Cross-validation for β

4

g

40

10

Cross-validation for Regularization: Woods Problem.

4.5

–2

30

–10

–10

0.5

Iteration

10

10 log(β)

log(β)

10

10

Cross-validation for β: log[E(G) History.

4

log(E(G)

log(β)

8

B

0

5

10

Iteration

15

20

25

Bias-Variance Trade-offs: Novel Applications. Figure . Various PL techniques improve MCO performance

B

Bias-Variance Trade-offs

Cross-Validation for Model Selection: Given a set Θ (sometimes called a model class) to choose θ from, we can find an optimal θ ∈ Θ. But how do we choose the set Θ? In PL, this is done using cross-validation. We choose ˆ has the best heldthat set Θ such that arg minθ∈Θ F(θ) out performance. As before, we use that model class Θ that yields the lowest estimate of Eq θ [G(x)] on the held-out data. We demonstrate the use of this PL technique for minimizing the Rosenbrock problem, which has a long curved valley that is poorly approximated by a single Gaussian. We use cross-validation to choose between a Gaussian mixture with up to four components. The improvement in performance is shown in Fig. d. Bagging: In bagging (Breiman, a), we generate multiple data sets by resampling the given data set with replacement. These new data sets will, in general, contain replicates. We “train” the learning algorithm on each of these resampled data sets, and average the results. In our case, we average the qθ got by our KL divergence minimization on each data set. PC works even on stochastic objective functions, and on the noisy Rosenbrock, we implemented PC with bagging by resampling ten times, and obtained significant performance gains, as seen in Fig. g. Stacking: In bagging, we combine estimates of the same learning algorithm on different data sets generated by resampling, whereas in stacking (Breiman, b; Smyth & Wolpert, ), we combine estimates of different learning algorithms on the same data set. These combined estimated are often better than any of the single estimates. In our case, we combine the qθ obtained from our KL divergence minimization algorithm using multiple models Θ. Again, Fig. h shows that crossvalidation for model selection performs better than a single model, and stacking performs slightly better than cross-validation.

Conclusions The conventional goal of reducing bias plus variance has interesting applications in a variety of fields. In straightforward applications, the bias-variance tradeoffs can decrease the MSE of estimators, reduce the generalization error of learning algorithms, and so on. In this article, we described a novel application of bias-variance trade-offs: we placed bias-variance

trade-offs in the context of MCO, and discussed the need for higher moments in the trade-off, such as a bias-variance-covariance trade-off. We also showed a way of applying just a bias-variance trade-off, as used in Parametric Learning, to improve the performance of MCO algorithms.

Recommended Reading Angluin, D. (). Computational learning theory: Survey and selected bibliography. In Proceedings of the twenty-fourth annual ACM symposium on theory of computing. New York: ACM. Berger, J. O. (). Statistical decision theory and bayesian analysis. New York: Springer. Breiman, L. (a). Bagging predictors. Machine Learning, (), –. Breiman, L. (b). Stacked regression. Machine Learning, (), –. Buntine, W., & Weigend, A. (). Bayesian back-propagation. Complex Systems, , –. Ermoliev, Y. M., & Norkin, V. I. (). Monte carlo optimization and path dependent nonstationary laws of large numbers. Technical Report IR--. International Institute for Applied Systems Analysis, Austria. Lepage, G. P. (). A new algorithm for adaptive multidimensional integration. Journal of Computational Physics, , –. Mackay, D. (). Information theory, inference, and learning algorithms. Cambridge, UK: Cambridge University Press. Robert, C. P., & Casella, G. (). Monte Carlo statistical methods. New York: Springer. Rubinstein, R., & Kroese, D. (). The cross-entropy method. New York: Springer. Smyth, P., & Wolpert, D. (). Linearly combining density estimators via stacking. Machine Learning, (–), –. Vapnik, V. N. (). Estimation of dependences based on empirical data. New York: Springer. Vapnik, V. N. (). The nature of statistical learning theory. New York: Springer. Wolpert, D. H. (). On bias plus variance. Neural Computation, , –. Wolpert, D. H., & Rajnarayan, D. (). Parametric learning and monte carlo optimization. arXiv:.v [cs.LG]. Wolpert, D. H., Strauss, C. E. M., & Rajnarayan, D. (). Advances in distributed optimization using probability collectives. Advances in Complex Systems, (), –.

Bias-Variance Trade-offs 7Bias-Variance

Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity

Bias-Variance-Covariance Decomposition The bias-variance-covarianc delcomposition is a theoretical result underlying 7ensemble learning algorithms. It is an extension of the 7bias-variance decomposition, for linear combinations of models. The expected squared error of the ensemble f¯(x) from a target d is: ⎛ ⎞ covar. ED {(f¯(x) − d) } = bias + var + − T ⎝ T⎠ The error is composed of the average bias of the models, plus a term involving their average variance, and a final term involving their average pairwise covariance. This shows that while a single model has a twoway bias-variance tradeoff, an ensemble is controlled by a three-way tradeoff. This ensemble tradeoff is often referred to as the accuracy-diversity dilemma for an ensemble. See 7ensemble learning for more details.

Bilingual Lexicon Extraction

Bilingual lexicon extraction is the task of automatically identifying a terms in a first language and terms in a second language which are translation f one another. In this context, a term can be either a single word or an expression composed of several words the full meaning of which cannot be derived compositionally from the meaning of the individual words. Bilingual lexicon extraction is itself a form of 7cross-lingual text mining and is an essential preliminary step in many approaches for performing other 7cross-lingual text mining tasks.

Binning 7Discretization

B

Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity Wulfram Gerstner Brain Mind Institute, Lausanne EPFL, Switzerland

Synonyms Correlation-based learning; Hebb rule; Hebbian learning

Definition The brain of humans and animals consists of a large number of interconnected neurons. Learning in biological neural systems is thought to take place by changes in the connections between these neurons. Since the contact points between two neurons are called synapses, the change in the connection strength is called synaptic plasticity. The mathematical description of synaptic plasticity is called a (biological) learning rule. Most of these biological learning rules can be categorized in the context of machine learning as unsupervised learning rules, and the remaining ones as rewardbased or reinforcement learning. The Hebb rule is an example of an unsupervised correlation-based learning rule formulated on the level of neuronal firing rates. Spike-timing-dependent plasticity (STDP) is an unsupervised learning rule formulated on the level of spikes. Modulation of learning rates in a Hebb rule or STDP rule by a diffusive signal carrying reward-related information yields a biologically plausible form of a reinforcement learning rule.

Motivation and Background Humans and animals can adapt to environmental conditions and learn new tasks. Learning becomes measurable by changes in the behavior: humans and animals get better at seeing and distinguishing visual objects with experience; animals can learn to go to a target location; humans can memorize a list of words and recall the items days later. How learning is implemented in the biological substrate is only partially known. The brain consists of billions of neurons. Each neuron has long wire-like extensions and makes contacts with thousands of other neurons. This network of neurons is not fixed but constantly changes. Connections

B

B

Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity

can be formed or can disappear, and existing connections can be strengthened or weakened. Neuroscientists have shown in numerous experiments that changes can be induced by stimulating neuronal activity in an appropriate fashion. Moreover, changes in synaptic connections that have been induced in one or a few seconds can persist for hours or days, an effect called long-term potentiation (LTP) or long-term depression (LTD) of synapses. The question arises of whether such long-lasting changes in connections are useful for learning. To answer this question, research in theoretical and computational neuroscience needs to solve two problems: First, develop a compact but realistic description of the phenomenon of synaptic plasticity observed in biology, i.e., extract learning rules from the biological data; and second, study the functional consequences of these learning rules. An important insight from experiments on LTP is that the activation of a synaptic connection alone does not lead to a long-lasting change; however, if the activation of the synapses by presynaptic signals is combined with some activation of the postsynaptic neuron, then a long-lasting change of the synapse may occur. The coactivation of presynaptic and postsynaptic neurons as a condition for learning is the key ingredient of Hebbian learning rules. Here, activation of the presynaptic neuron means that it fires one or several action potentials; activation of the postsynaptic neuron can be represented by high firing rates, a few well-timed action potentials or input from other neurons that lead to an increase in the membrane voltage.

Structure of the Learning System The Hebb Rule

Hebbian learning rules are local, i.e., they depend only on the state of the presynaptic and postsynaptic neurons plus possibly the current value of the synaptic weight itself. Let wij denotes the weight between a presynaptic neuron j and a postsynaptic neuron i, and let us describe the activity (e.g., the firing rate) of each neuron by a continuous variable ν j and ν i , respectively. Mathematically, we may therefore write for a local learning rule d wij = F(wij ; ν i , ν j ) dt

()

where F is an unknown function. In addition to locality, Hebbian learning requires some kind of cooperation or

correlation between the activity of the presynaptic neuron and that of the postsynaptic neuron. At the moment we restrict ourselves to the requirement of simultaneous activity of presynaptic and postsynaptic neurons. Since F is a function of the rates ν i and ν j , we may expand F about ν i = ν j = . An expansion to second order of the rates yields d pre post wij (t) ≈ c (wij ) + c (wij ) ν j + c (wij )ν i dt post + ccorr (wij ) ν i ν j + c (wij ) ν i pre

+ c (wij ) ν j + O(ν ).

()

Here, ν i and ν j are functions of time, i.e., ν i (t) and ν j (t) and so is the weight wij . The bilinear term ν i (t) ν j (t) is sensitive to the instantaneous correlations between presynaptic and postsynaptic activities. It is this term that makes Hebbian learning a useful concept. The simplest implementation of Hebbian plasticity would be to require ccorr > and set all other parameters in the expansion () to zero d wij = ccorr (wij ) ν i ν j . dt

()

Equation () with fixed parameter ccorr > is the prototype of Hebbian learning. However, since the activity variables ν i and ν j are always positive, such a rule will lead eventually to an increase of all weights in a network. pre Hence, some of the other terms (e.g., c or c ) need to have a negative coefficient to make Hebbian learning stable. In passing we note that a learning rule with ccorr < is usually called anti-Hebbian. Oja’s rule. A particular interesting case is a model post with coefficients ccorr > and c < , since it guarantees the normalization of the set of weights wi , . . . wiN converging onto the same postsynaptic neuron i. BCM rule. The Bienenstock–Cooper–Munro learning rule (also called BCM rule) with d wij = a(wij )Φ(ν i − ϑ) ν j dt

()

where Φ is some nonlinear function with Φ() = is a special case of (). The parameter ϑ depends on the average firing rate. Temporally asymmetric Hebbian learning. In the Taylor expansion () we focused on instantaneous correlations. More generally, we can use a Volterra expansion so as to also include temporal correlations with

Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity

nonzero time lag. With the additional assumptions that changes are instantaneous, a Volterra expansion generates terms of the form ∞ d wij ∝ ∫ [W+ (s)ν i (t) ν j (t − s) dt + W− (s)ν j (t) ν i (t − s)]ds

()

with some functions W+ and W− . For reasons of causality, W+ and W− must vanish for s < . Since W+ (s) ≠ W− (s), learning is asymmetric in time so that learning rules of the form () are called temporally asymmetric Hebbian learning. In the special case W+ (s) = −W− (s), we have antisymmetric Hebbian learning. The functions W+ and W− may depend on the present weight value. STDP rule. STDP is a form of Hebbian learning with increased temporal resolution. In contrast to ratebased Hebb models, neuronal activity is described by the firing times of the neuron, i.e., the moments when the presynaptic and postsynaptic neurons emit action f potentials. Let tj denote the f th spike of the presynaptic neuron j and tin the nth spike of the postsynaptic neuron i. The weight change in an STDP rule depends on the exact timing of presynaptic and postsynaptic spikes d f wij = ∑ ∑[A(wij ; t − tj )δ(t − tin ) dt n f f

f

+ B(wij ; t − ti )δ(t − tj )]

()

where A(x) and B(x) are some real-valued functions with A(wij , x) = B(wij , x) = for x < . Thus, at the moment of a postsynaptic spike the synaptic weight is f f updated by an amount that depends on the time ti −tj f

since a previous presynaptic spike tj . Similarly, at the moment of a presynaptic spike the synaptic weight is f updated by an amount that depends on the time tj − f

f

ti since a previous postsynaptic spike ti . The dependence on the present value wij can be used to keep the weight in a desired range < wij < wmax . A standard f choice for the functions A and B is A(wij ); t − tj = f

f

A+ (wij ) exp[−(t − tj )/τ+ ] for t − tj > and zero otherwise. Similarly, B(wij ; t − tin ) = B− (wij ) exp[−(t − f tin )/τ− ] for t − ti > and zero otherwise. Here, τ+ and τ− are time constants in the range of – ms. The case A+ (x) = (wmax − x) c+ and Bx (x) = − c− x is called

B

soft bounds. The choice A+ (x) = c+ Θ(wmax − x) and Bx = − c− Θ(x) is called hard bounds. Here, c+ and c− are positive constants. The term proportional to A+ causes potentiation (weight increase), the one proportional to A− causes depression (weight decrease) of synapses. Note that the STDP rule () can be interpreted as a spike-based form of temporally asymmetric Hebbian learning. Functional Consequences of Hebbian Learning

Sensitivity to correlations. All Hebbian learning rules are sensitive to the correlations between the activity of the presynaptic neuron j and that of the postsynaptic neuron i. If the activity of the postsynaptic neuron is given by a linear sum of all inputs rates, i.e., ν i = γ ∑j wij ν j , then correlations between presynaptic and postsynaptic activities can be traced back to correlations in the input. A particular clear example of learning driven by correlations in the input is Oja’s learning rule applied to a statistical ensemble of inputs with zero mean. In this case, the postsynaptic neuron becomes sensitive to the dominant principal component of the input ensemble. If the neuron model is nonlinear, Hebbian learning extracts the independent components of the statistical input ensemble. These two examples show that learning by a Hebbian learning rule makes neurons adapt to the statistics of the input. While the condition of zero-mean input is biologically not realistic (because neuronal firing rates are always positive), this condition can be relaxed so that the same result is also applicable to biologically plausible learning rules. Receptive fields and cortical maps. Neurons in the primary visual cortex of cats and monkeys respond to visual stimuli in a localized region of the visual field. This small sensitive zone is called the receptive field of the neuron. Neighboring neurons normally have very similar receptive fields. The exact location and properties of the receptive field are not fixed, but can be influenced by sensory stimulation. Models of unsupervised Hebbian learning can explain the development of receptive fields and the adaptation of cortical maps to the statistics of the ensemble of stimuli. Beyond the Hebb rule. Standard models of Hebbian learning are formulated on the level of neuronal firing rates, a graded variable characterizing neuronal activity. However, real neurons communicate by spikes, short electrical pulses or “action potentials” with a rather

B

B

Biomedical Informatics

stereotyped time course. Experiments have shown that the changes of synaptic efficacy depend not only on the mean firing rate of action potentials but on the relative timing of presynaptic and postsynaptic spikes on the level of milliseconds. This Spike-Timing Dependent Synaptic Plasticity (STDP) can be considered a temporally more precise form of Hebbian learning. The STDP rule indicated above supposes that pairs of spikes (one presynaptic and one postsynaptic action potential) within some time window cause a weight change. However, experimentally it was shown that at least three spikes are necessary (one presynaptic and two postsynaptic spikes). Moreover, the voltage of the postsynaptic neuron matters even in the absence of spikes. In most models of Hebbian learning and STDP, the pre factors c , c ... are constant or depend only on the synaptic weight. However, in biological context the speed of learning is often gated by neuromodulators. Since some of these neuromodulators contain rewardrelated information, one can think of learning as a three-factor rule where weight changes depend on presynaptic activity, postsynaptic activity, and the presence of a reward-related factor. A prominent neuromodulator linked to reward information is dopamine. Three factor learning rules fall in the class of reinforcement learning algorithms.

Cross References 7Dimensionality Reduction 7Reinforcement Learning 7Self-Organizing Maps

Recommended Reading Bliss, T., & Gardner-Medwin, A. (). Long-lasting potentation of synaptic transmission in the dendate area of unanaesthetized rabbit following stimulation of the perforant path. The Journal of Physiology, , –. Bliss, T., Collingridge, G., & Morris, R. (). Long-term potentiation: Enhancing neuroscience for years - introduction. Philosophical Transactions of the Royal Society of London. Series B : Biological Sciences, , –. Cooper, L., Intrator, N., Blais, B., & Shouval, H. Z. (). Theory of cortical plasticity. Singapore: World Scientific. Dayan, P., & Abbott, L. F. (). Theoretical Neuroscience. Cambridge, MA: MIT Press. Gerstner, W., & Kistler, W. K. (). Spiking neuron models. Cambridgess, UK: Cambridge University Press. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (). A neuronal learning rule for sub-millisecond temporal coding. Nature, , –.

Hebb, D. O. (). The organization of behavior. New York: Wiley. Lisman, J. (). Long-term potentiation: Outstanding questions and attempted synthesis. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences, , –. Malenka, R. C., & Nicoll, R. A. (). Long-term potentiation–a decade of progress? Science, , –. Markram, H., Lübke, J., Frotscher, M., & Sakmann, B. (). Regulation of synaptic efficacy by coincidence of postysnaptic AP and EPSP. Science, , –. Schultz, W., Dayan, P., & Montague, R. (). A neural substrate for prediction and reward. Science, , –.

Biomedical Informatics C. David Page, Sriraam Natarajan University of Wisconsin Medical School, Madison, USA

Introduction Recent years have witnessed a tremendous increase in the use of machine learning for biomedical applications. This surge in interest has several causes. One is the successful application of machine learning technologies in other fields such as web search, speech and handwriting recognition, agent design, spatial modeling, etc. Another is the development of technologies that enable the production of large amounts of data in the time it used to take to generate a single data point (run a single experiment). A third most recent development is the advent of Electronic Medical/Health Records (EMRs/EHRs). The drastic increase in the amount of data generated has led the biologists and clinical researchers to adopt algorithms that can construct predictive models from large amounts of data. Naturally, machine learning is emerging as a tool of choice. In this article, we will present a few data types and tasks involving such large-scale biological data, where machine learning techniques have been applied. For each of these data types and tasks, we first present the required background, followed by the challenges involved in addressing the tasks. Then, we present the machine learning techniques that have been applied to these data sets. Finally and most importantly, we

Biomedical Informatics

B

present the lessons learned in these tasks. We hope that these lessons will be helpful to researchers who aim to apply machine learning algorithms to biological applications and equip them with useful knowledge when they collaborate with biological scientists. Some of the data types that we present in this work are:

design, followed by EMR data and personalized medicine. For each of these data types, we motivate the problem and survey the different machine learning solutions. Finally, we conclude by outlining the lessons learned from all these data types and presenting some interesting and exciting directions for future research.

Gene expression microarrays SNPs and genetic data ● Mass spectrometry and other proteomic data ● High-throughput screening data for drug design ● Electronic Medical Records (EMR) and personalized medicine

Gene Expression Microarrays

● ●

Some of the key lessons learned from all these data types include the following: () We can often do surprisingly well with far more features than data points if there are many highly predictive features (e.g., predicting cancer from microarray data) and if we use methods that are robust to overfitting such as Voted Decision Stumps (Hardin et al., ; Waddell et al., ) (7Ensemble Learning and 7Decision Stumps), 7Naive Bayes (Golub et al., ; Listgarten et al., ), or Linear Support Vector Machines (SVMs) (see 7Support Vector Machine) (Furey et al., ; Hardin et al., ). () Bayes Net learning (Friedman, ) (see 7Bayesian Methods) often does not give us causality, but 7Active Learning and 7Time-Series data help if available (Pe’er, Regev, Elidan, & Friedman, ; Ong, Glassner, & Page, ; Tucker, Vinciotti, Hoen, Liu, & Famili, ; Zou & Conzen, ). () Multi-relational methods are useful for EMRs or molecular data as the data in these cases are very highly relational (see 7Multi-relational Data Mining). () There are more important issues than just increasing the accuracy of the learned model on these data sets. Such issues include how data was created, its comprehensibility (physicians typically want to understand the model that has been learned), and its privacy (some data sets contain private information that cannot be posted on public web sites and cannot even be downloaded off site). The rest of the paper is organized as follows: First we present gene expression microarrays, followed by SNPs and other genetic data. We then present mass spectrometry (MS) and related proteomic data. Next, we present high-throughput screening data for drug

This data type was presented in detail in AI Magazine (Molla et al., ) and hence we will brief it in this section. We encourage the reader to read Molla et al. () for more details on this data type. Genes are contained in the DNA of an organism. The mechanism by which proteins are produced from their corresponding genes is a two-step process. The first step is the transcription of a gene into a messenger RNA (mRNA) and in the second step called as translation, a protein is built using mRNA as a blueprint. One property that DNA and RNA have in common is that each is a chain of chemicals called as bases. In the case of DNA, these bases are Adenine, Cytosine, Guanine, and Thymine, commonly referred to as A, C, G, and T, respectively. RNA has the same set of four bases, except Thymine; RNA has Uracil, commonly referred as U. An important characteristic of DNA and RNA is complementarity, that is, each base only binds well with its complement: A with T (or U) and G with C. As a result of complementarity, a strand of either DNA or RNA has a strong affinity toward what is known as its reverse complement, which is a strand of either DNA or RNA that has bases exactly complementary to the original strand. Complementarity is central to the processes of replication of the DNA and transcription. In addition, complementarity can be used to detect specific sequences of bases within strands of DNA and RNA. This is done by first synthesizing a probe, a piece of DNA that is the complement of a sequence that one wants to detect, and then introducing this probe to a solution containing the genetic material (DNA or RNA) to be searched. This solution of genetic material is called the sample. In theory, the probe will bind to the sample if and only if the probe finds its complement in the sample (in reality, this process is often imperfect). The act of binding between a sample and probe is called hybridization. Prior to the experiment, a biologist labels the probe using a florescent flag. After the

B

B

Biomedical Informatics

hybridization experiment, one can easily scan to see if the probe has hybridized to its reverse complement in the sample. This allows the molecular biologist to determine the presence or absence of the sequence in the sample. Gene Chips

DNA probe technology has been adapted for detection of tens of thousands of sequences simultaneously. This has become possible due to the device called a microarray or gene chip, the working of which is illustrated in Fig. . When using the chips it is more common to label (luminescently) the samples than the probe. Thousands of copies of this labeled sample are spread across the probe, followed by washing away any copies that do not remain bound. Since the probes are attached at specific locations on the chip, if a labeled sample is detected at any position in the chip, the probe that is hybridized to its complement can be easily determined. The most common use of these gene chips is to measure the expression levels of various genes in the organism. Probes are typically on the order of -bases long, whereas samples are usually about times, as long, with a large variation due to the process that breaks up long sequences of RNA into small samples (Molla et al., ). To understand about the biology of an organism, say to understand human biology to design new drugs or lower the blood pressure or to cure diabetes, there is a necessity to understand the degree to which different genes get expressed as proteins under different conditions and different cell types. It is much easier to estimate the amount of mRNA for a gene than the protein-production rate. Microarrays provide the

Labeled sample (RNA) Hybridization

Probes(DNA)

Gene chip surface

Biomedical Informatics. Figure . Hybridization of sample to probe

measurement of RNAs corresponding to the given gene rather than the amounts of protein. In brief, experiments with the microarrays are performed as follows: As can be seen from the figure, probes are DNA strands attached to the gene chip surface. A typical probe length is bases (i.e., letters from A, C, G, T to represent a gene). There may be several different subsequences of these bases. Then the mRNA (which is the labeled sample) is passed over the microarrays and the mRNA will bind to the complementary DNA corresponding to the gene better than the other DNA strings. Then the florescence levels of the different gene chips segments are measured, which in turn measures the amount of mRNA on that surface. This mRNA measurement serves as a surrogate to the expression level of the gene.

Machine Learning for Microarrays The data from microarrays (gene chips) have been analyzed and used by machine learning researchers in two different ways: . Data points are genes. This is the case where the examples are genes while the features are the samples (measured expression levels of a single gene under a variety of conditions). The goal of this view is to categorize new genes based on the current set of examples. . Data points are samples (e.g., patients). This is the case where the examples are patients and the features are the measured expression levels of genes under one condition. The problems have been approached in two different ways. In the 7Unsupervised Learning approach, the goal is to cluster the genes according to their expression levels or to cluster the patients (samples) based on their gene expression levels, or both. Hierarchical clustering is especially widely applied. As one of many examples, see Perou et al. (). In the 7Supervised Learning setting, the Class labels are the category of the genes or the samples. The latter is the more common supervised task, each sample being mRNA from a different patient (with the same cell type from each patient) or an organism under different conditions to learn a model that accurately predicts the class based on the features. The features could be the patient’s expression values for each

Biomedical Informatics

gene, while the class labels might be the patient’s disease state. We discuss this task further in the subsequent paragraphs. Yet another widely studied supervised learning task is to predict cancer vs. normal for a wide variety of cancer types. One of the significant lessons learned is that it is easy to predict cancer vs. normal in patients based on the gene expression by several machine learning techniques, largely regardless of the type of cancer. The main reason for this is that if cancer is present, many genes in the cancer cells “go haywire” and hence are very predictive of the cancer. The primary challenge in this prediction problem is the noise in the data (impure RNA, cross-hybridization, etc.). Other related tasks that have been addressed include distinguishing related cancer types and distinguishing cancer from a related benign condition. An early success was a work by Golub et al. (), distinguishing acute myeloid leukemia and acute lymphoblastic leukemia (ALL). They used a weighted voting algorithm similar to Naive Bayes and achieved a very high accuracy. This result has been repeated on this data with many other machine learning (ML) approaches. Other work examined multiple myeloma vs. benign condition. This task is challenging because the benign condition is very similar to the cancer, and hence the machine learning algorithms had a difficult time predicting accurately. We refer to Hardin et al. () for more details on the experiments. Another important lesson for machine learning researchers from this data type is that the biologists often do not want one predictive model, but a rankordered list of genes that a biologist can explore further with additional lab tests on certain genes. Hence, there is a need to present a small set of highly interesting genes to perform follow-up experiments on. Toward this end, statisticians have used mutual information or a t-test to rank the genes. When using a t-test, they check if the mean expression levels are different under the two conditions (cancer vs. normal), yielding a p-value. But the issue is that when working with a large number of genes (typically in the order of ,), there could be some genes with lower p-value by chance. This is known as the “multiple comparisons problem.” One solution is to do a Bonferoni correction (multiply p-values by the number of genes), but this can be a drastic step and may eliminate all the genes. There are other methods such as

B

false discovery rate (Storey & Tibshirani, ) that uses the notion of q-values. We do not go into detail of this method. But the key recommendation we make is that such a method should be used along with the supervised learning method, as the biological collaborators might be interested in the ranking of genes. One of the most important research directions for the use of microarray data lies in the prognosis and treatment. The features are the same as those of diagnosis, but the class value becomes life expectancy for a given treatment (or a positive response vs. no response to a given treatment). The goal is to use the person’s genes to make these predictions. An example of this is the breast cancer prognosis study (Van’t Veer et al., ), where the goal is to predict good prognosis (no metastastis within years of initial diagnosis) vs. poor prognosis. They used an ensemble of voting algorithms and obtained very good results. Nevertheless, an important lesson learned from this experiment and others was that when using 7cross-validation, there is a need to tune parameters and perform feature selection independently on each fold of the crossvalidation. There can be a large number of features, and it is natural to want to reduce the size of the data set before working with it. But reducing the number of features by some measure of correlation with the class, such as information gain, using the entire data set means that on each fold of cross-validation, information has leaked from the labeled test set into the training process – labels of test cases were used to eliminate many features from the training set. Hence, selecting features by looking at the entire data set can partially negate the effect of cross-validation, sometimes yielding accuracy estimates that are more than % points overly optimistic. Hence the entire training process of selecting features, tuning parameters, and learning a model must be repeated for every fold in cross-validation by looking only at the training data for that fold. An important use of microarrays for prognosis and therapy is in the area of predictive personalized medicine (PPM). While we present the idea of PPM later in the paper, it must be mentioned that combining gene expression data with clinical trials of the patients to recommend the best treatment for the patients is a very exciting problem with promising impact in the area of PPM.

B

B

Biomedical Informatics

Gene A

Problem: Not Causality

P(A) 0.2

A A P(B) T 0.9 F 0.1

Gene B

Gene C

A P(C) T 0.8 F 0.1

B

A is a good predictor of B. But is A regulating B?? Ground truth might be:

Gene D

B T T F F

C T F T F

P(D) 0.9 0.2 0.3 0.1

Biomedical Informatics. Figure . A simple Bayes net. The actual learning task typically involves thousands of variables

Bayesian Networks for Regulatory Pathways: 7Bayesian Networks have been one of the successful machine learning methods used for the analysis of microarray data. Recall that a Bayes net is a directed acyclic graph, such as the one shown in Fig. that defines a joint distribution over the variables using a set of conditional distributions. Friedman and Halpern (Friedman & Halpern, ) were the first to use Bayes nets for the microarrays data type. In particular, the problem that was considered was finding regulatory pathways in genes. This problem can be posed as a supervised learning task as follows: Given: A set of microarray experiments for a single organism under different conditions. ● Do: Learn a graphical model that accurately predicts expression of some genes in terms of others.

●

Friedman and Halpern showed that using statistical methods, a Bayes net representing the observations (expression levels of different genes) can be learned automatically. A main advantage of Bayes nets is that they can (potentially) provide insight into the interaction networks within cells that regulate the expression of genes. But one has to exercise caution, interpreting the arcs of a learned Bayes net as representing causality. For example in Fig. , one might interpret the network to mean that gene A causes gene B and gene C to be expressed, in turn influencing gene D. Note that however, the Bayes net in this case just denotes the correlation and not the causality, that is, the direction of an

B

A C

A

A B

B

C C

B A

Or a more complicated variant

Biomedical Informatics. Figure . Why a learned Bayesian network may not be representing regulation of one gene by another

arc merely represents the fact that one variable is a good predictor of the other, as illustrated in Fig. . One possible method of learning causality is to use knock-out methods [Pe’er, Regev, Elidan, & Friedman, ], where for of the genes in S. cerevisiae (bakers’ yeast), biologists have created a knock-out mutant or a genetic mutant lacking that gene. If the parent of a gene in the Bayes net is knocked out and the child’s status remains unchanged, then it is unlikely that the arc from the parent to the child captures causality. A key limitation is that the mutants are not available for many organisms. Some other approaches such as RNAi have been proposed for more efficiently doing knock-outs, but a limitation is that RNAi typically reduces rather than eliminates expression of a gene. Ong, Glassner, and Page () used time-series data (data from the same organism at various time points) to partially address the issue of causality. They used these data to learn dynamic Bayesian networks in order to infer temporal direction for gene interactions, thereby getting a potentially better handle on causality. DBNs have been employed by other researchers for time-series gene expression data, and the approach has been extended to learn DBNs with continuous variables (Segal, Pe’er, Regev, Koller, & Friedman, ).

Single Nucleotide Polymorphisms Single-Nucleotide Polymorphisms (SNPs) are individual base positions (i.e., single-nucleotide positions)

Biomedical Informatics

in DNA, where people (or the organism of interest) vary. Most of the variation in human DNA is due to SNPs variations. (There are other variations such as copy number, insertions and deletions that we do not consider in this article.) There are well over three million known SNPs in humans. Technologies such as Illumina or Affymetrix whole-genome scan can measure a million SNPs in short time. The measurement of these variations is an order of magnitude faster, easier, and cheaper than sequencing all the genes of the person. It is believed that in the next decade, it will be possible to obtain the entire genome sequence for an individual human for under $, (Mardis, ). If we had every human’s entire sequence, it could be used to predict the susceptibility of diseases for humans or the adverse reactions to drugs for a certain subset of patients. The idea is illustrated in Fig. . Suppose the red dots in the figure are two copies of nucleotide A, and the green dots denote a different nucleotide, say C. As can be seen from the figure, people who respond to a treatment T (top half of the figure) have two copies of A (for instance, these could be the positive examples), while the people who do not respond to the treatment have at most one copy of A (negative examples and are presented in the bottom half of the figure). Now, we can imagine modeling the sequence to predict the susceptibility to a disease or responsiveness to a treatment. SNP data can serve as a surrogate for the above problem. SNPs allow us to detect the variations among humans. An example of SNP data is presented in Fig.

Susceptible to disease D or responds to treatment T

Not susceptible or not responding

Biomedical Informatics. Figure . Example application of sequencing human genes. The top half is the case, where patients respond to a treatment and the bottom is the case, where three patients do not respond to the treatment

B

for the prediction of myeloma cancer that is common with older people (with age > ) and is very rare in younger people (age < ). This data set consists of people diagnosed with myeloma at young age and people who weren’t diagnosed till they were when the disease is more common. Most SNP positions represent a pair of nucleotides and are typically restricted in the combinations of values they may assume. For example, in the figure, SNP can take values from the three possible combinations < C T, C C, T T > for its two positions. The goal is to use the feature values of the different SNPs to predict the class label which could be the susceptibility. That is, the goal is to determine genetic difference between people who got the disease at a young age vs. people who did not until they were old. There is also the possibility of two patients having the same SNP pattern in the data but not the identical DNA. Patients and may have CT for the SNP and GA for SNP, where both SNPs are on chromosome . But, Patient has C on SNP in the same copy of chromosome as the G in SNP, whereas Patient has C on the same copy as an A. Hence, while they have the same SNP pattern of CT and GA, they do not have identical DNA. The process of converting the data from the form in the Figure below to the form above is called Phasing. From a machine learning perspective, there is a choice of either working with the unphased data or to use an algorithm for phasing. It turns out that phasing is very difficult and is an active research area. If there are a number of unrelated patients phasing is very hard. Hence many machine learning researchers work mainly with unphased data. Admittedly, there is a small loss of information with the unphased data that compensates for the difficulty of phasing. Most biologists and statisticians using SNP data perform genome-wide associations studies (GWAS). The goal in this work is to find individual SNPs that are significantly associated with disease, that is, such that one of the SNP values, or alleles, raises the risk of disease. This is typically measured by “relative risk” or by “odds ratio,” and significance is typically measured by statistical tests such as Wald test, Score test, or LRLR (7logistic regression log likelihood, where each SNP is used individually to predict disease, and log likelihood of the predictive model is compared to guessing under the null hypothesis that the SNP is not associated).

B

B

Biomedical Informatics

SNP

Person

1

2

3

...

Class

Person 1

C

T

A

G

T

T

...

Old

Person 2

C

C

A

G

C

T

...

Young

Person 3

T

T

A

A

C

C

...

Old

Person 4

C

T

G

G

T

T

...

Young

.

.

.

.

.

.

...

.

.

.

.

.

.

.

...

.

.

.

.

.

.

.

...

.

Biomedical Informatics. Figure . Example of SNP data

One of many examples is the use of SNPs to predict susceptibility to breast cancer (Easton et al., ). The advantages of SNP data compared to microarray data are the following: () Because SNP analysis is typically performed on DNA from saliva or peripheral blood cells, a person’s SNP pattern does not change with time or disease. If the SNPs are collected from a blood sample of a person aged years, the SNP patterns are probably the same as when they were born. This gives more insight to the susceptibility of the person to many diseases. Hence, we do not see the widespread changes in SNP pattern with cancer, for example, that we see in microarray data from tumor samples. () It is easier to collect the samples. These can be obtained from the blood samples as against obtaining say, the biopsy of other tissue types. The challenges of SNP data are as follows: () As explained earlier, the data is unphased. Algorithms exist for phasing (haplotyping), but they are error prone and do not work well with unrelated patient samples. They require the data to consist of related individuals in order to have a dense coverage. () 7Missing Values are more common than in microarray data. The good news is that the amount of missing values is decreasing substantially (down from –% a few years ago to –%). () The sheer volume of measurements – currently, it is possible to measure a million SNPs out of over three million SNPs in the human genome. While this provides a tremendous amount of potential information, the resulting high dimensionality causes problems for machine learning. As with gene expression microarray data, we have a multiple comparisons problem, so approaches such as Bonferoni correction or

q-values from False Discovery Rate can again be applied. But even when a significant SNP is found, it usually only increases our accuracy at predicting disease by % or % points, because a single SNP typically either has a small effect or small penetrance (the variation is fairly rare – one value of the SNP is strongly predominant). So GWAS are missing a major opportunity to build predictive models by combining multiple SNPs with small effects – this is an exciting opportunity for machine learning. The supervised learning task can be defined as follows: Given: A set of SNP profiles each from a different patient. Phased: Nucleotides at each SNP position on each copy of each chromosome constitute the features and patient’s disease susceptibility or drug response constitutes the class. Unphased: Unordered pair of nucleotides at each SNP position constitutes the features and patient’s disease susceptibility or drug response constitutes the class. ● Do: Learn a model to predict the class based on the features. ●

We now briefly present one example of supervised learning from SNP data. (Waddell, Page, and Shaughnessy ()) found that there was evidence of a genetic component in predicting the blood cancer multiple myeloma as it was possible to distinguish the two cases significantly better than chance (% accuracy). The results from using Support Vector Machines (SVMs) are

Biomedical Informatics

Old

Young

Old

31

9

Young

14

26

Actual

Biomedical Informatics. Figure . Results on predicting multiple myeloma, young (susceptible) vs. old (less susceptible), , SNPs

presented in Fig. . Similar results were obtained using a Naive Bayes model as well. Listgarten et al. () also used the SNP data with the goal of predicting lung cancer. The accuracy of % obtained by them was remarkably similar to the task of predicting multiple myeloma. The best models for predicting lung cancer were also Naive Bayes and SVMs. There is a striking similarity between the two experiments on unrelated tasks using SNPs. When only the individual SNPs were considered, the accuracy for both the experiments fell to %. The lessons learned from SNP data are the following: () 7Supervised learning algorithms such as 7Naive Bayes and 7SVM that can handle large number of features in the presence of smaller number of training examples can predict disease susceptibility at rates better than chance and better than individual SNPs. () Accuracies are much lower than the ones with microarray data. This is mainly due to the fact that we are predicting the susceptibility to the diseases (or the response to a drug) as against predicting whether a person already has the disease (as with the microarray data). While we are predicting using the genetic component, there are also many environmental components that are responsible for the diseases and the response. We are not considering such components in our model and hence the accuracies are often not very high. In spite of relatively lower accuracies, they give a different valuable insight to the human gene. We now briefly outline a couple of exciting future directions for the use of SNP data. Pharmacogenetics is the problem of predicting drug response from SNP profile and has been gaining momentum over the past few years. This includes predicting drug efficacy and adverse reactions to certain drugs, given a person’s SNP profile. A recent New England Journal of Medicine article showed that the analysis of SNPs can significantly improve the dosing model for the most widely

B

used orally available blood thinner, Warfarin (IWPC, ). Another exciting direction is the combination of SNP data with other data types such as clinical data that includes the history of the patient and the lab tests and microarray data. The combination of these different data sets will not only improve the accuracy of the learned model but also provide a deeper insight to the different kinds of interactions that occur within a human, such as gene interactions with other drugs. It should be mentioned that other genetic data types are becoming available and may be useful for supervised learning as well. These data types can provide additional information about DNA sequence beyond SNPs but without the expense of full genome sequencing. They include copy-number variations and exon-sequencing.

Mass Spectrometry and Proteomics Microarrays are useful primarily because mRNA concentrations can serve as surrogates for protein concentrations and they are easier to measure. Though measuring protein concentrations directly is possible, it cannot be done in the same high-throughput manner as measuring mRNA. Recently, techniques such as Mass Spectrometry (MS or mass spec) have been successful in high-throughput measuring of proteins. Mass spec still does not given the complete coverage that microarrays provide, nor as good a quantitation. Mass spectometry is improving on many fronts, using many technologies. As one example, we present Time-Of-Flight (TOF) Mass Spectometry illustrated in Fig. . This measures the time required for an ionized particle starting from the sample plate (bottom of the figure) to hit the detector. The key idea is to place some proteins (indicated as larger circles) into a matrix (smaller circles are the matrix molecules). Because of mass spec limitations, the proteins typically are digested (broken into smaller peptides), for example, by the compound trypsin. When struck by a laser, the matrix molecules release protons that attach themselves to the peptides or protein fragments (shown in (a)). Note that the plate where the peptides are present is positively charged. This causes the peptides to migrate toward the detector. As can be seen in (b) of the figure, the molecules with smaller mass move faster toward the detector. The idea is to detect the number of molecules that hit the

B

B

Biomedical Informatics

Laser

Laser

Detector

Detector

+ +

+

+ + +

+

+

+

+ +

+10kv

+10kv The protons from the matrix molecules get attached to the proteins

Positively charged proteins are repelled towards the detector Smaller mass molecules hit detector first, while heavier ones detected later

a

b

Biomedical Informatics. Figure . Time-Of-Flight mass spectrometry

detector at any given time. This makes it possible to use time as a surrogate for mass of the protein. The experiment is repeated a number of times, counting frequencies of “flight-times.” Plotting time vs. the number of particles hitting the detector yields a spectrum as presented in Fig. . The figure shows three different fractions from the same sample. These kinds of spectra provide us an insight about the different types of proteins in a given sample. A technical detail is that sometimes molecules receive additional charge (additional protons) and hence fly faster. Therefore, the horizontal mass axis in a spectrum is actually a mass/charge ratio. The main issues for machine learning researchers working with mass spectrometry data compared to microarray data are as follows: () There is a lot of 7Noise in the data. The noise is due to extra peaks from handling of sample, from machine and environment (e.g., electrical noise). Also the mass to charge values may not exactly align across the spectra; the accuracy of the mass/charge values is the resolution of the mass spec. () Intensities (peak heights) are not calibrated across the spectra, making quantification difficult. This is to say that if one spectrum is compared to another, and if one of them has more intensity at a particular mass/charge, it does not necessarily mean that

the levels of the peptide at that mass/charge are higher in that spectrum. () Another issue is that the mass spectrometry data is not as comprehensive as microarray data, in that it is not possible to measure all peptides (typically only several hundred of them can be obtained). To get the best results, there is a need to fractionate the sample beforehand, getting different groups of proteins in different subsamples (fractions). () As already mentioned, the proteins themselves typically must be broken down (digested) into smaller peptides in order to get accurate readings from the mass spec. But this means processing is needed afterward not only to determine from a spectrum which peptides are present but also from that determination which proteins are present. It is worth noting that some of these challenges are being partially addressed by ongoing improvements in mass spectrometry technologies, including the use of “tandem mass spectrometry.” This data type opens up a lot of possibilities for machine learning research. Some of the learning tasks include: Learn to predict proteins from spectra, when the organism’s proteome (full set of proteins) is known. ● Learn to identify isotopic distributions (combinations of multiple peaks for a given molecule ●

Biomedical Informatics

B

7000 line 1 line 2 6000

line 3

5000

4000

3000

2000

1000

0

0

20000

40000

60000

80000

100000

120000 140000

160000

Biomedical Informatics. Figure . Example spectra from a competition by Lin et al.

arising from different isotypes of carbon, nitrogen. and oxygen). ● Learn to predict disease from either proteins, peaks or isotopic distributions as features. ● Construct pathway models. We will now present one case study that was successful and generated a lot of interest – Early Detection of Ovarian Cancer (Petricoin et al., ). Ovarian cancer is difficult to detect early, often leading to poor prognosis. The goal of this work was to predict ovarian cancer from blood samples. To this effect, the researchers trained and tested on mass spectra from blood serum. They used training cases ( positive) and used a held-out test set of cases ( positive). The results were extremely impressive (% sensitivity, % specificity). While the results were extremely impressive and while the machine learning methodology seemed very sound, it turns out that the preprocessing stage of the data may have introduced errors (Baggerly, Morris, & Combes, ). Mass spectrometry is very sensitive to the external factors as well. For instance, if we run cancer samples on Monday and normal samples on Wednesday, it is possible that we could get differences

from variations in the machine or nearby electrical equipment that is running on Monday but not Wednesday. Hence, one of the important lessons learned from this data type is the need for careful randomization of the data samples. This is to say that we should sample the positive and negative samples under identical conditions. It should not be the case that the positive examples are run through the machine on one day and the negatives on the other day. Any preprocessing of the data must be performed similarly. While mass spectrometry is a widely used type of high-throughput proteomic data, other types of data are also important and are briefly covered next.

Protein Structures X-ray crystallography and nuclear magnetic resonance are widely used to determine the three-dimensional structures of proteins. Predicting protein structures has been a very fertile field for machine learning research for several decades. While the amino acid sequence of a protein is called its primary structure, it is more difficult to determine secondary structure and tertiary (D) structure. Secondary structure maps subsequences of the primary

B

B

Biomedical Informatics

structure in the three classes of alpha helix (helical structures akin to a telephone cord, often denoted by A), beta strand (which comes together with other strand sections to form planar structures called beta sheets, often denoted by B), and less descript regions referred to as coil, or loop regions, often denoted by C. Predicting secondary structure and tertiary structure has been a popular topic for machine learning for many years, because training data exists yet it is difficult and expensive to experimentally determine structures. We will not attempt to survey all the work in this area. Waltz and colleagues (Zhang, Mesirov, & Waltz, ) showed the benefit of applying neural networks to the task of secondary structure prediction, and the best secondary structure predictors (e.g., Rost & Sander, ) have continued to be constructed by machine learning over the years. Approaches for predicting the tertiary structure have also relied heavily on machine learning and include ab initio prediction (e.g., Bonneau & Baker, ), prediction aided by crystallography data (e.g., DiMaio et al., ), and homology-based prediction (by finding similar proteins). For over a decade, there has been a regular competition in the prediction of protein structures (Critical Assessment of Structure Prediction [CASP]).

proteins that interact with the current protein say P. Generally, this is performed as follows: In the sample, there are some proteins of type X (shown in pink in the figure) and other types of proteins. Proteins that interact with X are bonded to X. Then antibodies (shown as Y-shaped green objects) are introduced in the sample. The idea of antibodies is to collect the proteins of type X. Once the antibodies have collected all protein X’s in the sample, they can be analyzed through mass spectrometry presented earlier. A particularly high-throughput way of measuring protein–protein interactions is through “ChIP-chip” data. The supervised learning tasks for this task include: Learn to predict protein–protein interactions: Protein three-dimensional structures may be critical. ● Use protein–protein interactions in construction of pathway models. ● Learn to predict protein function from interaction data. ●

Related Data Types ●

Protein–Protein Interactions Another proteomics data type is protein–protein interactions. This is illustrated in Fig. . The idea is to identify

Metabolomics measures concentration of each lowmolecular-weight molecule in sample. These typically are metabolites, or small molecules produced or consumed by reactions in biochemical pathways. These reactions are typically catalyzed by proteins (specifically, enzymes). This data typically uses mass spectrometry.

Antibody

The pink objects are protein X and they get attached to other proteins (2 in this figure). The green Y-shaped objects are the antibodies a

The antibodies get attached only to protein X and hence collecting the antibodies will result in collecting X ’s and the proteins that interact with X b

Biomedical Informatics. Figure . Schematic of antibody-based identification of protein–protein interactions

Biomedical Informatics

ChIP-chip data measures protein–DNA interactions. For example, transcription factors are proteins that interact with DNA in specific locations to alter transcription of a nearby gene. ● Lipomics is analogous to metabolomics, but measuring concentrations of Lipids rather than metabolites. These potentially help induce biochemical pathway information or to help disease diagnosis or treatment choice. ●

High-Throughput Screening Data for Drug Design The typical steps in designing a drug are: () Identifying a target protein – for example, while developing an antibiotic, it will be useful to find a protein that belongs to the bacteria that we are interested in and find a small molecule that will bind to that protein. In order to perform this, we need the knowledge of proteome/genome and the relevant biological path ways. () Determining the target site structure once the protein has been identified – this is typically performed using crystallography. () Finding a molecule that will bind to the target site. These steps are presented in Fig. . The molecules that bind to the target may have a number of other problems and hence they cannot directly be used as a drug. Some common problems are as follows: () They may bind too tightly or not tightly enough. () They may be toxic. () They may have unanticipated side effects in the body. () They may break down as soon as they get into the body or may not leave the body soon enough. () They may not get to the right target in the body (e.g., cross blood–brain barrier). () They may not diffuse from gut to bloodstream. Also,

B

since the organisms are different, even if a molecule works in the test tube and in animal studies, it may fail in clinical trials. Also while a molecule may work for some people, it may not work for others. Conversely, while some molecules may cause harmful side effects in some people, they may not do so in others. Often pharmaceutical companies will use robotic high-throughput screening assays to test many thousands of molecules to see if they bind to the target protein, and then computational chemists will work to determine the commonalities that allow them to bind to the target as often the structure of the target protein cannot be determined. The process of discovering the commonalities across the different molecules presents a great opportunity for machine learning research. The first study of this task using machine learning was by Dietterich, Lathrop, and Lozano-Perez and led to the formulation of MultiInstance Learning. Yet, another machine learning task could be to predict the reactions of the patients to the drugs. High-Throughput Screening: When the target structure is unknown, it is a common practice to test many molecules (,,) to find some that bind to the target. This is called as High-Throughput Screening. Hence, it is important to infer the shape of the target from threedimensional structural similarities. The shared threedimensional structure is called as pharmacophore. This is a perfect example of a machine learning task with a spatial target and is presented in Fig. . Given: A set of molecules, each labeled by activity (binding affinity for a target protein) and a set of lowenergy conformers for each molecule Do: Learn a model that accurately predicts the activity (may be Boolean or real valued).

Active

Determine target site structure

Inactive

Identify target protein

Synthesize a molecule that will bind

Biomedical Informatics. Figure . Steps drug design

involved

in

Biomedical Informatics. Figure . An example of structure learning

B

B

Biomedical Informatics

The common machine learning approaches taken toward solving this problem are: . Representing a molecule by thousands to millions of features and use standard techniques (KDD, ) . Representing each low-energy conformer by feature vector and use multiple-instance learning (Jain et al., ) . Relational learning – using either Inductive Logic Programming techniques (Finn, Muggleton, Page, & Srinivasan, ) or Graph Mining Thermolysin Inhibitors: We present some results of relational learning algorithms on thermolysin inhibitors data set (Davis, a). Thermolysin belongs to the family of metalloproteases and plays roles in physiological processes such as digestion and blood pressure regulation. The molecules in the data set are known inhibitors of thermolysin. Activity for these molecules is measured in pKi = −log Ki, where Ki is a dissociation constant, measuring the ratio of the concentrations of bound product to unbound constituents. A higher value indicates a stronger affinity for binding. The data set that was used had the ten lowest energy conformations (as computed by the SYBYL software package [www.tripos.com]) for each of thermolysin inhibitors along with their activity levels. The key results for this data set using the relational algorithm SAYU (Davis, b) were: ●

● ● ● ●

Ten five-point pharmacophore identified, falling into two groups (/ molecules): ● Three “acceptors,” one hydrophobe, and one donor ● Four “acceptors,” and one donor Common core of Zn ligands, Arg, and Asn interactions identified Correct assignments of functional groups Correct geometry to Å tolerance Increasing tolerance to . Å finds common six-point pharmacophore including one extra interaction

Antibacterial Peptides: This is a data set of pentapeptides showing activity against Pseudomonas aeruginosa (Spatola, Page, Vogel, Blondell, & Crozet, ). There are six active pharmacophores with < µg/ml of IC

Biomedical Informatics. Table Identified Pharmacophore A molecule M is active against Pseudomonas aeruginosa if it has a conformation B such that M has a hydrophobic group C M has a hydrogen acceptor D The distance between C and D in conformation B is . Å M has a positively charged atom E The distance between C and E in conformation B is Å The distance between D and E in conformation B is . Å M has a positively charged atom F The distance between C and F in conformation B is . Å The distance between D and F in conformation B is . Å The distance between E and F in conformation B is . Å Tolerance . Å

and five inactives. The pharmacophore that has been identified is presented in Table . Dopamine Agonists: The last data set that we present here consists of dopamine agonists (Martin et al., ). Dopamine works as a neurotransmitter in the brain, where it plays a major role in the movement control. Dopamine agonists are molecules that function like dopamine and produce dopamine-like effects and can potentially be used to treat diseases such as Parkinson’s disease. The data set had dopamine agonists along with their activity levels. The pharmacophore identified using Inductive Logic Programming is presented in Table .

Electronic Medical Records (EMR) and Personalized Medicine Predictive personalized medicine (PPM) is a vision of the future, whose parts are beginning to come into place now. Under this vision, physicians can construct safer and more effective prevention and treatment plans for

Biomedical Informatics

each patient. This is rendered possible by predicting the impact of treatments on patients – their effectiveness for different classes of patients, adverse reactions of certain drugs that are prescribed to the patients, and susceptibility of different types of patients to diseases. PPM can become a reality due to three reasons: The

Biomedical Informatics. Table Pharmacophore Identified for Dopamine Agonists Molecule A has the desired activity if ● In conformation B molecule A contains a hydrogen acceptor at C ● In conformation B molecule A contains a basic nitrogen group at D ● The distance between C and D is . ± . Å ● In conformation B molecule A contains a hydrogen acceptor at E ● The distance between C and E is . ± . Å ● The distance between D and E is . ± . Å ● In conformation B molecule A contains a hydrophobic group at F ● The distance between C and F is . ± . Å ● The distance between D and F is . ± . Å ● The distance between E and F is . ± . Å

P1

M

Patient ID Date P1 P1

first is the widespread use by many clinics of Electronic Medical Records (EMR also called as Electronic Health Records – EHR). The second is that whole-genome scan technology makes it possible in one experiment, for well under $,, to measure for one patient a half million to one million SNPs, or individual positions in the DNA where humans vary. The third key reason is the advancement of statistical modeling (machine learning) methods in the past decade that can handle large relational longitudinal databases with significant amount of noise. The first two reasons make it possible for the clinics to have a relational database of the form presented in Fig. . Given such a database, it is conceivable to use existing machine learning algorithms for achieving the goal of PPM. These algorithms could focus on predicting which patients are at risk (pos and neg examples). Another task is predicting which patients will respond to a specific treatment – a set of patients who have undergone specific treatments in order to learn predictive models that could be extended to similar patients of the population. Similarly, it is possible to focus on certain drugs and their adverse reactions and use them to predict the adverse reactions of similar drugs that are released in the market. In this work, we focus on the machine learning solutions to predicting adverse drug reactions for different drugs. There are actually at least three different tasks for machine learning in predicting Adverse Drug Events (ADEs).

Patient ID Date

Patient ID Gender Birthdate

P1 P1

3/22/63

Lab Test

Result

1/1/01 blood glucose 1/9/01 blood glucose

42 45

B

Physician Symptoms

1/1/01 2/1/03

Smith Jones

Diagnosis

Palpitations Hypoglycemic Fever, Aches influenza

Patient ID SNP1 SNP2 … SNP500K P1 P2

AA AB

AB BB

Patient ID

Date Prescribed

Date Filled

Physician

Medication

P1

5/17/98

5/18/98

Jones

Prilosec

BB AA

Dose

Duration

10 mg 3 months

Biomedical Informatics. Figure . Electronic Health Records (dramatically simplified) – most data currently do not include SNP information but are anticipated in the future

B

B

Biomedical Informatics

Task : Given: Patient data (from claims databases and/or EMRs) and a drug D Do: Construct a model to predict a minimum efficacious dose of drug D, because a minimum dose is less likely to induce an ADE. An example of this task is predicting the “stable dose” of the blood-thinner Warfarin (Coumadin) for a patient (McCarty, Wilke, Giampietro, Wesbrook, & Caldwell, ). A stable dose of Warfarin yields the desired degree of anticoagulation, whereas a higher dose can lead to bleeding ADEs; the stable dose for a patient is currently found by trial and error, modifying the dose and measuring the degree of anticoagulation. The cited study shows that a learned dosing model can predict a significantly better starting dose (significantly closer to the final “stable dose”) than the mg/day starting dose currently used in many clinics. Task : Given: Patient data (from claims databases and/or EMRs), a drug D, and an adverse event E Do: Construct a model to predict which patients are likely to suffer the adverse event E if they take D. In this second task, we assume that the association between D and E already has been hypothesized. We seek to construct models that can predict who will suffer a given event if they take the drug. Here, whether the patient will suffer adverse event E is the class variable to be predicted. This task is important for personalized medicine, as accurate models for this task can be used to identify patients who should not be given a particular drug. An earlier study has demonstrated the benefit of a Statistical Relational Learning (SRL) system called SAYU (Davis, b) over standard machine learning approaches with a feature-vector representation of the EHR, for the task of predicting which users of cox inhibitors would have an MI. Task : Given: Patient data (from claims databases and/or EMRs) and a drug D Do: Determine if evidence exists that associates D with a previously unanticipated adverse event. This third task is the most challenging because no associated event has been hypothesized. There is a need to identify the response variable to be predicted. In brief, the major approach for this task is to use machine

learning “in reverse.” We seek a model that can predict which patients are on drug D using the data after they start the drug (left censored) and also censoring the indications of the drug. If a model can predict (with accuracy better than chance on held-aside data) which patients are taking the drug, there must be some combination of variable settings more common among patients on the drug. Because we have left censored, in theory, this commonality should not consist of common symptoms, but common effects, presumably from the drug. The model can then be examined by the experts to see if it might indicate a possible new adverse event for the drug. The preceding use of machine learning “in reverse” actually can be viewed as Subgroup Discovery (Wrobel, ; Klösgen, ), finding a subgroup of patients on drug D who share some subsequent clinical events. The learned model – say an IF-THEN rule – need not correctly identify everyone on the drug but rather merely a subgroup of those on the drug, while not generating many false positives (individuals not on the drug). This task poses several different challenges that traditional ML methods will find difficult to handle. First, the data is multi-relational. There are several objects such as doctors, patients, drugs, diseases, and labs that are connected through relations such as visits, prescriptions, diagnoses, etc. If traditional machine learning (ML) techniques are to be employed on this problem, they require flattening the data into a single table. All known flattening techniques such as computing a join or summary features result in either () changes in frequencies on which machine learning algorithms critically depend or () loss of information. They also typically result in loss of some correlations between the objects and explosion in database size. Second, the data is non-i.i.d., as there are relationships between the objects and between different rows within a table. Third, there are arbitrary numbers of patient visits, diagnoses, and prescriptions for different patients. This is to say that there is no fixed pattern in the diagnoses and prescriptions of the patients. It is incorrect to assume that the patients are diagnosed a fixed number of times or to assume only the last diagnosis is relevant. To predict the adverse reactions to a drug, it is important to consider the other drugs that the patient is prescribed or has been prescribed in the past, as well as past diagnoses and laboratory results. To capture

Biomedical Informatics

these interactions, it is critical to explicitly model time since the interactions are highly temporal. Some drugs taken at the same time can lead to side effects while in some cases, drugs taken after one another cause side effects. It is important to capture such interactions to be able to make useful predictions for the physicians and the Federal Drug Authority (FDA). In this work, we focus on this hardest task and present the results on two data sets. Cox Inhibitors: Recently, a study was performed to see if there were any unanticipated adverse events that occurred when subjects used cox inhibitors (Vioxx, Celebrex, and Bextra). Cox inhibitors are a nonsteroidal anti-inflammatory class of drugs that were used to reduce joint pain. Vioxx, Celebrex, and Bextra were approved for use in the late s and were ranked as one of the top therapeutic drugs in the USA. Several clinical trials were conducted, and the APPROVe trial (focused on Vioxx outcomes) showed an increase of adverse events from myocardial infarction, stroke, and vascular thrombosis. The manufacturer withdrew Vioxx from the market shortly after the results were published. The other cox inhibitor drugs were discontinued shortly thereafter. This study utilized the Marshfield Clinic’s Personalized Medicine Research Project (McCarty, Wilke, Giampietro, Wesbrook, & Caldwell, ) (PMRP) cohort consisting of approximately , + subjects. The PMRP cohort included adults aged years and older, who reside in the Marshfield Epidemiology Study Area (MESA). Marshfield has one of the oldest internally developed Electronic Medical Records (Cattails MD) in the USA, with coded diagnoses dating back to the early s. Cattails MD has over , users throughout central and northern Wisconsin. Since the data is multi-relational, an Inductive Logic Programming (Muggleton & Raedt, ) system, Aleph (Srinivasan, ) was used to learn the models. Aleph learns rules in the form of Prolog clauses and scores rules by positive examples covered (P) minus negative examples covered (N). Seventy-five percent of the data was used for training and rule development, while the remaining % was used for testing. There were , subjects within the PMRP cohort that had medication records. Within this cohort, almost % of the subjects indicated use of a cox inhibitor, and more specifically, .% indicated the use of Vioxx. Approximately,

B

Biomedical Informatics. Table Cox Inhibitor Test Data Results

B

Actual Rule

+

−

+

−

,

Accuracy

.

.% of this cohort had an indicated use of clopidogrel biosulfate (Plavix). Aleph generated thousands of rules and selected a subset of the “best” rules that were based on the scoring algorithm. The authors also developed specific hypotheses to test for known adverse events to validate the approach (indicated by # A). This rule was: cox(A):- diagnoses(A, _,‘’). It states that if finding (A): the subject would have the diagnosis coded as (myocardial infarction). Aleph also provided summary statistics on model performance for identifying subjects on cox inhibitors, as indicated in Table . If we assume that the probability of being on the cox inhibitor is greater than. (the common threshold), then the model has a predictive probability of % to predict cox inhibitor use. OMOP Challenge: Observational Medical Outcomes Partnership (OMOP) designed and developed an automated procedure to construct simulated data sets to identify adverse drug events. The simulated data sets are modeled after real observational data sources but are comprised of hypothetical persons with fictional drug exposure and health outcomes occurrence. The data sets are constructed such that the relationships between the fictional drugs and fictional outcomes are well characterized as true and false associations. That is, hypothetical persons are created and assigned fictional drug exposure periods and instances of health outcomes based on random sampling from probability distributions that define the relationships between the fictional drugs and outcomes. The relationships created within the simulated data sets are contrived but are representative of the types of relationships observed within real observational data sources. OMOP has made a

B

Biomedical Informatics

simulated data set and the simulator itself publicly available as part of the OMOP Cup Data Mining Competition (http://omopcup.orwik.com). Aleph was used to learn rules from a subset of the data (about , patients). Each patient had a record of drugs and diagnoses (conditions) with dates attached. A few examples of the rules learned by Aleph in this data set are: on_drug(A):- condition_occurrence(B,C,A,D, E,,F,G,H) on_drug(A):- condition_occurrence(B,C,A,D,E, ,F,G,H) condition_occurrence(I,J,A,K,L, ,M,N,O) The first rule identifies drug as interesting, while the second rule identifies two other drugs as interesting when predicting the reaction for person A. With about rules, Aleph was able to achieve a % coverage. The results were compared against a Statistical Relational Learning technique (SRL) (Getoor & Taskar, ) that uses a probability distribution on the rules. The results are presented in Fig. . As expected, with a small number of rules, SRL has a better performance than Aleph, but as the number of rules increase, they converge on the same performance. The leading approaches in the first OMOP Cup include a machine learning approach based on random forests as well as several approaches based on techniques from epidemiology such as disproportionality analysis. At the time of this writing further details, as

0.7 0.65 0.6 Accuracy

0.55 0.5 0.45 Aleph

0.4

SRL

0.35 0.3 2

3 5 Number of rules

10

Biomedical Informatics. Figure . Results of OMOP data

well as plans for future competitions, are available at http://omopcup.orwik.com/. Identifying previously unanticipated ADEs, predicting who is most at risk for an ADE, and predicting safe and efficacious doses of drugs for particular patients are all important needs for society. With the recent advent of “paperless” medical record systems, the pieces are in place for machine learning to help meet these important needs.

Conclusion In this work, we aim to survey the abundant opportunities in biomedical applications to machine learning researchers by presenting several data types to which machine learning techniques have been applied successfully or showing tremendous promise. One of the most important developments in biology and medicine over the last few years is the availability of technologies that can produce large volumes of data. This in turn has necessitated the need for processing large volumes of data in a reasonable amount of time, presenting the perfect setting for machine learning algorithms to have an impact. We outlined several data types including gene expression microarrays (measuring mRNA), mass spectrometry (measuring proteins), SNP chips (measuring genetic variation), and Electronic Medical/Health Records (EMR/EHRs). The key lessons learned from all these data types are as follows: () Even if the number of features is greater than the number of data points (e.g., predicting cancer from microarray data), we can do well provided the features are highly predictive. () Careful randomization of data samples is necessary. () It is very easy to overfit the data and hence robust techniques such as voted 7decision stumps, 7naive Bayes or linear 7SVMs are in general very useful tools for such data sets. () 7Bayes nets do not give us causality and hence knock-out experiments (7active learning) and 7DBNs with 7time-series data can help. () Multi-relational methods such as SRL and ILP are helpful for predictive personalized medicine due to the relational nature of the data. () Mostly, the collaborators are interested in measures other than just accuracy. Comprehensibility, privacy, and ranking are other criteria that are important to biologists. This chapter is necessarily incomplete because so many exciting tasks and data types exist within biology

Biomedical Informatics

and medicine. While we have touched on many of the leading such data types, other related ones also exist. For example, there are many opportunities in analyzing genomic and protein sequences (Learning Models of Biological Sequences). Other opportunities exist within phylogenetics, for example, see work by Heckerman and colleagues on HIV (Carlson et al., ). New technologies such as optical mapping are constantly being developed and refined (Ananiev et al., ). Machine learning has great potential for developing models for computer-aided diagnosis (CAD), for example, for mammography (Burnside et al., ). Data types such as metabolomics and auxotropic growth experiments raise opportunities for active learning and for automatic revision of biological network models, for example, as in the Robot Scientist projects (Jones et al., ; Oliver et al., ). Incorporation of multiple data types can further help in mapping out the regulatory entities and networks of an organism (Noto & Craven, ). It is our hope that this article will encourage some machine learning researchers to delve deeper into these and other related opportunities.

Acknowledgment We would like to thank Elizabeth Burnside, Michael Caldwell, Mark Craven, Jesse Davis, Lingjun Li, David Madigan, Sean McIlwain, Michael Molla, Irene Ong, Peggy Peissig, Patrick Ryan, Jude Shavlik, Michael Sussman, Humberto Vidaillet, Michael Waddell and Steve Wesbrook.

Cross References 7Learning Models of Biological Sequences

Recommended Reading Ananiev, G. E., Goldstein, S., Runnheim, R., Forrest, D. K., Zhou, S., Potamousis, K., Churas, C. P., Bergendah, V., Thomson, J. A., & David, C. (). Schwartz. Optical mapping discerns genome wide DNA methylation profiles. BMC Molecular Biology, , doi:./---. Baggerly, K., Morris, J. S., & Combes, K. R. (). Reproducibility of seldi-tof protein patterns in serum: Comparing datasets from different experiments. Bioinformatics, , –. Bonneau, R., & Baker, D. (). Ab initio protein structure prediction: Progress and prospects. Annual Review of Biophysics and Biomolecular Structure, , –. Burnside, E. S., Davis, J., Chhatwal, J., Alagoz, O., Lindstrom, M. J., Geller, B. M., Littenberg, B., Kahn, C. E., Shaffer, K., &

B

Page, D. (). Unique features of hla-mediated hiv evolution in a mexican cohort: A comparative study. Radiology, , –. Carlson, J., Valenzuela-Ponce, H., Blanco-Heredia, J., GarridoRodriguez, D., Garcia-Morales, C., Heckerman, D., et al. (). Unique features of hla-mediated hiv evolution in a mexican cohort: A comparative study. Retrovirology, (), . Davis, J., Costa, V. S., Ray, S., & Page, D. (a). An integrated approach to feature construction and model building for drug activity prediction. In Proceedings of the th international conference on machine learning (ICML). Davis, J., Ong, I., Struyf, J., Burnside, E., Page, D., & Costa, V. S. (b). Change of representation for statistical relational learning. In Proceedings of the th international joint conference on artificial intelligence (IJCAI). DiMaio, F., Kondrashov, D., Bitto, E., Soni, A., Bingman, C., Phillips, G., & Shavlik, J. (). Creating protein models from electron-density maps using particle-filtering methods. Bioinformatics, , –. Easton, D. F., Pooley, K. A., Dunning, A. M., Pharoah, P. D., et al. (). Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, , –. Finn, P., Muggleton, S., Page, D., & Srinivasan, A. (). Discovery of pharmacophores using the inductive logic programming system progol. Machine Learning, (, ), –. Friedman, N. (). Being Bayesian about network structure. In Machine Learning, , –. Friedman, N., & Halpern, J. (). Modeling beliefs in dynamic systems. part ii: Revision and update. Journal of AI Research, , –. Furey, T. S., Cristianini, N., Duffy, N., Bednarski, B. W., Schummer, M., & Haussler, D. (). Support vector classification and validation of cancer tissue samples using microarray expression. Bioinformatics, (), –. Getoor, L., & Taskar, B. (). Introduction to statistical relational learning. Cambridge, MA: MIT Press. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., et al. (). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, , –. Hardin, J., Waddell, M., Page, C. D., Zhan, F., Barlogie, B., Shaughnessy, J., et al. (). Evaluation of multiple models to distinguish closely related forms of disease using DNA microarray data: An application to multiple myeloma. Statistical Applications in Genetics and Molecular Biology, (). Jain, A. N., Dietterich, T. G., Lathrop, R. H., Chapman, D., Critchlow, R. E., Bauer, B. E., et al. (). Compass: A shape-based machine learning tool for drug design. Aided Molecular Design, (), –. Jones, K. E., Reiser, F. M., Bryant, P. G. K., Muggleton, C. H., Kell, S., King, D. B., et al. (). Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, , –. KDD cup (). http://pages.cs.wisc.edu/ dpage/kddcup/. Klösgen, W. (). Handbook of data mining and knowledge discovery, chapter .: Subgroup discovery. New York: Oxford University Press. Listgarten, J., Damaraju, S., Poulin, B., Cook, L., Dufour, J., Driga, A., et al. (). Predictive models for breast cancer

B

B

Blog Mining

susceptibility from multiple single nucleotide polymorphisms. Clinical Cancer Research, , –. Mardis, E. R. (). Anticipating the , dollar genome. Genome Biology, (), . Martin, Y. C., Bures, M. G., Danaher, E. A., DeLazzer, J., Lico, I. I., & Pavlik, P. A. (). A fast new approach to pharmacophore mapping and its application to dopaminergic and benzodiazepine agonists. Journal of Computer Aided Molecular Design, , –. McCarty, C., Wilke, R. A., Giampietro, P. F, Wesbrook, S. D., & Caldwell, M. D. (). Personalized Medicine Research Project (PMRP): Design, methods and recruitment for a large population-based biobank. Personalized Medicine, , –. Molla, M., Waddell, M., Page, D., & Shavlik, J. (). Using machine learning to design and interpret gene expression microarrays. AI Magazine, (), –. Muggleton, S., & De Raedt, L. (). Inductive logic programming: Theory and methods. Journal of Logic Programming, (), –. Noto, K., & Craven, M. (). A specialized learner for inferring structured cis-regulatory modules. BMC Bioinformatics, (), doi:./---. Oliver, S. G., Young, M., Aubrey, W., Byrne, E., Liakata, M., Markham, M., et al. (). The automation of science. Science, , –. Ong, I., Glassner, J., & Page, D. (). Modelling regulatory pathways in e.coli from time series expression profiles. Bioinformatics, , S–S. Pe’er, D., Regev, A., Elidan, G., & Friedman, N. (). Inferring subnetworks from perturbed expression profiles. Bioinformatics, , –. Perou, C., Jeffrey, S., Van De Rijn, M., Rees, C. A., Eisen, M. B., Ross, D. T., et al. (). Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proccedings of National Academy of Science, , –. Petricoin, E. F., III, Ardekani, A. M., Hitt, B. A., Levine, P. J., Fusaro, V. A., Steinberg, S. M., et al. (). Use of proteomic patterns in serum to identify ovarian cancer. Lancet, , –. Rost, B., & Sander, C. (). Prediction of protein secondary structure at better than accuracy. Journal of Molecular Biology, , –. Segal, E., Pe’er, D., Regev, A., Koller, D., & Friedman, N. (April ). Learning module networks. Journal of Machine Learning Research, , –. Spatola, A., Page, D., Vogel, D., Blondell, S., & Crozet, Y. (). Can machine learning and combinatorial chemistry co-exist? In Proceedings of the American Peptide Symposium. Kluwer Academic Publishers. Srinivasan, A. (). The aleph manual. http://web.comlab.ox. ac.uk/oucl/research/areas/machlearn/Aleph/. Storey, J. D., & Tibshirani, R. (). Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences, , –. The International Warfarin Pharmacogenetics Consortium (IWPC) (). Estimation of the Warfarin Dose with Clinical and Pharmacogenetic Data. The New England Journal of Medicine, :–. Tucker, A., Vinciotti, V., Hoen, P. A. C., Liu, X., & Famili, A. F. (). Bayesian network classifiers for time-series microarray data. Advances in Intelligent Data Analysis VI, , –.

Van’t Veer, L. L., Dai, H., van de Vijver, M. M., He, Y., Hart, A., Mao, M., et al. (). Gene expression profiling predicts clinical outcome of breast cancer. Nature, , –. Waddell, M., Page, D., & Shaughnessy, J., Jr. (). Predicting cancer susceptibility from single-nucleotide polymorphism data: A case study in multiple myeloma. BIOKDD’: Proceedings of the fifth international workshop on bioinformatics, Chicago, IL. Wrobel, S. (). An algorithm for multi-relational discovery of subgroups. In European symposium on principles of kdd (pp. –). Lecture notes in computer science, Springer, Norway. Zhang, X., Mesirov, J. P., & Waltz, D. L. (). Hybrid system for protein secondary structure prediction. Journal of Molecular Biology, , –. Zou, M., & Conzen, S. D. (). A new dynamic Bayesian network approach for identifying gene regulatory networks from time course microarray data. Bioinformatics, , –.

Blog Mining Blog mining is the application of data mining (in particular, Web mining) techniques on blogs, adapted to the content, format, and language of the medium blog. A blog is a (more or less) frequently updated publication on the Web, sorted in (usually reverse) chronological order of the constituent blog posts. As in other areas of the Web, mining is applied to the content of blogs, to the various types of links between blogs, and to blogrelated behavior. The latter comprises blog authoring including link setting, blog reading and commenting, and querying (often in blog search engines). For more details on blogs and on mining them, see 7text mining for news and blogs analysis.

Boltzmann Machines Geoffrey Hinton University of Toronto, ON, Canada

Synonyms Boltzmann machines

Definition A Boltzmann machine is a network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm (Hinton &

Boltzmann Machines

Sejnowski, ) that allows them to discover interesting features that represent complex regularities in the training data. The learning algorithm is very slow in networks with many layers of feature detectors, but it is fast in “restricted Boltzmann machines” that have a single layer of feature detectors. Many hidden layers can be learned efficiently by composing restricted Boltzmann machines, using the feature activations of one as the training data for the next. Boltzmann machines are used to solve two quite different computational problems. For a search problem, the weights on the connections are fixed and are used to represent a cost function. The stochastic dynamics of a Boltzmann machine then allow it to sample binary state vectors that have low values of the cost function. For a learning problem, the Boltzmann machine is shown a set of binary data vectors and it must learn to generate these vectors with high probability. To do this, it must find weights on the connections so that relative to other possible binary vectors, the data vectors have low values of the cost function. To solve a learning problem, Boltzmann machines make many small updates to their weights, and each update requires them to solve many different search problems.

Motivation and Background The brain is very good at settling on a sensible interpretation of its sensory input within a few hundred milliseconds, and it is also very good, over a much longer timescale, at learning the code that is used to express its interpretations. It achieves both the settling and the learning using spiking neurons which, over a period of a few milliseconds, have a state of or . These neurons have intrinsic noise caused by the quantal release of vesicles of neurotransmitter at the synapses between the neurons. Boltzmann machines were designed to model both the settling and the learning, and were based on two seminal ideas that appeared in . Hopfield () showed that a neural network composed of binary units would settle to a minimum of a simple, quadratic energy function provided that the units were updated asynchronously and the pairwise connections between units were symmetrically weighted. Kirkpatrick et al. () showed that systems that were settling to energy minima could find deeper minima if noise was added to

B

the update rule so that the system could occasionally increase its energy to escape from poor local minima. Adding noise to a Hopfield net allows it to find deeper minima that represent more probable interpretations of the sensory data. More significantly, by using the right kind of noise, it is possible to make the log probability of finding the system in a particular global configuration be a linear function of its energy. This makes it possible to manipulate log probabilities by manipulating energies, and since energies are simple local functions of the connection weights, this leads to a simple, local learning rule.

Structure of Learning System The learning procedure for updating the connection weights of a Boltzmann machine is very simple, but to understand why it works it is first necessary to understand how a Boltzmann machine models a probability distribution over a set of binary vectors and how it samples from this distribution. The stochastic Dynamics of a Boltzmann Machine

When unit i is given the opportunity to update its binary state, it first computes its total input, xi , which is the sum of its own bias, bi , and the weights on connections coming from other active units: xi = bi + ∑ sj wij

()

j

where wij is the weight on the connection between i and j, and sj is if unit j is on and , otherwise. Unit i then turns on with a probability given by the logistic function: prob(si = ) =

+ e−xi

()

If the units are updated sequentially in any order that does not depend on their total inputs, the network will eventually reach a Boltzmann distribution (also called its equilibrium or stationary distribution) in which the probability of a state vector, v, is determined solely by the “energy” of that state vector relative to the energies of all possible binary state vectors: P(v) = e−E(v) / ∑ e−E(u) u

()

B

B

Boltzmann Machines

As in Hopfield nets, the energy of state vector v is defined as E(v) = − ∑ svi bi − ∑ svi svj wij i

()

i FPcost. Thus, given the values of FNcost and FPcost, a variety of costsensitive meta-learning methods can be, and have been, used to solve the class imbalance problem (Japkowicz & Stephen, ; Ling & Li, ). If the values of

C

C

Classification

FNcost and FPcost are not unknown explicitly, FNcost and FPcost can be assigned to be proportional to the number of positive and negative training cases (Japkowicz & Stephen, ). In case the class distributions of training and test datasets are different (e.g., if the training data is highly imbalanced but the test data is more balanced), an obvious approach is to sample the training data such that its class distribution is the same as the test data. This can be achieved by oversampling (creating multiple copies of examples of) the minority class and/or undersampling (selecting a subset of) the majority class (Provost, ). Note that sometimes the number of examples of the minority class is too small for classifiers to learn adequately. This is the problem of insufficient (small) training data and different from that of imbalanced datasets.

Recommended Reading Drummond, C., & Holte, R. (). Exploiting the cost (in)sensitivity of decision tree splitting criteria. In Proceedings of the seventeenth international conference on machine learning (pp. –). Drummond, C., & Holte, R. (). Severe class imbalance: Why better algorithms aren’t the answer. In Proceedings of the sixteenth European conference of machine learning, LNAI (Vol. , pp. –). Japkowicz, N., & Stephen, S. (). The class imbalance problem: A systematic study. Intelligent Data Analysis, (), –. Ling, C. X., & Li, C. (). Data mining for direct marketing – Specific problems and solutions. In Proceedings of fourth international conference on Knowledge Discovery and Data Mining (KDD-) (pp. –). Provost, F. (). Machine learning from imbalanced data sets . In Proceedings of the AAAI’ workshop on imbalanced data.

Classification Chris Drummond National Research Council of Canada

Synonyms Categorization; Generalization; Identification; Induction; Recognition

Definition In common usage, the word classification means to put things into categories, group them together in some useful way. If we are screening for a disease, we would group people into those with the disease and those without. We, as humans, usually do this because things in a group, called a 7class in machine learning, share common characteristics. If we know the class of something, we know a lot about it. In machine learning, the term classification is most commonly associated with a particular type of learning where examples of one or more 7classes, labeled with the name of the class, are given to the learning algorithm. The algorithm produces a classifier which maps the properties of these examples, normally expressed as 7attribute-value pairs, to the class labels. A new example whose class is unknown is classified when it is given a class label by the classifier based on its properties. In machine learning, we use the word classification because we call the grouping of things a class. We should note, however, that other fields use different terms. In philosophy and statistics, the term categorization is more commonly used. In many areas, in fact, classification often refers to what is called 7clustering in machines learning.

Motivation and Background Classification is a common, and important, human activity. Knowing something’s class allows us to predict many of its properties and so act appropriately. Telling other people its class allows them to do the same, making for efficient communication. This emphasizes two commonly held views of the objectives of learning. First, it is a means of 7generalization, to predict accurately the values for previously unseen examples. Second, it is a means of compression, to make transmission or communication more efficient. Classification is certainly not a new idea and has been studied for some considerable time. From the days of the early Greek philosophers such as Socrates, we had the idea of categorization. There are essential properties of things that make them what they are. It embodies the idea that there are natural kinds, ways of grouping things, that are inherent in the world. A major goal of learning, therefore, is recognizing natural kinds, establishing the necessary and sufficient conditions for belonging to a category. This “classical” view of categorization, most

Classification

often attributed to Aristotle, is now strongly disputed. The main competitor is prototype theory; things are categorized by their similarity to a prototypical example (Lakoff, ), either real or imagined. There is also much debate in psychology (Ashby & Maddox, ), where many argue that there is no single method of categorization used by humans. As much of the inspiration for machine learning originated in how humans learn, it is unsurprising that our algorithms reflect these distinctions. 7Nearest neighbor algorithms would seem to have much in common with prototype theory. These have been part of pattern recognition for some time (Cover & Hart, ) and have become popular in machine learning, more recently, as 7instance-based learners (Aha, Kiber, & Albert, ). In machine learning, we measure the distance to one or more members of a concept rather a specially constructed prototype. So, this type of learning is perhaps more a case of the exemplar learning found in the psychological literature, where multiple examples represent a category. The closest we have to prototype learning occurs in clustering, a type of 7unsupervised learning, rather than classification. For example, in 7k-means clustering group membership is determined by closeness to a central value. In the early days of machine learning, our algorithms (Mitchell, ; Winston, ) had much in common with the classical theory of categorization in philosophy and psychology. It was assumed that the data were consistent, there were no examples with the same attribute values but belonging to different classes. It was quickly realized that, even if the properties where necessary and sufficient to capture the class, there was often noise in the attribute and perhaps the class values. So, complete consistency was seldom attainable in practice. New 7classification algorithms were designed, which could tolerate some noise, such as 7decision trees (Breiman, Friedman, Olshen, & Stone, ; Quinlan, , ) and rule-based learners (see 7Rule Learning) (Clark & Niblett, ; Holte, ; Michalski, ).

space into regions belonging to a single class. The input space is defined by the Cartesian product of the attributes, all possible combinations of possible values. As a simple example, Fig. shows two classes + and −, each a random sample of a normal distribution. The attributes are X and Y of real type. The values for each attribute range from ±∞. The figure shows a couple of alternative ways that the space may be divided into regions. The bold dark lines, construct regions using lines that are parallel to the axes. New examples that have Y less than and X less than . with be classified as +, all others classified as −. Decision trees and rules form this type of boundary. A 7linear discriminant function, such as the bold dashed line, would divide the space into half-spaces, with new examples below the line being classified as + and those above as −. Instance-based learning will also divide the space into regions but the boundary is implicit. Classification occurs by choosing the class of the majority of the nearest neighbors to a new example. To make the boundary explicit, we could mark the regions where an example would be classified as + and those classified as −. We would end up with regions bounded by polygons. What differs among the algorithms is the shape of the regions, and how and when they are chosen. Sometimes the regions are implicit as in lazy learners (see 7Lazy Learning) (Aha, ), where the boundaries are not decided until a new example is being classified.

4

2 +

Y

0

−2

− − − − − − − − − − − −− −− − − − − −− − − − − − −−− −− − − −− −− − − − − − − − − − − −− − − − − −−− − −−−−−− − −−−−− −−+− −−− − − + −− −−− + −− − −− − − − −− − − −− − − − − +−−−− ++ + −− −− − + + + −− +−+− − −+ − − −+−+−−− − + + + − −−−−−− − − + + ++ + − −− +− − + − − − − − + + + + + − + +++ −++++−− ++−−− −−−−−− + + + +− +− ++ + + ++ ++ ++ − ++ +++ ++ +−+++ −− + + + ++ ++ +++− + + + + +− + +++ + + + + + + +− + − + + + +++ + ++ − −+ + + + ++ + + ++++ − ++ + +++−++ − ++− ++ + +++−+ ++ − − − − − − + ++ + + ++++++++ + + − + + + + − ++ ++ + + + +−++ ++ + +++ + + +++ + −+ − −− −− + + + +− + + + − −−

+ +

+

+

+ + +

+ +

+

−4

+

Structure of the Learning System Whether one uses instance-based learning, rule-based learning, decision trees, or indeed any other classification algorithm, the end result is the division of the input

C

−4

−2

0 X

2

4

Classification. Figure . Dividing the input space

C

C

Classification

Sometimes the regions are determined by decision theory as in generative classifiers (see 7Generative Learners) (Rubinstein & Hastie, ), which model the full joint distribution of the classes. For all classifiers though, the input space is effectively partitioned into regions representing a single class.

Applications One of the reasons that classification is an important part of machine learning is that it has proved to be a very useful technique for solving practical problems. Classification has been used to help scientists in the exploration, and comprehension, of their particular domains of interest. It has also been used to help solve significant industrial problems. Over the years a number of authors have stressed the importance of applications to machine learning and listed many successful examples (Brachman, Khabaza, Kloesgen, Piatetsky-Shapiro, & Simoudis, ; Langley & Simon, ; Michie, ). There have also been workshops on applications (Aha & Riddle, ; Engels, Evans, Herrmann, & Verdenius, ; Kodratoff, ) at major machine learning conferences and a special issue of Machine Learning (Kohavi & Provost, ), one of the main journals in the field. There are now conferences that are highly focused on applications. Collocated with major artificial intelligence conferences is the Innovative Applications of Artificial Intelligence conference. Since , this conference has highlighted practical applications of machine learning, including classification (Schorr & Rappaport, ). In addition, there are now at least two major knowledge discovery and 7data mining conferences (Fayyad & Uthurusamy, ; Komorowski & Zytkow, ) with a strong focus on applications.

Future Directions In machine learning, there are already a large number of different classification algorithms, yet new ones still appear. It seems unlikely that there is an end in sight. The “no free lunch theory” (Wolpert & Macready, ) indicates that there will never be a single best algorithm, better than all others in terms of predictive power. However, apart from their predictive performance, each classifier has its own attractive properties which are important to different groups of people. So,

new algorithms are still of value. Further, even if we are solely concerned about performance, it may be useful to have many different algorithms, all with their own biases (see 7Inductive Bias). They may be combined together to form an ensemble classifier (Caruana, Niculescu-Mizil, Crew, & Ksikes, ), which outperforms single classifiers of one type (see 7Ensemble Learning).

Limitations Classification has been critical part of machine research for some time. There is a concern that the emphasis on classification, and more generally on 7supervised learning, is too strong. Certainly much of human learning does not use, or require, labels supplied by an expert. Arguably, unsupervised learning should play a more central role in machine learning research. Although classification does require a label, it does necessarily need an expert to provide labeled examples. Many successful applications rely on finding some, easily identifiable, property which stands in for the class.

Recommended Reading Aha, D. W. (). Editorial. Artificial Intelligence Review, (–), –. Aha, D. W., Kibler, D., & Albert, M. K. (). Instance-based learning algorithms. Machine Learning, (), –. Aha, D. W., & Riddle, P. J. (Eds.). (). Workshop on applying machine learning in practice. In Proceedings of the th international conference on machine learning. Ashby, F. G., & Maddox, W. T. (). Human category learning. Annual Review of Psychology, , –. Bishop, C. M. (). Pattern recognition and machine learning. New York: Springer. Brachman, R. J., Khabaza, T., Kloesgen, W., Piatetsky-Shapiro, G., & Simoudis, E. (). Mining business databases. Communications of the ACM, (), –. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (). Classification and regression trees. Belmont, CA: Wadsworth. Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. (). Ensemble selection from libraries of models. In Proceedings of the st international conference on machine learning (pp. –). Clark, P., & Niblett, T. (). The CN induction algorithm. Machine Learning, , –. Cover, T., & Hart, P. (). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, , –. Dietterich, T., & Shavlik, J. (Eds.). Readings in machine learning. San Mateo, CA: Morgan Kaufmann. Engels, R., Evans, B., Herrmann, J., & Verdenius, F. (Eds.). (). Workshop on machine learning applications in the real world;

Classification Tree

methodological aspects and implications. In Proceedings of the th international conference on machine learning. Fayyad, U. M., & Uthurusamy, R. (Eds.). (). Proceedings of the first international conference on knowledge discovery and data mining. Holte, R. C. (). Very simple classification rules perform well on most commonly used datasets. Machine Learning, (), –. Kodratoff, Y. (Ed.). (). Proceedings of MLNet workshop on industrial application of machine learning. Kodratoff, Y., & Michalski, R. S. (). Machine learning: An artificial intelligence approach, (Vol. ). San Mateo, CA: Morgan Kaufmann. Kohavi, R., & Provost, F. (). Glossary of terms. Editorial for the special issue on applications of machine learning and the knowledge discovery process. Machine Learning, (/). Komorowski, H. J., & Zytkow, J. M. (Eds.). (). Proceedings of the first European conference on principles of data mining and knowledge discovery. Lakoff, G. (). Women, fire and dangerous things. Chicago, IL: University of Chicago Press. Langley, P., & Simon, H. A. (). Applications of machine learning and rule induction. Communications of the ACM, (), –. Michalski, R. S. (). A theory and methodology of inductive learning. In R. S. Michalski, T. J. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (pp. –). Palo Alto, CA: TIOGA Publishing. Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (). Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga Publishing Company. Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (). Machine learning: An artificial intelligence approach, (Vol. ). San Mateo, CA: Morgan Kaufmann. Michie, D. (). Machine intelligence and related topics. New York: Gordon and Breach Science Publishers. Mitchell, T. M. (). Version spaces: A candidate elimination approach to rule learning. In Proceedings of the fifth international joint conferences on artificial intelligence (pp. –). Mitchell, T. M. (). Machine learning. Boston, MA: McGraw-Hill. Quinlan, J. R. (). Induction of decision trees. Machine Learning, , –. Quinlan, J. R. (). C. programs for machine learning. San Mateo, CA: Morgan Kaufmann. Rubinstein, Y. D., & Hastie, T. (). Discriminative vs informative learning. In Proceedings of the third international conference on knowledge discovery and data mining (pp. –). Russell, S., & Norvig, P. (). Artificial intelligence: A modern approach. Upper Saddle River, NJ: Prentice-Hall. Schorr, H., & Rappaport, A. (Eds.). (). Proceedings of the first conference on innovative applications of artificial intelligence. Winston, P. H. (). Learning structural descriptions from examples. In P. H. Winston (Ed.), The psychology of computer vision (pp. –). New York: McGraw-Hill. Witten, I. H., & Frank, E. (). Data mining: Practical machine learning tools and techniques. San Fransisco: Morgan Kaufmann. Wolpert, D. H., & Macready, W. G. (). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, (), –.

C

Classification Algorithms There is a very large number of classification algorithms, including 7decision trees, 7instance-based learners, 7support vector machines, 7rule-based learners, 7neural networks, 7Bayesian networks. There also ways of combining them into 7ensemble classifiers such as 7boosting, 7bagging, 7stacking, and 7forests of trees. To delve deeper into classifiers and their role in machine learning, a number of books are recommended covering machine learning (Bishop, ; Mitchell, ; Witten & Frank, ) and artificial intelligence (Russell & Norvig, ) in general. Seminal papers on classifiers can be found in collections of papers on machine learning (Dietterich & Shavlik, ; Kodratoff & Michalski, ; Michalski, Carbonell, & Mitchell, , ).

Recommended Reading Bishop, C. M. (). Pattern recognition and machine learning. New York: Springer. Dietterich, T., & Shavlik, J. (Eds.). Readings in machine learning. San Mateo, CA: Morgan Kaufmann. Kodratoff, Y., & Michalski, R. S. (). Machine learning: An artificial intelligence approach, (Vol. ). San Mateo, CA: Morgan Kaufmann. Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (). Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga Publishing Company. Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (). Machine learning: An artificial intelligence approach, (Vol. ). San Mateo, CA: Morgan Kaufmann. Mitchell, T. M. (). Machine learning. Boston, MA: McGraw-Hill. Russell, S., & Norvig, P. (). Artificial intelligence: A modern approach. Upper Saddle River, NJ: Prentice-Hall. Witten, I. H., & Frank, E. (). Data mining: Practical machine learning tools and techniques. San Fransisco: Morgan Kaufmann.

Classification Learning 7Concept Learning

Classification Tree 7Decision Tree

C

C

Classifier Systems

Classifier Systems Pier Luca Lanzi Politecnico di Milano, Milano, Italy

Synonyms Genetics-based machine learning; Learning classifier systems

Definition Classifier systems are rule-based systems that combine 7temporal difference learning or 7supervised learning with a genetic algorithm to solve classification and 7reinforcement learning problems. Classifier systems come in two flavors: Michigan classifier systems, which are designed for online learning, but can also tackle offline problems; and Pittsburgh classifier systems, which can only be applied to offline learning. In Michigan classifier systems (Holland, ), learning is viewed as an online adaptation process to an unknown environment that represents the problem and provides feedback in terms of a numerical reward. Michigan classifier systems maintain a single candidate solution consisting of a set of rules, or a population of classifiers. Michigan systems apply () temporal difference learning to distribute the incoming reward to the classifiers that are accountable for it; and () a genetic algorithm to select, recombine, and mutate individual classifiers so as to improve their contribution to the current solution. In contrast, in Pittsburgh classifier systems (Smith, ), learning is viewed as an offline optimization process in which a genetic algorithm alone is applied to search for the best solution to a given problem. In addition, Pittsburgh classifier systems maintain not one, but a set of candidate solutions. While in the Michigan classifier system each individual classifier represents a part of the overall solution, in the Pittsburgh system each individual is a complete candidate solution (itself consisting of a set of classifiers). The fitness of each Pittsburgh individual is computed offline by testing it on a representative sample of problem instances. The individuals compete among themselves through selection, while crossover and mutation recombine solutions to search for better solutions.

Motivation and Background Machine learning is usually viewed as a search process in which a solution space is explored until an appropriate solution to the target problem is found (Mitchell, ) (see 7Learning as Search). Machine learning methods are characterized by the way they represent solutions (e.g., using 7decision trees, rules), by the way they evaluate solutions (e.g., classification accuracy, information gain) and by the way they explore the solution space (e.g., using a 7general-to-specific strategy or a 7specific-to-general strategy). Classifier systems are methods of genetics-based machine learning introduced by Holland, the father of 7genetic algorithms. They made their first appearance in Holland () where the first diagram of a classifier system, labeled “cognitive system,” was shown. Subsequently, they were described in detail in the paper “Cognitive Systems based on Adaptive Algorithms” (Holland and Reitman, ). Classifier systems are characterized by a rule-based representation of solutions and a genetics-based exploration of the solution space. While other 7rule learning methods, such as CN (Clark & Niblett, ) and FOIL (Quinlan & Cameron-Jones, ), generate one rule at a time following a sequential covering strategy (see 7Covering Algorithm), classifier systems work on one or more solutions at once, and they explore the solution space by applying the principles of natural selection and genetics. In classifier systems (Holland, ; Holland and Reitman, ; Wilson, ), machine learning is modeled as an online adaptation process to an unknown environment, which provides feedback in terms of a numerical reward. A classifier system perceives the environment through its detectors and, based on its sensations, it selects an action to be performed in the environment through its effectors. Depending on the efficacy of its actions, the environment may eventually reward the system. A classifier system learns by trying to maximize the amount of reward it receives from the environment. To pursue such a goal, it maintains a set (a population) of condition-action-prediction rules, called classifiers, which represents the current solution. Each classifier’s condition identifies some part of the problem domain; the classifier’s action represents a decision on the subproblem identified by its condition; and the classifier’s prediction, or strength, estimates the value of the action in terms of future

Classifier Systems

rewards on that subproblem. Two separate components, credit assignment and rule discovery, act on the population with different goals. 7Credit assignment, implemented either by methods of temporal difference or supervised learning, exploits the incoming reward to estimate the action values in each subproblem so as to identify the best classifiers in the population. At the same time, rule discovery, usually implemented by a genetic algorithm, selects, recombines, and mutates the classifiers in the population to improve the current solution. Classifier systems were initially conceived as modeling tools. Given a real system with unknown underlying dynamics, for instance a financial market, a classifier system would be used to generate a behavior that matched the real system. The evolved rules would provide a plausible, human readable model of the unknown system – a way to look inside the box. Subsequently, with the developments in the area of machine learning and the rise of reinforcement learning, classifier systems have been more and more often studied and presented as alternatives to other machine learning methods. Wilson’s XCS (), the most successful classifier system to date, has proven to be both a valid alternative to other reinforcement learning approaches and an effective approach to classification and data mining (Bull, ; Bull & Kovacs, ; Lanzi, Stolzmann, & Wilson, ). Kenneth de Jong and his students (de Jong, ; Smith, , ) took a different perspective on genetics-based machine learning and modeled learning as an optimization process rather than an adaptation process as done in Holland (). In this case, the solution space is explored by applying a genetic algorithm to a population of individuals each representing a complete candidate solution – that is, a set of rules (or a production system, de Jong, ; Smith, ). At each cycle, a critic is applied to each individual (to each set of rules) to obtain a performance measure that is then used by the genetic algorithm to guide the exploration of the solution space. The individuals in the population compete among themselves through selection, while crossover and mutation recombine solutions to search for better ones. The approaches of Holland (Holland, ; Holland and Reitman, ) and de Jong (de Jong, ; Smith, , ) have been extended and improved

C

in several ways (see Lanzi et al. () for a review). The models of classifier systems that are inspired by the work of Holland () at the University of Michigan are usually called Michigan classifier systems; the ones that are inspired by Smith (, ) and de Jong () at the University of Pittsburgh are usually termed Pittsburgh classifier systems – or briefly, Pitt classifier systems. Pittsburgh classifier systems separate the evaluation of candidate solutions, performed by an external critic, from the genetic search. As they evaluate candidate solutions as a whole, Pittsburgh classifier systems can easily identify and emphasize sequentially cooperating classifiers, which is particularly helpful in problems involving partial observability. In contrast, in Michigan classifier systems the credit assignment is focused, due to identification of the actual classifiers that produce the reward, so learning is much faster but sequentially cooperating classifiers are more difficult to spot. As Pittsburgh classifier systems apply the genetic algorithm to a set of solutions, they only work offline, whereas Michigan classifier systems work online, although they can also tackle offline problems. Finally, the design of Pittsburgh classifier systems involves decisions as to how an entire solution should be represented and how solutions should be recombined – a task which can be daunting. In contrast, the design of Michigan classifier systems involves simpler decisions about how a rule should be represented and how two rules should be recombined. Accordingly, while the representation of solutions and its related issues play a key role in Pittsburgh models, Michigan models easily work with several types of representations (Lanzi, ; Lanzi & Perrucci, ; Mellor, ).

Structure of the Learning System Michigan and Pittsburgh classifier systems were both inspired by the work of Holland on the broadcast language (Holland, ). However, their structures reflect two different ways to model machine learning: as an adaptation process in the case of Michigan classifier systems; and as an optimization problem, in the case of Pittsburgh classifier systems. Thus, the two models, originating from the same idea (Holland’s broadcast language), have radically different structures.

C

C

Classifier Systems

Michigan Classifier Systems Holland’s classifier systems define a general paradigm for genetics-based machine learning. The description in Holland and Reitman () provides a list of principles for online learning through adaptation. Over the years, such principles have guided researchers who developed several models of Michigan classifier systems (Butz, ; Wilson, , , ) and applied them to a large variety of domains (Bull, ; Lanzi & Riolo, ; Lanzi et al., ). These models extended and improved Holland’s original ideas, but kept all the ingredients of the original recipe: a population of classifiers, which represents the current system knowledge; a performance component, which is responsible for the short-term behavior of the system; a credit assignment (or reinforcement) component, which distributes the incoming reward among the classifiers; and a rule discovery component, which applies a genetic algorithm to the classifiers to improve the current knowledge.

Knowledge Representation In Michigan classifier systems, knowledge is represented by a population of classifiers. Each classifier is usually defined by four main parameters: the condition, which identifies some part of the problem domain; the action, which represents a decision on the subproblem identified by its condition; the prediction or strength, which estimates the amount of reward that the system will receive if its action is performed; and finally, the fitness, which estimates how good the classifier is in terms of problem solution. The knowledge representation of Michigan classifier systems is extremely flexible. Each one of the four classifier components can be tailored to fit the need of a particular application, without modifying the main structure of the system. In problems involving binary inputs, classifier conditions can be simply represented using strings defined over the alphabet {, , #}, as done in Holland and Reitman (), Goldberg (), and Wilson (). In problems involving real inputs, conditions can be represented as disjunctions of intervals, similar to the ones produced by other rule learning methods (Clark & Niblett, ) Conditions can also be represented as general-purpose symbolic expressions

(Lanzi, ; Lanzi & Perrucci, ) or first-order logic expressions (Mellor, ). Classifier actions are typically encoded by a set of symbols (either binary strings or simple labels), but continuous real-valued actions are also available (Wilson, ). Classifier prediction (or strength) is usually encoded by a parameter (Goldberg, ; Holland & Reitman, ; Wilson, ). However, classifier prediction can also be computed using a parameterized function (Wilson, ), which results in solutions represented as an ensemble of local approximators – similar to the ones produced in generalized reinforcement learning (Sutton & Barto, ).

Performance Component A simplified structure of Michigan classifier systems is shown in Fig. . We refer the reader to Goldberg () and Holland and Reitman () for a detailed description of the original model and to Butz () and Wilson (, , ) for descriptions of recent classifier system models. A classifier system learns through trial and error interactions with an unknown environment. The system and the environment interact continually. At each time step, the classifier system perceives the environment through its detectors; it builds a match set containing all the classifiers in the population whose condition matches the current sensory input. The match set typically contains classifiers that advocate contrasting actions; accordingly, the classifier system evaluates each action in the match set, and selects an action to be performed balancing exploration and exploitation. The selected action is sent to the effectors to be executed in the environment; depending on the effect that the action has in the environment, the system receives a scalar reward.

Credit Assignment The credit assignment component (also called reinforcement component, Wilson ) distributes the incoming reward to the classifiers that are accountable for it. In Holland and Reitman (), credit assignment is implemented by Holland’s bucket brigade algorithm (Holland, ), which was partially inspired by the credit allocation mechanism used by Samuel in his

Classifier Systems

Perceptions

Reward

Effectors

Credit Assignment Component

Classifiers representing the current knowledge

1

Match Set

3

2

Classifiers matching the current sensory inputs

Evaluation of the actions in the match set

Rule Discovery Component

Classifier Systems. Figure . Simplified structure of a Michigan classifier system. The system perceives the environment through its detectors and () it builds the match set containing the classifiers in the population that match the current sensory inputs; then () all the actions in the match set are evaluated, and () an action is selected to be performed in the environment through the effectors

pioneering work on learning checkers-playing programs (Samuel, ). In the early years, classifier systems and the bucket brigade algorithm were confined to the evolutionary computation community. The rise of reinforcement learning increased the connection between classifier systems and temporal difference learning (Sutton, ; Sutton & Barto, ): in particular, Sutton () showed that the bucket brigade algorithm is a kind of temporal difference learning, and similar connections were also made in Watkins () and Dorigo and Bersini (). Later, the connection between classifier systems and reinforcement learning became tighter with the introduction of Wilson’s XCS (), in which credit assignment is implemented by a modification of Watkins Q-learning (Watkins, ). As a consequence, in recent years, classifier systems are often presented as methods of reinforcement learning with genetics-based generalization (Bull & Kovacs, ).

Action

Detectors

Population

C

Rule Discovery Component The rule discovery component is usually implemented by a genetic algorithm that selects classifiers in the population with probability proportional to their fitness; it copies the selected classifiers and applies genetic operators (usually crossover and mutation) to the offspring classifiers; the new classifiers are inserted in the population, while other classifiers are deleted to keep the population size constant. Classifiers selection plays a central role in rule discovery. Classifier selection depends on the definition of classifier fitness and on the subset of classifiers considered during the selection process. In Holland and Reitman (), classifier fitness coincides with classifier prediction, while selection is applied to all the classifiers in the population. This approach results in a pressure toward classifiers predicting high returns, but typically tends to produce overly general solutions. To avoid such solutions, Wilson () introduced the XCS classifier system in which accuracy-based fitness is

C

C

Classifier Systems

coupled with a niched genetic algorithm. This approach results in a pressure toward accurate maximally general classifiers, and has made XCS the most successful classifier system to date.

Pittsburgh Classifier Systems The idea underlying the development of Pittsburgh classifier systems was to show that interesting behaviors could be evolved using a simpler model than the one proposed by Holland with Michigan classifier systems (Holland, ; Holland & Reitman, ). In Pittsburgh classifier systems, each individual is a set of rules that encodes an entire candidate solution; each rule has a fixed length, but each rule set (each individual) usually contains a variable number of rules. The genetic operators, crossover and mutation, are tailored to the rule-based, variable-length representation. The individuals in the population compete among themselves, following the selection-recombination-mutation cycle that is typical of genetic algorithms (Goldberg, ; Holland, ). While in Michigan classifier systems individuals in the population (the single rules) cooperate, in Pittsburgh classifier systems there is no cooperation among individuals (the rule sets), so that the genetic algorithm operation is simpler for Pittsburgh models. However, as Pittsburgh classifier systems explore a much larger search space, they usually require more computational resources than Michigan classifier systems. The pseudo-code of a Pittsburgh classifier system is shown in Fig. . At first, the individuals in the population are randomly initialized (line ). At time t, the

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

individuals are evaluated by an external critic, which returns a performance measure that the genetic algorithm exploits to compute the fitness of individuals (lines and ). Following this, selection (line ), recombination, and mutation (line ) are applied to the individuals in the population – as done in a typical genetic algorithm. The process stops when a termination criterion is met (line ), usually when an appropriate solution is found. The design of Pittsburgh classifier systems follows the typical steps of genetic algorithm design, which means deciding how a rule set should be represented, what genetic operators should be applied, and how the fitness of a set of rules should be calculated. In addition, Pittsburgh classifier systems need to address the bloat phenomenon (Tackett, ) that arises with any variable-sized representation, like the rule sets evolved by Pittsburgh classifier systems. Bloat can be defined as the growth of individuals without an actual fitness improvement. In Pittsburgh classifier systems, bloat increases the size of candidate solutions by adding useless rules to individuals, and it is typically limited by introducing a parsimony pressure that discourages large rule sets (Bassett & de Jong, ). Alternatively, Pittsburgh classifier systems can be combined with multi-objective optimization, so as to separate the maximization of the rule set performance and the minimization of the rule set size. Examples of Pittsburgh classifier systems include SAMUEL (Grefenstette, Ramsey, & Schultz, ), the Genetic Algorithm Batch-Incremental Concept Learner (GABIL) (de Jong & Spears, ), GIL (Janikow, ), GALE (Llorà, ), and GAssist (Bacardit, ).

t := 0 Initialize the population P(t) Evaluate the rules sets in P(t) While the termination condition is not satisfied Begin Select the rule sets in P(t) and generate Ps(t) Recombine and mutate the rule sets in Ps(t) P(t+1) := Ps(t) t := t+1 Evaluate the rules sets in P(t) End

Classifier Systems. Figure . Pseudo-code of a Pittsburgh classifier system

Classifier Systems

Applications Classifier systems have been applied to a large variety of domains, including computational economics (e.g., Arthur, Holland, LeBaron, Palmer, & Talyer, ), autonomous robotics (e.g., Dorigo & Colombetti, ), classification (e.g., Barry, Holmes, & Llora, ), fighter aircraft maneuvering (Bull, ; Smith, Dike, Mehra, Ravichandran, & El-Fallah, ), and many others. Reviews of classifier system applications are available in Lanzi et al. (), Lanzi and Riolo (), and Bull ().

Programs and Data The major sources of information about classifier systems are the LCSWeb maintained by Alwyn Barry, which can be reached through, and www.learningclassifier-systems.org_maintained by Xavier Llorà. Several implementations of classifier systems are freely available online. The first standard implementation of Holland’s classifier system in Pascal was described in Goldberg (), and it is available at http://www.illigal.org/; a C version of the same implementation, developed by Robert E. Smith, is available at http://www.etsimo.uniovi.es/ftp/pub/EC/CFS/src/. Another implementation of an extension of Holland’s classifier system in C by Rick L. Riolo is available at http://www.cscs.umich.edu/Software/Contents. html. Implementations of Wilson’s XCS () are distributed by Alwyn Barry at the LCSWeb, by Martin V. Butz (at www.illigal.org), and by Pier Luca Lanzi (at xcslib.sf.net). Among the implementations of Pittsburgh classifier systems, the Samuel system is available from Alan C. Schultz at http://www.nrl.navy.mil/; Xavier Llorà distributes GALE (Genetic and Artificial Life Environment) a fine-grained parallel genetic algorithm for data mining at www.illigal.org/xllora.

Cross References 7Credit Assignment 7Genetic Algorithms 7Reinforcement Learning 7Rule Learning

Recommended Reading Arthur, B. W., Holland, J. H., LeBaron, B., Palmer, R., & Talyer, P. (). Asset pricing under endogenous expectations in an artificial stock market. Technical Report, Santa Fe Institute.

C

Bacardit i Peñarroya, J. (). Pittsburgh genetic-based machine learning in the data mining era: Representations, generalization, and run-time. PhD thesis, Computer Science Department, Enginyeria i Arquitectura La Salle Universitat Ramon Llull, Barcelona. Barry, A. M., Holmes, J., & Llora, X. (). Data mining using learning classifier systems. In L. Bull (Ed.), Applications of learning classifier systems, studies in fuzziness and soft computing (Vol. , pp. –). Pagg: Springer. Bassett, J. K., & de Jong, K. A. (). Evolving behaviors for cooperating agents. In Proceedings of the twelfth international symposium on methodologies for intelligent systems, LNAI (Vol. ). Berlin: Springer. Booker, L. B. (). Triggered rule discovery in classifier systems. In J. D. Schaffer (Ed.), Proceedings of the rd international conference on genetic algorithms (ICGA). San Francisco: Morgan Kaufmann. Bull, L. (Ed.). (). Applications of learning classifier systems, studies in fuzziness and soft computing (Vol. ). Berlin: Springer, ISBN ----. Bull, L., & Kovacs, T. (Eds.). (). Foundations of learning classifier systems, studies in fuzziness and soft computing (Vol. ). Berlin: Springer, ISBN ----. Butz, M. V. (). Anticipatory learning classifier systems. Genetic algorithms and evolutionary computation. Boston, MA: Kluwer Academic Publishers. Clark, P., & Niblett, T. (). The CN induction algorithm. Machine Learning, (), –. de Jong, K. (). Learning with genetic algorithms: An overview. Machine Learning, (–), –. de Jong, K. A., & Spears, W. M. (). Learning concept classification rules using genetic algorithms. In Proceedings of the international joint conference on artificial intelligence (pp. –). San Francisco: Morgan Kaufmann. Dorigo, M., & Bersini, H. (). A comparison of Q-learning and classifier systems. In D. Cliff, P. Husbands, J.-A. Meyer, & S. W. Wilson (Eds.), From animals to animats : Proceedings of the third international conference on simulation of adaptive behavior (pp. –). Cambridge, MA: MIT Press. Dorigo, M., & Colombetti, M. (). Robot shaping: An experiment in behavior engineering. Cambridge, MA: MIT Press/Bradford Books. Goldberg, D. E. (). Genetic algorithms in search, optimization, and machine learning. Reading, MA: Addison-Wesley. Grefenstette, J. J., Ramsey, C. L., & Schultz, A. () Learning sequential decision rules using simulation models and competition. Machine Learning, (), –. Holland, J. () Escaping brittleness: The possibilities of generalpurpose learning algorithms applied to parallel rule-based systems. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning, an artificial intelligence approach (Vol. II, Chap. ) (pp. –). San Francisco: Morgan Kaufmann. Holland, J. H. (). Adaptation in natural and artificial systems. Ann Arbor, MI: University of Michigan Press (Reprinted by the MIT Press in ). Holland, J. H. (). Adaptation. Progress in Theoretical Biology, , –. Holland, J. H., & Reitman, J. S. (). Cognitive systems based on adaptive algorithms. In D. A. Waterman & F. Hayes-Roth (Eds.), Pattern-directed inference systems. New York: Academic Press.

C

C

Clause

(Reprinted from Evolutionary computation. The fossil record. D. B. Fogel (Ed.), IEEE Press ()). Janikow, C. Z. (). A knowledge-intensive genetic algorithm for supervised learning. Machine Learning, (–), –. Lanzi, P. L. (). Mining interesting knowledge from data with the XCS classifier system. In L. Spector, E. D. Goodman, A. Wu, W. B. Langdon, H.-M. Voigt, M. Gen, et al. (Eds.), Proceedings of the genetic and evolutionary computation conference (GECCO) (pp. –). San Francisco: Morgan Kaufmann. Lanzi, P. L. (). Learning classifier systems: A reinforcement learning perspective. In L. Bull & T. Kovacs (Eds.), Foundations of learning classifier systems, studies in fuzziness and soft computing (pp. –). Berlin: Springer. Lanzi, P. L., & Perrucci, A. (). Extending the representation of classifier conditions part II: From messy coding to Sexpressions. In W. Banzhaf, J. Daida, A. E. Eiben, M. H. Garzon, V. Honavar, M. Jakiela, & R. E. Smith (Eds.), Proceedings of the genetic and evolutionary computation conference (GECCO ) (pp. –). Orlando, FL: Morgan Kaufmann. Lanzi, P. L., & Riolo, R. L. (). Recent trends in learning classifier systems research. In A. Ghosh & S. Tsutsui (Eds.), Advances in evolutionary computing: Theory and applications (pp. –). Berlin: Springer. Lanzi, P. L., Stolzmann, W., & Wilson, S. W. (Eds.). (). Learning classifier systems: From foundations to applications. Lecture notes in computer science (Vol. ). Berlin: Springer. Llorá, X. (). Genetics-based machine learning using fine-grained parallelism for data mining. PhD thesis, Enginyeria i Arquitectura La Salle, Ramon Llull University, Barcelona. Mellor, D. (). A first order logic classifier system. In H. Beyer (Ed.), Proceedings of the conference on genetic and evolutionary computation (GECCO ’), (pp. –). New York: ACM Press. Quinlan, J. R., & Cameron-Jones, R. M. (). Induction of logic programs: FOIL and related systems. New Generation Computing, (&), –. Samuel, A. L. (). Some studies in machine learning using the game of checkers. In E. A. Feigenbaum & J. Feldman (Eds.), Computers and thought. New York: McGraw-Hill. Smith, R. E., Dike, B. A., Niehra, R. K., Ravichandran, B., & ElFallah, A. (). Classifier systems in combat: Two-sided learning of maneuvers for advanced fighter aircraft. Computer Methods in Applied Mechanics and Engineering, (–), –. Smith, S. F. () A learning system based on genetic adaptive algorithms. Doctoral dissertation, Department of Computer Science, University of Pittsburgh. Smith, S. F. (). Flexible learning of problem solving heuristics through adaptive search. In Proceedings of the eighth international joint conference on artificial intelligence (pp. –). Los Altos, CA: Morgan Kaufmann. Sutton, R. S. (). Learning to predict by the methods of temporal differences. Machine Learning, , –. Sutton, R. S., & Barto, A. G. (). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tackett, W. A. (). Recombination, selection, and the genetic construction of computer programs. Unpublished doctoral dissertation, University of Southern California. Watkins, C. (). Learning from delayed rewards. PhD thesis, King’s College.

Wilson, S. W. (). Classifier fitness based on accuracy. Evolutionary Computation, (), –. Wilson, S. W. (). Classifiers that approximate functions. Natural Computing, (–), –. Wilson, S. W. (). “Three architectures for continuous action” learning classifier systems. International workshops, IWLCS –, revised selected papers. In T. Kovacs, X. Llorà, K. Takadama, P. L. Lanzi, W. Stolzmann, & S. W. Wilson (Eds.), Lecture notes in artificial intelligence Vol. (pp. –). Berlin: Springer.

Clause A clause is a logical rule in a 7logic program. Formally, a clause is a disjunction of (possibly negated) literals, such as grandfather(x, y) ∨ ¬father(x, z) ∨ ¬parent(z, y). In the logic programming language 7Prolog this clause is written as grandfather(X,Y) :- father(X,Z), parent(Z,Y). The part to the left of :- (“if ”) is the head of the clause, and the right part is its body. Informally, the clause asserts the truth of the head given the truth of the body. A clause with exactly one literal in the head is called a Horn clause or definite clause; logic programs mostly consist of definite clauses. A clause without a body is also called a fact; a clause without a head is also called a denial, or a query in a proof by refutation. The clause without head or body is called the empty clause: it signifies inconsistency or falsehood and is denoted ◻. Given a set of clauses, the resolution inference rule can be used to deduce logical consequences and answer queries (see 7First-Order Logic). In machine learning, clauses can be used to express classification rules for structured individuals. For example, the following definite clause classifies a molecular compound as carcinogenic if it contains a hydrogen atom with charge above a certain threshold. carcinogenic(M) :- atom(M,A1), element(A1,h), charge(A1,C1), geq(C1,0.168).

Cluster Optimization

Cross References 7First-Order Logic 7Inductive Logic Programming 7Learning from Structured Data 7Logic Program 7Prolog

Clause Learning In 7speedup learning, clause learning is a 7deductive learning technique used for the purpose of 7intelligent backtracking in satisfiability solvers. The approach analyzes failures at backtracking points and derives clauses that must be satisfied by the solution. The clauses are added to the set of clauses from the original satisfiability problem and serve to prune new search nodes that violate them.

Click-Through Rate (CTR) CTR measures the success of a ranking of search results, or advertisement placing. Given the number of impressions, the number of times a web result or ad has been displayed, and the number of clicks, the number of users who clicked on the result/advertisement, CTR is the number of clicks divided by the number of impressions.

Clonal Selection The clonal selection theory (CST) is the theory used to explain the basic response of the adaptive immune system to an antigenic stimulus. It establishes the idea that only those cells capable of recognizing an antigenic stimulus will proliferate, thus being selected against those that do not. Clonal selection operates on both T-cells and B-cells. When antibodies on a B-cell bind with an antigen, the B-cell becomes activated and begins to proliferate. New B-cell clones are produced that are an exact copy of the parent B-cell, but then they undergo somatic hypermutation and produce antibodies that are specific to the invading antigen. The B-cells, in addition to proliferating or differentiating into plasma cells, can differentiate into long-lived B memory cells. Plasma cells produce large amounts of antibody which will attach

C

themselves to the antigen and act as a type of tag for T-cells to pick up on and remove from the system. This whole process is known as affinity maturation. This process forms the basis of many artificial immune system algorithms such as AIRS and aiNET.

Closest Point 7Nearest Neighbor

Cluster Editing The Cluster Editing problem is almost equivalent to Correlation Clustering on complete instances. The idea is to obtain a graph that consists only of cliques. Although Cluster Deletion requires us to delete the smallest number of edges to obtain such a graph, in Cluster Editing we are permitted to add as well as remove edges. The final variant is Cluster Completion in which edges can only be added: each of these problems can be restricted to building a specified number of cliques.

Cluster Ensembles Cluster ensembles are an unsupervised 7ensemble learning method. The principle is to create multiple different clusterings of a dataset, possibly using different algorithms, then aggregate the opinions of the different clusterings into an ensemble result. The final ensemble clustering should be in theory more reliable than the individual clusterings.

Cluster Optimization 7Evolutionary Clustering

C

C

Clustering

Clustering Clustering is a type of 7unsupervised learning in which the goal is to partition a set of 7examples into groups called clusters. Intuitively, the examples within a cluster are more similar to each other than to examples from other clusters. In order to measure the similarity between examples, clustering algorithms use various distortion or 7distance measures. There are two major types clustering approaches: generative and discriminative. The former assumes a parametric form of the data and tries to find the model parameters that maximize the probability that the data was generated by the chosen model. The latter represents graph-theoretic approaches that compute a similarity matrix defined over the input data.

Cross References 7Categorical Data Clustering 7Cluster Editing 7Cluster Ensembles 7Clustering from Data Streams 7Constrained Clustering 7Consensus Clustering 7Correlation Clustering 7Cross-Language Document Clustering 7Density-Based Clustering 7Dirichlet Process 7Document Clustering 7Evolutionary Clustering 7Graph Clustering 7k-Means Clustering 7k-Mediods Clustering 7Model-Based Clustering 7Partitional Clustering 7Projective Clustering 7Sublinear Clustering

Clustering Aggregation 7Consensus Clustering

Clustering Ensembles 7Consensus Clustering

Clustering from Data Streams João Gama University of Porto, Porto, Portugal

Definition 7Clustering is the process of grouping objects into different groups, such that the common properties of data in each subset is high, and between different subsets is low. The data stream clustering problem is defined as to maintain a consistent good clustering of the sequence observed so far, using a small amount of memory and time. The issues are imposed by the continuous arriving data points, and the need to analyze them in real time. These characteristics require incremental clustering, maintaining cluster structures that evolve over time. Moreover, the data stream may evolve over time and new clusters might appear, others disappear reflecting the dynamics of the stream.

Main Techniques Major clustering approaches in data stream cluster analysis include: Partitioning algorithms: construct a partition of a set of objects into k clusters, that minimize some objective function (e.g., the sum of squares distances to the centroid representative). Examples include k-means (Farnstrom, Lewis, & Elkan, ), and k-medoids (Guha, Meyerson, Mishra, Motwani, & O’Callaghan, ) ● Microclustering algorithms: divide the clustering process into two phases, where the first phase is online and summarizes the data stream in local models (microclusters) and the second phase generates a global cluster model from the microclusters. Examples of these algorithms include BIRCH (Zhang, Ramakrishnan, & Livny, ) and CluStream (Aggarwal, Han, Wang, & Yu, ) ●

Basic Concepts A powerful idea in clustering from data streams is the concept of cluster feature, CF. A cluster feature, or microcluster, is a compact representation of a set of points. A CF structure is a triple (N, LS, SS), used to store the sufficient statistics of a set of points:

Clustering from Data Streams

N is the number of data points LS is a vector, of the same dimension of data points, that store the linear sum of the N points ● SS is a vector, of the same dimension of data points, that store the square sum of the N points ●

●

The properties of cluster features are: ●

Incrementality If a point x is added to the cluster, the sufficient statistics are updated as follows: LSA ← LSA + x, SSA ← SSA + x , NA ← NA + .

●

Additivity If A and A are disjoint sets, merging them is equal to the sum of their parts. The additive property allows us to merge subclusters incrementally. LSC ← LSA + LSB , SSC ← SSA + SSB , NC ← NA + NB .

A CF entry has sufficient information to calculate the norms n

L = ∑ ∣xai − xbi ∣, i=

¿ Án À∑(xa − xb ) L = Á i i i=

and basic measures to characterize a cluster. ●

Centroid, defined as the gravity center of the cluster: ⃗ = LS . X N

●

Radius, defined as the average distance from member points to the centroid: √ R=

N ⃗ ∑ (⃗xi − X) . N

C

Partitioning Clustering k-means is the most widely used clustering algorithm. It constructs a partition of a set of objects into k clusters that minimize some objective function, usually a squared error function, which imply round-shape clusters. The input parameter k is fixed and must be given in advance that limits its real applicability to streaming and evolving data. Farnstrom et al. () proposed a single pass k-means algorithm. The main idea is to use a buffer where points of the dataset are kept compressed. The data stream is processed in blocks. All available space on the buffer is filled with points from the stream. Using these points, find k centers such that the sum of distances from data points to their closest center is minimized. Only the k centroids (representing the clustering results) are retained, with the corresponding k cluster features. In the following iterations, the buffer is initialized with the k-centroids, found in previous iteration, weighted by the k cluster features, and incoming data points from the stream. The Very Fast k-means (VFKM) algorithm (Domingos & Hulten, ) uses the Hoeffding bound to determine the number of examples needed in each step of a k-means algorithm. VFKM runs as a sequence of k-means runs, with increasing number of examples until the Hoeffding bound is satisfied. Guha et al. () present an analytical study on k-median clustering data streams. The proposed algorithm makes a single pass over the data stream and uses small space. It requires O(nk) time and O(nє) space where k is the number of centers, n is the number of points, and є < . They have proved that any k-median algorithm that achieves a constant factor approximation cannot achieve a better run time than O(nk).

Micro Clustering The idea of dividing the clustering process into two layers, where the first layer generates local models (microclusters) and the second layer generates global models from the local ones, is a powerful idea that has been used elsewhere. The BIRCH system (Zhang et al., ) builds a hierarchical structure of data, the CF-tree, where each node contains a set of cluster features. These CF’s contain the sufficient statistics describing a set of points in the data set, and all information of the cluster features below in

C

C

Clustering from Data Streams

Monitoring the Evolution of the Cluster Structure

the tree. The system requires two user defined parameters: B the branch factor or the maximum number of entries in each non-leaf node; and T the maximum diameter (or radius) of any CF in a leaf node. The maximum diameter T defines the examples that can be absorbed by a CF. Increasing T, more examples can be absorbed by a micro-cluster and smaller CF-Trees are generated (Fig. ). When an example is available, it traverses down the current tree from the root it finds the appropriate leaf. At each non-leaf node, the example follow the closestCF path, with respect to norms L or L . If the closest-CF in the leaf cannot absorb the example, make a new CF entry. If there is no room for new leaf, split the parent node. A leaf node might be expanded due to the constraints imposed by B and T. The process consists of taking the two farthest CFs and creates two new leaf nodes. When traversing backup the CFs are updated.

The CluStream Algorithm (Aggarwal et al., ) is an extension of the BIRCH system designed for data streams. Here, the CFs include temporal information: the time-stamp of an example is treated as a feature. CFs are initialized offline, using a standard k-means, with a large value for k. For each incoming data point, the distance to the centroids of existing CFs are computed. The data point is absorbed by an existing CF if the distance to the centroid falls within the maximum boundary of the CF. The maximum boundary is defined as a factor t of the radius deviation of the CF; otherwise, the data point starts a new micro-cluster. CluStream can generate approximate clusters for any user defined time granularity. This is achieved by storing the CFT at regular time intervals, referred to as snapshots. Suppose the user wants to find clusters in the stream based on a history of length h, the off-line Root node

CF2

CF1

CF2

CF1

CF1 CF2

CFb

Noon-root node

CFb

CF1

CF1

CF2

Leaf nodes

CF2 CF3

CFb

Clustering from Data Streams. Figure . The clustering feature tree in BIRCH. B is the maximum number of CFs in a level of the tree 1 Year 12 Months

1 Month 31 days

Natural tilted time window

1 Day 24 Hours

1Hour 4 Quar t

Clustering from Data Streams. Figure . The figure presents a natural tilted time window. The most recent data is stored with high-detail, older data is stored in a compressed way. The degree of detail decreases with time

Coevolution

component can analyze the snapshots stored at the snapshots t, the current time, and (t − h) by using the addictive property of CFT. An important problem is when to store the snapshots of the current set of microclusters. For example, the natural time frame (Fig. ) stores snapshots each quarter, four quarters are aggregated in hours, h are aggregated in days, etc. The aggregation level is domain-dependent and explores the addictive property of CFT. Tracking the Evolution of the Cluster Structure

Promising research lines are tracking change in clusters. Spiliopoulou, Ntoutsi, Theodoridis, and Schult () present system MONIC, for detecting and tracking change in clusters. MONIC assumes that a cluster is an object in a geometric space. It encompasses changes that involve more than one cluster, allowing for insights on cluster change in the whole clustering. The transition tracking mechanism is based on the degree of overlapping between the two clusters. The concept of overlap between two clusters, X and Y, is defined as the normed number of common records weighted with the age of the records. Assume that cluster X was obtained at time t and cluster Y at time t . The degree of overlapping between the two clusters is given by: overlap (X, Y) = ∑a∈X∩Y age(a, t )/∑x∈X age(x, t ). The degree of overlapping allows inferring properties of the underlying data stream. Cluster transition at a given time point is a change in a cluster discovered at an earlier timepoint. MONIC considers transitions as Internal and external transitions, that reflect the dynamics of the stream. Examples of cluster transitions include: the cluster survives, the cluster is absorbed; a cluster disappears; a new cluster emerges (Fig. ).

Recommended Reading Aggarwal, C., Han, J., Wang, J., & Yu, P. (). A framework for clustering evolving data streams. In Proceedings of the th international conference on very large data bases (pp. –). San Mateo, MA: Morgan Kaufmann. Domingos, P., & Hulten, G. (). A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of international conference on machine learning (pp. –). San Mateo, MA: Morgan Kaufmann. Farnstrom, F., Lewis, J., & Elkan, C. (). Scalability for clustering algorithms revisited. SIGKDD Explorations, (), –. Guha, S., Meyerson, A., Mishra, N., Motwani, R., & O’Callaghan, L. (). Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, (), –.

C

Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., & Schult, R. (). Monic: Modeling and monitoring cluster transitions. In Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (pp. –). New York: ACM Press. Zhang, T., Ramakrishnan, R., & Livny, M. (). Birch: An efficient data clustering method for very large databases. In Proceedings of ACM SIGMOD international conference on management of data (pp. –). New York: ACM Press.

Clustering of Nonnumerical Data 7Categorical Data Clustering

Clustering with Advice 7Correlation Clustering

Clustering with Constraints 7Correlation Clustering

Clustering with Qualitative Information 7Correlation Clustering

Clustering with Side Information 7Correlation Clustering

CN 7Rule Learning

Co-Training 7Semi-Supervised Learning

Coevolution 7Coevolutionary Learning

C

C

Coevolutionary Computation

Coevolutionary Computation 7Coevolutionary Learning

Coevolutionary Learning R. Paul Wiegand University of Central Florida, Orlando, FL, USA

Synonyms Coevolution; Coevolutionary computation

Definition Coevolutionary learning is a form of evolutionary learning (see 7Evolutionary Algorithms) in which the fitness evaluation is based on interactions between individuals. Since the evaluation of an individual is dependent on interactions with other evolving entities, changes in the set of entities used for evaluation can affect an individual’s ranking in a population. In this sense, coevolutionary fitness is subjective, while fitness in traditional evolutionary learning systems typically uses an objective performance measure.

Motivation and Background Ideally, coevolutionary learning systems focus on relevant areas of a search space by making adaptive changes between interacting, concurrently evolving parts. This can be particularly helpful when problem spaces are very large – infinite search spaces in particular. Additionally, coevolution is useful when applied to problems when no intrinsic objective measure exists. The interactive nature of evaluation makes them natural methods to consider for problems such as the search for gameplaying strategies (Fogel, ). Finally, some coevolutionary systems appear natural for search spaces which contain certain kinds of complex structures (Potter, ; Stanley, ), since search on smaller components in a larger structure can be emphasized. In fact, there is reason to believe that coevolutionary systems may be well suited for uncovering complex structures within a problem (Bucci & Pollack, ). Still, the dynamics of coevolutionary learning can be quite complex, and a number of pathologies often plague naïve users. Indeed, because of the subjective nature of coevolution, it can be easy to apply a particular coevolutionary learning system without a clear

understanding of what kind of solution one expects a coevolutionary algorithm to produce. Recent theoretical analysis suggests that a clear concept of solution and a careful implementation of an evaluation process consistent with this concept can produce a coevolutionary system capable of addressing many problems (de Jong & Pollack, ; Ficici, ; Panait, ; Wiegand, ). Accordingly, a great deal of research in this area focuses on evaluation and progress measurement.

Structure of Learning System Coevolutionary learning systems work in much the same way that an evolutionary learning system works: individuals encode some aspect of potential solutions to a problem, those representatives are altered during search using genetic-like operators such as mutation and crossover, and the search is directed by selecting better individuals as determined by some kind of fitness assessment. These heuristic methods gradually refine solutions by repeatedly cycling through such steps, using the ideas of heredity and survival of the fittest to produce new generations of individuals, with increased quality of solution. Just as in traditional evolutionary computation, there are many choices available to the engineer in designing such systems. The reader is referred to the chapters relating to evolutionary learning for more details. However, there are some fundamental differences between traditional evolution and coevolution. In coevolution, measuring fitness requires evaluating the interaction between multiple individuals. Interacting individuals may reside in the same population or in different populations; the interactive nature of coevolution evokes notions of cooperation and competition in entirely new ways; the choices regarding how to best conduct evaluation of these interactions for the purposes of selection are particularly important; and there are unique coevolutionary issues surrounding representation. In addition, because of its interactive nature, the dynamics of coevolution can lead to some well-known pathological behaviors, and particularly careful attention to implementation choices to avoid such conditions is generally necessary. Multiple Versus Single Population Approaches

Coevolution can typically be broadly classified as to whether interacting individuals reside in different populations or in the same population.

Coevolutionary Learning

In the case of multipopulation coevolution, measuring fitness requires evaluating how individuals in one population interact with individuals in another. For example, individuals in each population may represent potential strategies for particular players of a game, they may represent roles in a larger ecosystem (e.g., predators and prey), or they may represent components that are fitted into a composite assembly with other component then applied to a problem. Though individuals in different populations interact for the purposes of evaluation, they are typically otherwise independent of one another in the coevolutionary search process. In single population coevolution, an individual in the population is evaluated based on his or her interaction with other individuals in the same population. Such individuals may again represent potential strategies in a game, but evaluation may require them to trade off roles as to which player they represent in that game. Here, individuals interact not only for evaluation, but also implicitly compete with one another as resources used in the coevolutionary search process itself. There is some controversy in the field as to whether this latter type qualifies as “coevolution.” Evolutionary biologists often define coevolution exclusively in terms of multiple populations; however, in biological systems, fitness is always subjective, while the vast majority of computational approaches to evolutionary learning involve objective fitness assessment – and this subjective/objective fitness distinction creates a useful classification. To be sure, there are fundamental differences between how single population and multipopulation learning systems behave (Ficici, ). Still, single population systems that employ subjective fitness assessment behave a lot more like multipopulation coevolutionary systems than like objective fitness based evolution. Moreover, historically, the field has used the term coevolution whenever fitness assessment is based on interactions between individuals, and a large amount of that research has involved systems with only one population. Competition and Cooperation

The terms cooperative and competitive have been used to describe aspects of coevolution learning in at least three ways.

C

First and less commonly, these adjectives can describe qualitatively observed behaviors of potential solutions in coevolutionary systems, the results of some evolutionary process (e.g., “tit-for-tat” strategies, Axelrod, ). Second, problems are sometimes considered to be inherently competitive or cooperative. Indeed, game theory provides some guidance for making such distinctions. However, since in many kinds of problems little may be known about the actual structure of the payoff functions involved, we may not actually be able to classify the problem as definitively competitive or cooperative. The final and by far most common use of the term is to distinguish algorithms themselves. Cooperative algorithms are those in which interacting individuals succeed or fail together, while competitive algorithms are those in which individuals succeed at the expense of other individuals. Because of the ambiguity of the terms, some researchers advocate abandoning them altogether, instead focusing distinguishing terminology on the form a potential solution takes. For example, using the term 7compositional coevolution to describe an algorithm designed to return a solution composed of multiple individuals (e.g., a multiagent team) and using the term 7test-based coevolution to describe an algorithm designed to return an individual who performs well against an adaptive set of tests (e.g., sorting network). This latter pair of terms is a slightly different, though probably more useful distinction than the cooperative and competitive terms. Still, it is instructive to survey the algorithms based on how they have been historically classified. Examples of competitive coevolutionary learning include simultaneously learning sorting networks and challenging data sets in a predator–prey type relationship (Hillis, ). Here, individuals in one population representing potential sorting networks are awarded a fitness score based on how well they sort opponent data sets from the other population. Individuals in the second population represent potential data sets whose fitness is based on how well they distinguish opponent sorting networks. Competitive coevolution has also been applied to learning game-playing strategies (Fogel, ; Rosin & Belew, ). Additionally, competition has played a vital part in the attempts to coevolve complex agent

C

C

Coevolutionary Learning

behaviors (Sims, ). Finally, competitive approaches have been applied to a variety of more traditional machine learning problems, for example, learning classifiers in one population and challenging subsets of exemplars in the other (Paredis, ). Potter developed a relatively general framework for cooperative coevolutionary learning, applying it first to static function optimization and later to neural network learning (Potter, ). Here, each population contains individuals representing a portion of the network, and evolution of these components occurs almost independently, in tandem with one another, interacting only to be assembled into a complete network in order to obtain fitness. The decomposition of the network can be static and a priori, or dynamic in the sense that components may be added or removed during the learning process. Moriarty et al. take a different, somewhat more adaptive approach to cooperative coevolution of neural networks (Moriarty & Miikkulainen, ). In this case, one population represents potential network plans, while a second is used to acquire node information. Plans are evaluated based on how well they solve a problem with their collaborating nodes, and the nodes receive a share of this fitness. Thus, a node is rewarded for participating more with successful plans, and thus receives fitness only indirectly.

Evaluation

Choices surrounding how interacting individuals in coevolutionary systems are evaluated for the purposes of selection are perhaps the most important choices facing an engineer employing these methods. Designing the evaluation method involves a variety of practical choices, as well as a broader eye to the ultimate purpose of the algorithm itself. Practical concerns in evaluation include determining the number of individuals with whom to interact, how those individuals will be chosen for the interaction, and how the selection will operate on the results of multiple interactions (Wiegand, ). For example, one might determine the fitness of an individual by pairing him or her with all other individuals in the other populations (or the same population for single population approaches) and taking the average or maximum value

of such evaluations as the fitness assessment. Alternatively, one may simply use the single best individual as determined by a previous generation of the algorithm, or a combination of those approaches. Random pairings between individuals is also common. This idea can be extended to use tournament evaluation where successful individuals from pairwise interactions are promoted and further paired, assigning fitness based on how far an individual progresses in the tournament. Many of these methods have been evaluated empirically on a variety of types of problems (Angeline & Pollack, ; Bull, ; Wiegand, ). However, the designing of the evaluation method also speaks to the broader issue of how to best implement the desired 7solution concept, (a criterion specifying which locations in the search space are solutions and which are not) (Ficici, ). The key to successful application of coevolutionary learning is to first elicit a clear and precise solution concept and then design an algorithm (an evaluation method in particular) that implements such a concept explicitly. A successful coevolutionary learner capable of achieving reliable progress toward a particular solution concept often makes use of an archive of individuals and an update rule for that archive that insists the distance to a particular solution concept decrease with every change to the archive. For example, if one is interested in finding game strategies that satisfy Nash equilibrium constraints, one might consider comparing new individuals to an archive of potential individual strategies found so far that together represent a potential Nash mixed strategy (Ficici, ). Alternatively, if one is interested in maximizing the sum of an individual’s outcomes over all tests, one may likewise employ an archive of discovered tests that candidate solutions are able to solve (de Jong, ). It is useful to note that many coevolutionary learning problems are multiobjective in nature. That is, 7underlying objectives may exist in such problems, each creating a different ranking for individuals depending on the set of tests being considered during evaluation (Bucci & Pollack, ). The set of all possible underlying objectives (were it known) is sufficient to determine the outcomes on all possible tests. A careful understanding of this can yield approaches that create

Coevolutionary Learning

ideal and minimal evaluation sets for such problems (de Jong & Pollack, ). By acknowledging the link between multiobjective optimization and coevolutionary learning, a variety of evaluation and selection methods based on notions of multiobjective optimization have been employed. For example, there are selection methods that use Pareto dominance between candidate solutions and their tests as their basis of comparison (Ficici, ). Additionally, such methods can be combined with archive-based approaches to ensure monotonicity of progress toward a Pareto dominance solution concept (de Jong & Pollack, ).

C

and restricting selection and interaction using geometric constraints defined by those topologies (Pagie, ). Typically, these systems involve overlaying multiple grids of individuals, applying selection within some neighborhood in a given grid, and evaluating interactions between individuals in different grids using a similar type of cross-population neighborhood. The benefits of these systems are in part due to their ability to naturally regulate loss of diversity and spread of interaction information by explicit control over the size and shape of these neighborhoods.

Pathologies and Remedies Representation

Perhaps the core representational question in coevolution is the role that an individual plays. In test-based coevolution, an individual typically represents a potential solution to the problem or a test for a potential solution, whereas in compositional coevolution individuals typically represent a candidate component for a composite or ensemble solution. Even in test-based approaches, the true solution to the problem may be expressed as a population of individuals, rather than a single individual. The population may represent a mixed strategy while individuals represent potential pure strategies for a game. Engineers using such approaches should be clear of the form of the final solution produced by the algorithm, and that this form is consistent with the prescribed solution concept. In compositional approaches, the key issues tend to surround about how the problem is decomposed. In some algorithms, this decomposition is performed a priori, having different populations represent explicit components of the problem (Potter, ). In other approaches, the decomposition is intended to be somewhat more dynamic (Moriarty & Miikkulainen, ; Potter, ). Still more recent approaches seek to harness the potential of compositional coevolutionary systems to search open-ended representational spaces by gradually complexifying the representational space during the search (Stanley, ). In addition, a variety of coevolutionary systems have successfully dealt with some inherent pathologies by representing populations in spatial topologies,

Perhaps the most commonly cited pathology is the socalled loss of gradient problem, in which one population comes to severely dominate the others, thus creating a situation in which individuals cannot be distinguished from one another. The populations become disengaged and evolutionary progress may stall or drift (Watson & Pollack, ). Disengagement most commonly occurs when distinguishing individuals are lost in the evolutionary process ( forgetting), and the solution to this problem typically involves somehow retaining potentially informative, though possibly inferior quality individuals (e.g., archives). Intransitivities in the reward system can cause some coevolutionary systems to exhibit cycling dynamics (Watson & Pollack, ), where reciprocal changes force the system to orbit some part of a potential search space. The remedy to this problem often involves creating coevolutionary systems that change in response to traits in several other populations. Mechanisms introduced to produce such effects include competitive fitness sharing (Rosin & Belew, ). Another challenging problem occurs when individuals in a coevolutionary systems overspecialize on one underlying objective at the expense of other necessary objectives (Watson & Pollack, ). In fact, overspecialization can be seen as a form of disengagement on some subset of underlying objectives, and likewise the repair to this problem often involves retaining individuals capable of making distinctions in as many underlying objectives as possible (Bucci & Pollack, ).

C

C

Coevolutionary Learning

Finally, certain kinds of compositional coevolutionary learning algorithms can be prone to relative overgeneralization, a pathology in which components that perform reasonably well in a variety of composite solutions are favored over those that are part of an optimal solution (Wiegand, ). In this case, it is typically possible to bias the evaluation process toward optimal values by evaluating an individual in a variety of composite assemblies and assigning the best objective value found as the fitness (Panait, ). In addition to pathological behaviors in coevolution, the subjective nature of these learning systems creates difficulty in measuring progress. Since fitness is subjective, it is impossible to determine whether these relative measures indicate progress or stagnation when the measurement values do not change much. Without engaging some kind of external or objective measure, it is difficult to understand what the system is really doing. Obviously, if an objective measure exists then it can be employed directly to measure progress (Watson & Pollack, ). A variety of measurement methodologies have been employed when objective measurement is not possible. One method is to compare current individuals against all ancestral opponents (Cliff & Miller, ). Another predator/prey based method holds master tournaments between all the best predators and all the best prey found during the search (Nolfi & Floreano, ). A similar approach suggests maintaining the best individuals from each generation in each population in a hall of fame for comparison purposes (Rosin & Belew, ). Still other approaches seek to record the points during the coevolutionary search in which a new dominant individual was found (Stanley, ). A more recent approach advises looking at the population differential, examining all the information from ancestral generations rather than simply selecting a biased subset (Bader-Natal & Pollack, ). Conversely, an alternative idea is to consider how well the dynamics of the best individuals in different populations reflect the fundamental best response curves defined by the problem (Popovici, ). With a clear solution concept, an appropriate evaluation mechanism implementing that concept, and practical progress measures in place, coevolution can be an effective and versatile machine learning tool.

Cross References 7Evolutionary Algorithms

Recommended Reading Angeline, P., & Pollack, J. (). Competitive environments evolve better solutions for complex tasks. In S. Forest (Ed.), Proceedings of the fifth international conference on genetic algorithms (pp. –). San Mateo, CA: Morgan Kaufmann. Axelrod, R. (). The evolution of cooperation. New York: Basic Books. Bader-Natal, A., & Pollack, J. (). Towards metrics and visualizations sensitive to Coevolutionary failures. In AAAI technical report FS-- coevolutionary and coadaptive systems. AAAI Fall Symposium, Washington, DC. Bucci, A., & Pollack, J. B. (). A mathematical framework for the study of coevolution. In R. Poli, et al. (Eds.), Foundations of genetic algorithms VII (pp. –). San Francisco: Morgan Kaufmann. Bucci, A., & Pollack, J. B. (). Focusing versus intransitivity geometrical aspects of coevolution. In E. Cantú-Paz, et al. (Eds.), Proceedings of the genetic and evolutionary computation conference (pp. –). Berlin: Springer. Bull, L. (). Evolutionary computing in multi-agent environments: Partners. In T. Bäck (Ed.), Proceedings of the seventh international conference on genetic algorithms (pp. –). San Mateo, CA: Morgan Kaufmann. Cliff, D., & Miller, G. F. (). Tracking the red queen: Measurements of adaptive progress in co-evolutionary simulations. In Proceedings of the third European conference on artificial life (pp. –). Berlin: Springer. de Jong, E. (). The maxsolve algorithm for coevolution. In H. Beyer, et al. (Eds.), Proceedings of the genetic and evolutionary computation conference (pp. –). New York, NY: ACM Press. de Jong, E., & Pollack, J. (). Ideal evaluation from coevolution. Evolutionary Computation, , –. Ficici, S. G. (). Solution concepts in coevolutionary algorithms. PhD thesis, Brandeis University, Boston, MA. Fogel, D. (). Blondie: Playing at the edge of artificial intelligence. San Francisco: Morgan Kaufmann. Hillis, D. (). Co-evolving parasites improve simulated evolution as an optimization procedure. Artificial life II, SFI studies in the sciences of complexity (Vol. , pp. –). Moriarty, D., & Miikkulainen, R. (). Forming neural networks through efficient and adaptive coevolution. Evolutionary Computation, , –. Nolfi, S., & Floreano, D. (). Co-evolving predator and prey robots: Do “arm races” arise in artificial evolution? Artificial Life, , –. Pagie, L. (). Information integration in evolutionary processes. PhD thesis, Universiteit Utrecht, the Netherlands. Panait, L. (). The analysis and design of concurrent learning algorithms for cooperative multiagent systems. PhD thesis, George Mason University, Fairfax, VA. Paredis, J. (). Steps towards co-evolutionary classification networks. In R. A. Brooks & P. Maes (Eds.), Artificial life IV,

Collective Classification

proceedings of the fourth international workshop on the synthesis and simulation of living systems (pp. –). Cambridge, MA: MIT Press. Popovici, E. (). An analysis of multi-population co-evolution. PhD thesis, George Mason University, Fairfax, VA. Potter, M. (). The design and analysis of a computational model of cooperative co-evolution. PhD thesis, George Mason University, Fairfax, VA. Rosin, C., & Belew, R. (). New methods for competitive coevolution. Evolutionary Computation, , –. Sims, K. (). Evolving D morphology and behavior by competition. In R. A. Brooks & P. Maes (Eds.), Artificial life IV, proceedings of the fourth international workshop on the synthesis and simulation of living systems (pp. –). Cambridge, MA: MIT Press. Stanley, K. (). Efficient evolution of neural networks through complexification. PhD thesis, The University of Texas at Austin, Austin, TX. Watson, R., & Pollack, J. (). Coevolutionary dynamics in a minimal substrate. In L. Spector, et al. (Eds.), Proceedings from the genetic and evolutionary computation conference (pp. – ). San Francisco: Morgan Kaufmann. Wiegand, R. P. (). An analysis of cooperative coevolutionary algorithms. PhD thesis, George Mason University, Fairfax, VA.

Collaborative Filtering

Collaborative Filtering (CF) refers to a class of techniques used in that recommend items to users that other users with similar tastes have liked in the past. CF methods are commonly sub-divided into neighborhoodbased and model-based approaches. In neighborhoodbased approaches, a subset of users are chosen based on their similarity to the active user, and a weighted combination of their ratings is used to produce predictions for this user. In contrast, model-based approaches assume an underlying structure to users’ rating behavior, and induce predictive models based on the past ratings of all users.

Collection

7Class

C

Collective Classification Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor University of Maryland, MD, USA

Synonyms Iterative classification; Link-based classification

Definition Many real-world 7classification problems can be best described as a set of objects interconnected via links to form a network structure. The links in the network denote relationships among the instances such that the class labels of the instances are often correlated. Thus, knowledge of the correct label for one instance improves our knowledge about the correct assignments to the other instances it connects to. The goal of collective classification is to jointly determine the correct label assignments of all the objects in the network.

Motivation and Background Traditionally, a major focus of machine learning is to solve classification problems: given a corpus of documents, classify each according to its topic label; given a collection of e-mails, determine which are spam; given a sentence, determine the part-of-speech tag for each word; given a hand-written document, determine the characters, etc. However, much of the work in machine learning makes an independent and identically distributed (IID) assumption, and focuses on predicting the class label of each instance in isolation. In many cases, however, the class labels whose values need to be determined can benefit if we know the correct assignments to related class labels. For example, it is easier to predict the topic of a webpage if we know the topics of the webpages that link to it, the chance of a particular word being a verb increases if we know that the previous word in the sentence is a noun, knowing the rest of the characters in a word can make it easier to identify an unknown character, etc. In the last decade, many researchers have proposed techniques that attempt to classify samples in a joint or collective manner instead of treating each sample in isolation, and reported significant gains in classification accuracy.

C

C

Collective Classification

Theory/Solution Collective classification is a combinatorial optimization problem, in which we are given a set of nodes, V = {v , . . . , vn }, and a neighborhood function N , where Ni ⊆ V/{vi }, which describes the underlying network structure. Each node in V is a random variable that can take a value from an appropriate domain, L = {l , . . . , lq }. V is further divided into two sets of nodes: X , the nodes for which we know the correct values (observed variables) and, Y, the nodes whose values need to be determined. Our task is to label the nodes yi ∈ Y with one of a small number of predefined labels in L. Even though it is only in the last decade that collective classification has entered the collective conscience of machine learning researchers, the general idea can be traced further back (Besag, ). As a result, a number of approaches have been proposed. The various approaches to collective classification differ in the kinds of information they aim to exploit to arrive at the correct classification, and their mathematical underpinnings. We discuss each in turn.

Relational Classification Traditional classification concentrates on using the observed attributes of the instance to be classified. Relational classification (Slattery & Craven, ) attempts to go a step further by classifying the instance using not only the instance’s own attributes but also the instance’s neighbors’ attributes. For example, in a hypertext classification domain where we want to classify webpages, not only would we use the webpage’s own words but we would also look at the webpages linking to this webpage using hyperlinks and their words to arrive at the correct class label. Results obtained using relational classification have been mixed. For example, even though there have been reports of classification accuracy gains using such techniques, in certain cases, these techniques can harm classification accuracy (Chakrabarti, Dom, & Indyk, ).

Iterative Collective Classification with Neighborhood Labels A second approach to collective classification is to use the class labels assigned to the neighbor instead of using the neighbor’s observed attributes. For example, going

back to our hypertext classification example, instead of using the linking webpage’s words we would, in this case, use its assigned labels to classify the current webpage. Chakrabarti et al. () illustrated the use of this approach and reported impressive classification accuracy gains. Neville and Jensen () further developed the approach, and referred to the approach as iterative classification, and studied the conditions under which it improved classification performance (Jensen, Neville, & Gallagher, ). Techniques for feature construction from the neighboring labels were developed and studied (Lu & Getoor, ), along with methods that make use of only the label information (Macskassy & Provost, ), as well as a variety of strategies for when to commit the class labels (McDowell, Gupta, & Aha, ). Algorithm depicts pseudo-code for a simple version of the Iterative Classification Algorithm (ICA). The basic premise behind ICA is extremely simple. Consider a node Yi ∈ Y whose value we need to determine and suppose we know the values of all the other nodes in its neighborhood Ni (note that Ni can contain both observed and unobserved variables). Then, ICA assumes that we are given a local classifier f that takes the values of Ni as arguments and returns a label value for Yi from the class label set L. For local classifiers f that do not return a class label but a goodness/likelihood value given a set of attribute values and a label, we

Algorithm Iterative classification algorithm Iterative Classification Algorithm (ICA) for each node Yi ∈ Y do {bootstrapping} {compute label using only observed nodes in Ni } compute ⃗ai using only X ∩ Ni yi ← f (⃗ai ) end for repeat {iterative classification} generate ordering O over nodes in Y for each node Yi ∈ O do {compute new estimate of yi } compute ⃗ai using current assignments to Ni yi ← f (⃗ai ) end for until all class labels have stabilized or a threshold number of iterations have elapsed

Collective Classification

simply choose the label that corresponds to the maximum goodness/likelihood value; in other words, we replace f with argmaxl∈L f . This makes the local classifier f extremely flexible and we can use anything ranging from a decision tree to a 7support vector machine (SVM). Unfortunately, it is rare in practice that we know all values in Ni , which is why we need to repeat the process iteratively, in each iteration, labeling each Yi using the current best estimates of Ni and the local classifier f , and continuing to do so until the assignments to the labels stabilize. Most local classifiers are defined as functions whose argument consists of a fixed-length vector of attribute values. A common approach to circumvent such a situation is to use an aggregation operator such as count, mode, or prop, which measures the proportion of neighbors with a given label. In Algorithm , we use ⃗ai to denote the vector encoding the values in Ni obtained after aggregation. Note that in the first ICA iteration, all labels yi are undefined and to initialize them we simply apply the local classifier to the observed attributes in the neighborhood of Yi , this is referred to as “bootstrapping” in Algorithm . Researchers in collective classification (Macskassy & Provost, ; McDowell et al., ; Neville & Jensen, ) have extended the simple algorithm described above, and developed a version of Gibbs sampling that is easy to implement and faster than traditional Gibbs sampling approaches. The basic idea behind this algorithm is to assume, just like in the case of ICA, that we have access to a local classifier f that can sample for the best label estimate for Yi given all the values for the nodes in Ni . We keep doing this repeatedly for a fixed number of iterations (a period known as “burnin”). After that, not only do we sample for labels for each Yi ∈ Y but we also maintain count statistics as to how many times we sampled label l for node Yi . After collecting a predefined number of such samples we output the best label assignment for node Yi by choosing the label that was assigned the maximum number of times to Yi while collecting samples. One of the benefits of both variants of ICA is fairly simple to make use of any local classifier. Some of the classifiers used included the following: naïve Bayes (Chakrabarti et al., ; Neville & Jensen, ), 7logistic regression (Lu & Getoor, ), 7decision trees, (Jensen et al., ) and weighted-vote relational

C

neighbor (Macskassy & Provost, ). There is some evidence to indicate that discriminately trained local classifiers such as logistic regression tend to produce higher accuracies than others; this is consistent with results in other areas. Other aspects of ICA that have been the subject of investigation include the ordering strategy to determine in which order to visit the nodes to relabel in each ICA iteration. There is some evidence to suggest that ICA is fairly robust to a number of simple ordering strategies such as random ordering, visiting nodes in ascending order of diversity of its neighborhood class labels, and labeling nodes in descending order of label confidences (Getoor, ). However, there is also some evidence that certain modifications to the basic ICA procedure tend to produce improved classification accuracies. For example, both (Neville & Jensen, ) and (McDowell et al., ) propose a strategy where only a subset of the unobserved variables are utilized as inputs for feature construction. More specifically, in each iteration, they choose the top-k most confident predicted labels and use only those unobserved variables in the following iteration’s predictions, thus ignoring the less confident predicted labels. In each subsequent iteration they increase the value of k so that in the last iteration all nodes are used for prediction. McDowell et al. report that such a “cautious” approach leads to improved accuracies.

Collective Classification with Graphical Models In addition to the approaches described above, which essentially focus on local representations and propagation methods, another approach to collective classification is by first representing the problem with a highlevel global 7graphical model and then using learning and inference techniques for the graphical modeling approach to arrive at the correct classifications. These proposals include the use of both directed 7graphical models (Getoor, Segal, Taskar, & Koller, ) and undirected graphical models (Lafferty, McCallum, & Pereira, ; Taskar, Abbeel, & Koller, ). See 7statistical relational learning and Getoor and Taskar () for a survey of various graphical models that are suitable for collective classification. In general, these techniques can use both neighborhood labels and observed attributes

C

C

Collective Classification

of neighbors. On the other hand, due to their generality, these techniques also tend to be less efficient than the iterative collective classification techniques. One common way of defining such a global model uses a pairwise Markov random field (pairwise MRF) (Taskar et al., ). Let G = (V, E) denote a graph of random variables as before where V consists of two types of random variables, the unobserved variables, Y, which need to be assigned domain values from label set L, and observed variables X whose values we know (see 7Graphical Models). Let Ψ denote a set of clique potentials. Ψ contains three distinct types of functions: For each Yi ∈ Y, ψ i ∈ Ψ is a mapping ψ i : L → R≥ , where R≥ is the set of nonnegative real numbers. ● For each (Yi , Xj ) ∈ E, ψ ij ∈ Ψ is a mapping ψ ij : L → R≥ . ● For each (Yi , Yj ) ∈ E, ψ ij ∈ Ψ is a mapping ψ ij : L × L → R≥ .

●

Let x denote the values assigned to all the observed variables in V and let xi denote the value assigned to Xi . Similarly, let y denote any assignment to all the unobserved variables in V and let yi denote a value assigned to Yi . For brevity of notation we will denote by ϕ i the clique potential obtained by computing ϕ i (yi ) = ψ i (yi ) ∏(Yi ,Xj )∈E ψ ij (yi ). We are now in a position to define a pairwise MRF. Definition A pairwise Markov random field (MRF) is given by a pair ⟨G, Ψ⟩ where G is a graph and Ψ is a set of clique potentials with ϕ i and ψ ij as defined above. Given an assignment y to all the unobserved variables Y, the pairwise MRF is associated with the probability distri bution P(y∣x) = Z(x) ∏Yi ∈Y ϕ i (yi ) ∏(Yi ,Yj )∈E ψ ij (yi , yj ) where x denotes the observed values of X and Z(x) = ∑y′ ∏Yi ∈Y ϕ i (y′i ) ∏(Yi ,Yj )∈E ψ ij (y′i , y′j ). Given a pairwise MRF, it is conceptually simple to extract the best assignments to each unobserved variable in the network. For example, we may adopt the criterion that the best label value for Yi is simply the one corresponding to the highest marginal probability obtained by summing over all other variables from the probability distribution associated with the pairwise MRF. Computationally, however, this is difficult to achieve since computing one marginal probability

requires summing over an exponentially large number of terms, which is why we need approximate inference algorithms. Hence, approximate inference algorithms are typically employed, the two most common being loopy belief propagation (LBP) and mean-field relaxation labeling.

Applications Due to its general applicability, collective classification has been applied to a number of real-world problems. Foremost in this list is document classification. Chakrabarti et al. () was one of the first to apply collective classification to corpora of patents linked via hyperlinks and reported that considering attributes of neighboring documents actually hurts classification performance. Slattery and Craven () also considered the problem of document classification by constructing features from neighboring documents using an 7inductive logic programming rule learner. Yang, Slattery, & Ghani () conducted an in-depth investigation over multiple datasets commonly used for document classification experiments and identified different patterns. Other applications of collective classification include object labeling in images (Hummel & Zucker, ), analysis of spatial statistics (Besag, ), iterative decoding (Berrou, Glavieux, & Thitimajshima, ), part-of-speech tagging (Lafferty et al., ), classification of hypertext documents using hyperlinks (Taskar et al., ), link prediction (Getoor, Friedman, Koller, & Taskar, ; Taskar, Wong, Abbeel, & Koller, ), optical character recognition (Taskar, Guestrin, & Koller, ), entity resolution in sensor networks (Chen, Wainwright, Cetin, & Willsky, ), predicting disulphide bonds in protein molecules (Taskar, Chatalbashev, Koller, & Guestrin, ), segmentation of D scan data (Anguelov et al., ), and classification of e-mail speech acts (Carvalho & Cohen, ). Recently, there have also been attempts to extend collective classification techniques to the semi-supervised learning scenario (Lu & Getoor, b; Macskassy, ; Xu, Wilkinson, Southey, & Schuurmans, ).

Cross References 7Decision Trees 7Inductive Logic Programming 7Learning From Structured Data

Community Detection

7Relational Learning 7Semi-Supervised Learning 7Statistical Relational Learning

Recommended Reading Anguelov, D., Taskar, B., Chatalbashev, V., Koller, D., Gupta. D., Heitz, G., et al. (). Discriminative learning of Markov random fields for segmentation of d scan data. In IEEE computer society conference on computer vision and pattern recognition. IEEE Computer Society, Washington D.C. Berrou, C., Glavieux, A., & Thitimajshima, P. (). Near Shannon limit error-correcting coding and decoding: Turbo codes. In Proceedings of IEEE international communications conference, Geneva, Switzerland, IEEE. Besag, J. (). On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, B-, –. Carvalho, V., & Cohen, W. W. (). On the collective classification of email speech acts. In Special interest group on information retrieval, Salvador, Brazil, ACM. Chakrabarti, S., Dom, B., & Indyk, P. (). Enhanced hypertext categorization using hyperlinks. In International conference on management of data, Seattle, Washington New York: ACM. Chen, L., Wainwright, M., Cetin, M., & Willsky, A. (). Multitargetmultisensor data association using the tree-reweighted max-product algorithm. In SPIE Aerosense conference. Orlando, Florida. Getoor, L. (). Link-based classification. In Advanced methods for knowledge discovery from complex data. New York: Springer. Getoor, L., & Taskar, B. (Eds.). (). Introduction to statistical relational learning. Cambridge, MA: MIT Press. Getoor, L., Segal, E., Taskar, B., & Koller, D. (). Probabilistic models of text and link structure fro hypertext classification. In Proceedings of the IJCAI workshop on text learning: Beyond supervision, Seattle, WA. Getoor, L., Friedman, N., Koller, D., & Taskar, B. (). Learning probabilistic models of link structure. Journal of Machine Learning Research, , –. Hummel, R., & Zucker, S. (). On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, , –. Jensen, D., Neville, J., & Gallagher, B. (). Why collective inference improves relational classification. In Proceedings of the th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA. ACM. Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (). conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning, Washington DC. San Francisco, CA: Morgan Kaufmann. Lu, Q., & Getoor, L. (a). Link based classification. In Proceedings of the international conference on machine learning. AAAI Press, Washington, D.C. Lu, Q., & Getoor, L. (b). Link-based classification using labeled and unlabeled data. In ICML workshop on the continuum from labeled to unlabeled data in machine learning and data mining. Washington, D.C. Macskassy, S., & Provost, F. (). Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, , –.

C

Macskassy, S. A. (). Improving learning in networked data by combining explicit and mined links. In Proceedings of the twenty-second conference on artificial intelligence. AAAI Press, Vancouver, Canada. McDowell, L. K., Gupta, K. M., & Aha, D. W. (). Cautious inference in collective classification. In Proceedings of AAAI. AAAI Press, Vancouver, Canada. Neville, J., & Jensen, D. (). Relational dependency networks. Journal of Machine Learning Research, , –. Neville, J., & Jensen, D. (). Iterative classification in relation data. In Workshop on statistical relational learning, AAAI. Slattery, S., & Craven, M. (). Combining statistical and relational methods for learning in hypertext domains. In International conferences on inductive logic programming. SpringerVerlag, London, UK. Taskar, B., Abbeel, P., & Koller, D. (). Discriminative probabilistic models for relational data. In Proceedings of the annual conference on uncertainty in artificial intelligence. Morgan Kauffman, San Francisco, CA. Taskar, B., Guestrin, C., & Koller, D. (a). Max-margin markov networks. In Neural information processing systems. MIT Press, Cambridge, MA. Taskar, B., Wong, M. F., Abbeel, P., & Koller, D. (b). Link prediction in relational data. In Natural information processing systems. MIT Press, Cambridge, MA. Taskar, B., Chatalbashev, V., Koller, D., & Guestrin, C. (). Learning structured prediction models: A large margin approach. In Proceedings of the international conference on machine learning. ACM, New York, NY. Xu, L., Wilkinson, D., Southey, F., & Schuurmans, D. (). Discriminative unsupervised learning of structured predictors. In Proceedings of the international conference on machine learning. ACM, New York, NY. Yang, Y., Slattery, S., & Ghani, R. (). A study of approaches to hypertext categorization. Journal of Intelligent Information Systems. (–), –.

Commercial Email Filtering 7Text Mining for Spam Filtering

Committee Machines 7Ensemble Learning

Community Detection 7Group Detection

C

C

Comparable Corpus

Comparable Corpus A comparable corpus (pl. corpora) is a document collection composed of two or more disjoint subsets, each written in a different language, such that documents in each subset are on a same topic as the documents in the others. The prototypical example of a comparable corpora is a collection of newspaper article written in different languages and reporting about the same events: while they will not be, strictly speaking, the translation of one another, they will share most of the semantic content. Some methods for 7cross-language text mining rely, totally or partially, on the statistical properties of comparable corpora.

Competitive Coevolution 7Test-Based Coevolution

Competitive Learning Competitive learning is an 7artificial neural network learning process where different neurons or processing elements compete on who is allowed to learn to represent the current input. In its purest form competitive learning is in the so-called winner-take-all networks where only the neuron that best represents the input is allowed to learn. Since all neurons learn to better represent the kinds of inputs they already are good at representing, they become specialized to represent different kinds of inputs. For vector-valued inputs and representations, the input becomes quantized to the unit having the closest representation (model), and the representations are adapted to minimize the representation error using stochastic gradient descent. Competitive learning networks have been studied as models of how receptive fields and feature detectors, such as orientation-selective visual neurons, develop in neural networks. The same process is at work in

online 7K-means clustering, and variants of it in 7SelfOrganizing Maps (SOM) and the EM algorithm of mixture models.

Complex Adaptive System 7Complexity in Adaptive Systems

Complexity in Adaptive Systems Jun He Aberystwyth University, Wales, UK

Synonyms Adaptive system; Complex adaptive system

Definition An 7adaptive system, or complex adaptive system, is a special case of complex systems, which is able to adapt its behavior according to changes in its environment or in parts of the system itself. In this way, the system can improve its performance through a continuing interaction with its environment. The concept of 7complexity in an adaptive system is used to analyze the interactive relationship between the system and its environment, which can be classified into two types: 7internal complexity for model complexity, and 7external complexity for data complexity. The internal complexity is defined by the amount of input, information, or energy that the system receives from its environment. The external complexity refers to the complexity of how the system represents these inputs through its internal process.

Motivation and Background Adaptive systems range from natural systems to artificial systems (Holland, , ; Waldrop, ). Examples of natural systems include ant colonies, ecosystem, the brain, neural network and immune system, cell and developing embryo; examples of artificial systems include stock market, social system, manufacturing businesses, and human social group-based

Complexity in Adaptive Systems

endeavor in a cultural and social system such as political parties or communities. All these systems have a common feature: they can adapt to their environment. An adaptive system is adaptive in that way it has the capacity to change its internal structure for adapting the environment. It is complex in the sense that it is interactive with its environment. The interaction between an adaptive system and its environment is dynamic and nonlinear. Complexity emerges from the interaction between the system and environment, the elements of the system, where the emergent macroscopic patterns are more complex than the sum of the these low-level (microscopic) elements encompassed in the system. Understanding the evolution and development of adaptive systems still faces many mathematical challenges (Levin, ). The concepts of external and internal complexities are used to analyze the relation between an adaptive system and its environment. The description given below is based on Jürgen Jost’s () work, which introduced these two concepts and applied the theoretical framework to the construction of learning models, e.g., to design neural network architectures. In the following, the concepts are mainly applied to analyze the interaction between the system and its environment. The interaction among individual elements of the system is less discussed however, the concepts can be explored in that situation too.

Theory Adaptive System Environment and Regularities

The environment of an adaptive system is more complex than the system itself and its changes cannot be completely predictable for the system. However, the changes of the environment are not purely random and noisy; there exist regularities in the environment. An adaptive system can recognize these regularities, and depending on these regularities the system will express them through its internal process in order to adapt to the environment. The input that an adaptive system receives or extracts from its environment usually includes two parts: one is the part with regularities; and another is that appears random to the system. The part of regularities is useful and meaningful. An adaptive system will represent these regularities by internal processes. But

C

the part of random input is useless, and even at the worst it will be detrimental for an adaptive system. However, it will depend on the adaptive system’s internal model of the external environment for how to determine which part of input is meaningful and regular, and which part is random and devoid of meaning and structure. An adaptive system will translate the external regularities into its internal ones, and only the regularities are useful to the system. The system tries to extract as many regularities as possible, and to represent these regularities as efficiently as possible in order to make optimal use of its capacity. The notions of external complexity and internal complexity are used to investigate these two complementary aspects conceptually and quantitatively. In terms of these notions, an adaptive system aims to increase its external complexity and reduce its internal complexity. The two processes operate on their own time scale but are intricately linked and mutually dependent on each other. For example, the internal complexity will be only reduced if the external complexity is fixed. Under fixed inputs received from the external environment, an adaptive system can represent these inputs systems more efficiently and optimize its internal structure. If the external complexity is increased, e.g., if additional new input is required to handle by the system, then it is necessary to increase its internal complexity. The increase of internal complexity may occur through the creation of redundancy in the existing adaptive system, e.g., to duplicate some internal structures, and then enable the system to handle more external input. Once the input is fixed, the adaptive system then will represent the input as efficiently as possible and reduce the internal input. The decrease of internal complexity can be achieved through discarding some input as meaningless and irrelevant, e.g., leaving some regularities out for the purpose. Since the inputs relevant to the systems are those which can be reflected in the internal model, the external complexity is not equivalent to the amount of raw data received from the environment. In fact, it is only relevant to the inputs which can be processed in the internal model, or observations in some adaptive systems. Thus the external complexity ultimately is decided by the internal model constructed by the system.

C

C

Complexity in Adaptive Systems

External and Internal Complexities

External complexity means data complexity, which is used to measure the amount of input received from the environment for the system to handle and process. Such a complexity can be measured by entropy in the term of information theory. Internal complexity is model complexity, which is used to measure the complexity of a model for representing the input or information received by the system. The aim of the adaptive system is to obtain an efficient model as simple as possible, with the capacity to handle as much input as possible. On one hand, the adaptive system will try to maximize its external complexity and then to adapt to its environment in a maximal way; on the other hand, to minimize its internal complexity and then to construct a model to process the input in a most efficient way. These two aims sometimes seem conflicting, but such a conflict can be avoided when these two processes operate on different time scales. If given a model, the system will organize the input data and try to increase its ability to deal with the input from its environment, and then increase its external complexity. If given the input, conversely, it tries to simplify its model which represents that input and thus to decrease the internal complexity. The meaning of the input is relevant to the time scale under investigation. On a short time scale, for example, the input may consist of individual signals, but on a long time scale, it will be a sequence of signals which satisfies a probability distribution. A good internal model tries to express regularities in the input sequence, rather than several individual signals. And the decrease of internal complexity will happen on this time scale. A formal definition of the internal and external complexities concepts is based on the concept of entropy from statistical mechanics and information theory. Given a model θ, the system can model data as with X(θ) = (X , . . . , Xk ), which is assumed to have an internal probability distribution P(X(θ)) so that entropy can be computed. The external complexity is defined by

of information can be described in other approaches, e.g., the length of the representation of the data in the internal code of the system (Rissanen, ). In this case, the optimal coding is a consequence of the minimization of internal complexity, and then the length of the representation of data Xi (θ) behaves like log P(X(θ)) (Rissanen, ). On a short time scale, for a given model θ, the system tries to increase the amount of meaningful input information X(θ). On a long time scale, when the input is given, e.g., when the system has gathered a set of inputs on a time scale with a stationary probability distribution of input patterns Ξ, then the model should be improved to handle the input as efficiently as possible and reduce the complexity of the model. This complexity, or internal complexity, is defined by k

− ∑ P(Ξ i ∣ θ) log P(Ξ i ∣ θ) − log P(θ),

()

i=

with respect to the model θ. If Rissanen’s () 7minimum description length principle is applied to the above formula, then the optimal model will satisfy the variation problem min (− log P(Ξ ∣ θ) − log P(θ)) . θ

()

Here in the above minimization problem, there are two objectives to minimize. The first term is to measure how efficiently the model represents or encodes the data; and the second one is to measure how complicated the model is. In computer science, this latter term corresponds to the length of the program required to encode the model. The concepts of external and internal complexities can be applied into a system divided into subsystems. In this case, some internal part of the original whole system will become external to a subsystem. Thus the internal input of a subsystem consists of original external input and also input from the rest of the system, i.e., other subsystems.

k

− ∑ P(Xi (θ)) log P(Xi (θ)).

()

i=

An adaptive system tries to maximize the above external complexity. The probability distribution P(X(θ)) is for quantifying the information value of the data X(θ). The value

Application: Learning The discussion of these two concepts, external and internal complexities, can be put into the background of learning. In statistical learning theory (Vapnik, ), the criterion for evaluating a learning process is the expected prediction error of future data by the model

Complexity in Adaptive Systems

based on training data set with partial and incomplete information. The task is to construct a probability distribution drawn from an a-priori specific class for representing the distribution underlying the input data received. Usually, if a higher error is produced by a model on the training data, then a higher error will be expected on the future data. The error will depend on two factors: one is the accuracy of the model on the training data set, another is the simplicity of the model itself. The description of the data set can be split into two parts, the regular part, which is useful in constructing the model; and the random part, which is a noise to the model. The learning process fits very well into the theory framework of internal and external complexities. If the model is too complicated, it will bring the risk of overfitting the training data. In this case, some spurious or putative regularity is incorporated into the model, which will not appear in the future data. The model should be constrained within some model class with bounded complexity. This complexity in this context of statistical learning theory is measured by the VapnikChervonenkis dimension (see 7VC Dimension) (Vapnik, ). Under the simplest form of statistical learning theory, the system aims at finding a representation with smallest error in a class with given complexity constraints; and then the model should minimize the expected error on future data and also over-fitting error. The two concepts of over-fitting and leaving out regularities can be distinguished in the following sense. The former is caused by the noise in the data, i.e., the random part of the data, and this leads to putative regularities, which will not appear in the future data. The latter, leaving out regularities, means that the system can forgo some part of regularities in the data, or it is possible to make data compression. Thus, leaving out regularities can be used to simplify the model and reduce the internal complexity. However, a problem is still waiting for answer here, that is, what regularities in the data set are useful for data compression and also meaningful for future prediction; and what parts are random to the model. The internal complexity is the model complexity. If the internal complexity is chosen too small, then the model does not have enough capacity to represent all the important features of the data set. If the internal complexity is too large, on the other hand, then the

C

model does not represent the data efficiently. The internal complexity is preferably minimized under appropriate constraints on the adequacy of the representation of data. This is consistent with Rissanen’s principle of Minimum Description Length (Rissanen, ) to represent a given data set in the most efficient way. Thus a good model is both to simplify the model itself and to represent the data efficiently. The external complexity is the data complexity which should be large to represent the input accurately. This is related to Jaynes’ principle of maximizing the ignorance (Jaynes, ), where a model for representing data should have the maximal possible entropy under the constraint that all regularities can be reproduced. In this way, putative regularities could be eliminated in the model. However, this principle should be applied with some conditions as argued by Gell-Mann and Lloyd (); it cannot eliminate the essential regularities in the data, and an overlying complex model should be avoided. For some learning system, only a selection of data is gathered and observed by the system. Thus a middle term, observation, is added between model and data. The concept of observation refers to the extraction of value of some specific quantity from a given data or data pool. What a system can observe depends on its internal structure and its general model of the environment. The system does not have direct access to the raw data, but through constructing a model of the environment solely on the basis of the values of its observation. For such kind of learning system, Jaynes’ principle (Jaynes, ) is still applicable for increasing the external complexity. For the given observation made on a data set, the maximum entropy representation should be selected. However, this principle is still subject to the modification of Gell-Mann and Lloyd () to a principle where the model should not lose the essential regularities observed in the data. By contrast, the observations should be selected to reduce the internal complexity. Given a model, if the observation can be made on a given data set, then these observations should be selected so as to minimize the resulting entropy of the model, with the purpose of minimizing the uncertainty left about the data. Thus it leads to reduce the complexity. In most of the cases, the environment is dynamic, i.e., the data set itself can be varied, then the external

C

C

Complexity of Inductive Inference

complexity should be maximized again. Thus the observation should be chosen for maximal information gain extracted from the data to increase the external complexity. Jaynes’ principle (Jaynes, ) can be applied as the same as in previous discussion. But on a longer time scale, when the inputs reach some stationary distribution, the model should be simplified to reduce its internal complexity.

Detail We refer the reader to the article 7Inductive Inference for basic definitions in inductive inference and the notations used below. Let N denote the set of natural numbers. Let φ , φ , . . . denote a fixed acceptable programming system (Rogers, ). Let Wi = domain(φ i ).

Mind Changes and Anomalies Recommended Reading Gell-Mann, M., & Lloyd, S. (). Information measures, effective complexity, and total information. Complexity, (), –. Holland, J. (). Adaptation in natural and artificial systems. Cambridge, MA: MIT Press. Holland, J. (). Hidden order: How adaptation builds complexity. Reading, MA: Addison-Wesley. Jaynes, E. (). Information theory and statistical mechanics. Physical Review, (), –. Jost, J. (). External and internal complexity of complex adaptive systems. Theory in Biosciences, (), –. Levin, S. (). Complex adaptive systems: Exploring the known, the unknown and the unknowable. Bulletin of the American Mathematical Society, (), –. Rissanen, J. (). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Vapnik, V. (). Statistical learning theory. New York: John Wiley & Sons. Waldrop, M. (). Complexity: The emerging science at the edge of order and chaos. New York: Simon & Schuster.

Complexity of Inductive Inference Sanjay Jain, Frank Stephan National University of Singapore, Singapore, Republic of Singapore

Definition In 7inductive inference, the complexity of learning can be measured in various ways: by the number of hypotheses issued in the worst case until the correct hypothesis is found; by the number of data items to be consumed or to be memorized in order to learn in the worst case; by the Turing degree of oracles needed to learn the class under a certain criterion; by the intrinsic complexity which is – like the Turing degrees in recursion theory – a way to measure the complexity of classes by using reducibilities between them.

The first measure of complexity of learning can be considered as the number of mind changes needed before the learner converges to its final hypothesis in the TxtEx model of learning. The number of mind changes by a learner M on a text T can be counted as card ({m : ? ≠ M(T[m]) ≠ M(T[m+])}). A learner M TxtExn learns a class L of languages iff M TxtEx learns L and for all L ∈ L, for all texts T for L, M makes at most n mind changes on T. TxtExn is defined as the collection of language classes which can be TxtExn identified (see Case & Smith () for details). Consider the class of languages Ln = {L : card(L) ≤ n}. It can be shown that Ln+ ∈ TxtExn+ − TxtExn . Now consider anomalous learning. A class C is TxtExab -learnable iff there is a learner, which makes at most b mind changes (where b = ∗ denotes that the number of mind changes is finite on each text for a language in the class, but not necessarily bounded by a constant) and whose final hypothesis is allowed to make up to a errors (where a = ∗ denotes finitely many errors). For these learning criteria, we get a twodimensional hierarchy on what can be learnt. Let Cn = {f : φ f () =n f }. For a total function f , let Lf = {⟨x, f (x)⟩ : x ∈ N}, where ⟨⋅, ⋅⟩ denotes a computable pairing function: a bijective mapping from N × N to N. Let LC = {Lf : f ∈ C}. Then, one can show that n LCn+ ∈ TxtExn+ − TxtEx . Similarly, if we consider the class Sn = {f : card({m : f (m) ≠ f (m + )}) ≤ n}, then one can show that LSn+ ∈ TxtExn+ − TxtEx∗n (we refer the reader to Case and Smith () for a proof of the above).

Data and Time Complexity Wiehagen () considered the complexity of number of data needed for learning. Regarding time complexity, one should note the result by Pitt () that any TxtEx-learnable class of languages can be TxtEx-learnt by a learner that has time complexity (with respect to

Complexity of Inductive Inference

C

the size of the input) bounded by a linear function. This result is achieved by a delaying trick, where the learner just repeats its old hypothesis unless it has enough time to compute its later hypothesis. This seriously effects what one can say about time complexity of learning. One proposal made by Daley and Smith () is to consider the total time used by the learner until its sequence of hypotheses converges, resulting in a possibly more reasonable measure of time in the complexity of learning.

Besides memorizing some past elements seen, another way to address this issue is by giving feedback to the learner (see Case, Jain, Lange, & Zeugmann, ) on whether some element has appeared in the past data. A feedback learner is an iterative learner, which is additionally allowed to query whether certain elements appeared in earlier data. An n-feedback learner is allowed to make n such queries at each stage (when it receives the new input datum). Thus, M is an mfeedback learner if there exist computable functions Q and a F such that, for all texts T and all n:

Iterative and Memory-Bounded Learning

– Q(M(T[n]), T(n)) is defined and is a set of m elements; – If Q(M(T[n]), T(n)) = (x , x , . . . , xm ) then M(T[n + ]) = F(M(T[n]), T(n), y , y , . . . , ym ), where yi = iff xi ∈ ctnt(T[n]).

Another measure of complexity of learning can be considered when one restricts how much past data a learner can remember. Wiehagen introduced the concept of iterative learning in which the learner cannot remember any past data. Its new hypothesis is based only on its previous conjecture and the new datum it receives. In other words, there exists a recursive function F such that M(T[n + ]) = F(M(T[n]), T(n)), for all texts T and for all n. Here, M(T[]) is some fixed value, say the symbol ‘?’ which is used by the learner to denote the absence of a reasonable conjecture. It can be shown that being iterative restricts the learning capacity of learners. For example, let Le = {x : x ∈ N} and let L = {Le } ∪ {{S ∪ {n + }} : n ∈ N, S ⊆ Le , and max(S) ≤ n}; then L can be shown to be TxtEx-learnable but not iteratively learnable. Memory-bounded learning (see Lange & Zeugmann, ) is an extension of memory-limited learning, where the learner is allowed to memorize upto some fixed number of elements seen in the past. Thus, M is an m-memory-bounded learner if there exists a function mem and two computable functions mF and F such that, for all texts T and all n: – mem(T[]) = /; – M(T[n + ]) = F(M(T[n]), mem(T[n]), T(n + )); – mem(T[n + ]) = mF(M(T[n]), mem(T[n]), T(n + )); – mem(T[n + ]) − mem(T[n]) ⊆ {T(n + )}; – card(mem(T[n])) ≤ m. It can be shown that the criteria of inference based on TxtEx-learning by m-memory-bounded learners form a proper hierarchy.

Again, it can be shown that allowing more feedback gives greater learning power, and thus one can get a hierarchy based on the amount of feedback allowed.

Complexity of Final Hypothesis Another possibility on complexity of learning is to consider the complexity or size of the final grammar output by the learner. Freivalds () considered the case when the final program/grammar output by the learner is minimal: that is, there is no smaller index that accepts/generates the same language. He showed that this severely restricts the learning capacity of learners. Not only that, the learning capacity depends on the acceptable programming system chosen, unlike the case for most other criteria of learning such as TxtEx or TxtBc, which are independent of the acceptable programming system chosen. In particular, there are acceptable programming systems in which only classes containing finitely many infinite languages can be learnt using minimal final grammars (see Freivalds, ; Jain and Sharma, ). Chen () considered a modification of such a paradigm where one considers convergence to nearly minimal grammars rather than minimal. That is, instead of requiring that the final grammars are minimal, one requires that they are within a recursive function h of minimal. Here h may depend on the class being learnt. Chen showed that this allows one to have the criteria of minimal learnability

C

C

Complexity of Inductive Inference

to be independent of the acceptable programming system chosen. However, one can show that some simple classes are not minimally learnable. An example of such a class is the class LC which is derived from C = {f : ∀∞ × [f (x) = ]}, the class of all functions which are almost everywhere .

Intrinsic Complexity Another way to consider complexity of learning is to consider relative complexity in a way similar to how one considers Turing reductions in computability theory. Such a notion is called intrinsic complexity of the class. This was first considered by Freivalds et al. () for function learning. Jain and Sharma () considered it for language learning, and the following discussion is from there. An enumeration operator (see Rogers, ), Θ, is an algorithmic mapping from SEQ into SEQ such that the following two conditions are satisfied: – for all σ, τ ∈ SEQ, if σ ⊆ τ, then Θ(σ) ⊆ Θ(τ); – for all texts T, limn→∞ ∣Θ(T[n])∣ = ∞. By extension, we think of Θ as also mapping texts to texts such that Θ(T) = ⋃n Θ(T[n]). Furthermore, we define Θ(L) = {ctnt(Θ(T)) : T is a text for L}. Intuitively, Θ(L) denotes the set of languages to whose texts Θ maps texts of L. The reader should note the overloading of this notation because the type of the argument to Θ could be a sequence, a text or a language. One says that a sequence of grammars g , g , . . . is an acceptable TxtEx-sequence for L if the sequence of grammars converges to a grammar for L. L ≤weak L iff there are two operators Θ and Ψ such that for all L ∈ L , for all texts T for L, Θ(T) is a text for some L′ ∈ L such that if g , g , . . . is an acceptable TxtEx-sequence for L′ then Ψ(g , g , . . .) is an acceptable TxtEx-sequence for L. Note that different texts for the same language L may be mapped by Θ to texts for different languages in L above. If we require that different texts for L are mapped to texts for the same language L′ in L , then we get a stronger notion of reduction called strong reduction: L ≤strong L iff L ≤weak L and for all L ∈ L , Θ(L) contains only one language, where Θ is as in the definition for ≤weak reduction.

It can be shown that FIN is a complete class for TxtEx-identification with respect to ≤weak reduction (see Jain & Sharma, ). Interestingly it was shown that the class of pattern languages (Angluin, ), the class SD = {L : Wmin(L) = L} and the class COINIT = {{x : x ≥ n} : n ∈ N} are all equivalent under ≤strong . Let code be a bijective mapping from non-negative rational numbers to natural numbers. Then, one can show that the class RINIT = {{code(x) : ≤ x ≤ r, x is a rational number} : ≤ r ≤ , r is a rational number } is ≤strong complete for TxtEx (see Jain, Kinber, & Wiehagen, ). Interestingly every finite directed acyclic graph can be embedded into the ≤strong degree structure (Jain & Sharma, ). On the other hand the degree structure is non-dense in the sense that there exist classes L and L such that L Year < Attribute A = true : Market Rising Attribute A = false : Market Falling Year ≥ Attribute B = true : Market Rising Attribute B = false : Market Falling This tree contains embedded knowledge about two intervals of time: in one of which, –, attribute A is predictive; in the other, onward, attribute B is predictive. As time (in this case, year) is a monotonically increasing attribute, future classification using this decision tree will only use attribute B. If this domain can be expected to have recurring hidden context, information about the prior interval of time could be valuable. The decision tree in the example above contains information about changes in context. We define context as: ▸ Context is any attribute whose values are largely inde-

which instances of a hidden context are liable to be contiguous. There is also no restriction, in principle, to one dimension. Some alternatives to time as environmental attributes are dimensions of space, and space–time combinations. Given an environmental attribute, we can utilize a CSFS machine learning algorithm to gain information on likely hidden changes in context. The accuracy of the change points found will be dependent upon at least hidden context duration, the number of different contexts, the complexity of each local concept, and noise. The CSFS identified context change points can be expected to contain errors of the following types: . 7Noise or serial correlation errors. These would take the form of additional incorrect change points. . Errors due to the repetition of tests on time in different parts of the concept. These would take the form of a group of values clustered around the actual point where the context changed. . Errors of omission, change points that are missed altogether. The initial set of identified context changes can be refined by contextual 7clustering. This process combines similar intervals of the dataset, where the similarity of two intervals is based upon the degree to which a partial model is accurate on both intervals.

pendent but tend to be stable over contiguous intervals of another attribute known as the environmental attribute.

The ability of decision trees to capture context is associated with the fact that decision tree algorithms use a form of context-sensitive feature selection (CSFS). A number of machine learning algorithms can be regarded as using CSFS including decision tree algorithms (Quinlan, ), 7rule induction algorithms (Clark & Niblett, ), and 7ILP systems (Quinlan, ). All of these systems produce concepts containing local information about context. When contiguous intervals of time reflect a hidden attribute or context, we call time the environmental attribute. The environmental attribute is not restricted to time alone as it could be any ordinal attribute over

Recent Advances With the increasing amount of data being generated by organizations, recent work on concept drift has focused on mining from high volume 7data streams Hulten, Spencer, & Domingos, ; Wang, Fan, Yu, & Han, ; Koltzer & Maloof, , Mierswa, Wurst, Klinkenberg, Scholz, & Euler, ; Chu & Zaniolo, ; Gaber, Zaslavsky, & Krishnaswamy, . Methods such as Hulten et al’ s, combine decision tree learning with incremental methods for efficient updates, thus avoiding relearning large decision trees. Koltzer and Maloof also use incremental methods combined in an 7ensemble.

Concept Learning

Cross References 7Decision Trees 7Ensemble Methods 7Incremental Learning 7Inductive Logic Programming 7Lazy Learning

Recommended Reading Aha, D. W., Kibler, D., & Albert, M. K. (). Instance-based learning algorithms. Machine Learning, , –. Chu, F., & Zaniolo, C. (). Fast and light boosting for adaptive mining of data streams. In Advances in knowledge discovery and data mining. Lecture notes in computer science (Vol. , pp. –). Springer. Clark, P., & Niblett, T. (). The CN induction algorithm. Machine Learning, , –. Clearwater, S., Cheng, T.-P., & Hirsh, H. (). Incremental batch learning. In Proceedings of the sixth international workshop on machine learning (pp. –). Morgan Kaufmann. Domingos, P. (). Context-sensitive feature selection for lazy learners. Artificial Intelligence Review, , –. [Aha, D. (Ed.). Special issue on lazy learning.] Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (). Mining data streams: A review. SIGMOD Rec., (), –. Harries, M., & Horn, K. (). Learning stable concepts in domains with hidden changes in context. In M. Kubat & G. Widmer (Eds.), Learning in context-sensitive domains (workshop notes). th international conference on machine learning, Bari, Italy. Harries, M. B., Sammut, C., & Horn, K. (). Extracting hidden context. Machine Learning, (), –. Hulten, G., Spencer, L., & Domingos, P. (). Mining timechanging data streams. In KDD ’: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. –). New York: ACM. Kilander, F., & Jansson, C. G. (). COBBIT – A control procedure for COBWEB in the presence of concept drift. In P. B. Brazdil (Ed.), European conference on machine learning (pp. –). Berlin: Springer. Kolter, J. Z., & Maloof, M. A. (). Dynamic weighted majority: A new ensemble method for tracking concept drift. In Third IEEE international conference on data mining ICDM- (pp. –). IEEE CS Press. Kubat, M. (). Floating approximation in time-varying knowledge bases. Pattern Recognition Letters, , –. Kubat, M. (). A machine learning based approach to load balancing in computer networks. Cybernetics and Systems Journal. Kubat, M. (). Second tier for decision trees. In Machine learning: Proceedings of the th international conference (pp. –). California: Morgan Kaufmann. Kubat, M., & Widmer, G. (). Adapting to drift in continuous domains. In Proceedings of the eighth European conference on machine learning (pp. –). Berlin: Springer. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., & Euler, T. (). Yale: Rapid prototyping for complex data mining tasks. In KDD ’: Proceedings of the th ACM SIGKDD international conference on knowledge discovery and data mining (pp. –). New York: ACM.

C

Quinlan, J. R. (). Learning logical definitions from relations. Machine Learning, , –. Quinlan, J. R. (). C.: Programs for machine learning. Morgan Kaufmann: San Mateo. Salganicoff, M. (). Density adaptive learning and forgetting. In Machine learning: Proceedings of the tenth international conference (pp. –). San Mateo: Morgan Kaufmann. Schlimmer, J. C., & Granger, R. I., Jr. (a). Beyond incremental processing: Tracking concept drift. In Proceedings AAAI- (pp. –). Los Altos: Morgan Kaufmann. Schlimmer, J., & Granger, R., Jr. (b). Incremental learning from noisy data. Machine Learning, (), –. Turney, P. D. (a). Exploiting context when learning to classify. In P. B. Brazdil (Ed.), European conference on machine learning (pp. –). Berlin: Springer. Turney, P. D. (b). Robust classification with context sensitive features. In Paper presented at the industrial and engineering applicatións of artificial intelligence and expert systems. Turney, P., & Halasz, M. (). Contextual normalization applied to aircraft gas turbine engine diagnosis. Journal of Applied Intelligence, , –. Wang, H., Fan, W., Yu, P. S., & Han, J. (). Mining conceptdrifting data streams using ensemble classifiers. In KDD ’: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. –). New York: ACM. Widmer, G. (). Recognition and exploitation of contextual clues via incremental meta-learning. In L. Saitta (Ed.), Machine learning: Proceedings of the th international workshop (pp. –). San Francisco: Morgan Kaufmann. Widmer, G., & Kubat, M. (). Effective learning in dynamic environments by explicit concept tracking. In P. B. Brazdil (Ed.), European conference on machine learning (pp. –). Berlin: Springer. Widmer, G., & Kubat, M. (). Learning in the presence of concept drift and hidden contexts. Machine Learning, , –.

Concept Learning Claude Sammut The University of New South Wales, Sydney, NSW, Australia

Synonyms Categorization; Classification learning

Definition The term concept learning is originated in psychology, where it refers to the human ability to learn categories for object and to recognize new instances of those categories. In machine learning, concept is more formally

C

C

Concept Learning

defined as “inferring a boolean-valued function from training examples of its inputs and outputs” (Mitchell, ).

Background Bruner, Goodnow, and Austin () published their book A Study of Thinking, which became a landmark in psychology and would later have a major impact on machine learning. The experiments reported by Bruner, Goodnow, and Austin were directed toward understanding a human’s ability to categorize and how categories are learned. ▸ We begin with what seems a paradox. The world of experience of any normal man is composed of a tremendous array of discriminably different objects, events, people, impressions. . . But were we to utilize fully our capacity for registering the differences in things and to respond to each event encountered as unique, we would soon be overwhelmed by the complexity of our environment. . . The resolution of this seeming paradox. . . is achieved by man’s capacity to categorize. To categorize is to render discriminably different things equivalent, to group objects and events and people around us into classes. . . The process of categorizing involves. . . an act of invention. . . If we have learned the class “house” as a concept, new exemplars can be readily recognised. The category becomes a tool for further use. The learning and utilization of categories represents one of the most elementary and general forms of cognition by which man adjusts to his environment.

The first question that they had to deal with was that of representation: what is a concept? They assumed that objects and events could be described by a set of attributes and were concerned with how inferences could be drawn from attributes to class membership. Categories were considered to be of three types: conjunctive, disjunctive, and relational. ▸ . . .when one learns to categorize a subset of events in a certain way, one is doing more than simply learning to recognise instances encountered. One is also learning a rule that may be applied to new instances. The concept or category is basically, this “rule of grouping” and it is

such rules that one constructs in forming and attaining concepts.

The notion of a rule as an abstract representation of a concept influenced research in machine learning. For example, 7decision tree learning was used as a means of creating a cognitive model of concept learning (Hunt, Martin, & Stone, ). This model later inspired Quinlan’s development of ID (Quinlan, ). The learning experience may be in the form of examples from a trainer or the results of trial and error. In either case, the program must be able to represent its observations of the world, and it must also be able to represent hypotheses about the patterns it may find in those observations. Thus, we will often refer to the 7observation language and the 7hypothesis language. The observation language describes the inputs and outputs of the program and the hypothesis language describes the internal state of the learning program, which corresponds to its theory of the concepts or patterns that exist in the data. The input to a learning program consists of descriptions of objects from the universe and, in the case of 7supervised learning, an output value associated with the example. The universe can be an abstract one, such as the set of all natural numbers, or the universe may be a subset of the real world. No matter which method of representation we choose, descriptions of objects in the real world must ultimately rely on measurements of some properties of those objects. These may be physical properties such as size, weight, and color or they may be defined for objects, for example, the length of time a person has been employed for the purpose of approving a loan. The accuracy and reliability of a learned concept depends on the accuracy and reliability of the measurements. A program is limited in the concepts that it can learn by the representational capabilities of both observation and hypothesis languages. For example, if an attribute/value list is used to represent examples for an induction program, the measurement of certain attributes and not others clearly places bounds on the kinds of patterns that the learner can find. The learner is said to be biased by its observation language (see 7Language Bias). The hypothesis language also places constraints on what may and may not be learned. For

Concept Learning

example, in the language of attributes and values, relationships between objects are difficult to represent. Whereas, a more expressive language, such as first-order logic, can easily be used to describe relationships. Unfortunately, representational power comes at a price. Learning can be viewed as a search through the space of all sentences in a language for a sentence that best describes the data. The richer the language, the larger is the search space. When the search space is small, it is possible to use “brute force” search methods. If the search space is very large, additional knowledge is required to reduce the search.

Rules, Relations, and Background Knowledge In the early s, there was no discipline called “machine learning.” Instead, learning was considered to be part of “pattern recognition,” which had not yet split from AI. One of the main problems addressed at that time was how to represent patterns so that they could be recognized easily. Symbolic description languages were developed to be expressive and learnable. Banerji (, ) first devised a language, which he called a “description list,” which utilized an object’s attributes to perform pattern recognition. Pennypacker, a masters student of Banerji at the Case Institute of Technology, implemented the recognition procedure and also used Bruner, Goodnow, and Austin’s Conservative Focussing Strategy to learn conjunctive concepts (Pennypacker, ). Bruner, Goodnow, and Austin describe the strategy as follows: ▸ . . . this strategy may be described as finding a positive instance to serve as a focus, then making a sequence of choices each of which alters but one attribute value [of the focus] and testing to see whether the change yields a positive or negative instance. Those attributes of the focus which, when changed, still yield positive instance are not part of the concept. Those attributes of the focus that yield negative instances when changed are features of the concept.

The strategy is only capable of learning conjunctive concepts, that is, the concept description can only consist of a simple conjunction of tests on attribute values. Recognizing the limitations of simple attribute/value representations, Banerji () introduced the use of

C

predicate logic as a description language. Thus, Banerji was one of the earliest advocates of what would, many years later, become Inductive Logic Programming. In the s, a series of algorithms emerged that developed concept learning further. Winston’s ARCH program (Winston, ) was influential as one of the first widely known concept learning programs. Michalski (, ) devised the Aq family of learning algorithms that set some of the early benchmarks for learning programs. Early relational learning programs were developed by Hayes-Roth (), Hayes-Roth and McDermott (), and Vere (, ). Banerji emphasized the importance of a description language that could “grow.” That is, its descriptive power should increase as new concepts are learned. These concepts become background knowledge for future learning. A simple example from Banerji () illustrates the use of background knowledge. There is a language for describing instances of a concept and another for describing concepts. Suppose we wish to represent the binary number, , by a left-recursive binary tree of digits “” and “”: [head : [head : ; tail : nil]; tail : ] “head” and “tail” are the names of attributes. Their values follow the colon. The concepts of binary digit and binary number are defined as x ∈ digit ≡ x = ∨ x = x ∈ num ≡ (tail(x) ∈ digit ∧ head(x) = nil) ∨ (tail(x) ∈ digit ∧ head(x) ∈ num) Thus, an object belongs to a particular class or concept if it satisfies the logical expression in the body of the description. Note that the concept above is disjunctive. Predicates in the expression may test the membership of an object in a previously learned concept and can express relations between objects. Cohen and Sammut () devised a learning system based on Banerji’s ideas of a growing concept description language and this was further extended by Sammut and Banerji ().

Concept Learning and Noise One of the most severe drawbacks of early concept learning systems was that they assumed that data sets

C

C

Conditional Random Field

were not noisy. That is, all attribute values and class labels in the training data are assumed to be correct. This is unrealistic in most real applications. Thus, concept learning systems began incorporating statistical measures to minimize the effects of noise and to estimate error rates (Breiman, Friedman, Olshen, & Stone, ; Cohen, ; Quinlan, , ). Learning to classify objects from training examples has gone on to become one of the central themes of machine learning research. As the robustness of classification systems has increased, they have found many applications, particularly in data mining but in a broad range of other areas.

Cross References 7Data Mining 7Decision Tree Learning 7Inductive Logic Programming 7Learning as Search 7Relational Learning 7Rule Learning

Recommended Reading Banerji, R. B. (). An information processing program for object recognition. General Systems, , –. Banerji, R. B. (). The description list of concepts. Communications of the Association for Computing Machinery, (), –. Banerji, R. B. (). A Language for the Description of Concepts. General Systems, , –. Banerji, R. B. (). Artificial intelligence: A theoretical approach. New York: North Holland. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (). Classification and regression trees. Belmont, CA: Wadsworth. Bruner, J. S., Goodnow, J. J., & Austin, G. A. (). A study of thinking. New York: Wiley. Cohen, B. L., & Sammut, C. A. (). Object recognition and concept learning with CONFUCIUS. Pattern Recognition Journal, (), –. Cohen, W. W. (). In fast effective rule induction. In Proceedings of the twelfth international conference on machine learning, Lake Tahoe, California. Menlo Park: Morgan Kaufmann. Hayes-Roth, F. (). A structural approach to pattern learning and the acquisition of classificatory power. In First international joint conference on pattern recognition (pp. –). Washington, D.C. Hayes-Roth, F., & McDermott, J. (). Knowledge acquisition from structural descriptions. In Fifth international joint conference on artificial intelligence (pp. –). Cambridge, MA.

Hunt, E. B., Marin, J., & Stone, P. J. (). Experiments in induction. New York: Academic. Michalski, R. S. (). Discovering classification rules using variable valued logic system VL. In Third international joint conference on artificial intelligence (pp. –). Stanford, CA. Michalski, R. S. (). A theory and methodology of inductive learning. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. Palo Alto: Tioga. Mitchell, T. M. (). Machine learning. New York: McGraw-Hill. Pennypacker, J. C. (). An elementary information processor for object recognition. SRC No. -I--. Case Institute of Technology. Quinlan, J. R. (). Learning efficient classification procedures and their application to chess end games. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. Palo Alto: Tioga. Quinlan, J. R. (). The effect of noise on concept learning. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (Vol. ). Los Altos: Morgan Kaufmann. Quinlan, J. R. (). C.: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. Sammut, C. A., & Banerji, R. B. (). Learning concepts by asking questions. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (Vol. , pp. –). Los Altos, CA: Morgan-Kaufmann. Vere, S. (). Induction of concepts in the predicate calculus. In Fourth international joint conference on artificial intelligence (pp. –). Tbilisi, Georgia, USSR. Vere, S. A. (). Induction of relational productions in the presence of background information. In Fifth international joint conference on artificial intelligence. Cambridge, MA. Winston, P. H. (). Learning structural descriptions from examples. Unpublished PhD Thesis, MIT Artificial Intelligence Laboratory.

Conditional Random Field A Conditional Random Field is a form of 7Graphical Model for segmenting and 7classifying sequential data. It is the 7discriminative learning counterpart to the 7generative learning Markov Chain model.

Recommended Reading Lafferty, J., McCallum, A., & Pereira, F. (). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the th international conference on machine learning (pp. –). San Francisco, Morgan Kaufmann.

C

Conjunctive Normal Form

Confirmation Theory

Confusion Matrix. Table An example of three-class confusion matrix Assigned Class A

Actual Class

The branch of philosophy concerned with how (and indeed whether) evidence can confirm a hypothesis, even though typically it does not entail it. A distinction is sometimes drawn between total confirmation: how well confirmed a hypothesis is, given all available evidence and weight-of-evidence: the amount of extra confirmation added to the total confirmation of a hypothesis by a particular piece of evidence. Confirmation is often measured by the probability of a hypothesis conditional on evidence.

B

C

C

A

B

C

Confusion Matrix. Table The outcomes of classification into positive and negative classes

Confusion Matrix Kai Ming Ting Monash University, Australia

Definition A confusion matrix summarizes the classification performance of a 7classifier with respect to some 7test data. It is a two-dimensional matrix, indexed in one dimension by the true class of an object and in the other by the class that the classifier assigns. Table presents an example of confusion matrix for a three-class classification task, with the classes A, B, and C. The first row of the matrix indicates that objects belong to the class A and that are correctly classified as belonging to A, two misclassified as belonging to B, and one as belonging to C. A special case of the confusion matrix is often utilized with two classes, one designated the positive class and the other the negative class. In this context, the four cells of the matrix are designated as 7true positives (TP), 7false positives (FP), 7true negatives (TN), and 7false negatives (FN), as indicated in Table . A number of measures of classification performance are defined in terms of these four classification outcomes. 7Specificity = 7True negative rate = TN/(TN + FP) 7Sensitivity = 7True positive rate = 7Recall = TP/ (TP + FN)

Actual Class

Assigned Class Positive

Negative

Positive

TP

FN

Negative

FP

TN

7Positive predictive value = 7Precision = TP/(TP + FP) 7Negative predictive value = TN/(TN + FN)

Conjunctive Normal Form Bernhard Pfahringer University of Waikato, Hamilton, New Zealand

Conjunctive normal form (CNF) is an important normal form for propositional logic. A logic formula is in conjunctive normal form if it is a single conjunction of disjunctions of (possibly negated) literals. No more nesting and no other negations are allowed. Examples are: a ¬b a∧b (a ∨ ¬b) ∧ (c ∨ d) ¬a ∧ (b ∨ ¬c ∨ d) ∧ (a ∨ ¬d)

C

Connection Strength

Any arbitrary formula in propositional logic can be transformed into conjunctive normal form by application of the laws of distribution, De Morgan’s laws, and by removing double negations. It is important to note that this process can lead to exponentially larger formulas which implies that the process in the worst case runs in exponential time. An example for this behavior is the following formula given in 7disjunctive normal form (DNF), which is linear in the number of propositional variables in this form. When transformed into conjunctive normal form (CNF), its size is exponentially larger. DNF: (a ∧ a ) ∨ (a ∧ a ) ∨ . . . ∨ (an ∧ an+ ) CNF: (a ∨ a ∨ . . . ∨ an ) ∧ (a ∨ a ∨ . . . ∨ an ) ∧ . . . ∧ (a ∨ a ∨ . . . ∨ an+ )

Recommended Reading Russell, S., & Norvig, P. (). Artificial intelligence: A modern approach (p. ). Prentice Hall

Connection Strength 7Weight

Connections Between Inductive Inference and Machine Learning John Case , Sanjay Jain University of Delaware, Newark, USA National University of Singapore, Singapore, Republic of Singapore

Definition Inductive inference is a theoretical framework to model learning in the limit. Here we will discuss some results in inductive inference, which have relevance to machine learning community. The mathematical/theoretical area called 7Inductive Inference, is also known as computability theoretic learning and learning in the limit (Jain, Osherson, Royer, &

Sharma, ; Odifreddi, ) typically but, as will be seen below, not always involves a situation depicted in () just below. Data d , d , d , . . . Ð→M Ð→ Programs e , e , e , . . . . In

Out

() Let N = the set of nonnegative integers. Strings, including program strings, computer reals, and other data structures, inside computers, are finite bit strings and, hence, can be coded into N. Therefore, mathematically at least, it is without loss of mathematical generality that we sometimes use the data type N where standard practice would use a different type. In (), d , d , d , . . . can be, e.g., the successive values of a function f : N → N or the elements of a (formal) language L ⊆ N in some order; M is a machine; the ei ’s are from some hypothesis space of programs; and, for M’s successful learning, later ei ’s exactly or approximately compute the f or L. Such learning is off-line: in successful cases, one comes away with programs for past and future data. For the related problem of online extrapolation of next values for a function f , suitable ei ’s may be the values of f (i)’s based on having seen strictly prior values of f .

Detail We will discuss the off-line case until we say otherwise. It is typical in applied machine learning to present to a learner whatever data one has and to obtain one corresponding program hopefully for predicting these data and future data. In inductive inference the case where only one program is output is called one-shot learning. More typically, in inductive inference, one allows for mind-changes, i.e., for a succession of output programs, as one receives successively more input data, with the later programs hopefully eventually being useful for predictions. Typically, one does not get success on one’s first conjecture/output program, but rather, one may achieve success eventually, or, as it is said, in the limit after some sequence of trial and error. It is helpful at this juncture to present a problem for which this latter approach makes more sense than the one-shot approach. We will consider some different criteria of successful learning of f or L by M. For example, Ex-style criteria

Connections Between Inductive Inference and Machine Learning

will require that all but finitely many of the ei ’s are syntactically the same and do a reasonable job of computing the f or L. Bc-style criteria are more relaxed, more powerful, but less useful (B¯arzdi¸nš, ; Case & Lynes, ; Case & Smith, ): they do not require almost all ei ’s be the same syntactically. Here is a well-known regression technique from, e.g., (Hildebrand, ), for exactly “curve-fitting” polynomials. It is the method involving calculating forward differences. We express it as a learning machine M and illustrate with its being fed an example data sequence generated by a cubic polynomial x − x + x + .

()

See (Hildebrand, ), for how to recover the polynomials themselves. M , fed a finite data sequence of natural numbers, first looks for iterated forward differences to become (apparently) constant, then outputs a rule/program, which uses the (apparent) constant to extrapolate the data sequence for any desired prediction. For example, were M given the data sequence in the top row of Table , it would calculate to be the apparent constant after three differencings, so M then outputs the following informal rule/program.

C

the elements of the cubic polynomial, on successive values in N – the whole sequence , , , , , , , . . . . Along the way, though, just after the first data point, M thinks the apparent constant is ; just after the second that it is ; just after the third that it is ; and only after more of the data points does it converge for this cubic polynomial to the apparent (and, on this example, actual) constant . In general, M , on a polynomial of degree m, changes its mind up to m times until converging to its final program (of course on f (x) = x , M never converges, and each level of forward differences is just the sequence f again.). Hence, M above Ex-learns, e.g., the integer polynomials f : N → N, but it does not in general one-shot learn these polynomials – since the data alone do not disclose the degree of a generating polynomial. In this entry we survey some results from inductive inference but with an eye to topics having something to say regarding or to applied machine learning. In some cases, the theoretical results lend mathematical support to preexisting empirical observations about the efficacy of known machine learning techniques. In other cases, the theoretical results provide some, typically abstract, suggestions for the machine learning practitioner. In some of these cases, some of the suggestions apparently pay off in others, intriguingly, we do not know yet.

▸ To generate the level sequence, at level , start with ; at level , start with ; at level , start with ; add the apparent constant from level to get successive level data items; add successive level items to get successive level data items; finally, add successive level items to get as many successive level data items as needed for prediction.

This program, eventually output by M when its input the whole top row of Table , correctly predicts Connections Between Inductive Inference and Machine Learning. Table Example Sequence and Its Iterated Forward Differences Sequence: st Diffs: nd Diffs: rd Diffs:

Multi-Task or Context Sensitive Learning In empirical, applied machine learning, multitask or context sensitive learning involves trying to learn Y by first (de Garis, a, b; Fahlman, ; Thrun, ; Thrun & Sullivan, ; Tsung & Cottrell, ; Waibel, a, b) or simultaneously (Caruana, , ; Dietterich, Hild, & Bakiri, ; Matwin & Kubat, ; Mitchell, Caruana, Freitag, McDermott, & Zabowski, ; Pratt, Mostow, & Kamm, ; Sejnowski & Rosenberg, ; Bartlmae, Gutjahr, & Nakhaeizadeh, ) trying to learn also X – even in cases where there may be no inherent interest in learning X (see also 7Transfer Learning). There is, in many cases, an apparent empirical advantage in doing this for some X, Y. It can happen that Y is not apparently or easily learnable by itself, but is learnable if one learns X first or simultaneously in some case X itself can be a sequence of tasks X , . . . , Xn . Here the Xi s may need to be learned sequentially or simultaneously to learn Y. For example, to teach a robot to drive

C

C

Connections Between Inductive Inference and Machine Learning

a car, it is useful to train it also to predict the center of the road markings (see, e.g., Baluja & Pomerleau, ; Caruana, ). For another example: an experimental system to predict the value of German Daimler stock performed better when it was modified to track simultaneously the German stock-index DAX (Bartlmae et al., ). The value of the Daimler stock here was the primary or target concept and the value of the DAX – a related concept – provided useful auxiliary context. Angluin, Gasarch, and Smith () shows mathematically that, in effect, there are (mathematical) learning scenarios for which it was provable that Y could not be learned without learning X first – and, in other scenarios (Angluin et al., ; Kinber, Smith, Velauthapillai, & Wiehagen, ), Y could not be learned without simultaneously learning X. These mathematical results provide a kind of evidence that the empirical observations as to the apparent usefulness of multitask or context sensitive learning may not be illusionary, luck, or a mere accident of happening to use some data sets but not others. For illustration, here is a particularly simple theoretical example needing to be learned simultaneously and similar to examples in Angluin et al. (). Let R be the set of all computable functions mapping N to N. We use numerical names in N for programs. Let S = {( f , g) ∈ R × R ∣ f () is a program for g ∧ g() is a program for f }.

()

We say (p, q) is a program for ( f , g) ∈ R × R iff p is a program for f and q is a program for g. Consider a machine M which, if, as in (), M is fed d , d , . . ., but where each di is ( f (i), g(i)), then M outputs each ei = (g(), f ()). Clearly, M oneshot learns S. It can be easily shown that the component f ’s and g’s for ( f , g) ∈ S are not separately even Bc-learnable. It is important to note that, perhaps quite unlike real-world problems, the definition of this example S employs a simple self-referential coding trick: useful programs are coded into values of the functions at argument zero. A number of inductive inference results have been proved by means of (sometimes more complicated) self-referential coding tricks (see, e.g., Case, ). B¯arzdi¸nš indirectly (see Zeugmann, ) provided a kind of informal robustness idea in his attempt to be rid of such coding tricks in inductive inference.

More formally, Fulk () considered a learnability result involving a witnessing class C of (tuples of) functions to be robust iff each computable scrambling of C also witnesses the learnability result (the allowed computable scramblers are the general recursive operators of (Rogers, ), but we omit the formal details herein.) Example: A simple shift scrambler converting each f to f ′ , where f ′ (x) = f (x + ), would eliminate the coding tricks just above – since the values of f at argument zero would be lost in this scrambling. Some inductive inference results hold robustly and some not (see, e.g., Fulk, ; Jain, ; Jain, Smith, & Wiehagen, ; Jain et al., ; Case, Jain, Ott, Sharma, & Stephan, ). Happily, the S ⊆ R × R above (that is, learnable, but its components not) can be replaced by a more complicated class S ′ that robustly witnesses the same result. This is better theoretical evidence that the empirically noticed efficacy of multitask or context sensitive learning is not just an accident. It is residually important to note that (Jain et al., ) shows, though, that the computable scramblers can not get rid of more sophisticated coding tricks they called topological. S ′ mentioned just above turns out to employ this latter kind of coding trick. It is hypothesized in (Case et al., ) that nature likely employs some sophisticated coding tricks itself. For a separate informal argument about coding tricks of nature, see (Case, ). Ott and Stephan () introduces a finite invariance constraint on top of robustness. This so-called hyperrobustness does destroy all coding tricks, and the result about the theoretical efficacy of multitask or context sensitive learning is not hyperrobust. However, hyperrobustness, perhaps, leaves unrealistically sparse structure. Final note: Machine learning is an engineering endeavor. However, philosophers of science as well as practitioners in classical scientific disciplines should likely be considering the relevance of multitask or context sensitive inductive inference to their endeavors.

Special Cases of Inductive Logic Programming In this section we discuss some learning in the limit results for elementary formal systems (EFSs) (Smullyan, ). Essentially, EFSs are programs in a string rewriting system. It is well known (Arikawa, Shinohara, & Yamamoto, ) that EFSs are essentially (pure) logic

Connections Between Inductive Inference and Machine Learning

programs over strings. Hence, the results have possible relevance for 7inductive logic programming (ILP) (Bratko & Muggleton, ; Lavraˇc & Džeroski, ; Mitchell, ; Muggleton & De Raedt, ). First we will discuss some important special cases based on Angulin’s pattern languages (Angluin, ). A pattern language is (by definition) one generated by all the positive length substitution instances in a pattern, such as, abXYcbbZXa () — where the variables (for substitutions) are depicted in upper case and the constants/terminals in lower case and are from, say, the alphabet {a,b,c}. Just below is an EFS or logic program based on this example pattern. abXYcbbZXa ← .

()

It must be understood, though, that in () and in the next example EFS below, only positive length strings are allowed to be substituted for the variables. Angluin () showed the Ex-learnability of the class of pattern languages from positive data. For these results, in the paradigm of () above d , d , d , . . . is a listing or presentation of some formal language L over a finite nonempty alphabet and the ei ’s are programs that generate languages. In particular, for Angluin’s M, for L a pattern language, the ei ’s are patterns, and, for each presentation of L, all but finitely many of the corresponding ei ’s are the same correct pattern for L. Much work has been done on the learnability of pattern languages, e.g., Salomaa (a, b); Case, Jain, Kaufmann, Sharma, and Stephan (), and on bounded finite unions thereof, e.g., Shinohara (); Wright (); Kilpeläinen, Mannila, and Ukkonen (); Brazma, Ukkonen, and Vilo (); Case, Jain, Lange, and Zeugmann (). Regarding bounded finite unions of pattern languages: an n-pattern language is the union of the pattern languages for some n patterns P , . . . , Pn . Each n-pattern language is also Ex-learnable from positive data (see Wright ()). An EFS or logic program corresponding to the n-patterns P , . . . , Pn and generating the corresponding n-pattern language is just below. P ← . ⋮ Pn ← .

C

Pattern language learning algorithms have been successfully applied toward some problems in molecular biology, see, e.g., Shimozono et al. (), Shinohara and Arikawa (). Lange and Wiehagen () presents an interesting iterative (Wiehagen, ) algorithm learning the class of pattern languages – from positive data only and with polynomial time constraints. Iterative learners are Ex-learners for which each output depends only on its just prior output (if any) and the input data element currently seen. Their algorithm works in polynomial time (actually quadratic time) in the length of the latest data item and the previous hypothesis. Furthermore, the algorithm has a linear set of good examples, in the sense that if the input data contains these good examples, then the algorithm already converges to the correct hypothesis. The number of good examples needed is at most ∣P∣ + , where P is a pattern generating the data d , d , d , . . . for the language being learned. This algorithm may be useful in practice due to its fast run time, and being able to converge quickly, if enough good data is available early. Furthermore, due to iterativeness, it does not need to store previous data! Zeugmann () considers total learning time up to convergence of the algorithm just discussed in the just prior paragraph. Note that, for arbitrary presentations, d , d , d , . . ., of a pattern language, this time can be unbounded. In the best case it is polynomial in the length of a generating pattern P, where d , d , d , . . . is based on using P to get good examples early – in fact the time taken in the best case is Θ(∣P∣ logs (s + k)), where P is the pattern, s is the alphabet size, and k is the number of variables in P. Much more interesting is the case of average time taken up to convergence. The probability distribution (called uniform by Zeugmann) considered is as follows. A variable X is replaced by a string w with probability (s) ∣w∣ (i.e., all strings of length r together have probability −r , and the distribution is uniform among strings of length r). Different variables are replaced independently of each other. In this case the average total time up to convergence is O(k k s∣P∣ logs (ks)). The main thing is that for average case on probabilistic data (as can be expected in real life, though not necessarily with this kind of uniform distribution), the algorithm converges pretty fast and computations are done efficiently.

C

C

Connections Between Inductive Inference and Machine Learning

A number of papers consider Ex-learning of EFSs (Krishna Rao, ; Krishna Rao, , , ; Krishna Rao & Sattar, ) including with various bounds on the number of mind-changes until syntactic convergence to correct programs (Jain & Sharma, , ). The EFSs considered are patterns, n-patterns, those with a constant bound on the length of clauses, and some with constant bounds on search trees. The mind-change bounds are typically more dynamic than those given by constants: they involve counting down from finite representations (called notations) for infinite constructive ordinals. An example of this kind of bound: one can algorithmically, based on some input parameters, decide how many mind-changes will be allowed. In other examples, the decision as to how many mindchanges will be allowed can be algorithmically revised some constant number of times. It is possible that not yet created special cases of some of these algorithms could be made feasible enough for practice.

Learning Drifting Concepts A drifting concept to be learned is one which is a moving target (see 7Concept Drift). In some machine learning applications, concept drift must be dealt with (Bartlett, Ben-David, & Kulkarni, ; Blum & Chalasani, ; Devaney & Ram, ; Freund & Mansour, ; Helmbold and Long, ; Kubat, ; Widmer & Kubat, ; Wrobel, ). An inductive inference contribution is (Case et al., ) in which it is shown, for online extrapolation by computable martingale betting strategies, upper bounds on the “speed” of the moving target that permit success at all. Here success is to make unbounded amounts of “money” betting on correctness of ones extrapolations. Here is an illustrative result from (Case et al., ). For the pattern languages considered in the previous section, only positive length strings of terminals can be substituted for a variable in an associated pattern. The (difficult to learn) pattern languages with erasing are just the languages obtained by also allowing the substitution of the empty string for variables in a pattern. For our example, we restrict the terminal alphabet to be {,}. With each pattern language with erasing L (over this terminal alphabet) we associate its characteristic function χ L , which is on terminal strings in L and on those not in L. For ε denoting the empty string,

and for the terminal strings in length-lexicographical order, ε, , , , , , , , . . ., we would input a χ L itself to a potential extrapolating machine as the bit string, χ L (ε), χ L (), χ L (), χ L (), χ L (), . . .. Let E be the class of these characteristic functions. Pick a positive integer constant p. To model drift with permanence p, we imagine that a potential extrapolator for E receives successive bits from a member of E but keeps switching to the next bits of another, etc., but it must see at least p bits in a row of each member of E it sees before it can see the next bits of another. p is, then, a speed limit on drift. The result is that some suitably clever computable martingale betting strategy is successful at extrapolating E with drift permanence (speed limit on drift) of p = .

Behavioral Cloning Kummer and Ott (); Case, Ott, Sharma, and Stephan () studied learning in the limit of winning control strategies for closed computable games. These games nicely model reactive process-control problems. Included are such example process-control games as regulating temperature of a room to be in a desired interval, forever after no more than some fixed number of moves between the thermostat and processes disturbing the temperature (Roughly, closed computable games are those so that one can tell algorithmically when one has lost. A temperature control game that requires stability forever after some undetermined finite number of moves is not a closed computable game. For a more formal treatment, see Cenzer and Remmel (); Maler, Pnueli, and Sifakis (); Thomas (); Kummer and Ott ()). In machine learning, there are cases where one wants to teach a machine some motor skill possessed by human experts and where these human experts do not have access to verbalizable knowledge about how they perform expertly. Piloting an aircraft or expert operation of a swinging shipyard crane provide examples, and machine learning employs, in these cases, 7behavioral cloning, which uses direct performance data from the experts (Bain & Sammut, ; Bratko, Urbanˇciˇc, & Sammut, ; Šuc, ). Case et al. () studies the effects on learning in the limit closed computable games where the learning procedures also had access to the behavioral performance (but not the algorithms) of masters/experts at the

Connections Between Inductive Inference and Machine Learning

C

games. For example, it is showed that, in some cases, there is better performance cloning n + disparate masters over cloning only n. For a while it was not known in machine learning how to clone multiple experts even after Case et al. () was known to some; however, independently of Case et al., , and later, Dorian Šuc (Šuc, ) found a way to clone behaviorally more than one human expert simultaneously (for the freeswinging shipyard crane problem) – by having more than one level of feedback control, and he got enhanced performance from cloning the multiple experts!

() For k chosen so that − −k ≥ p, there exists a blind, probabilistic algorithmic coordinator PM, such that: (i) For each member of C, PM can coordinate with with probability − −k ≥ p; and (ii) PM is k-memory limited in the sense of (Osherson, Stob, & Weinstein, , P. ); more specifically, PM needs to remember whether it is outputting one of its first k bits — which are its only random bits (e.g., , a mere k = random bits for p = suffice.).

Learning To Coordinate

Regarding possible eventual applicability: Maye, Hsieh, Sugihara, and Brembs () cites finding deterministic chaos but not randomness in the behavior of animals. Hence, animals may not be exploiting random bits in learning anything, including to coordinate. However, one might build artifactual devices to exploit randomness, say, from radioactive decay, including, then, for enhancing learning to coordinate.

Montagna and Osherson () begins the study of learning in the limit to coordinate (digital) moves between at least two agents. The machines of Montagna and Osherson () are, in effect, general extrapolating devices (Montagna & Osherson, ; Case et al., ). Technically, and without loss of generality of the results, we restrict the moves of each coordinator to bits, i.e., zeros and ones. Coordination is achieved between two coordinators iff each, reacting to the bit sequence of the other, eventually (in the limit) matches it bit for bit. Montagna and Osherson () gives an example of two people who show up in a park each day at one of noon (bit ) or pm (bit ); each silently watches the other’s past behavior; and each tries, based on the past behavior of the other, to show up eventually exactly when the other shows up. If they manage it, they have learned to coordinate. A blind coordinator is one that reacts only to the presence of a bit from another process, not to which bit the other process has played (Montagna and Osherson, ). In Case et al. () is developed and studied the notion of probabilistically correct algorithmic coordinators. Next is a sample of theorems to the effect that just a few random bits can enhance learning to coordinate. Theorem (Case et al., ) Suppose ≤ p < . There exists a class of deterministic algorithmic coordinators C such that () No deterministic algorithmic coordinator can coordinate with all of C; and

Learning Geometric Clustering Case, Jain, Martin, Sharma, and Stephen () showed that learnability in the limit of 7clustering, with or without additional information, depends strongly on geometric constraints on the shape of the clusters. In this approach the hypothesis space of possible clusters is pre-given in each setting. It was hoped to obtain thereby insight into the difficulty of clustering when the clusters are restricted to preassigned geometrically defined classes. This is interestingly complementary to the conceptual clustering approach (see, e.g., Mishra, Ron, & Swaminathan, ; Pitt & Reinke, ) where one restricts the possible clusters to have good “verbal” descriptions in some language. Clustering of many of the geometric classes investigated was shown to require information in addition to a presentation, d , d , d , . . ., of the set of points to be clustered. For example, for clusters as convex hulls of finitely many points in a rational vector space, clustering can be done – but with the number of clusters as additional information. Let S consist of all polygons including their interiors – in the rational two-dimensional plane without intersections and degenerated angles (Attention was restricted to spaces of rationals since: . computer

C

C

Connections Between Inductive Inference and Machine Learning

reals are rationals, . this avoids the uncountability of the set of reals, and . this avoids dealing with uncomputable real points.) The class S can be clustered – but with the number of vertices of the polygons of the clusters involved as additional information. Correspondingly, then, it was shown that the class S ′ containing S together with all such polygons but with one hole (the nondegenerate differences of two members in S) cannot be clustered with the number of vertices as additional information, yet S ′ can be clustered with area as additional information – and this even in higher dimensions and with any number of holes (Case et al., ). It remains to be seen if some forms of geometrically constrained clustering can be usefully complementary to, say, conceptually/verbally constrained clustering.

Insights for Limitations of Science We briefly treat below in some problems regarding parsimonious, refutable, and consistent hypotheses. It is common wisdom in science that one should fit parsimonious explanations, hypotheses, or programs to data. In machine learning, this has been successfully applied, e.g., (Wallace, ; Wallace & Dowe, ). Curiously, though, there are many results in inductive inference in which we see sometimes severe degradations of learning power caused by demanding parsimonious predictive programs (see, e.g., Freivalds (); Kinber (); Chen (); Case, Jain, and Sharma (); Ambainis, Case, Jain, and Suraj ()). It is an interesting problem to resolve the seeming, likely not actual contradiction between the just prior two paragraphs. Popper’s Refutability (Popper, ) asserts that hypotheses in science should be subject to refutation. Besides the well-known difficulties of Duhem–Quine (Harding, ) of knowing which component hypothesis to throw out when a compound hypothesis badly fails to make correct predictions, inductive inference theorems have provided very different difficulties. Case and Smith () outlines cases of usefully incomplete (hence wrong) hypothesis that cannot be refuted, and Case and Suraj () (see also Case, ) provides cases of inductively inferable higher order hypothesis not totally subject to refutation in cases where ordinary hypotheses subject to full refutation cannot be inductively inferred.

While Duhem–Quine may impact machine learning eventually, it remains to be seen about the inductive inference results of the just prior paragraph. Requiring 7inductive inference procedures always to output an hypothesis in various senses consistent with (e.g., not ignoring) the data on which that hypothesis is based seems like mere common sense. However, from B¯arzdi¸nš (a); Blum and Blum (); Wiehagen (), Case, Jain, Stephan, and Wiehagen () we see that strict adherence to various consistency principles can severely attenuate the learning power of inductive inference machines. Furthermore, interestingly, even when inductive inference is polytime constrained, we see similar counterintuitive results to the effect that a kind of consistency can strictly attenuate learning power (Wiehagen & Zeugmann, ). A machine learning analog might be Breiman’s bagging (Breiman, ) and random forests (Breiman, ), where data is purposely ignored. However, in these cases, the purpose of ignoring data is to avoid overfitting to noise. It remains to be seen, whether, in applied machine learning involving cases of practically noiseless data, one can also obtain some advantage in ignoring some consistency principles. Again the potential lesson from inductive inference is abstract and provides only a hint of something to work out in real machine learning problems.

Cross References 7Behavioural Cloning 7Clustering 7Concept Drift 7Inductive Logic Programming 7Transfer Learning

Recommended Reading Ambainis, A., Case, J., Jain, S., & Suraj, M. (). Parsimony hierarchies for inductive inference. Journal of Symbolic Logic, , –. Angluin, D., Gasarch, W., & Smith, C. (). Training sequences. Theoretical Computer Science, (), –. Angluin, D. (). Finding patterns common to a set of strings. Journal of Computer and System Sciences, , –. Arikawa, S., Shinohara, T., & Yamamoto, A. (). Learning elementary formal systems. Theoretical Computer Science, , –. Bain, M., & Sammut, C. (). A framework for behavioural cloning. In K. Furakawa, S. Muggleton, & D. Michie (Eds.), Machine intelligence . Oxford: Oxford University Press.

Connections Between Inductive Inference and Machine Learning

Baluja, S., & Pomerleau, D. (). Using the representation in a neural network’s hidden layer for task specific focus of attention. Technical Report CMU-CS--, School of Computer Science, CMU, May . Appears in Proceedings of the IJCAI. Bartlett, P., Ben-David, S., & Kulkarni, S. (). Learning changing concepts by exploiting the structure of change. In Proceedings of the ninth annual conference on computational learning theory, Desenzano del Garda, Italy. New York: ACM Press. Bartlmae, K., Gutjahr, S., & Nakhaeizadeh, G. (). Incorporating prior knowledge about financial markets through neural multitask learning. In Proceedings of the fifth international conference on neural networks in the capital markets. B¯arzdi¸nš, J. (a). Inductive inference of automata, functions and programs. In Proceedings of the international congress of mathematicians, Vancouver (pp. –). B¯arzdi¸nš, J. (b). Two theorems on the limiting synthesis of functions. In Theory of algorithms and programs (Vol. , pp. –). Latvian State University, Riga. Blum, L., & Blum, M. (). Toward a mathematical theory of inductive inference. Information and Control, , –. Blum, A., & Chalasani, P. (). Learning switching concepts. In Proceedings of the fifth annual conference on computational learning theory, Pittsburgh, Pennsylvania, (pp. –). New York: ACM Press. Bratko, I., & Muggleton, S. (). Applications of inductive logic programming. Communications of the ACM, (), –. Bratko, I., Urbanˇciˇc, T., & Sammut, C. (). Behavioural cloning of control skill. In R. S. Michalski, I. Bratko, & M. Kubat (Eds.), Machine learning and data mining: Methods and applications, (pp. –). New York: Wiley. Brazma, A., Ukkonen, E., & Vilo, J. (). Discovering unbounded unions of regular pattern languages from positive examples. In Proceedings of the seventh international symposium on algorithms and computation (ISAAC’), Lecture notes in computer science, (Vol. , pp. –), Berlin: Springer-Verlag. Breiman, L. (). Bagging predictors. Machine Learning, (), –. Breiman, L. (). Random forests. Machine Learning, (), –. Caruana, R. (). Multitask connectionist learning. In Proceedings of the connectionist models summer school (pp. –). NJ: Lawrence Erlbaum. Caruana, R. (). Algorithms and applications for multitask learning. In Proceedings th international conference on machine learning (pp. –). San Francisco, CA: Morgan Kaufmann. Case, J. (). Infinitary self-reference in learning theory. Journal of Experimental and Theoretical Artificial Intelligence, , –. Case, J. (). The power of vacillation in language learning. SIAM Journal on Computing, (), –. Case, J. (). Directions for computability theory beyond pure mathematical. In D. Gabbay, S. Goncharov, & M. Zakharyaschev (Eds.), Mathematical problems from applied logic II. New logics for the XXIst century, International Mathematical Series, (Vol. ). New York: Springer. Case, J., & Lynes, C. (). Machine inductive inference and language identification. In M. Nielsen & E. Schmidt, (Eds.), Proceedings of the th International Colloquium on Automata, Languages and Programming, Lecture notes in computer science, (Vol. , pp. –). Berlin: Springer-Verlag. Case, J., & Smith, C. (). Comparison of identification criteria for machine inductive inference. Theoretical Computer Science, , –.

C

Case, J., & Suraj, M. (). Weakened refutability for machine learning of higher order definitions, . (Working paper for eventual journal submission). Case, J., Jain, S., Kaufmann, S., Sharma, A., & Stephan, F. (). Predictive learning models for concept drift (Special Issue for ALT’). Theoretical Computer Science, , –. Case, J., Jain, S., Lange, S., & Zeugmann, T. (). Incremental concept learning for bounded data mining. Information and Computation, , –. Case, J., Jain, S., Montagna, F., Simi, G., & Sorbi, A. (). On learning to coordinate: Random bits help, insightful normal forms, and competency isomorphisms (Special issue for selected learning theory papers from COLT’, FOCS’, and STOC’). Journal of Computer and System Sciences, (), –. Case, J., Jain, S., Martin, E., Sharma, A., & Stephan, F. (). Identifying clusters from positive data. SIAM Journal on Computing, (), –. Case, J., Jain, S., Ott, M., Sharma, A., & Stephan, F. (). Robust learning aided by context (Special Issue for COLT’). Journal of Computer and System Sciences, , –. Case, J., Jain, S., & Sharma, A. (). Machine induction without revolutionary changes in hypothesis size. Information and Computation, , –. Case, J., Jain, S., Stephan, F., & Wiehagen, R. (). Robust learning – rich and poor. Journal of Computer and System Sciences, (), –. Case, J., Ott, M., Sharma, A., & Stephan, F. (). Learning to win process-control games watching gamemasters. Information and Computation, (), –. Cenzer, D., & Remmel, J. (). Recursively presented games and strategies. Mathematical Social Sciences, , –. Chen, K. (). Tradeoffs in the inductive inference of nearly minimal size programs. Information and Control, , –. de Garis, H. (a). Genetic programming: Building nanobrains with genetically programmed neural network modules. In IJCNN: International Joint Conference on Neural Networks, (Vol. , pp. –). Piscataway, NJ: IEEE Service Center. de Garis, H. (b). Genetic programming: Modular neural evolution for Darwin machines. In M. Caudill (Ed.), IJCNN--WASH DC; International joint conference on neural networks (Vol. , pp. –). Hillsdale, NJ: Lawrence Erlbaum Associates. de Garis, H. (). Genetic programming: Building artificial nervous systems with genetically programmed neural network modules. In B. Soušek, & The IRIS group (Eds.), Neural and intelligenct systems integeration: Fifth and sixth generation integerated reasoning information systems (Chap. , pp. –). New York: Wiley. Devaney, M., & Ram, A. (). Dynamically adjusting concepts to accommodate changing contexts. In M. Kubat, G. Widmer (Eds.), Proceedings of the ICML- Pre-conference workshop on learning in context-sensitive domains, Bari, Italy (Journal submission). Dietterich, T., Hild, H., & Bakiri, G. (). A comparison of ID and backpropogation for English text-tospeech mapping. Machine Learning, (), –. Fahlman, S. (). The recurrent cascade-correlation architecture. In R. Lippmann, J. Moody, and D. Touretzky (Eds.), Advances in neural information processing systems (Vol. , pp. –). San Mateo, CA: Morgan Kaufmann Publishers. Freivalds, R. (). Minimal Gödel numbers and their identification in the limit. In Lecture notes in computer science (Vol. , pp. –). Berlin: Springer-Verlag.

C

C

Connections Between Inductive Inference and Machine Learning

Freund, Y., & Mansour, Y. (). Learning under persistent drift. In S. Ben-David, (Ed.), Proceedings of the third European conference on computational learning theory (EuroCOLT’), Lecture notes in artificial intelligence, (Vol. , pp. –). Berlin: Springer-Verlag. Fulk, M. (). Robust separations in inductive inference. In Proceedings of the st annual symposium on foundations of computer science (pp. –). St. Louis, Missouri. Washington, DC: IEEE Computer Society. Harding, S. (Ed.). (). Can theories be refuted? Essays on the Duhem-Quine thesis. Dordrecht: Kluwer Academic Publishers. Helmbold, D., & Long, P. (). Tracking drifting concepts by minimizing disagreements. Machine Learning, , –. Hildebrand, F. (). Introduction to numerical analysis. New York: McGraw-Hill. Jain, S. (). Robust behaviorally correct learning. Information and Computation, (), –. Jain, S., & Sharma, A. (). Elementary formal systems, intrinsic complexity, and procrastination. Information and Computation, , –. Jain, S., & Sharma, A. (). Mind change complexity of learning logic programs. Theoretical Computer Science, (), –. Jain, S., Osherson, D., Royer, J., & Sharma, A. (). Systems that learn: An introduction to learning theory (nd ed.). Cambridge, MA: MIT Press. Jain, S., Smith, C., & Wiehagen, R. (). Robust learning is rich. Journal of Computer and System Sciences, (), –. Kilpeläinen, P., Mannila, H., & Ukkonen, E. (). MDL learning of unions of simple pattern languages from positive examples. In P. Vitányi (Ed.), Computational learning theory, second European conference, EuroCOLT’, Lecture notes in artificial intelligence, (Vol. , pp. –). Berlin: Springer-Verlag. Kinber, E. (). On a theory of inductive inference. In Lecture notes in computer science (Vol. , pp. –). Berlin: SpringerVerlag. Kinber, E., Smith, C., Velauthapillai, M., & Wiehagen, R. (). On learning multiple concepts in parallel. Journal of Computer and System Sciences, , –. Krishna Rao, M. (). A class of prolog programs inferable from positive data. In A. Arikawa & A. Sharma (Eds.), Seventh international conference on algorithmic learning theory (ALT’ ), Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer-Verlag. Krishna Rao, M. (). Some classes of prolog programs inferable from positive data (Special Issue for ALT’). Theoretical Computer Science A, , –. Krishna Rao, M. (). Inductive inference of term rewriting systems from positive data. In S. Ben-David, J. Case, & A. Maruoka (Eds.), Algorithmic learning theory: Fifteenth international conference (ALT’ ), Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer-Verlag. Krishna Rao, M. (). A class of prolog programs with nonlinear outputs inferable from positive data. In S. Jain, H. U. Simon, & E. Tomita (Eds.), Algorithmic learning theory: Sixteenth international conference (ALT’ ), Lecture notes in artificial intelligence, (Vol. , pp. –). Berlin: Springer-Verlag. Krishna Rao, M., & Sattar, A. (). Learning from entailment of logic programs with local variables. In M. Richter, C. Smith, R. Wiehagen, & T. Zeugmann (Eds.), Ninth international conference on algorithmic learning theory (ALT’ ), Lecture notes in

artificial intelligence (Vol. , pp. –). Berlin: SpringerVerlag. Kubat, M. (). A machine learning based approach to load balancing in computer networks. Cybernetics and Systems, , –. Kummer, M., & Ott, M. (). Learning branches and learning to win closed recursive games. In Proceedings of the ninth annual conference on computational learning theory, Desenzano del Garda, Italy. New York: ACM Press. Lange, S., & Wiehagen, R. (). Polynomial time inference of arbitrary pattern languages. New Generation Computing, , –. Lavraˇc, N., & Džeroski, S. (). Inductive logic programming: Techniques and applications. New York: Ellis Horwood. Maler, O., Pnueli, A., & Sifakis, J. (). On the synthesis of discrete controllers for timed systems. In Proceedings of the annual symposium on the theoretical aspects of computer science, LNCS (Vol. , pp. –). Berlin: Springer-Verlag. Matwin, S., & Kubat, M. (). The role of context in concept learning. In M. Kubat & G. Widmer (Eds.), Proceedings of the ICML- pre-conference workshop on learning in contextsensitive domains, Bari, Italy, (pp. –). Maye, A., Hsieh, C., Sugihara, G., & Brembs, B. (). Order in spontaneous behavior. PLoS One, May, . See: http://brembs. net/spontaneous/ Mishra, N., Ron, D., & Swaminathan, R. (). A new conceptual clustering framework. Machine Learning, (–), –. Mitchell, T. (). Machine learning. New York: McGraw Hill. Mitchell, T., Caruana, R., Freitag, D., McDermott, J., & Zabowski, D. (). Experience with a learning, personal assistant. Communications of the ACM, , –. Montagna, F., & Osherson, D. (). Learning to coordinate: A recursion theoretic perspective. Synthese, , –. Muggleton, S., & De Raedt, L. (). Inductive logic programming: Theory and methods. Journal of Logic Programming, /, – . Odifreddi, P. (). Classical recursion theory (Vol. II). Amsterdam: Elsivier. Osherson, D., Stob, M., & Weinstein, S. (). Systems that learn: An introduction to learning theory for cognitive and computer scientists. Cambridge, MA: MIT Press. Ott, M., & Stephan, F. (). Avoiding coding tricks by hyperrobust learning. Theoretical Computer Science, (), –. Pitt, L., & Reinke, R. (). Criteria for polynomial-time (conceptual) clustering. Machine Learning, , –. Popper, K. (). Conjectures and refutations: The growth of scientific knowledge. New York: Basic Books. Pratt, L., Mostow, J., & Kamm, C. (). Direct transfer of learned information among neural networks. In Proceedings of the th national conference on artificial intelligence (AAAI-), Anaheim, California. Menlo Park, CA: AAAI press. Rogers, H. (). Theory of recursive functions and effective computability. New York: McGraw Hill (Reprinted, MIT Press, ). Salomaa, A. (a). Patterns (The formal language theory column). EATCS Bulletin, , –. Salomaa, A. (b). Return to patterns (The formal language theory column). EATCS Bulletin, , –. Sejnowski, T., & Rosenberg, C. (). NETtalk: A parallel network that learns to read aloud. Technical Report JHU-EECS--, Johns Hopkins University.

Consensus Clustering

Shimozono, S., Shinohara, A., Shinohara, T., Miyano, S., Kuhara, S., & Arikawa, S. (). Knowledge acquisition from amino acid sequences by machine learning system BONSAI. Transactions of Information Processing Society of Japan, , –. Shinohara, T. (). Inferring unions of two pattern languages. Bulletin of Informatics and Cybernetics, , –. Shinohara, T., & Arikawa, A. (). Pattern inference. In K. P. Jantke & S. Lange (Eds.), Algorithmic learning for knowledge-based systems, Lecture notes in artificial intelligence (Vol. , pp. –). Berlin: Springer-Verlag. Smullyan, R. (). Theory of formal systems. In Annals of Mathematics Studies (Vol. ). Princeton, NJ: Princeton University Press. Šuc, D. (). Machine reconstruction of human control strategies. Frontiers in artificial intelligence and applications (Vol. ). Amsterdam: IOS Press. Thomas, W. (). On the synthesis of strategies in infinite games. In Proceedings of the annual symposium on the theoretical aspects of computer science, LNCS (Vol. , pp. –). Berlin: SpringerVerlag. Thrun, S. (). Is learning the n-th thing any easier than learning the first? In Advances in neural information processing systems, . San Mateo, CA: Morgan Kaufmann. Thrun, S., & Sullivan, J. (). Discovering structure in multiple learning tasks: The TC algorithm. In Proceedings of the thirteenth international conference on machine learning (ICML) (pp. –). San Francisco, CA: Morgan Kaufmann. Tsung, F., & Cottrell, G. (). A sequential adder using recurrent networks. In IJCNN--WASHINGTON DC: International joint conference on neural networks June – (Vol. , pp. –). Piscataway, NJ: IEEE Service Center. Waibel, A. (a). Connectionist glue: Modular design of neural speech systems. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the connectionist models summer school (pp. –). San Mateo, CA: Morgan Kaufmann. Waibel, A. (b). Consonant recognition by modular construction of large phonemic time-delay neural networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems I (pp. –). San Mateo, CA: Morgan Kaufmann. Wallace, C. (). Statistical and inductive inference by minimum message length. (Information Science and Statistics). New York: Springer (Posthumously published). Wallace, C., & Dowe, D. (). Minimum message length and kolmogorov complexity (Special Issue on Kolmogorov Complexity). Computer Journal, (), –. http://comjnl. oxfordjournals.org/cgi/reprint///. Widmer, G., & Kubat, M. (). Learning in the presence of concept drift and hidden contexts. Machine Learning, , –. Wiehagen, R. (). Limes-Erkennung rekursiver Funktionen durch spezielle Strategien. Electronische Informationverarbeitung und Kybernetik, , –. Wiehagen, R., & Zeugmann, T. (). Ignoring data may be the only way to learn efficiently. Journal of Experimental and Theoretical Artificial Intelligence, , –. Wright, K. (). Identification of unions of languages drawn from an identifiable class. In R. Rivest, D. Haussler, & M. Warmuth (Eds.), Proceedings of the second annual workshop on computational learning theory, Santa Cruz, California, (pp. –). San Mateo, CA: Morgan Kaufmann Publishers.

C

Wrobel, S. (). Concept formation and knowledge revision. Dordrecht: Kluwer Academic Publishers. Zeugmann, T. (). On B¯arzdi¸nš’ conjecture. In K. P. Jantke (Ed.), Analogical and inductive inference, Proceedings of the international workshop, Lecture notes in computer science, (Vol. , pp. –). Berlin: Springer-Verlag. Zeugmann, T. (). Lange and Wiehagen’s pattern language learning algorithm: An average case analysis with respect to its total learning time. Annals of Mathematics and Artificial Intelligence, , –.

Connectivity 7Topology of a Neural Network

Consensus Clustering Synonyms Clustering aggregation; Clustering ensembles

Definition In Consensus Clustering we are given a set of n objects V, and a set of m clusterings {C , C , . . . , Cm } of the objects in V. The aim is to find a single clustering C that disagrees least with the input clusterings, that is, C minimizes D(C) = ∑ d(C, Ci ), Ci

for some metric d on clusterings of V. Meil˘a () proposed the principled variation of information metric on clusterings, but it has been difficult to analyze theoretically. The Mirkin metric is the most widely used, in which d(C, C′ ) is the number of pairs of objects (u, v) that are clustered together in C and apart in C′ , or vice versa; it can be calculated in time O(mn). We can interpret each of the clusterings Ci in Consensus Clustering as evidence that pairs ought be put together or separated. That is, w+uv is the number of Ci in which Ci [u] = Ci [v] and w−uv is the number of Ci in which Ci [u] ≠ Ci [v]. It is clear that w+uv + w−uv = m and

C

C

Constrained Clustering

that Consensus clustering is an instance of Correlation clustering in which the w−uv weights obey the triangle inequality.

GPS data, gene expression microarray analysis, video object identification, document clustering, and web search result grouping.

Structure of the Learning System

Constrained Clustering Kiri L. Wagstaff Pasadena, CA, USA

Definition Constrained clustering is a semisupervised approach to 7clustering data while incorporating domain knowledge in the form of constraints. The constraints are usually expressed as pairwise statements indicating that two items must, or cannot, be placed into the same cluster. Constrained clustering algorithms may enforce every constraint in the solution, or they may use the constraints as guidance rather than hard requirements.

Motivation and Background 7Unsupervised learning operates without any domainspecific guidance or preexisting knowledge. Supervised learning requires that all training examples be associated with labels. Yet it is often the case that existing knowledge for a problem domain fits neither of these extremes. Semisupervised learning methods fill this gap by making use of both labeled and unlabeled data. Constrained clustering, a form of semisupervised learning, was developed to extend clustering algorithms to incorporate existing domain knowledge, when available. This knowledge may arise from labeled data or from more general rules about the concept to be learned. One of the original motivating applications was noun phrase coreference resolution, in which noun phrases in a text must be clustered together to represent distinct entities (e.g., “Mr. Obama” and “the President” and “he”, separate from “Sarah Palin” and “she” and “the Alaska governor”). This problem domain contains several natural rules for when noun phrases should (such as appositive phrases) or should not (such as a mismatch on gender) be clustered together. These rules can be translated into a collection of pairwise constraints on the data to be clustered. Constrained clustering algorithms have now been applied to a rich variety of domain areas, including hyperspectral image analysis, road lane divisions from

Constrained clustering arises out of existing work with unsupervised clustering algorithms. In this description, we focus on clustering algorithms that seek a partition of the data into disjoint clusters, using a distance or similarity measure to place similar items into the same cluster. Usually, the desired number of clusters, k, is specified as an input to the algorithm. The most common clustering algorithms are k-means (MacQueen, ) and expectation maximization or EM (Dempster, Laird, & Rubin, ) (Fig. ). A constrained clustering algorithm takes the same inputs as a regular (unsupervised) clustering algorithm and also accepts a set of pairwise constraints. Each constraint is a 7must-link or 7cannot-link constraint. The must-link constraints form an equivalence relation, which permits the inference of additional transitively implied must-links as well as additional entailed cannot-link constraints between items from distinct must-link cliques. Specifying a significant number of pairwise constraints might be tedious for large data sets, so often they may be generated from a manually labeled subset of the data or from domain-specific rules. The algorithm may interpret the constraints as hard constraints that must be satisfied in the output or as soft preferences that can be violated, if necessary. The former approach was used in the first constrained clustering algorithms, COP-COBWEB (Wagstaff & Cardie,

Domain knowledge

Constraints = Output clusters

Input data

Constrained clustering

Constrained Clustering. Figure . The constrained clustering algorithm takes in nine items and two pairwise constraints (one must-link and one cannot-link). The output clusters respect the specified constraints

Constraint-Based Mining

) and COP-kmeans (Wagstaff, Cardie, Rogers, & Schroedl, ). COP-kmeans accommodates the constraints by restricting item assignments to exclude any constraint violations. If a solution that satisfies the constraints is not found, COP-kmeans terminates without a solution. Later, algorithms such as PCK-means and MPCK-means (Bilenko, Basu, & Mooney, ) permitted the violation of constraints when necessary by introducing a violation penalty. This is useful when the constraints may contain noise or internal inconsistencies, which are especially relevant in real-world domains. Constrained versions of other clustering algorithms such as EM (Shental, Bar-Hillel, Hertz, & Weinshall, ) and spectral clustering (Kamvar, Klein, & Manning, ) also exist. Penalized probabilistic clustering (PPC) is a modified version of EM that interprets the constraints as (soft) probabilistic priors on the relationships between items (Lu & Leen, ). In addition to constraining the assignment of individual items, constraints can be used to learn a better distance metric for the problem at hand (Bar-Hillel, Hertz, Shental, & Weinshall, ; Klein, Kamvar, & Manning, ; Xing, Ng, Jordan, & Russell, ). Must-link constraints hint that the effective distance between those items should be low, while cannotlink constraints suggest that their pairwise distance should be high. Modifying the metric accordingly permits the subsequent application of a regular clustering algorithm, which need not explicitly work with the constraints at all. The MPCK-means algorithm fuses these approaches together, providing both constraint satisfaction and metric learning simultaneously (Basu, Bilenko, & Mooney, ; Bilenko et al., ). More information about subsequent advances in constrained clustering algorithms, theory, and novel applications can be found in a compilation edited by Basu, Davidson, and Wagstaff (). Programs and Data

The MPCK-means algorithm is available in a modified version of the Weka machine learning toolkit (Java) at http://www.cs.utexas.edu/users/ml/risc/code/.

Recommended Reading Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (). Learning a Mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, , –.

C

Basu, S., Bilenko, M., & Mooney, R. J. (). A probabilistic framework for semi-supervised clustering. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. –). Seattle, WA. Basu, S., Davidson, I., & Wagstaff, K. (Eds.). (). Constrained Clustering: Advances in Algorithms, Theory, and Applications. Boca Raton, FL: CRC Press. Bilenko, M., Basu, S., & Mooney, R. J. (). Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the Twenty-first International Conference on Machine Learning (pp. –). Banff, AB, Canada. Dempster, A. P., Laird, N. M., & Rubin, D. B. (). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, (), –. Kamvar, S., Klein, D., & Manning, C. D. (). Spectral learning. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. –). Acapulco, Mexico. Klein, D., Kamvar, S. D., & Manning, C. D. (). From instancelevel constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proceedings of the Nineteenth International Conference on Machine Learning (pp. –). Sydney, Australia. Lu, Z. & Leen, T. (). Semi-supervised learning with penalized probabilistic clustering. In Advances in Neural Information Processing Systems (Vol. , pp. –). Cambridge, MA: MIT Press. MacQueen, J. B. (). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Symposium on Math, Statistics, and Probability (Vol. , pp. –). California: University of California Press. Shental, N., Bar-Hillel, A., Hertz, T., & Weinshall, D. (). Computing Gaussian mixture models with EM using equivalence constraints. In Advances in Neural Information Processing Systems (Vol. , pp. –). Cambridge, MA: MIT Press. Wagstaff, K. & Cardie, C. (). Clustering with instance-level constraints. In Proceedings of the Seventeenth International Conference on Machine Learning (pp. –). San Francisco: Morgan Kaufmann. Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (). Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. –). San Francisco: Morgan Kaufmann. Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (). Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (Vol. , pp. –). Cambridge, MA: MIT Press.

Constraint-Based Mining Siegfried Nijssen Katholieke Universiteit Leuven, Leuven, Belgium

Definition Constraint-based mining is the research area studying the development of data mining algorithms that search

C

C

Constraint-Based Mining

through a pattern or model space restricted by constraints. The term is usually used to refer to algorithms that search for patterns only. The most well-known instance of constraint-based mining is the mining of 7frequent patterns. Constraints are needed in pattern mining algorithms to increase the efficiency of the search and to reduce the number of patterns that are presented to the user, thus making knowledge discovery more effective and useful.

Motivation and Background Constraint-based pattern mining is a generalization of frequent itemset mining. For an introduction to frequent itemset mining, see 7Frequent Patterns. A constraint-based mining problem is specified by providing the following elements: A database D, usually consisting of independent transactions (or instances) ● A 7hypothesis space L of patterns ● A constraint q(θ, D) expressing criteria that a pattern θ in the hypothesis space should fulfill on the database

●

The general constraint-based mining problem is to find the set Th(D, L, q) = {θ ∈ L∣q(θ, D) = true}. Alternative problem settings are obtained by making different choices for D, L and q. For instance, If the database and hypothesis space consist of itemsets, and the constraint checks if the support of a pattern exceeds a predefined threshold in data, the frequent itemset mining problem is obtained (see 7Frequent Patterns) ● If the database and the hypothesis space consist of graphs or trees instead of itemsets, a graph mining or a tree mining problem is obtained. For more information about these topics, see 7Graph Mining and 7Tree Mining ● Additional syntactic constraints can be imposed ●

An overview of important types of constraints is given below. One can generalize the constraint-based mining problem beyond pattern mining. Also models, such as

7Decision Trees, could be seen as languages of interest. In the broadest sense, topics such as 7Constrained Clustering, 7Cost-Sensitive Learning, and even learning 7Support Vector Machines (SVMs) may be seen as constraint-based mining problems. However, it is currently not common to categorize these topics as constraint-based mining; in practice, the term refers to constraint-based pattern mining. From the perspective of constraint-based mining, the knowledge discovery process can be seen as a process in which a user repeatedly specifies constraints for data mining algorithms; the data mining system is a solver that finds patterns or models that satisfy the constraints. This approach to data mining is very similar to querying relational databases. Whereas relational databases are usually queried using operations such as projections, selections, and joins, in the constraintbased mining framework data is queried to find patterns or models that satisfy constraints that cannot be expressed in these primitives. A database which supports constraint-based mining queries, stores patterns and models, and allows later reuse of patterns and models, is sometimes also called an inductive database (Imielinski & Mannila, ).

Structure of the Learning System Constraints

Frequent pattern mining algorithms can be generalized along several dimensions. One way to generalize pattern mining algorithms is to allow them to deal with arbitrary 7coverage relations, which determine when a pattern matches a transaction in the data. In the example of mining itemsets, the subset relation determines the coverage relation. The coverage relation is at the basis of constraints such as minimum support; an alternative coverage relation would be the superset relation. From the coverage relation follows a generality relationship. A pattern θ is defined to be more specific than a pattern θ (denoted by θ ≻ θ ) if any transaction that is covered by θ is also covered by θ (see 7Generalization). In frequent itemset mining, itemset I is more general than itemset I if and only I ⊆ I . Generalization and coverage relationships can be used to identify the following types of constraints.

Constraint-Based Mining

Monotonic and Anti-Monotonic Constraints An essen-

tial property which is exploited in 7frequent pattern mining, is that all subsets of a frequent pattern are also frequent. This is a property that can be generalized: A constraint is called monotonic if any generalization of a pattern that satisfies the constraint, also satisfies the constraint ● A constraint is called anti-monotonic if any specialization of a pattern that satisfies the constraint, also satisfies the constraint

●

In some publications, the definitions of monotonic and anti-monotonic are used reversely. The following are examples of monotonic constraints: Minimum support Syntactic constraints, for instance: a constraint that requires that patterns specializing a given pattern x are excluded a constraint requiring patterns to be small given a definition of pattern size ● Disjunctions or conjunctions of monotonic constraints ● Negations of anti-monotonic constraints

● ●

The following are examples of anti-monotonic constraints: Maximum support ● Syntactic constraints, for instance, a constraint that requires that patterns generalizing a given pattern x are excluded ● Disjunctions or conjunctions of anti-monotonic constraints ● Negations of monotonic constraints ●

Succinct Constraints Constraints that can be pushed

in the mining process by adapting the pattern space or data, are called succinct constraints. An example of a succinct constraint is the monotonic constraint that an itemset should contain the item A. This constraint could be dealt with by deleting all transactions that do not contain A. For any frequent itemset found in the new dataset, it is now known that the item A can be added to it. Convertible Constraints Some constraints that are not

monotonic, can still be convertible monotonic (Pei &

C

Han, ). A constraint is convertible monotonic if for every pattern θ one least general generalization θ ′ can be identified such that if θ satisfies the constraint, then θ ′ also satisfies the constraint. An example of a convertible constraint is a maximum average cost constraint. Assume that every item in an itemset has a cost as defined by a function c(i). The constraint c(I) = ∑i∈I c(i)/∣I∣ ≤ maxcost is not monotonic. However, for every itemset I with c(I) ≤ maxcost, if an item i is removed with c(i) = maxi∈I c(i), an itemset with c(I − {i}) ≤ c(I) ≤ maxcost is obtained. Maximum average cost has the desirable property that no access to the data is needed to identify the generalization that should satisfy the constraints. If it is not possible to identify the necessary least general generalization before accessing the data, the convertible constraint is also sometimes called weak (anti-)monotone (Zhu, Yan, Han, & Yu, ). Boundable Constraints Constraints on non-monotonic

measures for which a monotonic bound exist, are called boundable. An example of such a constraint is a minimum accuracy constraint in a database with binary class labels. Assume that every itemset is interpreted as a rule if I then else (thus, class label is predicted if a transaction contains itemset I, or class label otherwise; see 7Supervised Descriptive Rule Discovery). A minimum accuracy constraint can be formalized by the formula (fr(I, D ) + ∣D ∣ − fr(I, D ))/∣D∣ ≥ minacc, where Dk is the database containing only the examples labeled with class label k. It can be derived from this that fr(I, D ) ≥ ∣D∣minacc−∣D ∣+fr(I, D ) ≥ ∣D∣minacc−∣D ∣. In other words, if a high accuracy is desirable, a minimum number of examples of class is required to be covered, and a minimum frequency constraint can thus be derived. Therefore, minimum support can be used as a bound for minimum accuracy. The principle of deriving bounds for non-monotonic measures can be applied widely (Bayardo, Agrawal, & Gunopulos, ; Morishita & Sese, ). Borders If constraints are not restrictive enough, the

number of patterns can be huge. Ignoring statistics about patterns such as their exact frequency, the set of patterns can be represented more compactly only by

C

C

Constraint-Based Mining

listing the patterns in the border(s) (Mannila & Toivonen, ), similar to the idea of 7version spaces. An example of a border is the set of maximal frequent itemsets (see 7Frequent Patterns). Borders can be computed for other types of both monotonic and antimonotonic constraints as well. There are several complications compared to the simple frequent pattern mining setting: If there is an anti-monotonic constraint, such as maximum support, not only is it needed to compute a border for the most specific elements in the set (SSet), but also a border for the least general elements in the set (G-Set) ● If the formula is a disjunction of conjunctions, the result of a query becomes a union of version spaces, which is called a multi-dimensional version space (see Fig. ) (De Raedt, Jaeger, Lee, & Mannila, ); the G-Set of one version space may be more general than the G-Set of another version space ●

Both the S-Set and the G-Set can be represented by listing elements just within the version space (the positive border), or elements just outside the version space (the negative border). For instance, the positive border of the G-Set consists of those patterns which are part of the version space, and for which no generalizations exist which are part of the version space. Similarly, there may exist several representations of multi-dimensional version spaces; optimizing the representation of multi-dimensional version spaces is analogous to optimizing queries in relational databases (De Raedt et al., ). Borders form a condensed representations, that is, they compactly represent the solution space; see 7Frequent Patterns. Algorithms For many of the constraints specified in

the previous section specialized algorithms have been developed in combination with specific hypothesis spaces. It is beyond the scope of this chapter to discuss all these algorithms; only the most common ideas are provided here. The main idea is that 7Apriori can easily be updated to deal with general monotonic constraints in arbitrary hypothesis spaces. The concept of a specialization 7refinement operator is essential to operate on

other hypothesis spaces than itemsets. A specialization operator ρ(θ) computes a set of specializations in the hypothesis space for a given input pattern. In pattern mining, this operator should have the following properties:

Completeness: every pattern in the hypothesis space should be reachable by repeated application of the refinement operator starting from the most general pattern in the hypothesis space ● Nonredundancy: every pattern in the hypothesis space should be reachable in only one way starting from the most general pattern in the hypothesis space ●

In itemset mining, optimal refinement is usually obtained by first ordering the items (for instance, alphabetically, or by frequency), and then adding items that are higher in the chosen order to a set than the items already in the set. For instance, for the itemset {A, C}, the specialization operator returns ρ({A, C}) = {{A, C, D}, {A, C, E}}, assuming that the domain of items {A, B, C, D, E} is considered. Other refinement operators are needed while dealing with other hypothesis spaces, such as in 7graph mining. The search in Apriori proceeds 7breadth-first. Each level, the specialization operator is applied on patterns satisfying the monotonic constraints to generate candidates for the next level. For every new candidate it is checked whether its generalizations satisfy the monotonic constraints. To create a set of generalizations, a generalization refinement operator can be used. In frequent itemset mining, usually single items are removed from the itemset to generate generalizations. More changes are required to deal with antimonotonic constraints. A simple way of dealing with both monotonic and anti-monotonic constraints is to first compute all patterns that satisfy the monotonic constraints, and then to prune the patterns that fail to satisfy the anti-monotonic constraints. More challenging is to “push” anti-monotonic constraints in the mining process. An observation which is often exploited is that generalizations of patterns that do not satisfy the anti-monotonic constraint need not be considered. Well-known strategies are:

Constructive Induction Top element of the partial order G-Border (1) G-Border S-Border (1) Version Space

C G-Border (2) S-Border (2)

(a) A 1-dimensional version space

Version Space (2)

More specific

S-Border

Version Space (1)

More general

Top element of the partial order

C

(b) A 2-dimensional version space

Constraint-Based Mining. Figure . Version spaces

In a breadth-first setting: traverse the lattice in reverse order for monotonic constraints, after the patterns have been determined satisfying the antimonotonic constraints (De Raedt et al., ) ● In a depth-first setting: during the search for patterns, try to guess the largest pattern that can still be reached, and prune a branch in the search if the pattern does not satisfy the monotonic constraint on this pattern (Bucila, Gehrke, Kifer, & White, ; Kifer, Gehrke, Bucila, & White, ) ●

It is beyond the scope of this chapter to discuss how to deal with other types of constraints; however, it should be pointed out that not all combinations of constraints and hypothesis spaces have been studied; it is not obvious whether all constraints can be pushed usefully in a pattern search for any hypothesis space, for instance, when boundable constraints in more complex hypothesis spaces (such as graphs) are involved. Research in this area is ongoing.

De Raedt, L., Jaeger, M., Lee, S. D., & Mannila, H. (). A theory of inductive query answering (extended abstract). In Proceedings of the second IEEE international conference on data mining (ICDM) (pp. –). Los Alamitos, CA: IEEE Press. Imielinski, T., & Mannila, H. (). A database perspective on knowledge discovery. Communications of the ACM, , –. Kifer, D., Gehrke, J., Bucila, C., & White, W. M. (). How to quickly find a witness. In Proceedings of the twenty-second ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (pp. –). San Diego, CA: ACM Press. Mannila, H., & Toivonen, H. (). Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, (), –. Morishita, S., & Sese, J. (). Traversing itemset lattices with statistical metric pruning. In Proceedings of the nineteenth ACM SIGACT-SIGMOD-SIGART symposium on database systems (PODS) (pp. –). San Diego, CA: ACM Press. Pei, J., & Han, J. (). Constrained frequent pattern mining: A pattern-growth view. SIGKDD Explorations, (), –. Zhu, F., Yan, X., Han, J., & Yu, P. S. (). gPrune: A constraint pushing framework for graph pattern mining. In Proceedings of the sixth Pacific-Asia conference on knowledge discovery and data mining (PAKDD). Lecture notes in computer science (Vol. , pp. –). Berlin: Springer.

Cross References 7Constrained Clustering 7Frequent Pattern Mining 7Graph Mining 7Tree Mining

Recommended Reading Bayardo, R. J., Jr., Agrawal, R., & Gunopulos, D. (). Constraintbased rule mining in large, dense databases. In Proceedings of the th international conference on data engineering (ICDE) (pp. –). Sydney, Australia. Bucila, C., Gehrke, J., Kifer, D., & White, W. M. (). DualMiner: A dual-pruning algorithm for itemsets with constraints. Data Mining and Knowledge Discovery, (), –.

Constructive Induction Constructive induction is any form of 7induction that generates new descriptors not present in the input data (Dietterich & Michalski, ).

Recommended Reading Dietterich, T. G., & Michalski, R. S. (). A comparative review of selected methods for learning from examples. In Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). Machine learning: An artificial intelligence approach, pp. –. Tioga.

C

Content Match

Content Match 7Text Mining for Advertising

Content-Based Filtering Synonyms Content-based recommending

Definition Content-based filtering is prevalent in 7Information Retrieval, where the text and multimedia content of documents is used to select documents relevant to a user’s query. In the context this refers to content-based recommenders, that provide recommendations by comparing representations of content describing an item to representations of content that interests a user.

Definition A learning system that can continue adding new data without the need to ever stop or freeze the updating. Usually continual learning requires incremental and 7online learning as a component, but not every incremental learning system has the ability to achieve continual learning, i.e., the learning may deterioate after some time.

Cross References 7Cumulative Learning

Continuous Attribute A continuous attribute can assume all values on the number line within the value range. See 7Attribute and 7Measurement Scales.

Contrast Set Mining Definition

Content-Based Recommending 7Content-Based Filtering

Context-Sensitive Learning

Contrast set mining is an area of 7supervised descriptive rule induction. The contrast set mining problem is defined as finding contrast sets, which are conjunctions of attributes and values that differ meaningfully in their distributions across groups (Bay & Pazzani, ). In this context, groups are the properties of interest.

Recommended Reading 7Concept Drift

Bay, S.D., & Pazzani, M. J. (). Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, (), –.

Contextual Advertising 7Text Mining for Advertising

Cooperative Coevolution 7Compositional Coevolution

Continual Learning Co-Reference Resolution Synonyms Life-Long Learning

7Entity Resolution

Correlation Clustering

Correlation Clustering Anthony Wirth The University of Melbourne, Victoria, Australia

Synonyms Clustering with advice; Clustering with constraints; Clustering with qualitative information; Clustering with side information

Definition In its rawest form, correlation clustering is graph optimization problem. Consider a 7clustering C to be a mapping from the elements to be clustered, V, to the set {, . . . , ∣V∣}, so that u and v are in the same cluster if and only if C[u] = C[v]. Given a collection of items in which each pair (u, v) has two weights w+uv and w−uv , we must find a clustering C that minimizes ∑

w−uv +

C[u]=C[v]

∑

w+uv ,

()

C[u]≠C[v]

or, equivalently, maximizes ∑ C[u]=C[v]

w+uv +

∑

w−uv .

()

C[u]≠C[v]

Note that although w+uv and w−uv may be thought of as positive and negative evidence towards coassociation, the actual weights are nonnegative.

Motivation and Background The notion of clustering with advice, that is nonmetricdriven relations between items, had been studied in other communities (Ferligoj & Batagelj, ) prior to its appearance in theoretical computer science. Traditional clustering problems, such as k-median and k-center, assume that there is some type of distance measure (metric) on the data items, and often specify the number of clusters that should be formed. In the clustering with advice framework, however, the number of clusters to be built need not be specified in advance: it can be an outcome of the objective function. Furthermore, instead of, or in addition to, a distance function, we are given advice as to which pairs of

C

items are similar. The two weights w+uv and w−uv correspond to external advice about whether the pair should be clustered together or separately. Bansal, Blum, and Chawla () introduced the problem to the theoretical computer science and machine-learning communities. They were motivated by database consistency problems, in which the same entity appeared in different forms in various databases. Given a collection of such records from multiple databases, the aim is to cluster together the records that appear to correspond to the same entity. From this viewpoint, the log odds ratio from some classifier, log (

Pr(same) ), Pr(different)

corresponds to a label wuv for the pair. In many applications only one of the + and − weights for the pair is nonzero, that is ⎧ ⎪ ⎪(wuv , ) (w+uv , w−uv ) = ⎨ ⎪ ⎪ ⎩(, −wuv )

for wuv ≥ for wuv ≤ .

In addition, if every pair has weight ±, then the instance is called complete, otherwise it is referred to as general. Demaine, Emanuel, Fiat, and Immorlica () suggest the following motivation. Suppose we have a set of guests at a party. Each guest has preferences for whom they would like to sit with, and for whom they would like to avoid. We must group the guests into tables in a way that enhances the amicability of the party. The notion of producing good clusterings when given inconsistent advice first appeared in the work of Ben-Dor, Shamir, and Yakhini (). A canonical example of inconsistent advice is this: items u and v are similar, items v and y are similar, but u and y are dissimilar. It is impossible to find a clustering that satisfies all the advice. Figure shows a very simple example of inconsistent advice. In addition, although Correlation clustering is an NP-hard problem, recent algorithms for clustering with advice guarantee that their solutions are only a specified factor worse than the optimal: that is, they are approximation algorithms.

Theory In setting out the correlation clustering framework, Bansal et al. () noted that the following algorithm

C

C

Correlation Clustering

type procedure to round the solution of a linear programming relaxation of the problem: minimize + − ∑ wij ⋅ xij + wij ⋅ ( − xij ) ij

subject to

Correlation Clustering. Figure . Top left is a toy clustering with advice example showing three similar pairs (solid edges) and three dissimilar pairs (dashed edges). Bottom left is a clustering solution for this example with four singleton clusters, while bottom right has one cluster. Top right is a partitioning into two clusters that appears to best respect the advice

produces a -approximation for the maximization problem:

()

xik ≤ xij + xjk

for all i, j, k

xij ∈ [, ]

for all i, j

In this setting, xij = implies i and j’s separation, while xij = implies coclustering, with values in between representing partial evidence. In practice solving this linear program is very slow and has huge memory demands (Bertolacci & Wirth, ). Charikar et al. also showed that this version of problem is APX-hard. For the maximization problem (), they showed that instances with general weights were APX-hard and provided a rounding of the following semidefinite program (SDP) that yields a . factor approximation algorithm. maximize ∑ wij (vi ⋅ vj ) + ∑ wij ( − vi ⋅ vj )

▸ If the total of the positive weights exceeds the total of the negative weights then, place all the items in a single cluster; otherwise, make each item a singleton cluster.

They then showed that complete instances are NP-hard to optimize, and how to minimize the penalty () with a constant factor approximation. The constant for this combinatorial algorithm was rather large. The algorithm relied heavily on the completeness of the instance; it iteratively cleans clusters until every cluster is δ-clean. That is, for each item at most a fraction δ ( < δ < ) of the other items in its cluster have a negative relation with it, and at most δ outside its cluster a positive relation. Bansal et al. also demonstrated that the minimization problem on general instances is APX-hard: there is some constant, larger than , below which approximation is NP-hard. Finally, they provided a polynomial time approximation scheme (PTAS) for maximizing () in complete instances. The constant factor for minimizing () on complete instances was improved to by Charikar, Guruswami, and Wirth (). They employed a region-growing

+(ij)

−(ij)

subject to

()

vi ⋅ vi =

for all i

vi ⋅ vj ≥

for all i, j

In this case we interpret vi ⋅ vj = as evidence that i and j are in the same cluster, but vi ⋅ vj = as evidence toward separation. Emanuel and Fiat () extended the work of Bansal et al. by drawing a link between Correlation Clustering and the Minimum Multicut problem. This reduction to Multicut provided an O(log n) approximation algorithm for minimizing general instances of Correlation Clustering. Interestingly, Emanuel and Fiat also showed that there was reduction in the opposite direction: an optimal solution to Correlation Clustering induced an optimal solution to Minimum Multicut. Demaine and Immorlica () also drew the link from Correlation Clustering to Minimum multicut and its O(log n) approximation algorithm. In addition, they described an O(r )-approximation algorithm for graphs that exclude the complete bipartite graph Kr,r as a minor.

Correlation Clustering

Swamy (), using the same SDP () as Charikar et al., but different rounding techniques, showed how to maximize () within factor . in general instances. The factor approximation for minimization () of complete instances was lowered to . by Ailon, Charikar, and Newman (). Using the distances obtained by solving the linear program (), they repeat the following steps: ▸ form a cluster around random item i by including each (unclustered) j with probability − xij ; set the cluster aside.

Since solving the linear program is highly resource hungry, Ailon et al. provided a combinatorial alternative: add j to i’s cluster if w+ij > w−ij . Not only is this algorithm very fast, it is actually a factor approximation. Recently, Tan () has shown that the / + є inapproximability for maximizing () on general weighted graphs extends to general unweighted graphs. A further variant in the Correlation Clustering family of problems is the maximization of ()–(), known as maximizing correlation. Charikar and Wirth () proved an Ω(/ log n) approximation for the general problem of maximizing n

n

∑ ∑ aij xi xj ,

s.t. xi ∈ {−, } for all i,

()

i= j=

for a matrix A with null diagonal entries, by rounding the canonical SDP relaxation. This effectively maximized correlation with the requirement that two clusters be formed; it was not hard to extend this to general instances. The gap between the vector SDP solution and the integral solution to maximizing the quadratic program () was in fact shown to be Θ(/ log n) in general (Alon, Makarychev, Makarychev, & Naor, ). However, in other instances such as those with a bounded number of nonzero weights for each item, a constant factor approximation was possible. Arora, Berger, Hazan, Kindler, and Safra () went further and showed that it is quasi-NP-hard to approximate the maximization to a factor better than Ω(/ logγ n) for some γ > . Shamir, Sharan, and Tsur () showed that 7Cluster Editing and p-Cluster Editing, in which p clusters must be formed, are NP-complete (for p ≥ ). Gramm, Guo, Hüffner, and Niedermeier () took

C

an innovative approach to solving the Clustering Editing problem exactly. They had previously produced an O(.k + n ) time hand-made search tree algorithm, where k is the number of edges that need to be modified. This “awkward and error-prone work” was then replaced with a computer program that itself designed a search tree algorithm, involving automated case analysis, that ran in O(.k + n ) time. Kulis, Basu, Dhillon, and Mooney () unify various forms of clustering, correlation clustering, spectral clustering, and clustering with constraints in their kernel-based approach to k-means. In this, they have a general objective function that includes penalties for violating pairwise constraints and for having points spread far apart from their cluster centers, where the spread is measured in some high-dimensional space.

Applications The work of Demaine and Immorlica () on Correlation Clustering was closely linked with that of Bejerano et al. on Location Area Planning. This problem is concerned with the allocation of cells in a cellular network to clusters known as location areas. There are costs associated with traffic between the location areas (cuts between clusters) and with the size of clusters themselves (related to paging phones within individual cells). These costs drive the clustering solution in opposite directions, on top of which there are constraints on cells that must (or cannot) be in the same cluster. The authors show that the same O(log n) region-growing algorithm for minimizing Correlation Clustering and Multicut applies to Location Area Planning. Correlation clustering has been directly applied to the coreference problem in natural language processing and other instances in which there are multiple references to the same object (Daume, ; McCallum & Wellner, ). Assuming some sort of undirected graphical model, such as a Conditional Random Field, algorithms for correlation clustering are used to partition a graph whose edge weights corresponding to logpotentials between node pairs. The machine learning community has applied some of the algorithms for Correlation clustering to problems such as email clustering and image segmentation. With similar applications in mind, Finley and Joachims () explore the idea of adapting the pairwise input information to fit example

C

C

Correlation Clustering

clusterings given by a user. Their objective function is the same as Correlation Clustering (), but their main tool is the 7Support Vector Machine. There has been considerable interest in the 7consensus clustering problem, which is an excellent application of Correlation clustering techniques. Gionis, Mannila, and Tsaparas () note several sources of motivation for the Consensus Clustering; these include identifying the correct number of clusters and improving clustering robustness. They adapt Charikar et al.’s region-growing algorithm to create a three-approximation that performs reasonably well in practice, though not as well as local search techniques. Gionis et al. also suggest using sampling as a tool for handling large data sets. Bertolacci and Wirth () extended this study by implementing Ailon et al.’s algorithms with sampling, and therefore a variety of ways of developing a full clustering from the clustering of the sample. They noted that LP-based methods performed best, but placed a significant strain on resources.

Applications of Clustering with Advice The 7k-means clustering algorithm is perhaps the most-used clustering technique: Wagstaff et al. incorporated constraints into a highly cited k-means variant called COP-KMEANS. They applied this algorithm to the task of identifying lanes of traffic based on input GPS data. In the constrained-clustering framework, the constraints are usually assumed to be consistent (noncontradictory) and hard. In addition to the usual must- and cannot-link constraints, Davidson and Ravi () added constraints enforcing various requirements on the distances between points in particular clusters. They analyzed the computational feasibility of the problem of establishing the (in) feasibility of a set of constraints, for various constraint types. Their constrained k-means algorithms were used to help a robot discover objects in a scene.

Recommended Reading Ailon, N., Charikar, M., & Newman, A. (). Aggregating inconsistent information: Ranking and clustering. In Proceedings of the Thirty-Seventh ACM Symposium on the Theory of Computing (pp. –). New York: ACM Press.

Alon, N., Makarychev, K., Makarychev, Y., & Naor, A. (). Quadratic forms on graphs. Inventiones Mathematicae, (), –. Arora, S., Berger, E., Hazan, E., Kindler, G., & Safra, S. (). On non-approximability for quadratic programs. In Proceedings of Forty-Sixth Symposium on Foundations of Computer Science. (pp. –). Washington DC: IEEE Computer Society. Bansal, N., Blum, A., & Chawla, S. (). Correlation clustering. In Correlation clustering (pp. –). Washington, DC: IEEE Computer Society. Ben-Dor, A., Shamir, R., & Yakhini, Z. (). Clustering gene expression patterns. Journal of Computational Biology, , –. Bertolacci, M., & Wirth, A. (). Are approximation algorithms for consensus clustering worthwhile? In Proceedings of Seventh SIAM International Conference on Data Mining. (pp. –). Philadelphia: SIAM. Charikar, M., Guruswami, V., & Wirth, A. (). Clustering with qualitative information. In Proceedings of forty fourth FOCS (pp. –). Charikar, M., & Wirth, A. (). Maximizing quadratic programs: Extending Grothendieck’s inequality. In Proceedings of forty fifth FOCS (pp. –). Daume, H. (). Practical structured learning techniques for natural language processing. PhD thesis, University of Southern California. Davidson, I., & Ravi, S. (). Clustering with constraints: Feasibility issues and the k-means algorithm. In Proceedings of Fifth SIAM International Conference on Data Mining. Demaine, E., Emanuel, D., Fiat, A., & Immorlica, N. (). Correlation clustering in general weighted graphs. Theoretical Computer Science, (), –. Demaine, E., & Immorlica, N. (). Correlation clustering with partial information. In Proceedings of Sixth Workshop on Approximation Algorithms for Combinatorial Optimization Problems. (pp. –). Emanuel, D., & Fiat, A. (). Correlation clustering – minimizing disagreements on arbitrary weighted graphs. In Proceedings of Eleventh European Symposium on Algorithms (pp. –). Ferligoj, A., & Batagelj, V. (). Clustering with relational constraint. Psychometrika, (), –. Finley, T., & Joachims, T. (). Supervised clustering with support vector machines. In Proceedings of Twenty-Second International Conference on Machine Learning. Gionis, A., Mannila, H., & Tsaparas, P. (). Clustering aggregation. In Proceedings of Twenty-First International Conference on Data Engineering. To appear. Gramm, J., Guo, J., Hüffner, F., & Niedermeier, R. (). Automated generation of search tree algorithms for hard graph modification problems. Algorithmica, (), –. Kulis, B., Basu, S., Dhillon, I., & Mooney, R. (). Semi-supervised graph clustering: A kernel approach. In Proceedings of TwentySecond International Conference on Machine Learning (pp. –). McCallum, A., & Wellner, B. (). Conditional models of identity uncertainty with application to noun coreference. In L. Saul,

Cost-Sensitive Learning

Y. Weiss, & L. Bottou, (Eds.), Advances in neural information processing systems (pp. –). Cambridge, MA: MIT Press. Meil˘a, M. (). Comparing clusterings by the variation of information. In Proceedings of Sixteenth Conference on Learning Theory (pp. –). Shamir, R., Sharan, R., & Tsur, D. (). Cluster graph modification problems. Discrete Applied Mathematics, , –. Swamy, C. (). Correlation Clustering: Maximizing agreements via semidefinite programming. In Proceedings of Fifteenth ACM-SIAM Symposium on Discrete Algorithms (pp. –). Tan, J. (). A Note on the inapproximability of correlation clustering. Technical Report ., eprint arXiv, .

C

Definition Cost-Sensitive Learning is a type of learning that takes the misclassification costs (and possibly other types of cost) into consideration. The goal of this type of learning is to minimize the total cost. The key difference between cost-sensitive learning and cost-insensitive learning is that cost-sensitive learning treats different misclassifications differently. That is, the cost for labeling a positive example as negative can be different from the cost for labeling a negative example as positive. Cost-insensitive learning does not take misclassification costs into consideration.

Correlation-Based Learning 7Biological Learning: Synaptic Plasticity, Hebb Rule and Spike Timing Dependent Plasticity

Cost In 7Markov decision processes, negative rewards are often expressed as costs. A reward of −x is expressed as a cost of x. In 7supervised learning, cost is used as a synonym for 7loss.

Cross References 7Loss

Cost Function 7Loss Function

Cost-Sensitive Classification 7Cost-Sensitive Learning

Cost-Sensitive Learning Charles X. Ling, Victor S. Sheng The University of Western Ontario, Canada

Synonyms Cost-sensitive classification; Learning with different classification costs

Motivation and Background Classification is an important task in inductive learning and machine learning. A classifier, trained from a set of training examples with class labels, can then be used to predict the class labels of new examples. The class label is usually discrete and finite. Many effective classification algorithms have been developed, such as 7naïve Bayes, 7decision trees, 7neural networks, and 7support vector machines. However, most classification algorithms seek to minimize the error rate: the percentage of the incorrect prediction of class labels. They ignore the difference between types of misclassification errors. In particular, they implicitly assume that all misclassification errors have equal cost. In many real-world applications, this assumption is not true. The differences between different misclassification errors can be quite large. For example, in medical diagnosis of a certain cancer (where having cancer is regarded as the positive class, and non-cancer (healthy) as negative), misdiagnosing a cancer patient as healthy (the patient is actually positive but is classified as negative; thus it is also called “false negative”) is much more serious (thus expensive) than a false-positive error. The patient could lose his/her life because of a delay in correct diagnosis and treatment. Similarly, if carrying a bomb is positive, then it is much more expensive to miss a terrorist who carries a bomb onto a flight than searching an innocent person. Cost-sensitive learning takes costs, such as the misclassification cost, into consideration. Turney () provides a comprehensive survey of a large variety of different types of costs in data mining and machine

C

C

Cost-Sensitive Learning

learning, including misclassification costs, data acquisition cost (instance costs and attribute costs), 7active learning costs, computation cost, human–computer interaction cost, and so on. The misclassification cost is singled out as the most important cost, and it has received the most attention in recent years.

Theory The theory of cost-sensitive learning (Elkan, ; Zadrozny and Elkan, ) describes how the misclassification cost plays its essential role in various costsensitive learning algorithms. Without loss of generality, binary classification is assumed (i.e., positive and negative class) in this paper. In cost-sensitive learning, the costs of false positive (actual negative but predicted as positive; denoted as FP), false negative (FN), true positive (TP), and true negative (TN) can be given in a cost matrix, as shown in Table . In the table, the notation C(i, j) is also used to represent the misclassification cost of classifying an instance from its actual class j into the predicted class i ( is used for positive, and for negative). These misclassification cost values can be given by domain experts, or learned via other approaches. In cost-sensitive learning, it is usually assumed that such a cost matrix is given and known. For multiple classes, the cost matrix can be easily extended by adding more rows and more columns. Note that C(i, i) (TP and TN) is usually regarded as the “benefit” (i.e., negated cost) when an instance is predicted correctly. In addition, cost-sensitive learning is often used to deal with datasets with very imbalanced class distributions (see 7Class Imbalance Problem) (Japkowicz & Stephen, ). Usually (and without loss of generality), the minority or rare class is regarded as the positive class, and it is often more expensive to misclassify an actual positive example into negative,

Cost-Sensitive Learning. Table An Example of Cost Matrix for Binary Classification

than an actual negative example into positive. That is, the value of FN = C(, ) is usually larger than that of FP = C(, ). This is true for the cancer example mentioned earlier (cancer patients are usually rare in the population, but predicting an actual cancer patient as negative is usually very costly) and the bomb example (terrorists are rare). Given the cost matrix, an example should be classified into the class that has the minimum expected cost. This is the minimum expected cost principle. The expected cost R(i ∣ x) of classifying an instance x into class i (by a classifier) can be expressed as: R (i ∣ x) = ∑ P (j ∣ x) C (j, i),

where P(j ∣ x) is the probability estimation of classifying an instance into class j. That is, the classifier will classify an instance x into positive class if and only if: P ( ∣ x) C (, ) + P ( ∣ x) C (, ) ≤ P ( ∣ x) C (, ) + P ( ∣ x) C (, ) This is equivalent to: P ( ∣ x) (C (, ) − C (, )) ≤ P ( ∣ x) (C (, ) − C (, )) Thus, the decision (of classifying an example into positive) will not be changed if a constant is added into a column of the original cost matrix. Thus, the original cost matrix can always be converted to a simpler one by subtracting C(, )to the first column, and C(, ) to the second column. After such conversion, the simpler cost matrix is shown in Table . Thus, any given cost-matrix can be converted to one with C(, ) = C(, ) = . (Here it is assumed that the misclassification cost is the same for Cost-Sensitive Learning. Table A Simpler Cost Matrix with an Equivalent Optimal Classification

Actual negative

Actual positive

C(, ), or TP

C(, ), or FN

Predict negative

Predict positive C(, ), or FP

C(, ), or TP

Predict positive C(, ) – C(, )

Predict negative

()

j

True negative

True positive

C(, ) – C(, )

Cost-Sensitive Learning

all examples. This property is a special case of the one discussed in Elkan ().) In the rest of the paper, it will be assumed that C(, ) = C(, ) = . Under this assumption, the classifier will classify an instance x into positive class if and only if: P ( ∣ x) C (, ) ≤ P ( ∣ x) C (, ) As P( ∣ x) = − P( ∣ x), a threshold p∗ can be obtained for the classifier to classify an instance x into positive if P( ∣ x) ≥ p∗ , where p∗ =

C(, ) . C(, ) + C(, )

()

Thus, if a cost-insensitive classifier can produce a posterior probability estimation p( ∣ x) for each test example x, one can make the classifier cost-sensitive by simply choosing the classification threshold according to (), and classify any example to be positive whenever P( ∣ x) ≥ p∗ . This is what several cost-sensitive metalearning algorithms, such as Relabeling, are based on (see later for details). However, some cost-insensitive classifiers, such as C., may not be able to produce accurate probability estimation; they return a class label without a probability estimate. Empirical Thresholding (Sheng & Ling, ) does not require accurate estimation of probabilities – an accurate ranking is sufficient. It simply uses 7cross-validation to search for the best probability value p∗ to use as a threshold. Traditional cost-insensitive classifiers are designed to predict the class in terms of a default, fixed threshold of .. Elkan () shows that one can “rebalance” the original training examples by sampling, such that the classifiers with the . threshold is equivalent to the classifiers with the p* threshold as in (), in order to achieve cost-sensitivity. The rebalance is done as follows. If all positive examples (as they are assumed as the rare class) are kept, then the number of negative examples should be multiplied by C(,)/C(,) = FP/FN. Note that as usually FP < FN, the multiple is less than . This is, thus, often called “under-sampling the majority class.” This is also equivalent to “proportional sampling,” where positive and negative examples are sampled by the ratio of: p () FN : p () FP

()

C

where p() and p() are the prior probability of the positive and negative examples in the original training set. That is, the prior probabilities and the costs are interchangeable: doubling p() has the same effect as doubling FN, or halving FP (Drummond & Holte, ). Most sampling meta-learning methods, such as costing (Zadrozny, Langford, & Abe, ), are based on () above (see later for details). Almost all meta-learning approaches are either based on () or () for the thresholding- and samplingbased meta-learning methods, respectively, to be discussed in the next section.

Structure of Learning System Broadly speaking, cost-sensitive learning can be categorized into two categories. The first one is to design classifiers that are cost-sensitive in themselves.They are called the direct method. Examples of direct cost-sensitive learning are ICET (Turney, ) and cost-sensitive decision tree (Drummond & Holte, ; Ling, Yang, Wang, & Zhang, ). The other category is to design a “wrapper” that converts any existing cost-insensitive (or cost-blind) classifiers into cost-sensitive ones. The wrapper method is also called cost-sensitive metalearning method, and it can be further categorized into thresholding and sampling. Here is a hierarchy of the cost-sensitive learning and some typical methods. This paper will focus on cost-sensitive meta-learning that considers the misclassification cost only. Cost-Sensitive learning – Direct methods ● ICET (Turney, ) ● Cost-sensitive decision trees (Drummond & Holte, ; Ling et al., ) – Meta-learning ● Thresholding MetaCost (Domingos, ) CostSensitiveClassifier (CSC in short) (Witten & Frank, ) Cost-sensitive naïve Bayes (Chai, Deng, Yang, & Ling, ) Empirical Thresholding (ET in short) (Sheng & Ling, ) ● Sampling Costing (Zadrozny et al., ) Weighting (Ting, )

C

C

Cost-Sensitive Learning

Direct Cost-Sensitive Learning

The main idea of building a direct cost-sensitive learning algorithm is to directly introduce and utilize misclassification costs into the learning algorithms. There are several works on direct cost-sensitive learning algorithms, such as ICET (Turney, ) and cost-sensitive decision trees (Ling et al., ). ICET (Turney, ) incorporates misclassification costs in the fitness function of genetic algorithms. On the other hand, cost-sensitive decision tree (Ling et al., ), called CSTree here, uses the misclassification costs directly in its tree building process. That is, instead of minimizing entropy in attribute selection as in C., CSTree selects the best attribute by the expected total cost reduction. That is, an attribute is selected as a root of the (sub) tree if it minimizes the total misclassification cost. Note that as both ICET and CSTree directly take costs into model building, they can also take easily attribute costs (and perhaps other costs) directly into consideration, while meta cost-sensitive learning algorithms generally cannot. Drummond and Holte () investigate the costsensitivity of the four commonly used attribute selection criteria of decision tree learning: accuracy, Gini, entropy, and DKM. They claim that the sensitivity of cost is highest with the accuracy, followed by Gini, entropy, and DKM. Cost-Sensitive Meta-Learning

Cost-sensitive meta-learning converts existing costinsensitive classifiers into cost-sensitive ones without modifying them. Thus, it can be regarded as a middleware component that preprocesses the training data, or post-processes the output, from the cost-insensitive learning algorithms. Cost-sensitive meta-learning can be further classified into two main categories: thresholding and sampling, based on () and () respectively, as discussed in the theory section. Thresholding uses () as a threshold to classify examples into positive or negative if the cost-insensitive classifiers can produce probability estimations. MetaCost (Domingos, ) is a thresholding method. It first uses bagging on decision trees to obtain reliable probability estimations of training examples, relabels the classes of training examples according to (), and then uses the

relabeled training instances to build a cost-insensitive classifier. CSC (Witten & Frank, ) also uses () to predict the class of test instances. More specifically, CSC uses a cost-insensitive algorithm to obtain the probability estimations P(j ∣ x) of each test instance. (CSC is a meta-learning method and can be applied to any classifiers.) Then it uses () to predict the class label of the test examples. Cost-sensitive naïve Bayes (Chai et al., ) uses () to classify test examples based on the posterior probability produced by the naïve Bayes. As seen, all thresholding-based meta-learning methods rely on accurate probability estimations of p( ∣ x) for the test example x. To achieve this, Zadrozny and Elkan () propose several methods to improve the calibration of probability estimates. ET (Empirical Thresholding) (Sheng and Ling, ) is a thresholding-based meta-learning method. It does not require accurate estimation of probabilities – an accurate ranking is sufficient. ET simply uses cross-validation to search the best probability from the training instances as the threshold, and uses the searched threshold to predict the class label of test instances. On the other hand, sampling first modifies the class distribution of the training data according to (), and then applies cost-insensitive classifiers on the sampled data directly. There is no need for the classifiers to produce probability estimations, as long as they can classify positive or negative examples accurately. Zadrozny et al. () show that proportional sampling with replacement produces duplicated cases in the training, which in turn produces overfitting in model building. Instead, Zadrozny et al. () proposes to use “rejection sampling” to avoid duplication. More specifically, each instance in the original training set is drawn once, and accepted into the sample with the accepting probability C(j, i)/Z, where C(j, i) is the misclassification cost of class i, and Z is an arbitrary constant such that Z ≥ max C(j,i). When Z = maxij C(j, i), this is equivalent to keeping all examples of the rare class, and sampling the majority class without replacement according to C(, )/C(, ) – in accordance with (). Bagging is applied after rejection sampling to improve the results further. The resulting method is called Costing. Weighting (Ting, ) can also be viewed as a sampling method. It assigns a normalized weight to each instance according to the misclassification costs

Covariance Matrix

specified in (). That is, examples of the rare class (which carries a higher misclassification cost) are assigned, proportionally, high weights. Examples with high weights can be viewed as example duplication – thus oversampling. Weighting then induces cost-sensitivity by integrating the instances’ weights directly into C., as C. can take example weights directly in the entropy calculation. It works whenever the original cost-insensitive classifiers can accept example weights directly. (Thus, it can be said that Weighting is a semi meta-learning method.) In addition, Weighting does not rely on bagging as Costing does, as it “utilizes” all examples in the training set.

Recommended Reading Chai, X., Deng, L., Yang, Q., & Ling, C. X. (). Test-cost sensitive naïve Bayesian classification. In Proceedings of the fourth IEEE international conference on data mining. Brighton: IEEE Computer Society Press. Domingos, P. (). MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of the fifth international conference on knowledge discovery and data mining, San Diego (pp. –). New York: ACM. Drummond, C., & Holte, R. (). Exploiting the cost (in)sensitivity of decision tree splitting criteria. In Proceedings of the th international conference on machine learning (pp. –). Elkan, C. (). The foundations of cost-sensitive learning. In Proceedings of the th international joint conference of artificial intelligence (pp. –). Seattle: Morgan Kaufmann. Japkowicz, N., & Stephen, S. (). The class imbalance problem: A systematic study. Intelligent Data Analysis, (), –. Ling, C. X., Yang, Q., Wang, J., & Zhang, S. (). Decision trees with minimal costs. InProceedings of international conference on machine learning (ICML’). Sheng, V. S., & Ling, C. X. (). Thresholding for making classifiers cost-sensitive. In Proceedings of the st national conference on artificial intelligence (pp. –), – July , Boston, Massachusetts. Ting, K. M. (). Inducing cost-sensitive trees via instance weighting. In Proceedings of the second European symposium on principles of data mining and knowledge discovery (pp. –). Heidelberg: Springer. Turney, P. D. (). Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research, , –. Turney, P. D. (). Types of cost in inductive concept learning. In Proceedings of the workshop on cost-sensitive learning at the th international conference on machine learning, Stanford University, California. Witten, I. H., & Frank, E. (). Data mining – Practical machine learning tools and techniques with Java implementations. San Francisco: Morgan Kaufmann. Zadrozny, B., & Elkan, C. (). Learning and making decisions when costs and probabilities are both unknown. In Proceedings

C

of the seventh international conference on knowledge discovery and data mining (pp. –). Zadrozny, B., Langford, J., & Abe, N. (). Cost-sensitive learning by cost-proportionate instance weighting. In Proceedings of the third International conference on data mining.

C Cost-to-Go Function Approximation 7Value Function Approximation

Covariance Matrix Xinhua Zhang Australian National University, Canberra, Australia

Definition It is convenient to define a covariance matrix by using multi-variate random variables (mrv): X = (X , . . . , Xd )⊺. For univariate random variables Xi and Xj , their covariance is defined as: Cov(Xi , Xj ) = E [(Xi − µ i )(Xj − µ j )] , where µ i is the mean of Xi : µ i = E[Xi ]. As a special case, when i = j, then we get the variance of Xi , Var(Xi ) = Cov(Xi , Xi ). Now in the setting of mrv, assuming that each component random variable Xi has finite variance under its marginal distribution, the covariance matrix Cov(X, X) can be defined as a d-by-d matrix whose (i, j)-th entry is the covariance: (Cov(X, X))ij = Cov(Xi , Xj ) = E [(Xi − µ i )(Xj − µ j )] .

And its inverse is also called precision matrix.

Motivation and Background The covariance between two univariate random variables measures how much they change together, and as a special case, the covariance of a random variable with itself is exactly its variance. It is important to note that covariance is an unnormalized measure of the correlation between the random variables. As a generalization to multi-variate random variables X = (X , . . . , Xd )⊺ , the covariance matrix is a

C

Covariance Matrix

d-by-d matrix whose (i, j)-th component is the covariance between Xi and Xj . In many applications, it is important to characterize the relations between a set of factors, hence the covariance matrix plays an important role in practice, especially in machine learning.

Theory It is easy to rewrite the element-wise definition into the matrix form: Cov(X, X) = E [(X − E[X])(X − E[X])⊺ ] ,

()

which naturally generalizes the variance of univariate random variables: Var(X) = E[(X − E[X]) ]. Moreover, it is also straightforward to extend the covariance of a single mrv X to two mrv ’s X (d dimensional) and y (s dimensional), under the name cross-covariance. It quantifies how much the component random variables in X and y change together. The crosscovariance matrix is defined as a d × s matrix Cov(X, y) whose (i, j)-th entry is (Cov(X, y))ij = Cov(Xi , Yj )

Cross-covariance Cov(X, y) has the following properties. . Symmetry: Cov(X, y) = Cov(y, X). . Linearity: Cov(X + X , y) = Cov(X , y) + Cov (X , y). . Relating to covariance: If X and y have the same dimension, then Cov(X + y, X + y) = Cov(X, X) + Cov(y, y) + Cov(y, X). . Linear transform: Cov(AX, By) = ACov(X, y)B. It is highly important to note that Cov(X, y) = is a necessary but not sufficient condition for X and y to be independent. Correlation Coefficient

Entries in the covariance matrix are sometimes presented in a normalized form by dividing each entry by its corresponding standard deviations. This quantity is called the correlation coefficient, represented as ρ Xi ,Xj , and defined as ρ Xi ,Xj =

= E [(Xi − E[Xi ])(Yj − E[Yj ])] . Cov(X, y) can also be written in the matrix form as Cov(X, y) = E [(X − E[X])(y − E[y])⊺ ] , where the expectation is with respect to the joint distribution of (X, y). Obviously, Cov(X, y) becomes Cov(X, X) when y = X. Properties

Covariance Cov(X, X) has the following properties: . Positive semi-definiteness. It follows from () that Cov(X, X) is positive semi-definite. Cov(X, X) = if, and only if, X is a constant almost surely, i.e., there exists a constant x such that Pr(X ≠ x) = . Cov(X, X) is not positive definite if, and only if, there exists a constant α such that ⟨α, X⟩ is constant almost surely. . Relating cumulant to moments: Cov(X, X) = E[XX⊺ ] − E[X]E[X]⊺ . . Linear transform: If y = AX + b where A ∈ Rs×d and b ∈ Rs , then Cov(y, y) = ACov(X, X)A⊺ .

Cov(Xi , Xj ) . Cov(Xi , Xi )/ Cov(Xj , Xj )/

The corresponding matrix is called the correlation matrix, and for ΓX set to Cov(X, X) with all nondiagonal entries zeroed, and ΓY likewise, then the correlation matrix is given by Corr(X, y) = ΓX

−/

Cov(X, y)ΓY

−/

.

The correlation coefficient takes on values between [−, ]. Parameter Estimation

Given observations x , . . . , xn of a mrv X, an unbiased estimator of Cov(X, X) is: S=

n ⊺ ∑(xi − x¯ )(xi − x¯ ) , n − i=

where x¯ = n ∑ni= xi . The denominator n − reflects the fact that the mean is unknown and the sample mean is used in place. Note the maximum likelihood estimator in this case replaces the denominator n − by n.

Covariance Matrix

Conjugate Priors

A covariance matrix is used to define the Gaussian distribution. In this case, the inverse Wishart distribution is the conjugate prior for the covariance matrix. Since the Gamma distribution is a -D version of the Wishart distribution, in the -D case the Gamma is the conjugate prior for precision matrix.

C

there: k(xi , xj ) := ϕ(xi )⊺ ϕ(xj ). Since the measure in () only needs inner products, one can even directly define k(, ) without explicitly specifying ϕ. This allows us to

C ● Implicitly use a rich feature space whose dimension

can be infinitely high. ● Apply this measure of cross correlation to non-

Applications Several key uses of the covariance matrix are reviewed here.

Euclidean spaces as long as a kernel k(xi , xj ) can be defined on it.

Correlation and Least Squares Approximation Correlation and Kernel Methods

In many machine learning problems, we often need to quantify the correlation of two mrv s which may be from two different spaces. For example, we may want to study how much the image stream of a movie is correlated with the comments it receives. For simplicity, we consider a r-dimensional mrv X and a s-dimensional mrv y. To study their correlation, suppose we have n n pairs of observations {(xi , yi )}i= drawn iid from certain underlying joint distribution of (X, y). Let x¯ = n x and y¯ = n ∑ni= yi , and stack {xi } and {yi } into n ∑i= i x˜ = (x , . . . , xn )⊺ and Y˜ = (y , . . . , yn )⊺ respectively. Then the cross-covariance matrix Cov(X, y) can be estimated by n ∑ni= (xi − x¯ )(yi − y¯ )⊺ . To quantify the crosscorrelation by a real number, we need to apply some norm of the cross-covariance matrix, and the simplest one is the Frobenius norm, ∥A∥F = ∑ij Aij . Therefore, we obtain a measure of cross-correlation,

n ∥ ∑(xi − x¯ )(yi − y¯ )⊺ ∥ = H˜xx˜ ⊺ H Y˜ Y˜ ⊺ , n i= n F

()

where Hij = δ ij − n , and δ ij = if i = j and otherwise. It is important to notice that () in this measure, inner product is performed only in the space of X and y separately, i.e., no transformation between X and y is required, () the data points affect the measure only via inner products x⊺i xj as the (i, j)-th entry of x˜ x˜ ⊺ (and similarly for yi ). Hence we can endow new inner products on X and y, which eventually allows us to apply kernels, e.g., Gretton, Herbrich, Smola, Bousquet, & Schölkopf (). In a nutshell, kernel methods (Schölkopf & Smola, ) redefine the inner product x⊺i xj by mapping xi to a richer feature space via ϕ(xi ) and then compute the inner product

The measure of () can be equivalently motivated by least square 7linear regression. That is, we look for a linear transform T : Rd → Rs which minimizes n ∑ ∥(yi − y¯ ) − T(xi − x¯ )∥ . n i= And one can show that its minimum objective value is exactly equal to () up to a constant, as long as all yi − y¯ and xi − x¯ have unit length. In practice, this can be achieved by normalization. Or, the measure in () itself can be normalized by replacing the covariance matrix with the correlation matrix. Principal Component Analysis

The covariance matrix plays a key role in principal component analysis (PCA). Assume that we are given n iid observations x , . . . , xn of a mrv X, and let x¯ = x . PCA tries to find a set of orthogonal directions n ∑i i w , w , . . ., such that the projection of X to the direction w , w⊺ X, has the highest variance among all possible directions in the d-dimensional space. After subtracting from X the projection to w , w is chosen as the highest variance projection direction for the remainder. This procedure goes on for the required number of components. To find w := argmax w Var(w⊺ X), we need an empirical estimate of Var(w⊺ X). Estimating E[(w⊺ X) ] by w⊺ ( n ∑i xi x⊺i ) w, and E[w⊺ X] by n ∑i w⊺ xi , we get w = argmaxw : ∥w = ∥ w⊺ Sw, where S =

n ⊺ ∑(xi − x¯ )(xi − x¯ ) , n i=

n i.e., S is n− times the unbias empirical estimate of the covariance of X, based on samples x , . . . , xn . w turns

C

Covering Algorithm

out to be exactly the eigenvector of S corresponding to the greatest eigenvalue. Note that PCA is independent of the distribution of X. More details on PCA can be found at Jolliffe (). Gaussian Processes

Gaussian processes are another important framework in machine learning that rely on the covariance matrix. It is a distribution over functions f (⋅) from certain space X to R, such that for any n ∈ N and any n points n {xi ∈ X }i= , the set of values of f evaluated at {xi }i , {f (x ), . . . , f (xn )}, will have an n-dimensional Gaussian distribution. Different choices of the covariance matrix of the multi-variate Gaussian lead to different stochastic processes such as Wiener process, Brownian motion, Ornstein–Uhlenbeck process, etc. In these cases, it makes more sense to define a covariance funcn tion C : X × X ↦ R, such that given any set {xi ∈ X }i= for any n ∈ N, the n-by-n matrix (C(xi , xj ))ij is positive semi-definite and can be used as the covariance matrix. This further allows straightforward kernelization of a Gaussian process by using the kernel function as the covariance function. Although the space of functions is infinite dimensional, the marginalization property of multi-variate Gaussian distributions guarantees that the user of the model only needs to consider the observed xi , and ignore all the other possible x ∈ X . This important property says that for a mrv X = (X⊺ , X⊺ )⊺ ∼ N (µ, Σ), the marginal distribution of X is N (µ , Σ ), where Σ is the submatrix of Σ corresponding to X (and similarly for µ ). So taking into account the random variable X will not change the marginal distribution of X . For a complete treatment of covariance matrix from a statistical perspective, see Casella and Berger (), and Mardia, Kent, and Bibby () provides details for the multi-variate case. PCA is comprehensively discussed in Jolliffe (), and kernel methods are introduced in Schölkopf and Smola (). Williams & Rasmussen () gives the state of the art on how Gaussian processes can be utilized for machine learning.

Cross References 7Gaussian Distribution 7Gaussian Processes 7Kernel Methods

Recommended Reading Casella, G., & Berger, R. (). Statistical inference (nd ed.). Pacific Grove, CA: Duxbury. Gretton, A., Herbrich, R., Smola, A., Bousquet, O., & Schölkopf, B. (). Kernel methods for measuring independence. Journal of Machine Learning Research, , –. Jolliffe, I. T. () Principal component analysis (nd ed.). Springer series in statistics. New York: Springer. Mardia, K. V., Kent, J. T., & Bibby, J. M. (). Multivariate analysis. London: Academic Press. Schölkopf, B., & Smola, A. (). Learning with kernels. Cambridge, MA: MIT Press. Williams, C. K. I., & Rasmussen, C. E. (). Gaussian processes for regression. Cambridge, MA: MIT Press.

Covering Algorithm 7Rule Learning

Credit Assignment Claude Sammut The University of New South Wales

Synonyms Structural credit assignment

assignment;

Temporal

credit

Definition When a learning system employs a complex decision process, it must assign credit or blame for the outcomes to each of its decisions. Where it is not possible to directly attribute an individual outcome to each decision, it is necessary to apportion credit and blame between each of the combinations of decisions that contributed to the outcome. We distinguish two cases in the credit assignment problem. Temporal credit assignment refers to the assignment of credit for outcomes to actions. Structural credit assignment refers to the assignment of credit for actions to internal decisions. The first subproblem involves determining when the actions that deserve credit were taken and the second involves assigning credit to the internal structure of actions (Sutton, ).

Credit Assignment

Motivation Consider the problem of learning to balance a pole that is hinged on a cart (Michie & Chambers, , Anderson & Miller, ). The cart is constrained to run along a track of finite length and a fixed force can be applied to push the cart left or right. A controller for the pole and cart system must make a decision whether to push left or right at frequent, regular time intervals, for example, times a second. Suppose that this controller is capable of learning from trial-and-error. If the pole falls over, then it must determine which actions it took helped or hurt its performance. Determining that action is the problem of temporal credit assignment. Although the actions are directly responsible for the outcome of a trial, the internal process for choosing the action indirectly affects the outcome. Assigning credit or blame to those internal processes that lead to the choice of action is the structural credit assignment problem. In the case of pole balancing, the learning system will typically keep statistics such as how long, on average, the pole remained balanced after taking a particular action in a particular state, or after a failure, it may count back and determine the average amount of time to failure after taking a particular action in a particular state. Using these statistics, the learner attempts to determine the best action for a given state. The above example is typical of many problems in 7reinforcement learning (Sutton & Barto, ), where an agent interacts with its environment and through that interaction, learns to improve its performance in a task. Although Samuel () was the first to use a form of reinforcement learning in his checkers playing program, Minksy () first articulated the credit assignment, as follows: ▸ Using devices that also learn which events are associated with reinforcement, i.e., reward, we can build more autonomous “secondary reinforcement” systems. In applying such methods to complex problems, one encounters a serious difficulty – in distributing credit for success of a complex strategy among the many decisions that were involved.

The BOXES algorithm of Michie and Chambers () learned to control a pole balancer and performed credit assignment but the problem of credit assignment later became central to reinforcement learning, particularly following the work of Sutton (). Although credit

C

assignment has become most strongly identified with reinforcement learning, it may appear in any learning system that attempts to assess and revise its decisionmaking process.

C Structural Credit Assignment The setting for our learning system is that we have an agent that interacts with an environment. The environment may be a virtual one, as in game playing, or it may be physical, as in a robot performing some task. The agent receives input, possibly through sensing devices, that allows it to characterize the state of the world. Somehow, the agent must map these inputs to appropriate responses. These responses may change the state of the world. In reinforcement learning, we assume that the agent will receive some reward signal after an action or sequence of actions. Its job is to maximize these rewards over time. Structural credit assignment is associated with generalization over the input space of the agent. For example, a game player may have to respond to a very large number of potential board positions or a robot may have to respond to a stream of camera images. It is infeasible to learn a complete mapping from every possible input to every possible output. Therefore, a learning agent will typically use some means of grouping input signals. In the case of the BOXES pole balancer, Michie and Chambers discretized the state space. The state is characterized by the cart’s position and velocity and the pole’s angle and angular velocity. These parameters create a four-dimensional space, which was broken into three regions (left, center, right) for the pole angle, five for the angular velocity, and three for the cart position and velocity. These choices were arbitrary and other combinations also worked. Having divided the input space into non-overlapping regions, Michie and Chambers associated a push-left and push-right action with each region, or box. The learning algorithm maintains a score for each action and chooses the next action based on that score. BOXES was an early, and simple example, of creating an internal representation for mapping inputs to outputs. The problem with this method is that the structure of the decision-making system is fixed at the start and the learner is incapable of changing the representation. This may be needed if, for example, the subdivisions

C

Credit Assignment

that were chosen do not correspond to a real decision boundary. A learning system that could adapt its representation has an advantage, in this case. The BOXES representation can be thought of as a lookup table that implements a function that maps an input to an output. The fixed lookup table can be replaced by a 7function approximator that, given examples from the desired function, generalizes from them to construct an approximation of that function. Different function approximation techniques can be used. For example, Moore’s () function approximator was a 7nearest-neighbor algorithm, implemented using 7kd-tree to improve efficiency. Other function approximation methods may also be used, e.g., Albus’ CMAC algorithm (), 7locally weighted regression (Atkeson, Schaal, & Moore, ), 7perceptrons (Rosenblatt, ), 7multi-layer networks (Hinton, Rumelhart, & Williams, ), 7radial basis functions, etc. Structural credit assignment is also addressed in the creation of hierarchical representations. See 7hierarchical reinforcement learning. Other approaches to structural credit assignment include 7Value function approximation (Bertsekas & Tsitsiklis, ) and automatic basis generation (Mahadevan, ). See the entry on 7Gaussian Processes for examples of recent Bayesian and kernel method based approaches to solving the credit assignment problem.

Temporal Credit Assignment In the pole balancing example described above, the learning system receives a signal when the pole has fallen over. How does it know which actions leading up to the failure contributed to the fall? The system will receive a high-level punishment in the event of a failure or a reward in tasks where there is a goal to be achieved. In either case, it makes sense to assign the greatest credit or blame to the most recent actions and assign progressively less to the preceding actions. Each time a learning trial is repeated, the value of an action is updated so that if it leads to another action of higher value, its weight is increased. Thus, the reward or punishment propagates back through the sequence of decisions taken by the system. The credit assignment problem was addressed by Michie and Chambers, in the BOXES, algorithm but many other solutions

have subsequently been proposed. See the entries on 7Q-learning (Watkins, ; Watkins & Dayan, ) and 7temporal difference learning (Barto, Sutton, & Anderson, ; Sutton, ). Although temporal credit assignment is usually associated with reinforcement learning, it also appears in other forms of learning. In 7learning by imitation or 7behavioral cloning, an agent observes the actions of another agent and tries to learn from traces of behaviors. In this case, the learner must judge which actions of the other agent should receive credit or blame. Plan learning also encounters the same problem (Benson & Nilsson, ; Wang, Simon, & Lehman, ), as does 7explanation-based learning (Mitchell, Keller, & Kedar-Cabelli, ; Dejong & Mooney, ; Laird, Newell, & Rosenbloom, ). To illustrate the connection with explanation-based learning, we use one of the earliest examples of this kind of learning, Mitchell and Utgoff ’s, LEX program (Mitchell, Utgoff, & Banerji, ). The program was intended to learn heuristics for performing symbolic integration. Given a mathematical expression that included an integral sign, the program tried to transform the expression into one they did not. The standard symbolic integration operators were known to the program but not when it is best to apply them. The task of the learning system was to learn the heuristics for when to apply the operators. This was done by experimentation. If no heuristics were available, the program attempted a brute force search. If the search was successful, all the operators applied, leading to the success were assumed to be positive examples for a heuristic, whereas operators applied during a failed attempt became negative examples. Thus, LEX performed a simple form of credit assignment, which is typical of any system that learns how to improve sequences of decisions. 7Genetic algorithms can also be used to evolve rules that perform sequences of actions (Holland, ). When situation-action rules are applied in a sequence, we have a credit assignment problem that is similar to when we use a reinforcement learning. That is, how do we know which rules were responsible for success or failure and to what extent? Grefenstette () describes a bucket brigade algorithm in which rules are given strengths that are adjusted to reflect credit or blame.

Credit Assignment

This is similar to temporal difference learning except that in the bucket brigade the strengths apply to rules rather than states. See Classifier Systems and for a more comprehensive survey of bucket brigade methods, see Goldberg ().

Transfer Learning After a person has learned to perform some task, learning a new, but related, task is usually easier because knowledge of the first learning episode is transferred to the new task. Transfer Learning is particularly useful for acquiring new concepts or behaviors when given only a small amount for training data. It can be viewed as a form of credit assignment because successes or failures in previous learning episodes bias future learning. Reid (, ) identifies three forms of 7inductive bias involved in transfer learning for rules: language bias, which determines what kinds of rules can be constructed by the learner; the search bias, which determines the order in which rules will be searched; and the evaluation bias, which determines how the quality of the rules will be assessed. Note that learning language bias is a form of structural credit assignment. Similarly, where rules are applied sequentially, evaluation bias becomes temporal credit assignment. Taylor and Stone () give a comprehensive survey of transfer in 7reinforcement learning, in which they describe a variety of techniques for transferring the structure of an RL task from one case to another. They also survey methods for transferring evaluation bias. Transfer learning can be applied in many different settings. Caruana () developed a system for transferring inductive bias in 7neural networks performing multitask learning and more recent research has been directed toward transfer learning in 7Bayesian Networks (Niculescu-mizil & Caruana, ). See 7Transfer Learning and Silver et al. () and Banerjee, Liu, and Youngblood () for recent work on transfer learning.

Cross References 7Bayesian Network 7Classifier Systems 7Genetic Algorithms

C

7Hierarchical Reinforcement Learning 7Inductive Bias 7kd-Trees 7Locally Weighted Regression 7Nearest-Neighbor 7Perceptrons 7Radial Basis Function 7Reinforcement Learning 7Temporal Difference Learning 7Transfer Learning

Recommended Reading Albus, J. S. (). A new approach to manipulator control: The cerebellar model articulation controller (CMAC). Journal of Dynamic Systems, Measurement and Control, Transactions ASME, (), –. Anderson, C. W., & Miller, W. T. (). A set of challenging control problems. In W. Miller, R. S. Sutton, & P. J. Werbos (Eds.), Neural Networks for Control. Cambridge: MIT Press. Atkeson, C., Schaal, S., & Moore, A. (). Locally weighted learning. AI Review, , –. Banerjee, B., Liu, Y., & Youngblood, G. M. (Eds.), (). Proceedings of the ICML workshop on “Structural knowledge transfer for machine learning.” Pittsburgh, PA. Barto, A., Sutton, R., & Anderson, C. (). Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-, –. Benson, S., & Nilsson, N. J. (). Reacting, planning and learning in an autonomous agent. In K. Furukawa, D. Michie, & S. Muggleton (Eds.), Machine Intelligence . Oxford: Oxford University Press. Bertsekas, D. P., & Tsitsiklis, J. (). Neuro-dynamic programming. Nashua, NH: Athena Scientific. Caruana, R. (). Multitask learning. Machine Learning, , –. Dejong, G., & Mooney, R. (). Explanation-based learning: An alternative view. Machine Learning, , –. Goldberg, D. E. (). Genetic algorithms in search, optimization and machine learning. Boston: Addison-Wesley Longman Publishing. Grefenstette, J. J. (). Credit assignment in rule discovery systems based on genetic algorithms. Machine Learning, (–), –. Hinton, G., Rumelhart, D., & Williams, R. (). Learning internal representation by back-propagating errors. In D. Rumelhart, J. McClelland, & T. P. R. Group (Eds.), Parallel distributed computing: Explorations in the microstructure of cognition (Vol. ., pp. –). Cambridge: MIT Press.

C

C

Cross-Language Document Categorization

Holland, J. (). Escaping brittleness: The possibilities of generalpurpose learning algorithms applied to parallel rule-based systems. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (Vol. ). Los Altos: Morgan Kaufmann. Laird, J. E., Newell, A., & Rosenbloom, P. S. (). SOAR: An architecture for general intelligence. Artificial Intelligence, (), –. Mahadevan, S. (). Learning representation and control in Markov decision processes: New frontiers. Foundations and Trends in Machine Learning, (), –. Michie, D., & Chambers, R. (). Boxes: An experiment in adaptive control. In E. Dale & D. Michie (Eds.), Machine Intelligence . Edinburgh: Oliver and Boyd. Minsky, M. (). Steps towards artificial intelligence. Proceedings of the IRE, , –. Mitchell, T. M., Keller, R. M., & Kedar-Cabelli, S. T. (). Explanation based generalisation: A unifying view. Machine Learning, , –. Mitchell, T. M., Utgoff, P. E., & Banerji, R. B. (). Learning by experimentation: Acquiring and refining problem-solving heuristics. In R. Michalski, J. Carbonell, & T. Mitchell (Eds.), Machine kearning: An artificial intelligence approach. Palo Alto: Tioga. Moore, A. W. (). Efficient memory-based learning for robot control. Ph.D. Thesis, UCAM-CL-TR-, Computer Laboratory, University of Cambridge, Cambridge. Niculescu-mizil, A., & Caruana, R. (). Inductive transfer for Bayesian network structure learning. In Proceedings of the th International Conference on AI and Statistics (AISTATS ). San Juan, Puerto Rico. Reid, M. D. (). Improving rule evaluation using multitask learning. In Proceedings of the th International Conference on Inductive Logic Programming (pp. –). Porto, Portugal. Reid, M. D. (). DEFT guessing: Using inductive transfer to improve rule evaluation from limited data. Ph.D. thesis, School of Computer Science and Engineering, The University of New South Wales, Sydney, Australia. Rosenblatt, F. (). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanics. Washington, DC: Spartan Books. Samuel, A. (). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, (), –. Silver, D., Bakir, G., Bennett, K., Caruana, R., Pontil, M., Russell, S., et al. (). NIPS workshop on “Inductive transfer: years later”. Whistler, Canada. Sutton, R. (). Temporal credit assignment in reinforcement learning. Ph.D. thesis, Department of Computer and Information Science, University of Massachusetts, Amherst, MA. Sutton, R., & Barto, A. (). Reinforcement learning: An introduction. Cambridge: MIT Press. Taylor, M. E., & Stone, P. (). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, , –. Wang, X., Simon, H. A., Lehman, J. F., & Fisher, D. H. (). Learning planning operators by observation and practice. In Proceedings of the Second International Conference on AI Planning Systems, AIPS- (pp. –). Chicago, IL.

Watkins, C. (). Learning with delayed rewards. Ph.D. thesis, Psychology Department, University of Cambridge, Cambridge. Watkins, C., & Dayan, P. (). Q-learning. Machine Learning, (–), –.

Cross-Language Document Categorization Document Categorization is the task consisting in assigning a document to zero, one or more categories in a predefined taxonomy. Cross-language document categorization describes the specific case in which one is interested in automatically categorize a document in a same taxonomy regardless of the fact that the document is written in one of several languages. For more details on the methods used to perform this task see 7cross-lingual text mining.

Cross-Language Information Retrieval Cross-language information retrieval (CLIR) is the task consisting in recovering the subset of a document collection D relevant to a query q, in the special case in which D contains documents written in more than one language. Generally, it is additionally assumed that the subset of relevant documents must be returned as an ordered list, in decreasing order of relevance. For more details on methods and applications see 7cross-lingual text mining.

Cross-Language Question Answering Question answering is the task consisting in finding in a document collection the answer to a question. CLCat is the specific case in which the question and the documents can be in different languages. For more details on the methods used to perform this task see 7crosslingual text mining.

Cross-Lingual Text Mining

Cross-Lingual Text Mining Nicola Cancedda, Jean-Michel Renders Xerox Research Centre Europe, Meylan, France

Definition Cross-lingual text mining is a general category denoting tasks and methods for accessing the information in sets of documents written in several languages, or whenever the language used to express an information need is different from the language of the documents. A distinguishing feature of cross-lingual text mining is the necessity to overcome some language translation barrier.

Motivation and Background Advances in mass storage and network connectivity make enormous amounts of information easily accessible to an increasingly large fraction of the world population. Such information is mostly encoded in the form of running text which, in most cases, is written in a language different from the native language of the user. This state of affairs creates many situations in which the main barrier to the fulfillment of an information need is not technological but linguistic. For example, in some cases the user has some knowledge of the language in which the text containing a relevant piece of information is written, but does not have a sufficient control of this language to express his/her information needs. In other cases, documents in many different languages must be categorized in a same categorization schema, but manually categorized examples are available for only one language. While the automatic translation of text from a natural language into another (machine translation) is one of the oldest problems on which computers have been used, a palette of other tasks has become relevant only more recently, due to the technological advances mentioned above. Most of them were originally motivated by needs of government Intelligence communities, but received a strong impulse from the diffusion of the World-Wide Web and of the Internet in general.

C

Tasks and Methods A number of specific tasks fall under the term of Crosslingual text mining (CLTM), including: Cross-language information retrieval Cross-language document categorization ● Cross-language document clustering ● Cross-language question answering ● ●

These tasks can in principle be performed using methods which do not involve any 7Text Mining, but as a matter of fact all of them have been successfully approached relying on the statistical analysis of multilingual document collections, especially parallel corpora. While CLTM tasks differ in many respect, they are all characterized by the fact that they require to reliably measure the similarity of two text spans written in different languages. There are essentially two families of approaches for doing this: . In translation-based approaches one of the two text spans is first translated into the language of the other. Similarity is then computed based on any measure used in mono-lingual cases. As a variant, both text spans can be translated in a third pivot language. . In latent semantics approaches, an abstract vector space is defined based on the statistical properties of a parallel corpus (or, more rarely, of a comparable corpus). Both text spans are then represented as vectors in such latent semantic space, where any similarity measure for vector spaces can be used. The rest of this entry is organized as follows: first Translation-related approaches will be introduced, followed by Latent-semantic approaches. Finally, each of the specific CLTM tasks will be discussed in turn.

Translation-Based Approaches The simplest approach consists in using a manuallywritten machine-readable bilingual dictionary: words from the first span are looked up and replaced with words in the second language (see e.g., Zhang & Vines, ). Since typically dictionaries contain entries for “citation forms” only (e.g., the singular for nouns, the infinitive for verbs etc.), words in both spans are preliminarily lemmatized, i.e., replaced with the corresponding

C

C

Cross-Lingual Text Mining

citation form. In all cases when the lexica and morphological analyzers required to perform lemmatization are not available, a frequently adopted crude alternative consists in stemming (i.e., truncating by taking away a suffix) both the words in the span to be translated and in the corresponding side in the lexicon. Some languages (e.g., Germanic languages) are characterized by a very productive compounding: simpler words are connected together to form complex words. Compound words are rarely in dictionaries as such: in order to find them it is first necessary to break compounds into their elements. This can be done based on additional linguistic resources or by means of heuristics, but in all cases it is a challenging operation in itself. If the method used afterward to compare the two spans in the target language can take weights into account, translations are “normalized” in such a way that the cumulative weight of all translations of a word is the same regardless of the number of alternative translations. Most often, the weight is simply distributed uniformly among all alternative translations. Sometimes, only the first translation for each word is kept, or the first two or three. A second approach consists in extracting a bilingual lexicon from a parallel corpus instead of using a manually-written one. Methods for extracting probabilistic lexica look at the frequencies with which a word s in one language was translated with a word t to estimate the translation probability p(t∣s). In order to determine which word is the translation of which other word in the available examples, these examples are preliminarily aligned, first at the sentence level (to know what sentence is the translation of what other sentence) and then at the word level. Several methods for aligning sentences at the word level have been proposed, and this problem is a lively research topic in itself (see Brown, Della Pietra, Della Pietra, & Mercer, for a seminal paper). Once a probabilistic bilingual dictionary is available, it can be used much in the same way as human-written dictionaries, with the notable difference that the estimated conditional probabilities provide a natural way to distribute weight across translations. When the example documents used for extracting the bilingual dictionaries are of the same style and domain as the text spans to be translated, this can result in a significant increase in accuracy for the final task, whatever this is. It is often the case that a parallel corpus sufficiently similar in topic and style to the spans to be translated is unavailable, or it is too small to be used for reliably

estimating translation probabilities. In such cases, it can be possible to replace or complement the parallel corpus with a “comparable” corpus. A comparable corpus is a pair of collections of documents, one in each of the languages of interest, which are known to be similar in content, although not the translation of one another. A typical case might be two sets of articles from corresponding sections of different newspapers collected during a same period of time. If some additional bilingual seed dictionary (human-written or extracted from a parallel corpus) is also available, then the comparable corpus can be leveraged as well: a word t is likely to be the translation of a word s if it turns out that the words often appearing near s are translations of the words often appearing near t. Using this observation it is thus possible to estimate the probability that t is a valid translation of s even though they are not contained in the original dictionary. Most approaches proceed by associating with s a context vector. This vector, with one component for each word in the source language, can simply be formed by summing together the count histograms of the words occurring within a fixed window centered in all occurrences of s in the corpus, but is often constructed using statistically more robust association measures, such as mutual information. After a possible normalization step, the context vector CV(s) is translated using the seed dictionary into the target language. A context vector is also extracted from the corpus for all target words t. Eventually, a translation score between s and t is computed as ⟨Tr(CV(s)), CV(t)⟩: S(s, t) = ⟨CV(s), Tr(CV(t))⟩ =

∑

(s′ ,t ′ )∈D

a(s, s′ )a(t, t ′ ),

where a is the association score used to construct the context vector. While effective in many cases, this approach can provide inaccurate similarity values when polysemous words and synonyms appear in the corpus. To deal with this problem, Gaussier, Renders, Matveeva, Goutte, and Déjean () propose the following extension: S(s, t) =

′ ′′ ′′ ∑ (∑ a(s , s )a(s, s )) s′

(s′ ,t ′ )∈D

(∑ a(t , t ′′ )a(t, t ′′ )), ′

t ′′

which is more robust in cases when the entries in the seed bilingual dictionary do not cover all senses

Cross-Lingual Text Mining

actually present in the two sides of the comparable corpus. Although these methods for building bilingual dictionaries can be (and often are) used in isolation, it can be more effective to combine them. Using a bilingual dictionary directly is not the only way for translating a span from one language into another. A second alternative consists in using a machine translation (MT) system. While the MT system, in turn, relies on a bilingual dictionary of some sort, it is in general in the position of leveraging contextual clues to select the correct words and put them in the right order in the translation. This can be more or less useful depending on the specific task. MT systems fall, broadly speaking, into two classes: rule-based and statistical. Systems in the first class rely on sets of hand-written rules describing how words and syntactic structures should be translated. Statistical machine translation (SMT) systems learn this mapping by performing a statistical analysis of a parallel corpus. Some authors (e.g., Savoy & Berger, ) also experimented with combining translation from multiple machine translation systems.

Latent Semantic Approaches In CLTM, Latent Semantic approaches rely on some interlingua (language-independent) representation. Most of the time, this interlingua representation is obtained by linear or non-linear statistical analysis techniques and more specifically 7dimensionality reduction methods with ad-hoc optimization criterion and constraints. But, others adopt a more manual approach by exploiting multilingual thesauri or even multilingual ontologies in order to map textual objects towards a list – possibly weighted – of interlingua concepts. For any textual object (typically a document or a section of document), the interlingua concept representation is derived from a sequence of operations that encompass: . Linguistic preprocessing (as explained in previous sections, this step amounts to extract the relevant, normalized “terms” of the textual objects, by tokenisation, word segmentation/decompounding, lemmatisation/stemming, part-of-speech tagging, stopword removal, corpus-based term filtering, Noun-phrase extractions, etc.).

C

. Semantic enrichment and/or monolingual dimensionality reduction. . Interlingua semantic projection. A typical semantic enrichment method is the generalized vector space model, that adds related terms – or neighbour terms – to each term of the textual object, neighbour terms being defined by some cooccurrence measures (for instance, mutual information). Semantic enrichment can alternatively be achieved by using (monolingual) thesaurus, exploiting relationships such as synonymy, hyperonymy and hyponymy. Monolingual dimensionality reduction consists typically in performing some latent semantic analysis (LSA), some form of principal component analysis on the textual object/term matrix. Dimensionality reduction techniques such as LSA or their discrete/probabilistic variants such as probabilistic semantic analysis (PLSA) and latent dirichlet allocation (LDA) offer to some extent a semantic robustness to deal with the effects of polysemy/synonymy, adopting a languagedependent concept representation in a space of dimension much smaller than the size of the vocabulary in a language. Of course, steps () and () are highly languagedependent. Textual objects written in different languages will not follow the same linguistic processing or semantic enrichment/ dimensionality reduction. The last step (), however, aims at projecting textual objects in the same language-independent concept space, for any source language. This is done by first extracting these common concepts, typically from a parallel corpus that offers a natural multiple-view representation of the same objects. Starting from these multiple-view observations, common factors are extracted through the use of canonical correlation analysis (CCA), crosslanguage latent semantic analysis, their kernelized variants (eg. Kernel-CCA) or their discrete, probabilistic extensions (cross-language latent dirichlet allocation, multinomial CCA, …). All these methods try to discover latent factors that simultaneously explain as much as possible the “intra-language” variance and the “inter-language” correlation. They differ in the choice of the underlying distributions and how they precisely define and combine these two criteria. The following subsections will describe them in more details. As already emphasized, CLTM mainly relies on defining appropriate similarities between textual objects

C

C

Cross-Lingual Text Mining

expressed in different languages. Numerous categorization, clustering and retrieval algorithms focus on defining efficient and powerful measures of similarity between objects, as strengthened recently by the development of kernel methods for textual information access. We will see that the (linear) statistical algorithms used for performing steps () and () can most of the time be embedded into one valid (Mercer) kernel, so that we can very easily obtain non-linear variants of these algorithms, just by adopting some standard non-linear kernels. Cross-Language Semantic Analysis

This amounts to concatenate the vectorial representation of each view of the objects of the parallel collection (typically, objects are aligned sentences), and then to perform standard singular value decomposition of the global object/term matrix. Equivalently, defining the kernel similarity matrix between all pairs of multiview objects as the sum of the mono-lingual textual similarity matrices, this amounts to perform the eigenvalue decomposition of the corresponding kernel Gram matrix, if a dual formulation is adopted. The number of eigenvalues/eigenvectors that are retained to define the latent factors and the corresponding projections is typically from several hundreds of components to several thousands, still much fewer than the original sizes of the vocabulary. Note that this process does not really control the formation of interlingua concepts: nothing prevents the method from extracting factors that are linear combination of terms in one language only.

different languages is obtained by comparing their posterior distribution over these latent classes. Note that this approach could easily integrate supervised topic information and provides a nice framework for semisupervised interlingua concept extraction. Cross-Language Canonical Correlation Analysis The Primal Formulation CCA is a standard statistical

method to perform multi-block multivariate analysis, the goal being to find linear combinations of variables for each block (i.e., each language) that are maximally correlated. In other words, CCA is able to enforce the commonality of latent concept formations by extracting maximally correlated projections. Starting from a set of paired views of the same objects (typically, aligned sentences of a parallel corpus) in languages L and L, the algebraic formulation of this optimization problem leads to a generalized eigenvalue problem of size (n + n ), where n and n are the sizes of the vocabularies in L and L respectively. For obvious scalability reasons, the dual – or kernel – formulation (of size N, the number of paired objects in the training set) is often preferred. Kernel Canonical Correlation Analysis Basically, Kernel

Canonical Correlation Analysis amounts to do CCA on some implicit, but more complex feature space and to express the projection coefficients as linear combination of the training paired objects. This results in the dual formulation, which is a generalized eigenvalue/vector α

Cross-Language Latent Dirichlet Allocation

The extraction of interlingua components is realised by using LDA to model the set of parallel objects, by imposing the same proportion of components (topics) for all views of the same object. This is represented in Fig. . LDA is performing some form of clustering, with a predefined number of components (K) and with the constraint that the two views of the same object belongs to the clusters with the same membership values. This results in .K component profiles that are then used for “folding in” (projecting) new documents by launching some form of EM to derive their posterior probabilities to belong to each of the language-independent component. The similarity between two documents written in

θ

β1

Z1

Z2

W1

W2 N1

β2

N2 Nseg

Cross-Lingual Text Mining. Figure . Latent allocation of a parallel corpus

dirichlet

Cross-Lingual Text Mining

problem of size N, that involves only the monolingual kernel gram matrices K and K (matrices of monolingual textual similarities between all pairs of objects in the training set in language L and L respectively). Note that it is easy to show that the eigenvalues go by pairs: we always have two symmetrical eigenvalues +λ and −λ. This kernel formulation has the advantage to include any text specific prior properties in the kernel (e.g., use of N-gram kernels, word-sequence kernels, and any semantically-smoothed kernel). After extraction of the first k generalized eigenvalues/eigenvectors, the similarity between any pair of test objects in languages L and L can be computed by using projection matrices composed of extracted eigenvector as well as the (monolingual) kernels of the test objects with the training objects. Regularization and Partial Least Squares Solution When

the number of training examples (N) is less than n and n (the dimensions of the monolingual feature spaces), the eigenvalue spectrum of the KCCA problem has generally two null eigenvalues (due to data centering), (N −) eigenvalues in + and (N −) eigenvalues in −, so that, as such, the KCCA problem only results in trivial solutions and is useless. When using kernel methods, the case (N < n , n ) is frequent, so that some regularization scheme is needed. One way of realizing this regularization is to resort to finding the directions of maximum covariance (instead of correlation): this can be considered as a partial least squares (PLS) problem, whose formulation is very similar to the CCA problem. Adopting a mixed criterion CCA/PLS (trying to maximize a combination of covariance and correlation between projections) turns out to both avoid overfitting (or spurious solutions) and to enhance numerical stability. Approximate Solutions Both CCA and KCCA suffer from a lack of scalability, due to the fact the complexity of generalized eigenvalue/vector decomposition is O(N ) for KCCA or O(min(n , n ) ) for CCA. As it can be shown that performing a complete KCCA (or KPLS) analysis amounts to do first complete PCA’s, and then a linear CCA (or PLS) on the resulting new projections, it is obvious that we could reduce the complexity by working on a reduced-rank approximation (incomplete

C

KPCA) of the kernel matrices. However, the implicit projections derived from incomplete KPCA may be not optimal with respect to cross-correlation or covariance criteria. Another idea to decrease the complexity is to perform some incomplete Cholesky decomposition of the (monolingual) kernel matrices K and K (that is equivalent to partial Gram-Schmit orthogonalisation in the feature space): K = G .Gt and K = G .Gt , with Gi of rank k ≪ N. Considering Gi as the new representation of the training data, KCCA now reduces to solving a generalized eigenvalue problem of size .k.

Specific Applications The previous sections illustrated a number of different ways of solving the core problem of cross-language text mining: quantifying the similarity between two spans of text in different languages. In this section we turn to describing some actual applications relying on these methods. Cross-Language Information Retrieval (CLIR)

Given a collection of documents in several languages and a single query, the CLIR problem consists in producing a single ranking of all documents according to their relevance to the query. CLIR is in particular useful whenever a user has some knowledge of the languages in which documents are written, but not enough to express his/her information needs in those languages by means of a precise query. Sometimes CLIR engines are coupled with translation tools to help the user access the content of relevant documents written in languages unknown to him/her. In this case document collections in an even larger number of languages can be effectively queried. It is probably fair to say that the vast majority of the CLIR systems use a translation-based approach. In most cases it is the query which is translated in all languages before being sent to monolingual search engines. While this limits the amount of translation work that needs be done, it requires doing it on-line at query time. Moreover, when queries are short it can be difficult to translate them correctly, since there is little context to help identifying the correct sense in which words are used. For these reasons several groups also proposed translating all documents at indexing time instead. Regardless of whether queries or documents

C

C

Cross-Lingual Text Mining

are translated, whenever similarity scores between (possibly translated) queries and (possibly translated) documents are not directly comparable, all methods then face the problem of merging multiple monolingual rankings in a single multilingual ranking. Research in CLIR and cross-language question answering (see below) has been significantly stimulated by at least three government-sponsored evaluation campaigns:

segments) by using information retrieval techniques treating the question as a query, and then performing some finer-grained analysis to converge to a sufficiently short snippet. Questions are classified in a hierarchy of possible “question types.” Also, documents are preliminarily indexed to identify elements (e.g., person names) that are potential answers to questions of relevant types (e.g., “Who” questions). Cross-language question answering (CLQA) is the extension of this task to the case where the collection ● The NII Test Collection for IR Systems (NTCIR) contains documents in a language different than the lan(http://research.nii.ac.jp/ntcir/), running yearly since guage of the question. In this task a CLIR step replaces , focusing on Asian languages (Japanese, the monolingual IR step to shortlist promising docuChinese, Korean) and English. ments. The classification of the question is generally ● The Cross-Language Evaluation Forum (CLEF) done in the source language. (http://www.clef-campaign.org), running yearly since Both CLEF and NTCIR (see above) organize cross, focusing on European languages. language question answering comparative evaluations ● A cross-language track at the Text Retrieval Conon an annual basis. ference (TREC) (http://trec.nist.gov/), which was run until , focused on querying documents in Arabic using queries in English. Cross-Language Categorization (CLCat) and Clustering The respective websites are ideal starting points for any further exploration on the subject. Cross-Language Question Answering (CLQA)

Question answering is the task of automatically finding the answer to a specific question in a document collection. While in practice this vague description can be instantiated in many different ways, the sense in which the term is mostly understood is strongly influenced by the task specification formulated by the National Institute of Science and Technology (NIST) of the United States for its TREC evaluation conferences (see above). In this sense, the task consists in identifying a text snippet, i.e., a substring, of a predefined maximal length (e.g., characters, or characters) within a document in the collection containing the answer. Different classes of questions are considered: Questions around facts and events. Questions requiring the definition of people, things and organizations. ● Questions requiring as answer lists of people, objects or data. ● ●

Most proposals for solving the QA problem proceed by first identifying promising documents (or document

(CLCLu)

Cross-language categorization tackles the problem of categorizing documents in different languages in a same categorization scheme. The vast majority of document categorization systems rely on machine learning techniques to automatically acquire the necessary knowledge (often referred to as a model) from a possibly large collection of manually categorized documents. Most often the model is based on frequency counts of words, and is thus intrinsically language-dependent. The most direct way to perform categorization in different languages would consist in manually categorizing a sufficient amount of documents in all languages of interest and then train a set of independent categorizer. In some cases, however, it is impractical to manually categorize a sufficient number of documents to ensure accurate categorization in all languages, while it can be easier to identify bilingual dictionaries or parallel (or comparable) corpora for the language pairs and in the application domain of interest. In such cases it is then preferable to obtain manually categorized documents only for a single language A and use them to train a monolingual categorizer. Any of the translation-based approaches described above can then be used to translate a document originally in language B – or most often its representation as a bag of

Cumulative Learning

words– into language A. Once the document is translated, it can be categorized using the monolingual A system. As an alternative, latent-semantics approaches can be used as well. An existing parallel corpus can be used to identify an abstract vector space common to A and B. The manually categorized documents in A can then be represented in this space, and a model can be learned which operates directly on this latent-semantic representation. Whenever a document in B needs be categorized, it is first projected in the common semantic space and then categorized using the same model. All these considerations carry unchanged to the cross-language clustering task, which consists in identifying subsets of documents in a multilingual document collection which are mutually similar to one another according to some criterion. Again, this task can be effectively solved by either translating all documents into a single language or by learning a common semantic space and performing the clustering task there. While CLCat and Clustering are relevant tasks in many real-world situations, it is probably fair to say that less effort has been devoted to them by the research community than to CLIR and CLQA.

Recommended Reading Brown, P. E., Della Pietra, V. J., Della Pietra, S. A., & Mercer, R. L. (). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, (), – . Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., & Déjean, H. (). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the nd annual meeting of the association for computational linguistics, Barcelona, Spain. Morristown, NJ: Association for Computational Linguistics. Savoy, J., & Berger, P. Y. (). Report on CLEF- evaluation campaign: Monolingual, bilingual and GIRT information retrieval. In Proceedings of the cross-language evaluation forum (CLEF) (pp. –). Heidelberg: Springer. Zhang, Y., & Vines, P. (). Using the web for translation disambiguation. In Proceedings of the NTCIR- workshop meeting, Tokyo, Japan.

Cross-Validation Definition Cross-validation is a process for creating a distribution of pairs of 7training and 7test sets out of a single

C

7data set. In cross validation the data are partitioned into k subsets, S …Sk , each called a fold. The folds are usually of approximately the same size. The learning algorithm is then applied k times, for i = to k, each time using the union of all subsets other than Si as the 7training set and using Si as the 7test set.

Cross References 7Algorithm Evaluation 7Leave-One-Out Cross-Validation

Cumulative Learning Pietro Michelucci , Daniel Oblinger Strategic Analysis, Inc., Arlington, VA, USA DARPA/IPTO, Arlington, VA , USA

Synonyms Continual learning; Lifelong learning; Sequential inductive transfer

Definition Cumulative learning (CL) exploits knowledge acquired on prior tasks to improve learning performance on subsequent related tasks. Consider, for example, a CL system that is learning to play chess. Here, one might expect the system to learn from prior games concepts (e.g., favorable board positions, standard openings, end games, etc.) that can be used for future learning. This is in contrast to base learning (Vilalta & Drissi, ) in which a fixed learning algorithm is applied to a single task and performance tends to improve only with more exemplars. So, in CL there tends to be explicit reuse of learned knowledge to constrain new learning, whereas base learning depends entirely upon new external inputs. Relevant techniques for CL operate over multiple tasks, often at higher levels of abstraction, such as new problem space representations, task-based selection of learning algorithms, dynamic adjustment of learning parameters, and iterative analysis and modification of the learning algorithms themselves. Though actual usage of this term is varied and evolving, CL typically connotes sequential 7inductive transfer. It should be noted that the word “inductive” in this connotation

C

C

Cumulative Learning

qualifies the transfer of knowledge to new tasks, not the underlying learning algorithms.

Related Terminology The terms “meta-learning” and “learning to learn” are sometimes used interchangeably with CL. However each of these concepts has a specific relationship to CL. 7Meta-learning (Brazdil et al., ; Vilalta & Drissi, ) involves the application of learning algorithms to meta-data, which are abstracted representations of input data or learning system knowledge. In the case that abstractions of system knowledge are themselves learning algorithms, meta-learning involves assessing the suitability of these algorithms for previous tasks and, on that basis, selecting algorithms for new tasks (see entry on “meta-learning”). In general, the sharing of abstracted knowledge across tasks in a CL system implies the use of meta-learning techniques. However, the converse is not true. Meta-learning can and does occur in learning systems that do not accumulate and transfer knowledge across tasks. Learning to learn is a synonym for inductive transfer. Thus, learning to learn is more general than CL. Though it specifies the application of knowledge learned in one domain to another, it does not stipulate whether that knowledge is accumulated and applied sequentially or shared in a parallel learning context.

Motivation and Background Traditional 7supervised learning approaches require large datasets and extensive training in order to generalize to new inputs in a single task. Furthermore, traditional (non-CL) 7reinforcement learning approaches require tightly constrained environments to ensure a

tractable state space. In contrast, humans are able to generalize across tasks in dynamic environments from brief exposure to small datasets. The human advantage seems to derive from the ability to draw upon prior task and context knowledge to constrain hypothesis development for new tasks. Recognition of this disparity between human learning and traditional machine learning had led to the pursuit of methods that seek to emulate the accumulation and exploitation of taskbased knowledge that is observed in humans. A coarse evolution of this work is depicted in Fig. .

History Advancements in CL have resulted from two classes of innovation: the development of techniques for 7inductive transfer and the integration of those techniques into autonomous learning systems. Alan Turing () was the first to propose a cumulative learning system. His paper is best remembered for the imitation game, later known as the Turing test. However, the final sections of the paper address the question of how a machine could be made sufficiently complex to be able to pass the test. He posited that programming it would be too difficult a task. Therefore, it should be instructed as one might teach a child, starting with simple concepts and working up to more complex ones. Banerji () introduced the use of predicate logic as a description language for machine learning. Thus, Banerji was one of the earliest advocates of what would later become 7ILP. His concept description language allowed the use of background knowledge and therefore was an extensible language. The first implementation of a cumulative learning system based on Banerji’s ideas was Cohen’s CONFUCIUS (Cohen, ;

Supervised Learning Learning Supervised

Parallel: Parallel: Inductive Inductive Bias Bias

Inductive Inductive Transfer Transfer

MULTI-TASK MULTI-TASK LEARNING LEARNING

Sequential/ Sequential/ Hybrid: Hybrid: CUMULATIVE CUMULATIVE LEARNING LEARNING

Reinforcement Learning

Cumulative Learning. Figure . Evolution of cumulative learning

Cumulative Learning

Cohen & Sammut, ). In this work, an instructor teaches the system concepts that are stored in a longterm memory. When examples of a new concept are seen, their descriptions are matched against stored concepts, which allow the system to re-describe the examples in terms of the background knowledge. Thus, as more concepts are accumulated, the system is capable of describing complex objects more compactly than if it had not had the background knowledge. Compact representations generally allow complex concepts to be learned more efficiently. In many cases, learning would be intractable without the prior knowledge. See the entries on 7Inductive Logic Programming, which describe the use of background knowledge further. Independent of the research in symbolic learning, much of the 7inductive transfer research that underlies CL took root in 7artificial neural network research, a traditional approach to 7supervised learning. For example, Abu-Mostafa () introduced the notion of reducing the hypothesis space of a neural network by introducing “hints” either as hard-wired additions to the network or via examples designed to teach a particular invariance. The task of a neural network can be thought of as the determination of a function that maps exemplars into a classification space. So, in this context, hints constitute an articulation of some aspect of the target mapping function. For example, if a neural network is tasked with mapping numbers into primes and composites, one “hint” would be that all even numbers (besides ) are composite. Leveraging such a priori knowledge about the mapping function may facilitate convergence on a solution. An inherent limitation to neural networks, however, is their immutable architecture, which does not lend itself to the continual accumulation of knowledge. Consequently, Ring () introduced a neural network that constructs new nodes on demand in a reinforcement learning context in order to support ongoing hierarchical knowledge acquisition and transfer. In this model, nodes called “bions” correspond simultaneously to the enactment and perception of a single behavior. If two bions are activated in sequence repeatedly, a new bion is created to join the coincident pair and represent their collective functionality. Contemporaneously, Pratt, Mostow, and Kamm () investigated the hypothesis that knowledge

C

acquired by one neural network could be used to assist another neural network learn a related task. In the speech recognition domain, they trained three separate networks, each corresponding to speech segments of a different length, such that each network was optimized to learn certain types of phonemes. They then demonstrated that a direct transfer of information encoded as network weights from these three specialized networks to a single, combined speech recognition network resulted in a tenfold reduction in training epochs for the combined network compared with the number of training epochs required when no knowledge was transferred. This was one of the first empirical results in neural network-based transfer learning. Caruana () extended this work to demonstrate the performance benefits associated with the simultaneous transfer of 7inductive bias in a “Multitask Learning” (MTL) methodology. In this work, Caruana hypothesized that training the same neural network simultaneously on related tasks would naturally induce additional constraints on learning for each individual task. The intuition was that converging on a mapping in support of multiple tasks with shared representations might best reveal aspects of the input that are invariant across tasks, thus obviating within-task regularities, which might be less relevant to classification. Those empirical results are supported by Baxter () who proved that the number of examples required by a representation learner for learning a single task is an inverse linear function of the number of simultaneous tasks being learned. Though the innovative underpinnings of inductive transfer that critically underlie CL evolved in a supervised learning context, it was the integration of those methods with classical reinforcement learning that has led to current models of CL. Early integration of this type comes from Thrun and Mitchell (), who applied an extension of explanation-based learning (EBL), called explanation-based neural networks (EBNN) (Mitchell & Thrun, ), to an agent-based “lifelong learning framework.” This framework provides for the acquisition of different control policies for different environments and reward functions. Since the robot actuators, sensors, and the environment (largely) remain invariant, this framework supports the use of knowledge acquired from one control problem to be applied to another. By using EBNN to allow learning

C

C

Cumulative Learning

from previous control problems to constrain learning on new control problems, learning is accelerated over the lifetime of the robot. More recently, Silver and Mercer () introduced a hybrid model that involves a combination of parallel and sequential inductive transfer in an autonomous agent framework. The so-called task rehearsal method (TRM) uses MTL to combine new training inputs with relevant exemplars that are generated from prior task knowledge. Thus, inductive bias is achieved by training the neural networks on new tasks while simultaneously rehearsing learned task knowledge.

process evaluates the training input in the context of LTM to determine the most relevant domain knowledge that can be used to constrain short term learning. The comparison process also determines the weight assigned to domain knowledge that is used to bias short term learning. Once the rate of performance improvement on the primary task falls below a threshold the assessment process compares the state of STM to the environment to determine which domain knowledge to extract and store in LTM.

Structure of the Learning System

The simplicity of the architecture shown in Fig. belies the richness of the feature space for CL systems. The following classification dimensions are derived largely from the ML specification. This list includes both qualitative and quantitative dimensions. They are presented in three overlapping categories: architectural features, characteristics of the knowledge base, and learning capabilities.

CL is characterized by systems that use prior knowledge to bias future learning. The canonical interpretation is that knowledge transfer occurs at the task level. Although this description encompasses a broad research space, it is not boundless. In particular, CL systems must be able to () retain knowledge and () use that knowledge to restrict the hypothesis space for new learning. Nonetheless, learning systems can vary widely across numerous orthogonal dimensions and still meet these criteria.

Toward a CL Specification Recognizing the empirical utility of a more specific delineation of CL systems, Silver and Poirier () introduced a set of functional requirements, classification criteria, and performance specifications that characterize more precisely the scope of machines capable of lifelong learning. Any system that meets these requirements is considered a machine lifelong learning (ML) system. A general CL architecture that conforms to the ML standard is depicted in Fig. . Two basic memory constructs are typical of CL systems. Long term memory (LTM) is required for storing domain knowledge (DK) that can be used to bias new learning. Short term memory (STM) provides a working memory for building representations and testing hypotheses associated with new task learning. Most of the ML requirements specify the interplay of these constructs. LTM and STM are depicted in Fig. , along with a comparison process, an assessment process, and the learning environment. In this model, the comparison

Classification of CL Systems

Architecture

The following architectural dimensions for a CL system range from paradigm choices to low-level interface considerations. Learning paradigm – The learning paradigm(s) may include supervised learning (e.g., neural network, SVM, ILP, etc.), unsupervised learning (e.g., clustering), reinforcement learning (e.g., automated agent), or some combination thereof. Figure depicts a general architecture with processes that are common across these

Comparison ComparisonProcess Engine State Relevant DK

LTM LTM

Extracted DK

STM STM

Environment Environment

State

Assessment AssessmentProcess Engine

Cumulative Learning. Figure . Typical CL system

Cumulative Learning

learning paradigms, and which could be elaborated to reflect the details of each. Task order – CL systems may learn tasks sequentially (Thrun & Mitchell, ), in parallel (e.g., MTL (Caruana, )), or via a hybrid methodology (e.g., TRM (Silver & Mercer, )). One hybrid approach is to engage in practice (i.e., revisiting prior learned tasks). Transferring knowledge between learned tasks through practice may serve to improve generalization accuracy. Task order would be reflected in the sequence of events within and among process arrows in the Fig. architecture. For example, a system may alternate between processing new exemplars and “practicing” with old, stored exemplars. Transfer method – Knowledge transfer can also be representational or functional. Functional transfer provides implicit pressure from related training exemplars. For example, the environmental input in Fig. may take the form of training exemplars drawn randomly from data representing two related tasks, such that learning to classify exemplars from one task implicitly constrains learning on the other task. Representational knowledge transfer involves the direct or indirect (Pratt et al., ) assignment of a hypothesis representation. A direct inductive transfer entails the assignment of an original hypothesis representation, such as a vector of trained neural network activation weights. This might take the form of a direct injection to LTM in Fig. . Indirect transfer implies that some level of abstraction analysis has been applied to the hypothesis representation prior to assignment. Learning stages – A learning system may implement learning in a single stage or in a series of stages. An example of a two-stage system is one that waits to initiate the long-term storage of domain knowledge until after primary task learning in short-term memory is complete. Like task order, learning stages would be reflected in the sequence of events within and among process arrows in the Fig. architecture. But in this case, ordering pertains to the manner in which learning is staged across encoding processes. Interface cardinality – The interface cardinality can be fixed or variable. Fixing the number of inputs and outputs has the advantage of providing a consistent interface without posing restrictions on the growth of the internal representation.

C

Data type – The input and output data types can be fixed or variable. A type-flexible system can produce both categorical and scalar predictions. Scalability – CL systems may or may not scale on a variety of dimensions including inputs, outputs, training examples, and tasks.

Knowledge

This category pertains to the long-term storage of learned knowledge. Thus, the following CL dimensions characterize knowledge representation, storage, and retrieval. Knowledge representation – Stored knowledge can manifest as functional or representational. Functional knowledge retention involves the storage of specific exemplars or parameter values, which tends to be more accurate, whereas representational knowledge retention involves the storage of hypotheses derived from training on exemplars, which has the advantage of storage economy. Retention efficacy – The efficacy of long term retention varies across CL systems. Effective retention implies that only domain knowledge with an acceptable level of accuracy is retained so that errors aren’t propagated to future hypotheses. A related consideration is whether or not the consolidation of new domain knowledge degrades the accuracy of current or prior hypotheses. Retention efficiency – The retention efficiency of long term memory can vary according to both economy of representation and computationally efficiency. Indexing method – The input to the comparison process used to select appropriate knowledge for biasing new learning may simply be exemplars (as provided by LTM in Fig. ) or may take a representational form (e.g., a vector of neural network weights). Indexing efficiency – CL systems vary in terms of the speed and accuracy with which they can identify related prior knowledge that is suitable for inductive transfer during short term learning. The input to this selection process is the indexing method. Meta-knowledge – CL systems differentially exhibit the ability to abstract, store, and utilize meta-knowledge, such as characteristics of the input space, learning system parameter values, etc.

C

C

Cumulative Learning

Cumulative Learning. Table CL System Dimensions Category

Dimension

Values (ML guidance is indicated by ✓)

Architecture

Learning paradigm

Supervised learning Reinforcement learning Unsupervised learning ✓ Hybrid

Task order

Sequential Parallel ✓ Revisit (practice) Hybrid

Transfer method

Functional Representational – direct Representational – indirect

Learning stages

✓ Single (computational retention efficiency) Multiple

Interface cardinality

✓ Fixed Variable

Data type

Fixed Variable

Scalability

✓ Inputs ✓ Outputs ✓ Exemplars ✓ Tasks

Knowledge

Representation

Functional Representational – disjoint ✓ Representational – continuous

Retention efficacy

✓ Improves prior task performance ✓ Improves new task performance

Retention efficiency

✓ Space (memory usage) ✓ Time (computational processing)

Indexing method

✓ Deliberative – functional ✓ Deliberative – representational Reflexive

Cumulative Learning

C

Cumulative Learning. Table (Continued) Category

Dimension

Values (ML guidance is indicated by ✓)

Indexing efficiency

✓ Time < O(nc ), c > (n = tasks)

Meta-knowledge

✓ Probability distribution of input space Learning curve Error rate

Learning

Agency

Single learning method Task-based selection of learning method

Utility

Single learning method Task-based selection of learning method

Task awareness

Task boundary identification (begin/end)

Bias modulation

✓ Estimated sample complexity ✓ Number of task exemplars ✓ Generalization knowledge

accuracy

of

retained

✓ Relatedness of retained knowledge Learning efficacy

✓ Generalization ∣ bias ≥ generalization ∣ no bias

Learning efficiency

✓ Time ∣ bias ≤ time ∣ no bias

Learning

While all of the dimensions listed herein impact learning, the following dimensions correspond to specific learning capabilities or learning performance metrics. Agency – The agency of a learning system is the degree of sophistication exhibited by its top-level controller. For example a learning system may be on the low end of the agency continuum if it always applies one predetermined learning method to one task or on the high end if it selects among many learning methods as a function of the learning task. One might imagine, for example, two process diagrams such as the one depicted in Fig. , that share the same LTM, but are otherwise distinct and differentially activated by a governing controller as a function of qualitative aspects of the input. Utility – Domain knowledge acquisition can be deliberative in the sense that the learning system decides which hypotheses to incorporate based upon their estimated utility, or reflexive, in which case all

hypotheses are stored irrespective of utility considerations. Task awareness – Task awareness characterizes the system’s ability to identify the beginning and end of a new task. Bias modulation – A CL system may have the ability to determine the extent to which short-term learning would benefit from inductive transfer and, on that basis, assign a relevant weight. The depth of this analysis can vary and might consider factors such as the estimated sample complexity, number of exemplars, the generalization accuracy of retained knowledge, and relatedness of retained knowledge. Learning efficacy – A measure of learning efficacy is derived by comparing generalization performance in the presence and absence of an inductive bias. Learning is considered effective when the application of an inductive bias results in greater generalization performance on the primary task than when the bias is absent.

C

C

Cumulative Learning

Learning efficiency – Similarly, learning efficiency is assessed by comparing the computational time needed to generate a hypothesis in the presence and absence of an inductive bias. Lower computational time in the presence of bias signifies greater learning efficiency.

The Research Space Table summarizes the classification dimensions, providing an overview of the research space, an evaluative framework for assessing and contrasting CL approaches, and a generative framework for identifying new areas of exploration. In addition, checked items in the Values column indicate ML guidance. Specifically, an ideal ML system would correspond functionally to the called-out items and performance criteria. However, Silver and Poirier () allude to the fact that it would be nigh impossible to generate a strictly compliant ML system since some of the recommended criteria do not coexist easily. For example, effective and efficient learning are mutually incompatible because they require different forms of knowledge transfer. Nonetheless, a CL system that falls within scope of the majority of the ML criteria would be well-positioned to exhibit lifelong learning behavior.

above, is also premised on a model of building concepts from structured lessons. In this case, however, there is no a priori knowledge acquisition. Instead, some “common” knowledge about the world is provided explicitly to the learning system, and then lessons are taught by a human teacher using the same natural instruction methods that would be used to teach another human. Rather than requiring a specific learning algorithm, this framework provides a context for evaluating and comparing learning algorithms. It includes a knowledge representation language that supports syntactic, logical, procedural, and functional knowledge, an interaction language for communication among the learning system, instructor, and environment, and an integration architecture that evaluates, processes, and responds to interaction language communiqués in the context of existing knowledge and through the selective utilization of available learning algorithms. The learning performance advantages anticipated by these proposals for instructional computing seem to stem from the economy of representation afforded by hierarchical knowledge combined with the tremendous learning bias imposed by explicit instruction.

Recommended Reading Future Directions Emergent work (Oblinger, ; Swarup, Lakkaraju, Ray, & Gasser, ) in instructable computing has given rise to a new CL paradigm that is largely ML compliant and involves high degrees of task awareness and agency sophistication. Swarup et al. () describe an approach in which domain knowledge is represented in the form of structured graphs. Short term (primary task) learning occurs via a genetic algorithm, after which domain knowledge is extracted by mining frequent subgraphs. The accumulated domain knowledge forms an ontology to which the learning system grounds symbols as a result of structured interactions with instructional agents. Subsequent interactions occur using the symbol system as a shared lexicon for communication between the instructor and the learning system. Knowledge acquired from these interactions bootstrap future learning. The Bootstrapped Learning framework proposed by Oblinger () provides for hierarchical, domainindependent learning that, like the effort described

Abu-Mostafa, Y. (). Learning from hints in neural networks (invited). Journal of Complexity, (), –. Banerji, R. B. (). A Language for the Description of Concepts. General Systems, , –. Baxter, J. (). Learning internal representations. In (COLT): Proceeding of the workshop on computational learning theory, Santa Cruz, California. Morgan Kaufmann. Brazdil P., Giraud-Carrier, C., Soares, C., & Vilalta, R. (). Metalearning – Applications to Data Mining, Springer. Caruana, R. (). Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the tenth international conference on machine learning, University of Massachusetts, Amherst (pp. –). Caruana, R. (). Algorithms and applications for multitask learning. In Machine learning: Proceedings of the th international conference on machine learning (ICML ), Bari, Italy (pp. –). Morgan Kauffmann. Cohen, B. L. (). A Theory of Structural Concept Formation and Pattern Recognition. Ph.D. Thesis, Department of Computer Science, The University of New South Wales. Cohen, B. L., & Sammut, C. A. (). Object Recognition and Concept Learning with CONFUCIUS. Pattern Recognition Journal, (), –. Mitchell, T. (). The need for biases in learning generalizations. Rutgers TR CBM-TR-. Mitchell, T. M., & Thrun, S. B. (). Explanation-based neural network learning for robot control. In Hanson, Cowan, &

Curse of Dimensionality

Giles (Eds.), Advances in neural information processing systems (pp. –). San Francisco, CA: Morgan-Kaufmann. Nilsson, N. J. (). Introduction to machine learning: An early draft of a proposed textbook (p. ). Online at http://ai.stanford.edu/ $\sim$nilsson/MLBOOK.pdf. Accessed on July , . Oblinger, D. (). Bootstrapped learning proposer information pamphlet for broad agency announcement -. Online at http://fs.fbo.gov/EPSData/ODA/Synopses//BAA/BLPIPfinal.pdf. Pratt, L. Y., Mostow, J., & Kamm, C. A. (). Direct transfer of learned information among neural networks. In Proceedings of the ninth national conference on artificial intelligence (AAAI-), Anaheim, CA (pp. –). Ring, M. (). Incremental development of complex behaviors through automatic construction of sensory-motor hierarchies. In Proceedings of the eighth international workshop (ML), San Mateo, California. Silver, D., & Mercer, R. (). The task rehearsal method of lifelong learning: Overcoming impoverished data. In R. Cohen & B. Spencer (Eds.), Advances in artificial intelligence, th conference of the Canadian society for computational studies of intelligence (AI ), Calgary, Canada, May –, . Lecture notes in computer science (Vol. , pp. –). London: Springer. Silver, D., & Poirier, R. (). Requirements for machine lifelong learning. JSOCS Technical Report TR--, Acadia University. Swarup, S., Lakkaraju, K., Ray, S. R., & Gasser, L. (). Symbol grounding through cumulative learning. In P. Vogt et al. (Eds.), Symbol grounding and beyond: Proceedings of the third international workshop on the emergence and evolution of linguistic communication, Rome, Italy (pp. –). Berlin: Springer. Swarup, S., Mahmud, M. M. H., Lakkaraju, K., & Ray, S. R. (). Cumulative learning: Towards designing cognitive architectures for artificial agents that have a lifetime. Tech. Rep. UIUCDCS-R--. Thrun, S. (). Lifelong learning algorithms. In S. Thrun & L. Y. Pratt (Eds.), Learning to learn. Norwell, MA: Kluwer Academic. Thrun, S., & Mitchell, T. (). Lifelong robot learning. Robotics and Autonomous Systems, , –. Turing, A. M. (). Computing Machinery and Intelligence. Mind Mind, (), –. Vilalta, R., & Drissi, Y. (). A perspective view and survey of meta-learning. Artificial Intelligence Review, , –.

Curse of Dimensionality Eamonn Keogh, Abdullah Mueen University California-Riverside, Riverside, CA, USA

Definition The curse of dimensionality is a term introduced by Bellman to describe the problem caused by the expo-

C

nential increase in volume associated with adding extra dimensions to Euclidean space (Bellman, ). For example, evenly-spaced sample points suffice to sample a unit interval with no more than . distance between points; an equivalent sampling of a -dimensional unit hypercube with a grid with a spacing of . between adjacent points would require sample points: thus, in some sense, the D hypercube can be said to be a factor of “larger” than the unit interval. Informally, the phrase curse of dimensionality is often used to simply refer to the fact that one’s intuitions about how data structures, similarity measures, and algorithms behave in low dimensions do typically generalize well to higher dimensions.

Background Another way to envisage the vastness of high-dimensional Euclidean space is to compare the size of the unit sphere with the unit cube as the dimension of the space increases: as the dimension increases. As we can see in Fig. , the unit sphere becomes an insignificant volume relative to that of the unit cube. In other words, almost all of the high-dimensional space is far away from the center. In research papers, the phrase curse of dimensionality is often used as shorthand for one of its many implications for machine learning algorithms. Examples of these implications include: 7Nearest neighbor searches can be made significantly faster for low-dimensional data by indexing the data with an R-tree, a KD-tree, or a similar spatial access method. However, for high-dimensional data all such methods degrade to the performance of a simple linear scan across the data. ● For machine learning problems, a small increase in dimensionality generally requires a large increase in the numerosity of the data, in order to keep the same level of performance for regression, clustering, etc. ● In high-dimensional spaces, the normally intuitive concept of proximity or similarity may not be qualitatively meaningful. This is because the ratio of an object’s nearest neighbor over its farthest neighbor approaches one for high-dimensional spaces (Aggarwal, Hinneburg, & Keim, ). In other ●

C

C

Curse of Dimensionality 1

r=

0.8

Volume of the hypersphere Volume of the hypercube

r

0.6 0.4 0.2 0

0

2

4

6

8

10

12

14

16

18

20

Dimension

Curse of Dimensionality. Figure . The ratio of the volume of the hypersphere enclosed by the unit hypercube. The most intuitive example, the unit square and unit circle, are shown as an inset. Note that the volume of the hypersphere quickly becomes irrelevant for higher dimensionality

words, all objects are approximately equidistant from each other. There are many ways to attempt to mitigate the curse of dimensionality, including 7feature selection and 7dimensionality reduction. However, there is no single solution to the many difficulties caused by the effect.

Recommended Reading The major database (SIGMOD, VLDB, PODS), data mining (SIGKDD, ICDM, SDM), and machine learning (ICML, NIPS)

conferences typically feature several papers which explicitly address the curse of dimensionality each year. Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (). On the surprising behavior of distance metrics in high dimensional spaces. In ICDT (pp. –). London, England. Bellman, R. E. (). Dynamic programming. Princeton, NJ: Princeton University Press. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., & Keogh, E. (). Querying and mining of time series data: Experimental comparison of representations and distance measures. In Proceedings of the VLDB endowment (Vol. , pp. –). Auckland, NewZealand.

D Data Mining On Text 7Text Mining

are required and the specific learning techniques and software by which they are to be analyzed. The following are a number of key processes and techniques. Sourcing, Selecting, and Auditing Appropriate Data

Data Preparation Geoffrey I. Webb Monash University, Victoria, Australia

Synonyms Data preprocessing; Feature construction

Definition Before data can be analyzed, they must be organized into an appropriate form. Data preparation is the process of manipulating and organizing data prior to analysis.

Motivation and Background Data are collected for many purposes, not necessarily with machine learning in mind. Consequently, there is often a need to identify and extract relevant data for the given analytic purpose. Every learning system has specific requirements about how data must be presented for analysis and hence, data must be transformed to fulfill those requirements. Further, the selection of the specific data to be analyzed can greatly affect the models that are learned. For these reasons, data preparation is a critical part of any machine learning exercise. Data preparation is often the most time-consuming part of any nontrivial machine learning project.

Processes and Techniques The manner in which data are prepared varies greatly depending upon the analytic objectives for which they

It is necessary to review the data that are already available, assess their suitability to the task at hand, and investigate the feasibility of sourcing new data collected specifically for the desired task. Much of the theory on which learning systems are based assumes that the training data are a random sample of the population about which the user wishes to learn a model. However, much historical data represent biased samples, for example, data that have been easy to collect or that have been considered interesting for some other purpose. It is desirable to consider whether the available data are sufficiently representative of the future data to which a learned model is to be applied. It is important to assess whether there is sufficient data to realistically obtain the desired machine learning outcomes. Data quality should be investigated. Much data is